Skip to content
Kodakodadocs
Operations

Monitoring

Health checks, doctor command, and observability surfaces.

Koda exposes every check an operator needs through HTTP probes, the doctor command, and the dashboard's Operations view. No custom agents, no sidecars — the monitoring surface is the same stack you installed.

Health probes

Two HTTP endpoints are safe to hit from any orchestrator:

  • GET /api/runtime/readiness — runtime readiness. 200 when accepting tasks, 503 otherwise.
  • GET /health — lightweight liveness probe on the app service. Returns 200 if the process is alive.
bash
curl https://koda.example.com/health | jq
curl https://koda.example.com/api/runtime/readiness | jq

Doctor

koda doctor runs a broader set of checks than the HTTP probes: bootstrap config, secret hygiene, storage connectivity, dashboard reachability, and control-plane reachability. It's what the installer runs on first boot and what the update path runs before committing a new release.

bash
koda doctor --json

--json makes the output machine-readable — suitable for CI, alerting, or a scheduled cron. Each check returns a status (pass, warn, fail) and a human-readable message.

Schedule the doctor
A cron running koda doctor --json every few minutes catches slow drift — a rotated provider key, a nearly-full volume, a missing session secret — before a user hits it. Pipe the JSON into whatever alerting you already have.

Operations dashboard

The dashboard's Operations view aggregates the same data the APIs return, with two extras:

  • Integration health history — recent verification results per provider and integration. A pattern of failing verifications after a key rotation shows up here before an agent task fails.
  • Audit feed — live tail of security.* events. Useful for spotting rejected-message storms or lockout patterns.

Logs

  • koda logs tails the combined compose log. Pass service names to filter: koda logs web app.
  • Structured logs — the control plane and runtime emit structured JSON on stderr. Pipe into journald, Loki, or whatever log aggregator you already run.
  • Audit events — the security.* event family is both in the logs and in the audit store; treat them as the canonical record.

Metrics

Koda emits Prometheus metrics from the app service at /metrics. The repository also ships Prometheus alert rules and a Grafana dashboard that reference the metric names emitted by koda/services/metrics.py.

  • Scrape :8090/metrics for runtime, queue, cost, tools, memory, RunGraph, proposals, skills, channels, quality cockpit, and ops benchmark metrics.
  • Keep blackbox probes for /health and /api/runtime/readiness so liveness/readiness failures are visible even when app metrics are degraded.
  • Parse structured logs with Promtail or Vector for narrative incident context that metrics cannot carry.

Monitoring checklist

  • Liveness probe: HTTP GET /health on :8090.
  • Readiness probe: HTTP GET /api/runtime/readiness.
  • Metrics scrape: HTTP GET /metrics on :8090.
  • Doctor on a schedule: koda doctor --json every few minutes, alert on non-pass.
  • Postgres: standard metrics (connections, replication lag, transaction rate). Alert on backups not ≤ 24 h old.
  • Object storage: bucket exists, write succeeds, free space headroom.
  • Disk: $${STATE_ROOT_DIR} and $${ARTIFACT_STORE_DIR} free space.
  • Session secret: present. Missing WEB_OPERATOR_SESSION_SECRET in production is a boot failure.
  • TLS: certificate expiry at the reverse proxy.

Next steps