Operations

Monitoring

Health checks, doctor command, and observability surfaces.

Koda exposes every check an operator needs through HTTP probes, the doctor command, and the dashboard's Operations view. No custom agents, no sidecars — the monitoring surface is the same stack you installed.

Health probes

Three HTTP endpoints are safe to hit from any orchestrator:

GET /api/runtime/ready — runtime readiness. 200 when accepting tasks, 503 otherwise.
GET /api/control-plane/health — control-plane health. Returns a component-level payload (database, object storage, gRPC services).
GET /health — lightweight liveness probe on the app service. Returns 200 if the process is alive.

bash

$curl https://koda.example.com/api/control-plane/health | jq

Doctor

koda doctor runs a broader set of checks than the HTTP probes: bootstrap config, secret hygiene, storage connectivity, dashboard reachability, and control-plane reachability. It's what the installer runs on first boot and what the update path runs before committing a new release.

bash

$koda doctor --json

--json makes the output machine-readable — suitable for CI, alerting, or a scheduled cron. Each check returns a status (pass, warn, fail) and a human-readable message.

Schedule the doctor

A cron running koda doctor --json every few minutes catches slow drift — a rotated provider key, a nearly-full volume, a missing session secret — before a user hits it. Pipe the JSON into whatever alerting you already have.

Operations dashboard

The dashboard's Operations view aggregates the same data the APIs return, with two extras:

Integration health history — recent verification results per provider and integration. A pattern of failing verifications after a key rotation shows up here before an agent task fails.
Audit feed — live tail of security.* events. Useful for spotting rejected-message storms or lockout patterns.

Logs

koda logs tails the combined compose log. Pass service names to filter: koda logs web app.
Structured logs — the control plane and runtime emit structured JSON on stderr. Pipe into journald, Loki, or whatever log aggregator you already run.
Audit events — the security.* event family is both in the logs and in the audit store; treat them as the canonical record.

Metrics

A metrics surface is not part of the default install. Existing deployments typically scrape the JSON health endpoints and derive metrics from structured logs. If you need full Prometheus-style scraping, the typical pattern is:

Poll /api/control-plane/health with your Prometheus blackbox_exporter.
Parse structured logs with Promtail or Vector to derive koda_task_*, koda_auth_*, and koda_provider_* metrics.
Build dashboards on the familiar trio: request volume, error rate, queue depth.

Monitoring checklist

Liveness probe: HTTP GET /health on :8090.
Readiness probe: HTTP GET /api/runtime/ready.
Doctor on a schedule: koda doctor --json every few minutes, alert on non-pass.
Postgres: standard metrics (connections, replication lag, transaction rate). Alert on backups not ≤ 24 h old.
Object storage: bucket exists, write succeeds, free space headroom.
Disk: $${STATE_ROOT_DIR} and $${ARTIFACT_STORE_DIR} free space.
Session secret: present. Missing WEB_OPERATOR_SESSION_SECRET in production is a boot failure.
TLS: certificate expiry at the reverse proxy.

Next steps

Troubleshooting — how to respond when one of the checks above fires.
Security — the audit event taxonomy.