Monitoring
Health checks, doctor command, and observability surfaces.
Koda exposes every check an operator needs through HTTP probes, the doctor command, and the dashboard's Operations view. No custom agents, no sidecars — the monitoring surface is the same stack you installed.
Health probes
Three HTTP endpoints are safe to hit from any orchestrator:
GET /api/runtime/ready— runtime readiness. 200 when accepting tasks, 503 otherwise.GET /api/control-plane/health— control-plane health. Returns a component-level payload (database, object storage, gRPC services).GET /health— lightweight liveness probe on theappservice. Returns 200 if the process is alive.
curl https://koda.example.com/api/control-plane/health | jqDoctor
koda doctor runs a broader set of checks than the HTTP probes: bootstrap config, secret hygiene, storage connectivity, dashboard reachability, and control-plane reachability. It's what the installer runs on first boot and what the update path runs before committing a new release.
koda doctor --json--json makes the output machine-readable — suitable for CI, alerting, or a scheduled cron. Each check returns a status (pass, warn, fail) and a human-readable message.
koda doctor --json every few minutes catches slow drift — a rotated provider key, a nearly-full volume, a missing session secret — before a user hits it. Pipe the JSON into whatever alerting you already have.Operations dashboard
The dashboard's Operations view aggregates the same data the APIs return, with two extras:
- Integration health history — recent verification results per provider and integration. A pattern of failing verifications after a key rotation shows up here before an agent task fails.
- Audit feed — live tail of
security.*events. Useful for spotting rejected-message storms or lockout patterns.
Logs
koda logstails the combined compose log. Pass service names to filter:koda logs web app.- Structured logs — the control plane and runtime emit structured JSON on stderr. Pipe into journald, Loki, or whatever log aggregator you already run.
- Audit events — the
security.*event family is both in the logs and in the audit store; treat them as the canonical record.
Metrics
A metrics surface is not part of the default install. Existing deployments typically scrape the JSON health endpoints and derive metrics from structured logs. If you need full Prometheus-style scraping, the typical pattern is:
- Poll
/api/control-plane/healthwith your Prometheusblackbox_exporter. - Parse structured logs with Promtail or Vector to derive
koda_task_*,koda_auth_*, andkoda_provider_*metrics. - Build dashboards on the familiar trio: request volume, error rate, queue depth.
Monitoring checklist
- Liveness probe: HTTP
GET /healthon:8090. - Readiness probe: HTTP
GET /api/runtime/ready. - Doctor on a schedule:
koda doctor --jsonevery few minutes, alert on non-pass. - Postgres: standard metrics (connections, replication lag, transaction rate). Alert on backups not ≤ 24 h old.
- Object storage: bucket exists, write succeeds, free space headroom.
- Disk:
$${STATE_ROOT_DIR}and$${ARTIFACT_STORE_DIR}free space. - Session secret: present. Missing
WEB_OPERATOR_SESSION_SECRETin production is a boot failure. - TLS: certificate expiry at the reverse proxy.
Next steps
- Troubleshooting — how to respond when one of the checks above fires.
- Security — the audit event taxonomy.