Operations

Troubleshooting

Common issues and where to look first.

When something goes wrong, the checks below are the order a typical operator would follow. Most issues are visible from three places: doctor, dashboard Operations view, and container logs.

First line of defence

Always run koda doctor first. It takes a few seconds, rules out bootstrap problems, and tells you whether the issue is local (env, storage, secrets) or downstream (provider, integration, runtime task).

Doctor is red

Bootstrap configuration failed

Missing session secret. Set WEB_OPERATOR_SESSION_SECRET to 32+ random bytes in .env. Restart the stack.
Conflicting env profile. KODA_ENV=production with CONTROL_PLANE_AUTH_MODE=development or ALLOW_LOOPBACK_BOOTSTRAP=true is refused at boot. Fix the .env and restart.

Storage connectivity failed

Postgres unreachable. Check koda logs postgres — look for disk-full or out-of-memory kills. Confirm KNOWLEDGE_V2_POSTGRES_DSN is correct.
Object storage missing bucket. seaweedfs-init should create it on first boot. If it didn't, run koda down && koda up to rerun the init container. If that fails, the compose logs show the AWS CLI's exact complaint.

Dashboard unreachable

koda logs web usually shows a build or runtime error at the top of recent output.
Check the reverse proxy is forwarding / to 127.0.0.1:3000 and that the cert is valid.

Authentication issues

Account lockout. Five failed logins locks the account for 15 minutes by default. Wait or adjust CONTROL_PLANE_OPERATOR_LOGIN_LOCKOUT_SECONDS.
Session cookie rejected. In production the cookie must be Secure and served over HTTPS. Check DevTools → Application → Cookies.
Clock skew. If the host's clock is far off, JWT and session validation can fail. Run timedatectl.

Forgot password with no SMTP

Recovery codes are the recovery path when SMTP isn't configured.

Navigate to /forgot-password.
Enter the email and one of the recovery codes saved at registration.
Set a new password. All remaining recovery codes are invalidated — regenerate a fresh set from Settings › Security.

Lost recovery codes

There is no self-service recovery without a code. Regenerating requires host access: stop the stack, run the bootstrap-code flow, re-register the owner. Treat this as a disaster-recovery procedure, not a routine one.

Provider issues

Provider verification fails

Read the raw error surfaced in the dashboard — it comes straight from the provider. The usual suspects: rotated key, wrong scope, missing project/org, billing block.
If the verify button stays greyed out, the control plane is rate-limiting repeated verification attempts. Wait a minute.

Provider fails mid-task

The runtime attempts to resume on another provider if one is configured. If you see tasks failing without a fallback, configure at least one peer provider.

Agent issues

Agent doesn't reply

Telegram allow list. Unapproved users are silently dropped; the audit feed shows security.telegram.rejected. Add the user ID / username in Settings → Telegram access.
Draft agent. An unpublished agent won't accept tasks. Publish it from the dashboard.
Provider credential missing. Agents inherit the provider default. If no default is set, creation was incomplete — walk the provider wizard again.

Tool loop never terminates

The runtime caps iterations. If you see a task stuck, open the trace view — the tool dispatcher logs every iteration with a cycle-detection note. Usually the agent is repeating a failed command without updating its approach; fix the underlying failure (missing tool, blocked command, unreadable path) rather than the loop cap.

Storage issues

Postgres disk full

Check which tables are largest: the knowledge_* and retrieval_traces tables grow with use.
Lower MEMORY_MAX_PER_USER to make maintenance prune more aggressively; re-run the maintenance job.
If you retain every retrieval trace, schedule a truncation after N days. Traces are inspection aids, not durable state.

Object storage over quota

Artifacts accumulate over time. Run an audit of artifact_manifests and drop ones older than your retention window.
If you're on AWS S3, enable lifecycle rules on the Koda bucket.

Upgrade went wrong

koda update auto-rolls-back if post-update doctor is red. If that didn't fire (rare — usually because an external dependency changed), roll back manually:

bash

$koda down
$# restore the previous .env from backup if you changed it
$koda install --manifest ~/.koda/previous-release.yaml

Getting help

GitHub Discussions — the Discussions tab on the main repo is the right place for questions.
GitHub Issues — for reproducible bugs, with the output of koda doctor --json and relevant log snippets.
Security reports — see SECURITY.md in the repo. Do not open public issues for vulnerabilities.

Next steps

Monitoring — probes and checks that catch issues before they page you.
Security — the full hardening and audit model.