Troubleshooting
Common issues and where to look first.
When something goes wrong, the checks below are the order a typical operator would follow. Most issues are visible from three places: doctor, dashboard Operations view, and container logs.
Always run koda doctor first. It takes a few seconds, rules out bootstrap problems, and tells you whether the issue is local (env, storage, secrets) or downstream (provider, integration, runtime task).
Doctor is red
Bootstrap configuration failed
- Missing session secret. Set
WEB_OPERATOR_SESSION_SECRETto 32+ random bytes in.env. Restart the stack. - Conflicting env profile.
KODA_ENV=productionwithCONTROL_PLANE_AUTH_MODE=developmentorALLOW_LOOPBACK_BOOTSTRAP=trueis refused at boot. Fix the.envand restart.
Storage connectivity failed
- Postgres unreachable. Check
koda logs postgres— look for disk-full or out-of-memory kills. ConfirmKNOWLEDGE_V2_POSTGRES_DSNis correct. - Object storage missing bucket.
seaweedfs-initshould create it on first boot. If it didn't, runkoda down && koda upto rerun the init container. If that fails, the compose logs show the AWS CLI's exact complaint.
Dashboard unreachable
koda logs webusually shows a build or runtime error at the top of recent output.- Check the reverse proxy is forwarding
/to127.0.0.1:3000and that the cert is valid.
Authentication issues
Can't log in
- Account lockout. Five failed logins locks the account for 15 minutes by default. Wait or adjust
CONTROL_PLANE_OPERATOR_LOGIN_LOCKOUT_SECONDS. - Session cookie rejected. In production the cookie must be
Secureand served over HTTPS. Check DevTools → Application → Cookies. - Clock skew. If the host's clock is far off, JWT and session validation can fail. Run
timedatectl.
Forgot password with no SMTP
Recovery codes are the recovery path when SMTP isn't configured.
- Navigate to
/forgot-password. - Enter the email and one of the recovery codes saved at registration.
- Set a new password. All remaining recovery codes are invalidated — regenerate a fresh set from Settings › Security.
Lost recovery codes
There is no self-service recovery without a code. Regenerating requires host access: stop the stack, run the bootstrap-code flow, re-register the owner. Treat this as a disaster-recovery procedure, not a routine one.
Provider issues
Provider verification fails
- Read the raw error surfaced in the dashboard — it comes straight from the provider. The usual suspects: rotated key, wrong scope, missing project/org, billing block.
- If the verify button stays greyed out, the control plane is rate-limiting repeated verification attempts. Wait a minute.
Provider fails mid-task
The runtime attempts to resume on another provider if one is configured. If you see tasks failing without a fallback, configure at least one peer provider.
Agent issues
Agent doesn't reply
- Telegram allow list. Unapproved users are silently dropped; the audit feed shows
security.telegram.rejected. Add the user ID / username in Settings → Telegram access. - Draft agent. An unpublished agent won't accept tasks. Publish it from the dashboard.
- Provider credential missing. Agents inherit the provider default. If no default is set, creation was incomplete — walk the provider wizard again.
Tool loop never terminates
The runtime caps iterations. If you see a task stuck, open the trace view — the tool dispatcher logs every iteration with a cycle-detection note. Usually the agent is repeating a failed command without updating its approach; fix the underlying failure (missing tool, blocked command, unreadable path) rather than the loop cap.
Storage issues
Postgres disk full
- Check which tables are largest: the
knowledge_*andretrieval_tracestables grow with use. - Lower
MEMORY_MAX_PER_USERto make maintenance prune more aggressively; re-run the maintenance job. - If you retain every retrieval trace, schedule a truncation after N days. Traces are inspection aids, not durable state.
Object storage over quota
- Artifacts accumulate over time. Run an audit of
artifact_manifestsand drop ones older than your retention window. - If you're on AWS S3, enable lifecycle rules on the Koda bucket.
Upgrade went wrong
koda update auto-rolls-back if post-update doctor is red. If that didn't fire (rare — usually because an external dependency changed), roll back manually:
koda down# restore the previous .env from backup if you changed itkoda install --manifest ~/.koda/previous-release.yamlGetting help
- GitHub Discussions — the
Discussionstab on the main repo is the right place for questions. - GitHub Issues — for reproducible bugs, with the output of
koda doctor --jsonand relevant log snippets. - Security reports — see
SECURITY.mdin the repo. Do not open public issues for vulnerabilities.
Next steps
- Monitoring — probes and checks that catch issues before they page you.
- Security — the full hardening and audit model.