Skip to content

Monitoring Model

Clawdie monitoring is split into distinct layers so “process is running” is not confused with “system is healthy”.

The running process writes state into:

  • data/health/host.json
  • data/health/pipeline.json
  • data/health/jail.json

Inspected by just doctor.

Tracks: process startup, database init, channel connection, IPC watcher, message loop heartbeat, scheduler heartbeat.

Answers: is the main process alive and making progress?

Tracks: Telegram connected, last inbound received, last message routed, last jailed run started/finished, last reply sent, last pipeline failure.

Answers: are messages actually flowing?

Tracks: last jail run started/finished, last success, last failure, last failure code and message, last duration.

Answers: is the isolated executor working?

The Watchdog class in src/watchdog.ts runs two timers:

  • health timer (60s) — reads free memory (sysctl vm.stats.vm.v_free_count), throttles queue concurrency to 1 if below threshold
  • control plane timer (5 min) — runs runControlPlaneChecks(), stores the latest ControlPlaneReport

The watchdog listens on a Unix socket at ${AGENT_TMP_DIR}/ipc/{agent}-watchdog.sock for IPC. Override via AGENT_TMP_DIR if needed. Send {"cmd":"status"}\n to get current mode, throttle state, memory, active/queued jails, and the latest control plane report.

src/controlplane.ts checks service jails and system state. Runs at startup (before initDatabase()) and every 5 minutes via the watchdog timer.

Checks:

CheckMethodFix if failing
hostd reachableTCP connect to socketnone (can’t self-fix)
{agent}-db runningjls -q namehostd('bastille-start')
{agent}-git runningjls -q namehostd('bastille-start')
{agent}-cms runningjls -q namehostd('bastille-start')
PF enabledpfctl -s infohostd('pf-enable')

Severity:

  • Jail failures → fail (db down = cannot start)
  • PF disabled, hostd unreachable → warn (agent can run, degraded)

When the metrics server is enabled, it exposes two lightweight HTTP endpoints:

  • /metrics — Prometheus text-format counters and gauges for scraping
  • /healthz — minimal liveness probe that returns ok

Use them for different purposes:

  • /healthz answers: is the metrics listener up?
  • /metrics answers: what counters and gauges is the runtime exposing?
  • just doctor answers: is the system actually healthy?

Do not treat /healthz as a replacement for just doctor. A live metrics listener does not guarantee that the pipeline, jails, control plane, or service checks are healthy.

Terminal window
just doctor

Reports (in order):

  • overall status
  • latest host heartbeats
  • latest Telegram and pipeline activity
  • latest jail success/failure
  • Stripe status
  • watchdog mode, memory, active/queued jails
  • control plane check results per service
  • split-brain DB availability and row counts

Exit codes:

  • STATUS: ok → exit 0
  • STATUS: warn → exit 0 (degraded but running)
  • STATUS: error → exit 1 (action required)

Note: a missing built-in knowledge artifact is expected during development and is reported as warn (built-in knowledge unavailable) rather than failing the entire health check.

Session safety:

  • Pi sessions (groups/<group>/sessions/*.jsonl) grow over time. When a session file gets too large, the model can hit a context window limit and fail to respond.
  • Use AGENT_SESSION_MAX_BYTES to cap session size; the runtime will start a fresh session automatically when exceeded (silent by default).
  • Additional prompt guardrails limit resource abuse:
    • AGENT_MAX_INBOUND_CHARS — truncates inbound messages exceeding this length
    • AGENT_MAX_BACKLOG_MESSAGES — caps the number of historical messages included in a prompt
    • AGENT_MAX_BACKLOG_CHARS — caps total character count of the backlog
    • AGENT_MAX_PROMPT_CHARS — hard limit on the final assembled prompt size

Timestamps are printed in European format (DD.mmm.YYYY HH:MM).

A running PID can hide real failures:

  • Telegram intake dead
  • scheduler stalled
  • jail execution failing
  • service jails down
  • PF disabled (no public web traffic)

Each layer catches a different class of failure.

Bastille monitor and Clawdie doctor solve different problems:

  • Bastille monitor — jail service watchdog at the OS level
  • Clawdie doctor — application, pipeline, and control plane health

Use both; don’t confuse them.

Beyond the runtime health files above, the agent exposes a family of structured reports for operator inspection on demand. Each report has a matching Telegram slash command and follows the same Observed / Interpretation / Operator Notes template — see Structured Reports for the design and the full list.

ReportCommandWhat it answers
System/reportAre services + jails + controlplane healthy?
Disk/diskWhat is consuming ZFS pool space and snapshots?
Tasks/tasksWhat is in the controlplane task queue?
Budget/budgetreportToken budgets and burn analytics
Publish/publishreportTenant publish/content state
Test/Build/testreportWas the last build/test run green?

/testreport is fed by scripts/write-test-build-status.sh, not by the running process — invoke the wrapper from CI, a hook, or by hand to refresh its status files. The pre-commit and post-commit hooks run it automatically so each commit message footer reflects what was passing at commit time.

For the full operator command reference (status, sessions, admin actions, free-text routing), see Operator Commands.

When the configured LLM provider is in cooldown (e.g. zAI usage cap), the agent transparently routes to the operator-defined fallback. Active cooldowns are visible in /policy and as structured logger.warn lines on every fallback-active run. See Provider Fallback for configuration, manual release (/clearcooldown), and the configured / effective / actual observability triple.