Monitoring Model

Clawdie monitoring is split into distinct layers so “process is running” is not confused with “system is healthy”.

Runtime Health Files

The running process writes state into:

data/health/host.json
data/health/pipeline.json
data/health/jail.json

Inspected by just doctor.

Monitoring Layers

1. Host Health

Tracks: process startup, database init, channel connection, IPC watcher, message loop heartbeat, scheduler heartbeat.

Answers: is the main process alive and making progress?

2. Pipeline Health

Tracks: Telegram connected, last inbound received, last message routed, last jailed run started/finished, last reply sent, last pipeline failure.

Answers: are messages actually flowing?

3. Jail Health (Warden)

Tracks: last jail run started/finished, last success, last failure, last failure code and message, last duration.

Answers: is the isolated executor working?

4. Watchdog

The Watchdog class in src/watchdog.ts runs two timers:

health timer (60s) — reads free memory (sysctl vm.stats.vm.v_free_count), throttles queue concurrency to 1 if below threshold
control plane timer (5 min) — runs runControlPlaneChecks(), stores the latest ControlPlaneReport

The watchdog listens on a Unix socket at ${AGENT_TMP_DIR}/ipc/{agent}-watchdog.sock for IPC. Override via AGENT_TMP_DIR if needed. Send {"cmd":"status"}\n to get current mode, throttle state, memory, active/queued jails, and the latest control plane report.

5. Control Plane

src/controlplane.ts checks service jails and system state. Runs at startup (before initDatabase()) and every 5 minutes via the watchdog timer.

Checks:

Check	Method	Fix if failing
hostd reachable	TCP connect to socket	none (can’t self-fix)
`{agent}-db` running	`jls -q name`	`hostd('bastille-start')`
`{agent}-git` running	`jls -q name`	`hostd('bastille-start')`
`{agent}-cms` running	`jls -q name`	`hostd('bastille-start')`
PF enabled	`pfctl -s info`	`hostd('pf-enable')`

Severity:

Jail failures → fail (db down = cannot start)
PF disabled, hostd unreachable → warn (agent can run, degraded)

Metrics Endpoints

When the metrics server is enabled, it exposes two lightweight HTTP endpoints:

/metrics — Prometheus text-format counters and gauges for scraping
/healthz — minimal liveness probe that returns ok

Use them for different purposes:

/healthz answers: is the metrics listener up?
/metrics answers: what counters and gauges is the runtime exposing?
just doctor answers: is the system actually healthy?

Do not treat /healthz as a replacement for just doctor. A live metrics listener does not guarantee that the pipeline, jails, control plane, or service checks are healthy.

Doctor Command

just doctor

Reports (in order):

overall status
latest host heartbeats
latest Telegram and pipeline activity
latest jail success/failure
Stripe status
watchdog mode, memory, active/queued jails
control plane check results per service
split-brain DB availability and row counts

Exit codes:

STATUS: ok → exit 0
STATUS: warn → exit 0 (degraded but running)
STATUS: error → exit 1 (action required)

Note: a missing built-in knowledge artifact is expected during development and is reported as warn (built-in knowledge unavailable) rather than failing the entire health check.

Session safety:

Pi sessions (groups/<group>/sessions/*.jsonl) grow over time. When a session file gets too large, the model can hit a context window limit and fail to respond.
Use AGENT_SESSION_MAX_BYTES to cap session size; the runtime will start a fresh session automatically when exceeded (silent by default).
Additional prompt guardrails limit resource abuse:
- AGENT_MAX_INBOUND_CHARS — truncates inbound messages exceeding this length
- AGENT_MAX_BACKLOG_MESSAGES — caps the number of historical messages included in a prompt
- AGENT_MAX_BACKLOG_CHARS — caps total character count of the backlog
- AGENT_MAX_PROMPT_CHARS — hard limit on the final assembled prompt size

Timestamps are printed in European format (DD.mmm.YYYY HH:MM).

Why This Split Exists

A running PID can hide real failures:

Telegram intake dead
scheduler stalled
jail execution failing
service jails down
PF disabled (no public web traffic)

Each layer catches a different class of failure.

Bastille’s Role

Bastille monitor and Clawdie doctor solve different problems:

Bastille monitor — jail service watchdog at the OS level
Clawdie doctor — application, pipeline, and control plane health

Use both; don’t confuse them.

Operator-Facing Reports

Beyond the runtime health files above, the agent exposes a family of structured reports for operator inspection on demand. Each report has a matching Telegram slash command and follows the same Observed / Interpretation / Operator Notes template — see Structured Reports for the design and the full list.

Report	Command	What it answers
System	`/report`	Are services + jails + controlplane healthy?
Disk	`/disk`	What is consuming ZFS pool space and snapshots?
Tasks	`/tasks`	What is in the controlplane task queue?
Budget	`/budgetreport`	Token budgets and burn analytics
Publish	`/publishreport`	Tenant publish/content state
Test/Build	`/testreport`	Was the last build/test run green?

/testreport is fed by scripts/write-test-build-status.sh, not by the running process — invoke the wrapper from CI, a hook, or by hand to refresh its status files. The pre-commit and post-commit hooks run it automatically so each commit message footer reflects what was passing at commit time.

For the full operator command reference (status, sessions, admin actions, free-text routing), see Operator Commands.

Provider Fallback Health

When the configured LLM provider is in cooldown (e.g. zAI usage cap), the agent transparently routes to the operator-defined fallback. Active cooldowns are visible in /policy and as structured logger.warn lines on every fallback-active run. See Provider Fallback for configuration, manual release (/clearcooldown), and the configured / effective / actual observability triple.