Monitoring Model
Clawdie monitoring is split into distinct layers so “process is running” is not confused with “system is healthy”.
Runtime Health Files
Section titled “Runtime Health Files”The running process writes state into:
data/health/host.jsondata/health/pipeline.jsondata/health/jail.json
Inspected by just doctor.
Monitoring Layers
Section titled “Monitoring Layers”1. Host Health
Section titled “1. Host Health”Tracks: process startup, database init, channel connection, IPC watcher, message loop heartbeat, scheduler heartbeat.
Answers: is the main process alive and making progress?
2. Pipeline Health
Section titled “2. Pipeline Health”Tracks: Telegram connected, last inbound received, last message routed, last jailed run started/finished, last reply sent, last pipeline failure.
Answers: are messages actually flowing?
3. Jail Health (Warden)
Section titled “3. Jail Health (Warden)”Tracks: last jail run started/finished, last success, last failure, last failure code and message, last duration.
Answers: is the isolated executor working?
4. Watchdog
Section titled “4. Watchdog”The Watchdog class in src/watchdog.ts runs two timers:
- health timer (60s) — reads free memory (
sysctl vm.stats.vm.v_free_count), throttles queue concurrency to 1 if below threshold - control plane timer (5 min) — runs
runControlPlaneChecks(), stores the latestControlPlaneReport
The watchdog listens on a Unix socket at ${AGENT_TMP_DIR}/ipc/{agent}-watchdog.sock
for IPC. Override via AGENT_TMP_DIR if needed. Send {"cmd":"status"}\n to get current mode, throttle state, memory,
active/queued jails, and the latest control plane report.
5. Control Plane
Section titled “5. Control Plane”src/controlplane.ts checks service jails and system state. Runs at startup
(before initDatabase()) and every 5 minutes via the watchdog timer.
Checks:
| Check | Method | Fix if failing |
|---|---|---|
| hostd reachable | TCP connect to socket | none (can’t self-fix) |
{agent}-db running | jls -q name | hostd('bastille-start') |
{agent}-git running | jls -q name | hostd('bastille-start') |
{agent}-cms running | jls -q name | hostd('bastille-start') |
| PF enabled | pfctl -s info | hostd('pf-enable') |
Severity:
- Jail failures →
fail(db down = cannot start) - PF disabled, hostd unreachable →
warn(agent can run, degraded)
Metrics Endpoints
Section titled “Metrics Endpoints”When the metrics server is enabled, it exposes two lightweight HTTP endpoints:
/metrics— Prometheus text-format counters and gauges for scraping/healthz— minimal liveness probe that returnsok
Use them for different purposes:
/healthzanswers: is the metrics listener up?/metricsanswers: what counters and gauges is the runtime exposing?just doctoranswers: is the system actually healthy?
Do not treat /healthz as a replacement for just doctor. A live metrics
listener does not guarantee that the pipeline, jails, control plane, or service
checks are healthy.
Doctor Command
Section titled “Doctor Command”just doctorReports (in order):
- overall status
- latest host heartbeats
- latest Telegram and pipeline activity
- latest jail success/failure
- Stripe status
- watchdog mode, memory, active/queued jails
- control plane check results per service
- split-brain DB availability and row counts
Exit codes:
STATUS: ok→ exit 0STATUS: warn→ exit 0 (degraded but running)STATUS: error→ exit 1 (action required)
Note: a missing built-in knowledge artifact is expected during development and is reported as warn (built-in knowledge unavailable) rather than failing the entire health check.
Session safety:
- Pi sessions (
groups/<group>/sessions/*.jsonl) grow over time. When a session file gets too large, the model can hit a context window limit and fail to respond. - Use
AGENT_SESSION_MAX_BYTESto cap session size; the runtime will start a fresh session automatically when exceeded (silent by default). - Additional prompt guardrails limit resource abuse:
AGENT_MAX_INBOUND_CHARS— truncates inbound messages exceeding this lengthAGENT_MAX_BACKLOG_MESSAGES— caps the number of historical messages included in a promptAGENT_MAX_BACKLOG_CHARS— caps total character count of the backlogAGENT_MAX_PROMPT_CHARS— hard limit on the final assembled prompt size
Timestamps are printed in European format (DD.mmm.YYYY HH:MM).
Why This Split Exists
Section titled “Why This Split Exists”A running PID can hide real failures:
- Telegram intake dead
- scheduler stalled
- jail execution failing
- service jails down
- PF disabled (no public web traffic)
Each layer catches a different class of failure.
Bastille’s Role
Section titled “Bastille’s Role”Bastille monitor and Clawdie doctor solve different problems:
- Bastille monitor — jail service watchdog at the OS level
- Clawdie doctor — application, pipeline, and control plane health
Use both; don’t confuse them.
Operator-Facing Reports
Section titled “Operator-Facing Reports”Beyond the runtime health files above, the agent exposes a family of
structured reports for operator inspection on demand. Each report has a
matching Telegram slash command and follows the same Observed /
Interpretation / Operator Notes template — see
Structured Reports for the design and the full list.
| Report | Command | What it answers |
|---|---|---|
| System | /report | Are services + jails + controlplane healthy? |
| Disk | /disk | What is consuming ZFS pool space and snapshots? |
| Tasks | /tasks | What is in the controlplane task queue? |
| Budget | /budgetreport | Token budgets and burn analytics |
| Publish | /publishreport | Tenant publish/content state |
| Test/Build | /testreport | Was the last build/test run green? |
/testreport is fed by scripts/write-test-build-status.sh, not by the
running process — invoke the wrapper from CI, a hook, or by hand to refresh
its status files. The pre-commit and post-commit hooks run it automatically
so each commit message footer reflects what was passing at commit time.
For the full operator command reference (status, sessions, admin actions, free-text routing), see Operator Commands.
Provider Fallback Health
Section titled “Provider Fallback Health”When the configured LLM provider is in cooldown (e.g. zAI usage cap), the
agent transparently routes to the operator-defined fallback. Active
cooldowns are visible in /policy and as structured logger.warn lines on
every fallback-active run. See Provider Fallback for
configuration, manual release (/clearcooldown), and the
configured / effective / actual observability triple.