Provider Fallback

When the primary LLM provider returns a “usage cap reached” error, the agent keeps replying instead of looping on 429s — it transparently switches to a configured fallback until the cap window passes, then automatically returns to the primary.

In Plain Language

Some LLM providers (notably zAI) impose rolling 5-hour usage caps. When you hit one, every request fails until the reset.
Without fallback, the bot would retry the capped provider on every message and stay broken for hours.
Fallback puts the capped provider in a “cooldown” until the reset timestamp, routes new runs through your operator-chosen alternative (a paid stable OpenRouter model is the recommended target — see “Choosing a fallback model” below), and resumes the primary the moment the cooldown expires.
The cooldown survives a process restart so a quick service bounce inside the cap window does not re-trip the cap.

Configuration

Set in .env:

Variable	Required	Example
`LLM_FALLBACK_PROVIDER`	yes (when fallback is desired)	`openrouter`
`LLM_FALLBACK_MODEL`	recommended	`openai/o3` (paid, stable)
`LLM_FALLBACK_DEFAULT_COOLDOWN_SECONDS`	optional (default `3600`)	`1800`

The default cooldown is used only when the cap message has no parseable reset stamp. Real zAI cap errors include the reset timestamp and the cooldown matches the reset exactly.

The fallback provider’s API key (OPENROUTER_API_KEY for openrouter, ZAI_API_KEY for zai, etc.) must also be set. The agent verifies this at startup and warns in the logs if it is missing — the warning is the only notice you will get before the fallback fails for real.

Choosing a fallback model

The point of the fallback is to keep the bot replying when the primary provider is unavailable. That means the fallback model must itself be reliable enough for interactive chat. Two specific cautions:

Free-tier models are not appropriate as fallback targets. They rate-limit aggressively and unpredictably, and the failure mode is silent (the request returns nothing, the operator sees no reply). We observed this in production with meta-llama/llama-3.3-70b-instruct:free — during a zAI cap, the fallback to the free Llama model produced silent no-reply situations indistinguishable from “agent is dead.” Use a paid OpenRouter model (e.g. openai/o3, anthropic/claude-3.5-sonnet) or another stable provider entirely.
The fallback should match the primary’s capability class where possible — a fallback that cannot use tools or has a much smaller context window will degrade chat behavior in ways that look like regressions but are actually the model swap.

Free-tier models can still be useful as ad-hoc per-chat overrides via /model for low-priority chats, where a silent no-reply is recoverable. They should not be the system default fallback.

Compaction interaction

AGENT_COMPACTION_PROVIDER / AGENT_COMPACTION_MODEL (when unset) makes compaction follow whatever the chat runtime is using — including the fallback when a cooldown is active. This is the validated default.

If you pin compaction to a separate provider, it stays on that provider regardless of chat fallback state. Useful when you want compaction to keep using a cheap stable model while chat may swap to a more expensive fallback, or vice versa. The trade-off is one more API key to manage and one more cost surface to monitor.

How Cooldowns Work

A run fails with 429 Usage limit reached for 5 hour. Your limit will reset at YYYY-MM-DD HH:MM:SS.
The runner parses the reset timestamp (treated as local time) and stores { provider: 'zai', until: <reset>, reason: <message> } in memory and on disk.
Every subsequent run consults the cooldown map before spawning pi. If the configured provider is in cooldown, the spawn args swap to the fallback provider/model.
The cooldown auto-expires at the reset timestamp. Next run uses the primary again.

The cooldown file now resolves with the same project-local precedence used by other operator artifacts:

AGENT_STATUS_DIR/provider-cooldowns.json
otherwise CLAWDIE_VAR_DIR/provider-cooldowns.json (legacy compatibility)
otherwise repo-local tmp/state/provider-cooldowns.json

Expired entries are dropped on load.

Inspecting State

/policy shows active cooldowns under the runtime line:

Default runtime: zai / glm-4.6
Provider cooldown: zai until 2026-04-25T19:00:59 → fallback openrouter/openai/o3

When no cooldowns are active, the line is omitted — runtime looks normal.

Logs include structured warnings on every fallback-active run:

{ originalProvider: 'zai', fallbackProvider: 'openrouter', cooldownUntil: '...' } Provider fallback active — preferred provider is in cooldown

And on the run that trips the cooldown:

{ provider: 'zai', until: '2026-04-25T19:00:59', reason: '429 Usage limit reached; resets ...' } Provider cap detected — marking cooldown

Manual Release

If you know the cap was lifted early or want to retry the primary before the parsed reset time, clear the cooldown manually:

/clearcooldown                # lists active cooldowns and prints usage
/clearcooldown zai            # clears one
/clearcooldown all            # clears every active cooldown

The command is admin-only and ops-chat-gated. It persists immediately so the cleared state survives restart.

Observability Triple

Every agent activity row now records three provider/model values:

Field	Meaning
`configured_*`	What `.env` says (`PI_TUI_PROVIDER` / `PI_TUI_MODEL`)
`effective_*`	What was actually passed to pi (after fallback swap)
`actual_*`	What pi reports having used (parsed from session JSONL)

When fallback is active, configured_* and effective_* differ. actual_* should match effective_* for a successful run; a divergence suggests pi rewrote the model selection internally.

Behavior That Stays The Same

Per-chat overrides (group.jailConfig.provider / .model) are not touched by the cooldown layer. If you have explicitly set a chat to a specific provider, only that provider’s cooldowns affect it.
Cap detection is conservative — the parser only matches the specific zAI cap signature, not generic 429s, transport errors, or rate-limit responses from other providers. This is intentional to avoid false positives. If you need the same behavior for another provider, the pattern lives in parseProviderCapError() in src/provider-fallback.ts.

When Fallback Is Not Configured

If a primary provider hits its cap and LLM_FALLBACK_PROVIDER is unset:

The cooldown is still tracked.
Runs continue to use the primary and continue to fail until reset.
Logs include a clear warning: Provider in cooldown but no fallback configured; passing through.
/policy will show the cooldown line without a fallback target.

This is intentional — the fallback is opt-in. Without it, you fail visibly rather than silently routing to a wrong provider.