Health Engine

The health engine gives Mudabbir self-awareness of its own system state. It validates configuration, tests LLM connectivity, persists errors to disk, and injects diagnostic context into the agent’s system prompt so it can help users troubleshoot problems.

Overview

Here’s the problem the health engine solves: you start a Deep Work project, it fails halfway through because your API key expired, the error flashes as a toast notification, you miss it, and now there’s no way to find out what went wrong. You ask the agent “what happened?” and it has no idea — the error vanished with the page refresh.

The health engine fixes this by:

Running checks on startup and every 5 minutes to catch config/connectivity issues early
Persisting every error to an append-only log on disk that survives restarts
Injecting health state into the agent’s system prompt so it knows about problems before you ask
Giving the agent diagnostic tools so it can look up errors and suggest fixes itself

The entire health engine is LLM-independent. All checks are pure Python — no API calls to the LLM are needed. If the LLM is down, the health engine still works. The agent is a consumer of health state, not a producer.

Architecture

┌──────────────────────────────────────────────────────┐
│                    HealthEngine                       │
│                                                       │
│  run_startup_checks()    run_connectivity_checks()   │
│         │                         │                   │
│         ▼                         ▼                   │
│  ┌─────────────┐          ┌─────────────┐            │
│  │  10 Startup  │          │ 1 Connectivity│           │
│  │   Checks     │          │    Check      │           │
│  │  (sync/fast) │          │  (async/5s)   │           │
│  └──────┬──────┘          └──────┬──────┘            │
│         │                         │                   │
│         └───────────┬─────────────┘                   │
│                     ▼                                 │
│           ┌─────────────────┐                         │
│           │ overall_status  │                         │
│           │ healthy/degraded│                         │
│           │ /unhealthy      │                         │
│           └────────┬────────┘                         │
│                    │                                  │
│    ┌───────────────┼───────────────┐                  │
│    ▼               ▼               ▼                  │
│ Dashboard     System Prompt    ErrorStore             │
│ Health Dot    Injection        (JSONL on disk)        │
│ + Modal       (when degraded)                         │
└──────────────────────────────────────────────────────┘

The HealthEngine class lives in mudabbir.health.engine and is accessed as a singleton via get_health_engine(). It orchestrates all checks, computes overall status, delegates error storage to ErrorStore, and provides the prompt injection block.

Health Checks

The engine runs 11 checks across three categories:

Startup Checks (sync, fast)

These run synchronously at startup and during each heartbeat. They never block or make network calls.

Check ID	Name	Category	Severity	What it checks
`config_exists`	Config File	config	warning	`~/.mudabbir/config.json` exists
`config_valid_json`	Config JSON Valid	config	critical	Config file parses as valid JSON
`config_permissions`	Config Permissions	config	warning	File permissions are 600 (Unix only)
`api_key_primary`	Primary API Key	config	critical	API key exists for the selected backend
`api_key_format`	API Key Format	config	warning	API keys match expected prefix patterns (`sk-ant-` for Anthropic, `sk-` for OpenAI)
`backend_deps`	Backend Dependencies	config	critical	Required Python packages are importable for the selected backend
`secrets_encrypted`	Secrets Encrypted	config	warning	`secrets.enc` exists and contains a valid Fernet token
`disk_space`	Disk Space	storage	warning	`~/.mudabbir/` directory is under 500 MB
`audit_log_writable`	Audit Log Writable	storage	warning	`audit.jsonl` can be opened for append
`memory_dir_accessible`	Memory Directory	storage	warning	`~/.mudabbir/memory/` exists and is a directory

Connectivity Checks (async, background)

These make network calls and can be slow (5-second timeout).

Check ID	Name	Category	Severity	What it checks
`llm_reachable`	LLM Reachable	connectivity	critical	LLM API responds (Anthropic: hits `/v1/models`; Ollama: hits `/api/tags`)

Each check returns a HealthCheckResult dataclass with check_id, name, category, status (ok/warning/critical), message, fix_hint, and timestamp.

Status Computation

The engine computes overall_status from individual check results using a simple priority rule:

if any check has status "critical":
    overall_status = "unhealthy"
elif any check has status "warning":
    overall_status = "degraded"
else:
    overall_status = "healthy"

Three possible states:

Status	Meaning	Dashboard indicator	Prompt injection
healthy	All checks passed	Green dot	None (saves context window)
degraded	At least one warning, no criticals	Yellow dot	Issues injected into system prompt
unhealthy	At least one critical failure	Red dot	Issues injected into system prompt

Info

When the system is healthy, the health engine adds nothing to the system prompt. This is intentional — it preserves context window space for actual conversation.

Persistent Error Log

The ErrorStore persists errors to ~/.mudabbir/health/errors.jsonl as append-only JSONL. Errors survive page refresh, browser close, and server restart.

Error entry format

Each line is a JSON object:

{
  "id": "a1b2c3d4e5f6",
  "timestamp": "2025-03-15T10:30:00+00:00",
  "source": "agent_loop",
  "severity": "error",
  "message": "Anthropic API returned 401: Invalid API key",
  "traceback": "Traceback (most recent call last):\n  ...",
  "context": { "session_id": "abc123", "backend": "claude_agent_sdk" }
}

Field	Type	Description
`id`	string	12-char hex UUID (e.g. `a1b2c3d4e5f6`)
`timestamp`	string	ISO 8601 UTC timestamp
`source`	string	Where the error originated (e.g. `agent_loop`, `deep_work`, `tool_execution`)
`severity`	string	`error` or `warning`
`message`	string	Human-readable error description
`traceback`	string	Python traceback (if available)
`context`	object	Additional metadata (session ID, backend, tool name, etc.)

Rotation

When errors.jsonl exceeds 10 MB, the store rotates:

errors.jsonl → errors.jsonl.1
errors.jsonl.1 → errors.jsonl.2
… up to errors.jsonl.5 (oldest gets deleted)

This keeps disk usage bounded without losing recent history.

Agent Diagnostic Tools

The health engine registers three tools that the agent can use to diagnose problems:

Tool	Description	Parameters
`health_check`	Run system health diagnostics. Returns check results with status and fix hints.	`include_connectivity` (bool, default: false) — also run connectivity checks (slower)
`error_log`	Read recent errors from the persistent error log.	`limit` (int, default: 10) — max errors to return; `search` (string) — filter by text
`config_doctor`	Validate configuration with playbook-backed diagnosis. Returns a report with symptoms, causes, and fix steps.	`section` (string) — focus area: `api_keys`, `backend`, `storage`, or empty for all

Self-diagnosis flow

When a user says “something went wrong” or “why did that fail?”, the agent can:

Call health_check to see if any checks are failing
Call error_log to find the specific error with traceback
Call config_doctor to get playbook-backed fix suggestions
Explain the problem and suggest concrete steps to resolve it

User: "My Deep Work project failed and I don't know why"

Agent: [calls health_check] → sees "LLM Reachable: critical"
Agent: [calls error_log(search="deep_work")] → finds the specific API timeout
Agent: "Your Anthropic API key appears to be invalid. The health check shows
        the API returned a 401 error. Go to Settings > API Keys and paste
        a fresh key, then retry your Deep Work project."

Info

The agent doesn’t need to be told about these tools — they’re automatically registered in the tool registry alongside all other built-in tools.

System Prompt Injection

When the system is degraded or unhealthy, the health engine injects a block into the agent’s system prompt via the AgentContextBuilder. This happens in bootstrap/context_builder.py:

from Mudabbir.health import get_health_engine

health_block = get_health_engine().get_health_prompt_section()
if health_block:
    parts.append(health_block)

The injected block looks like:

# System Health Status
System is currently: DEGRADED

Known issues:
- [WARNING] Config Permissions: Config file permissions too open: 0o644 (should be 600)
  Fix: Run: chmod 600 ~/.mudabbir/config.json
- [CRITICAL] LLM Reachable: Cannot reach Anthropic API: Connection timed out
  Fix: Check your internet connection or https://status.anthropic.com

If the user reports problems, check these issues first.
Use the `health_check` tool for diagnostics and `error_log` for recent errors.

This means the agent is pre-loaded with knowledge of system problems before the user even asks. When healthy, nothing is injected — zero context window overhead.

Warning

The prompt injection is wrapped in a try-except that silently catches all errors. Health engine failures never break prompt building or block the agent from starting.

Repair Playbooks

Playbooks are pure data mappings from check_id to diagnostic information. They live in mudabbir.health.playbooks and are used by the config_doctor tool and the dashboard health modal.

Each playbook contains:

Field	Description
`symptom`	What the user experiences (e.g. “Agent fails to respond or returns authentication errors”)
`causes`	List of possible root causes
`fix_steps`	Ordered list of steps to resolve the issue
`auto_fixable`	Whether the system can fix this automatically (currently only `config_permissions`)

Playbooks exist for: api_key_primary, llm_reachable, config_valid_json, backend_deps, disk_space, config_permissions, and secrets_encrypted.

Example playbook:

"llm_reachable": {
    "symptom": "Agent times out or returns network errors",
    "causes": [
        "Internet connection is down",
        "Anthropic API is experiencing an outage",
        "Firewall or proxy blocking API requests",
        "Ollama is not running (if using ollama backend)",
    ],
    "fix_steps": [
        "Check your internet connection",
        "Visit https://status.anthropic.com for API status",
        "If using Ollama: run 'ollama serve' in a terminal",
        "Check if a firewall/VPN is blocking api.anthropic.com",
    ],
    "auto_fixable": False,
}

Dashboard UI

Health dot

The sidebar displays a small colored dot next to the Mudabbir logo:

Green — healthy (all checks pass)
Yellow — degraded (warnings present)
Red — unhealthy (critical failures)

Clicking the health dot opens a modal showing:

Overall status with a colored badge
List of all check results with status icons
For failing checks: the error message and fix hint
A Fix Issues button that opens relevant settings (e.g. API Keys if the key is missing)

WebSocket updates

Health status changes are broadcast to all connected WebSocket clients as:

{
  "type": "health_update",
  "data": {
    "status": "degraded",
    "check_count": 11,
    "issues": [...],
    "last_check": "2025-03-15T10:30:00+00:00"
  }
}

The dashboard listens for health_update messages and updates the dot color and modal content in real time — no polling needed.

Configuration

Setting	Default	Description
`health_check_on_startup`	`true`	Run startup checks when Mudabbir launches and print a colored summary to the terminal

When enabled, you’ll see output like this on startup:

  [OK]   Config File: Config file exists at /home/user/.mudabbir/config.json
  [OK]   Config JSON Valid: Config file is valid JSON
  [WARN] Config Permissions: Config file permissions too open: 0o644 (should be 600)
         Run: chmod 600 ~/.mudabbir/config.json
  [OK]   Primary API Key: Anthropic API key is configured
  [OK]   API Key Format: API key formats look correct
  [OK]   Backend Dependencies: All dependencies available for claude_agent_sdk
  [OK]   Secrets Encrypted: Encrypted secrets file is valid (256 bytes)
  [OK]   Disk Space: Data directory: 12.3 MB
  [OK]   Audit Log Writable: Audit log is writable
  [OK]   Memory Directory: Memory directory is accessible

  System: DEGRADED

Heartbeat

The health engine registers a background job via APScheduler that runs every 5 minutes:

Runs all startup checks (sync)
Runs connectivity checks (async, 5-second timeout)
Compares the new overall_status against the previous status
If the status changed (e.g. healthy → degraded), broadcasts a health_update to all WebSocket clients
Logs status transitions to the server log

The heartbeat reuses the existing APScheduler instance from the ProactiveDaemon — no additional scheduler process is created.

Info

The heartbeat only broadcasts on status transitions, not every 5 minutes. If the system stays healthy, no WebSocket messages are sent. This avoids unnecessary network traffic and UI updates.

REST API

The health engine exposes 4 REST endpoints. See the API Reference for full details:

GET /api/health — Get current health status summary
GET /api/health/errors — Query the persistent error log
POST /api/health/check — Trigger a full health check run
DELETE /api/health/errors — Clear the persistent error log

Agent Loop — How the agent processes messages and uses diagnostic tools
Security Model — Guardian AI, audit logging, and the security layer
API Reference — Full REST endpoint documentation

Last updated: February 21, 2026

7 min read

Edit this page

Was this page helpful?

Health Engine

Overview

Architecture

Health Checks

Startup Checks (sync, fast)

Connectivity Checks (async, background)

Status Computation

Persistent Error Log

Error entry format

Rotation

Agent Diagnostic Tools

Self-diagnosis flow

System Prompt Injection

Repair Playbooks

Dashboard UI

Health dot

Health modal

WebSocket updates

Configuration

Heartbeat

REST API

Related