Back to blog
2026-03-0613 min

Adding a Heartbeat to Polaris: Building an Always-On AI Agent Daemon

AIdaemonsystemsPythonorchestration

Polaris started as a conversation-scoped tool. You open Claude Code, Polaris orchestrates your work, and when you close the session, everything stops. The state persists in Polaris.md and .polaris/loop-state.yaml, so you can pick up where you left off, but nothing runs while you're not looking.

That's fine for development work. It's not fine for things that need to happen on a schedule: scanning job boards every 6 hours, monitoring a deployment for drift, running nightly test suites against a staging environment. For those you need a daemon.

The Chyros Model

I studied how Claude Code's internal scheduling system works (the source is called "Chyros" in some contexts). The key design decisions:

  • 1-second polling interval for the scheduler tick. Cheap enough to be responsive, but all the real work is gated behind cron expression evaluation so you're not burning CPU.
  • File-based task persistence via a single JSON file. No database. The scheduler reads .claude/scheduled_tasks.json, evaluates each task against the current time, and fires those that are due.
  • Lock management to prevent double-firing when multiple sessions are open. A PID-based lock file with liveness probing (every 5 seconds, non-owning sessions check if the owner's PID is still alive and take over if it's dead).
  • Jitter on fire times to prevent thundering herd at :00 boundaries. Random offset within a configurable window.
  • Missed task detection on restart. One-shot tasks that missed their fire time get surfaced for confirmation. Recurring tasks just fire on the next tick.

I borrowed the parts that make sense for a background daemon and dropped the parts that are specific to interactive REPL use.

Polaris Daemon Architecture

%%MERMAID_START%%graph TD A[polaris daemon start] --> B[PID Check] B -->|Already running| C[Exit with error] B -->|Clear| D[Write PID file] D --> E[Load task queue] E --> F[Start webhook server] F --> G[Heartbeat Loop] G --> H{Check cron tasks} H -->|Due| I[Execute via polaris-ask] H -->|Not due| J{Check watch tasks} J -->|Files changed| I J -->|No changes| K[Sleep interval] I --> L[Update task state] L --> M[Send notifications] M --> K K --> G N[SIGTERM/SIGINT] --> O[Graceful shutdown] O --> P[Remove PID file] O --> Q[Save state]%%MERMAID_END%%

Three Task Types

Cron tasks fire on a schedule defined by a standard 5-field cron expression. The implementation is a minimal cron parser that evaluates minute/hour/day/month/weekday fields against the current time. After firing, the next fire time is calculated and stored on the task. No catch-up logic (if the daemon was down when a cron task was due, it fires on the first tick after restart, not N times for N missed intervals).

Watch tasks monitor file or directory paths for changes. The watcher uses mtime checks (not inotify/fsevents) because it's simpler and cross-platform. Each heartbeat tick, the watcher compares current mtimes against stored values. For directories, it takes the max mtime of all files recursively. Changed paths trigger the associated task.

Webhook tasks listen on an HTTP endpoint. The daemon runs a lightweight async HTTP server (raw asyncio, no framework) that accepts POST requests to /trigger/{path}. When a request comes in, the matching task is queued for immediate execution on the next tick.

Adding Tasks

# Scan job boards every 6 hours
polaris daemon add "job-scan" \
  --type cron \
  --cron "0 */6 * * *" \
  --model gemini \
  --prompt "Scan all configured job boards and generate a digest"
 
# Rebuild index when source files change
polaris daemon add "auto-reindex" \
  --type watch \
  --watch ./src \
  --watch ./lib \
  --model claude \
  --prompt "Reindex the codebase, focusing on changed files"
 
# Trigger deployment review via webhook
polaris daemon add "deploy-review" \
  --type webhook \
  --webhook-path /deploy-review \
  --model claude \
  --prompt "Review the latest deployment diff and flag potential issues"

Task Execution

Every task executes through the same path: polaris-ask {model} "{prompt}". This is the same delegation mechanism that the interactive Polaris orchestrator uses. The daemon doesn't need its own model integration layer because polaris-ask already handles routing to Claude, GPT, Gemini, and Kimi.

The execution is async (asyncio.create_subprocess_exec), with a configurable timeout per task (default: 300 seconds). Stdout is captured as the task result. Non-zero exit codes are treated as failures.

Failed tasks increment a retry counter. After max_retries (default 3) consecutive failures, the task stays in failed status until manually reset. Cron tasks reschedule regardless of success/failure (you don't want a flaky task to permanently stop its schedule).

State and Recovery

The daemon persists two files in .polaris/daemon/:

tasks.json: The task queue. Contains all task definitions, their current state, last run time, next run time, run count, and error history. Written atomically via temp file + rename to prevent corruption on crash.

daemon-state.json: Global daemon state. Tracks the last heartbeat timestamp, total tasks run, started_at time, and the last 100 errors. This is what polaris daemon status reads.

daemon.pid: PID file for single-instance enforcement. On startup, the daemon checks if the PID in this file corresponds to a running process (os.kill(pid, 0)). If yes, it refuses to start. If the process is dead (stale PID file), it cleans up and proceeds.

On restart after a crash:

  1. PID file is stale, gets cleaned up
  2. Task queue loads from tasks.json
  3. Cron tasks with next_run_at in the past fire on the first tick
  4. Watch tasks reset their mtime baselines (first tick establishes current state, second tick detects changes)
  5. Daemon state loads previous error history and run counts

Notification Channels

Three channels, all optional:

Stdout (always on): Structured logging through Python's logging module. Every task execution, completion, and failure gets logged with timestamps and task IDs.

Discord/Slack webhook: POST to a webhook URL with an embedded message. Discord gets rich embeds (color-coded by severity, with timestamps and footers). Slack gets a simpler text format with emoji indicators. The webhook URL can be set via CLI flag or environment variable (POLARIS_DISCORD_WEBHOOK / POLARIS_SLACK_WEBHOOK).

SMTP email: For environments where webhook isn't available. Configurable host, port, TLS, and credentials. Less real-time than webhooks, but works everywhere.

The notification on each task includes: task name, model used, run number, and a preview of the result (truncated to 500 chars for webhooks, full text for email). Failures include the error message and retry count.

The Webhook Server

The webhook server is deliberately minimal. Raw asyncio TCP server, manual HTTP parsing, no framework dependencies. It supports three endpoints:

  • GET /health returns {"ok": true} (for monitoring)
  • GET /status returns the full task queue summary
  • POST /trigger/{path} triggers matching webhook tasks

The server binds to localhost by default (port 9876). For external access, you'd put it behind a reverse proxy. The lack of authentication is intentional for localhost use. If you expose it externally, your reverse proxy should handle auth.

Why not use FastAPI or Flask? Because adding a web framework as a daemon dependency felt wrong. The webhook server is maybe 100 lines of code. It doesn't need routing, middleware, serialization libraries, or any of the other things frameworks provide. It needs to accept a POST, match a path, and update a task status.

Integration with the Job Pipeline

The daemon was designed specifically to enable the job hunting pipeline (Phase 5 of the portfolio buildout). The pipeline runs as a cron task:

polaris daemon add "job-discovery" \
  --type cron \
  --cron "0 */6 * * *" \
  --model gemini \
  --prompt "Run the job discovery pipeline: scan Greenhouse/Ashby APIs for target companies, score roles, generate application materials for top matches, write digest to ~/polaris-jobs/"

Gemini handles the research-heavy scraping and scoring. For high-scoring roles that need cover letters, a follow-up task routes to Claude:

polaris daemon add "application-gen" \
  --type watch \
  --watch ~/polaris-jobs/pending/ \
  --model claude \
  --prompt "Generate tailored cover letter and resume for each role in ~/polaris-jobs/pending/"

The watch task triggers whenever the discovery task writes new pending roles. Two tasks, two models, fully automated.

What I'd Do Differently

The cron parser is brute-force. next_fire_time checks minute by minute for up to 48 hours. For expressions like 0 0 1 1 * (once a year), this fails because it only looks 48 hours ahead. A proper cron library (like croniter) would handle this correctly, but I wanted zero external dependencies for the scheduler core. The tradeoff is acceptable for the hourly/daily schedules I actually use.

Watch tasks should debounce. A git pull that touches 50 files triggers 50 mtime changes, but the watch task only needs to fire once. The current implementation fires once per tick (since it batches all changes), but if the heartbeat interval is very short (say 5 seconds), rapid file changes could trigger the task multiple times before the first execution finishes.

The webhook server should support task-specific payloads. Right now, a POST to /trigger/deploy-review triggers the task with its static prompt. It would be more useful if the POST body could inject context into the prompt (like the deployment diff URL). This is a simple change but I haven't needed it yet.