Polaris: Building a Multi-Model Orchestration Engine

Claude is good at architecture. GPT is good at scaffolding. Kimi destroys frontend work. Gemini can chew through 2M tokens of context without breaking a sweat.

I kept switching between them manually, copying context back and forth, losing state, re-explaining what I needed. It was like managing a team of brilliant interns who all have amnesia. I wanted one system that knew which model to call for which task, maintained persistent state across iterations, and could run autonomously until the work was done.

That's Polaris.

What It Actually Is

Polaris turns Claude Code into an autonomous development engine. You describe a feature, Polaris breaks it into phases with independently testable tasks, routes work to specialist models, runs two-pass QA on every phase, and loops until the whole thing ships. The state persists in a Polaris.md file that lives in your repo (so it shows up in PRs), and a vector index tracks your codebase for semantic search.

The core components:

%%MERMAID_START%%graph TD A[User Task] --> B[Phase Orchestrator] B --> C[Task Classifier] C --> D{Model Router} D -->|Architecture/Debug| E[Claude] D -->|UI/Frontend| F[Kimi] D -->|Scaffolding/DevOps| G[GPT-4o] D -->|Research/Long-context| H[Gemini] E --> I[QA Gate Pass 1: Automated] F --> I G --> I H --> I I --> J[QA Gate Pass 2: Adversarial] J --> K{Phase Approved?} K -->|Yes| L[Next Phase] K -->|No| B L --> B M[AST Chunker] --> N[Voyage Code 3] N --> O[LanceDB Index] O --> P[Hybrid Search] P --> Q[Reranker] Q --> R[Search Results] S[Drift Detector] -->|>5% drift| T[Auto-Reindex] T --> O%%MERMAID_END%%

The Routing Problem

The naive approach to multi-model orchestration is just "send everything to the best model." That works until you realize the best model depends entirely on what you're asking it to do.

I built a strength matrix based on actual benchmark data and my own testing. It lives in polaris/routing/router.py:

DEFAULT_STRENGTH_MATRIX = {
    "debug": {
        "claude": 92,   # SWE-Bench Verified: 82.1%
        "codex": 80,    # SWE-Bench Pro: 56.8%
        "gemini": 78,
        "kimi": 72,     # SWE-Bench Verified: 76.8%
    },
    "ui_frontend": {
        "kimi": 90,     # Tested: glassmorphism, animations, responsive
        "claude": 75,
        "gemini": 70,
        "codex": 65,
    },
    "research": {
        "gemini": 95,   # 2M context, GPQA Diamond 94.3%
        "claude": 78,
        "kimi": 70,
        "codex": 65,
    },
    "algorithm": {
        "kimi": 95,     # LiveCodeBench: 83-85%
        "gemini": 78,
        "claude": 75,
        "codex": 70,
    },
    # ... 12 task types total
}

The TaskClassifier in polaris/routing/classifier.py analyzes the prompt and any mentioned file paths to determine the task type (debug, refactor, scaffold, ui_frontend, architecture, test_gen, explain, research, image_to_code, terminal_cli, algorithm, general). Then the ModelRouter looks up scores and picks the winner.

There's also a file ownership layer. If your project config says "*.tsx" -> kimi, that overrides the classifier entirely. In practice this matters because some files consistently need the same model regardless of what you're doing to them.

What Didn't Work: LLM-Based Routing

My first attempt at routing used Claude itself to classify tasks. You'd think an LLM would be great at understanding what kind of work a prompt describes. It was, but the latency was brutal. Adding 2-3 seconds to every single task delegation just to decide who to delegate to killed the whole flow. The regex + keyword classifier runs in under a millisecond and gets it right ~90% of the time. Good enough.

The Search Layer (Where CHEMMRAG Lives)

The search system is the foundation everything else builds on. When Polaris needs to understand your codebase (for context injection, drift detection, or answering "where is X implemented?"), it hits a LanceDB vector index populated by AST-aware chunks.

Why AST Chunking Matters

Most RAG systems chunk code by line count or token count. Split every 500 tokens, maybe with some overlap. This is fine for prose and absolutely terrible for code.

Think about what happens when you split a 200-line class at line 100. The first chunk has the constructor and half the methods. The second chunk has the other half. Neither chunk makes semantic sense on its own. The embedding model sees half a class in each and produces mediocre vectors for both.

The chunker in polaris/chunker/bridge.py uses tree-sitter to parse actual ASTs for 20+ languages. It walks the syntax tree and extracts semantic units: functions, methods, classes, declarations. Each chunk is a complete, meaningful code entity.

@dataclass
class Chunk:
    content_hash: str      # SHA256 for change detection
    stable_id: str         # path:nodeType:name:startLine:endLine
    content: str           # The actual code
    file_path: str
    language: str
    start_line: int
    end_line: int
    chunk_type: str        # function, method, class, declaration, block
    name: str | None       # Function/class name if applicable
    graph: GraphMetadata | None  # Parent/child/call relationships

The stable_id is deterministic (file path + node type + name + line range), which means you can detect exactly which chunks changed between indexing passes without re-embedding everything.

The GraphMetadata is the thing I'm most proud of. Each chunk tracks its parent (the class it belongs to), children (methods inside it), and call relationships (functions it calls, functions that call it). This powers the graph expansion in search, where finding one relevant function automatically surfaces the class it lives in and the functions it calls.

Embeddings and Search Pipeline

The actual search pipeline in polaris/core/index.py runs five stages:

%%MERMAID_START%%graph LR A[Query] --> B[HyDE Rewrite] B --> C[Embed with voyage-code-3] C --> D[Vector Search + FTS] D --> E[RRF Fusion] E --> F[Voyage Rerank 2.5] F --> G[Graph Expansion] G --> H[Results]%%MERMAID_END%%

HyDE rewrite: Uses Claude Haiku to generate a hypothetical code snippet that would answer the query. This synthetic document gets embedded instead of (well, alongside) the raw query. It dramatically improves recall for natural language queries like "how does the auth middleware work?" because the embedding of a hypothetical auth middleware implementation is closer in vector space to the actual auth middleware than the question itself.
Dual embedding: The HyDE output gets embedded with voyage-code-3 (1024-dim, specifically trained for code). The model distinguishes between "document" and "query" input types, which matters for asymmetric retrieval.
Hybrid search: Vector similarity search runs against LanceDB, and a full-text search runs in parallel. Both return their top candidates.
RRF fusion: Reciprocal Rank Fusion combines the two result sets. The formula is dead simple: for each document, sum 1/(k + rank) across both result lists. k=60 is standard. Documents that rank high in both searches bubble to the top.
Reranking: Voyage Rerank 2.5 does a cross-attention pass over the query and each candidate. This is the expensive step but it's running on maybe 20 candidates, not thousands, so it's fine.
Graph expansion: For each result, pull in parent classes, called functions, and callers from the graph metadata. If you searched for "login handler" and found the function, you also get the auth middleware class it lives in and the session validator it calls.

The Numbers

LanceDB runs in-process (no external database to manage) with PyArrow backing. For a ~50K line codebase, initial indexing takes about 90 seconds. Incremental updates (via drift detection) typically process in under 5 seconds because they only re-embed changed chunks.

The voyage-code-3 embeddings produce 1024-dimensional float32 vectors. For 10K chunks, that's ~40MB of vector data. LanceDB handles this without breaking a sweat.

Autonomous Loops

The orchestrator in polaris/core/orchestrator.py implements what I call "Ralph-style" loops (named after the first time I got Claude to iterate autonomously on a task without human intervention).

The loop state persists in .polaris/loop-state.yaml:

active: true
iteration: 7
max_iterations: 50
task: "Implement OAuth2 PKCE flow for the dashboard"
completion_promise: "TASK_COMPLETE"
started_at: "2026-04-10T14:30:00Z"
polaris:
  last_reindex_iteration: 5
  current_drift: 2.3
  tool_usage:
    search_code: 12
    start_task: 3
    complete_task: 2
  delegation_history:
    - model: kimi
      task: "Build OAuth callback component"
      iteration: 4

Each iteration, the system:

Checks drift against the manifest (two-phase: fast mtime check, then deep content hash for changed files)
Auto-reindexes if drift exceeds 5%
Evaluates whether the completion promise has been satisfied
Generates iteration context for Claude with current phase, task progress, and search results

The completion promise is a simple pattern. The model outputs <promise>TASK_COMPLETE</promise> when it believes the work is done. This triggers QA passes, and if those pass, the loop terminates. If QA fails, the loop continues with the QA feedback injected into context.

Phase Orchestration

Complex tasks get decomposed into phases, each with independently testable tasks. The state lives in Polaris.md:

# Polaris Plan
 
## Phase 1: Database Schema [APPROVED]
- [x] Create user model with OAuth fields
- [x] Add session table with PKCE verifier storage
- [x] Migration scripts
 
## Phase 2: Auth Flow [IN_PROGRESS]
- [x] PKCE challenge generation
- [ ] Token exchange endpoint  (IN_PROGRESS)
- [ ] Refresh token rotation
 
## Phase 3: Dashboard Integration [PENDING]
- [ ] Login component
- [ ] Protected route wrapper
- [ ] Session management hook

Two QA passes are required before a phase gets approved:

Pass 1: Automated testing (pytest, mypy, ruff). Does the code compile? Do the tests pass? Are there type errors?
Pass 2: Adversarial verification. A separate agent tries to break the implementation. Boundary conditions, error paths, concurrency issues, security probes. It outputs a VERDICT: PASS/FAIL/PARTIAL.

I've had Pass 2 catch real bugs that Pass 1 missed. One time it found a race condition in session cleanup that only manifested under concurrent requests. The automated tests all passed because they ran sequentially.

Multi-Model Delegation

The actual delegation happens through polaris-ask, a CLI tool that wraps each model's interface. The execution model is simple: spawn a subprocess, pipe the prompt in, capture stdout.

# Single delegation
polaris-ask kimi "Build a responsive nav component with dark mode toggle"
 
# Parallel delegation (single shell command)
polaris-ask kimi "Build the UI" & \
polaris-ask codex "Write the Dockerfile" & \
polaris-ask gemini "Research the API docs" & \
wait

The team/runtime.py module handles tmux-based worker orchestration for longer tasks. Each worker gets its own tmux pane, runs independently, and streams JSON output that the coordinator collects.

Why Not Function Calling?

I tried using each model's native function calling / tool use API for structured delegation. The problem is that every model implements tools differently. Claude uses XML-ish tool blocks. GPT uses JSON function calls. Gemini has its own thing. Maintaining four different tool schema formats and parsing four different response formats was a nightmare.

The polaris-ask approach is dumber and better. The worker model just outputs text. The coordinator (Claude) reads the text and decides what to do with it. Natural language as the integration layer. It's more tokens but way less code and way fewer parsing bugs.

The HTTP Proxy (For Analytics)

Polaris includes an HTTP/2 proxy (polaris/proxy/server.py) that sits between Claude Code and the Anthropic API. It intercepts every request and logs:

Input/output token counts
Request/response sizes
Latency per request
Cost estimates

There's also a response cache keyed by prompt hash. If you ask the same question twice (which happens more than you'd think during development), the second request returns instantly from cache. The cache is TTL-based and invalidates when the codebase changes (via drift detection).

The proxy handles HTTP/2 with custom TLS, which was a pain to get right. The httpx library with h2 does most of the heavy lifting, but managing the TLS certificates for MITM required wrapping the cryptography library's certificate generation.

The Dashboard

A React 18 dashboard (dashboard-ui/) provides real-time visibility into:

Current plan phases and task status
Search interface for the vector index
Token usage analytics (charts via Recharts)
Terminal output (xterm.js)
Agent notes

The dashboard connects via WebSocket and updates in real-time as the orchestration loop runs.

IDE Extensions

Polaris ships with VS Code and IntelliJ plugins. The VS Code extension auto-configures everything: writes the CLAUDE.md, sets up MCP, configures permissions. The IntelliJ plugin takes a different approach, running its own HTTP proxy server for API interception and analytics.

The interesting architectural difference: VS Code launches Claude Code as a terminal subprocess and communicates through its stdout. IntelliJ intercepts HTTP traffic through a proxy, which gives it response caching (which the VS Code extension doesn't have) but makes the setup more complex.

What I'd Do Differently

The MCP server grew too large. The mcp/server.py file has 80+ tool schemas. I should have split it into domain-specific servers (search, orchestration, team management) from the beginning. Refactoring a monolithic MCP server is painful because every Claude Code session needs the whole thing available.

Drift detection should be event-based, not polling. Right now it runs a full mtime scan every iteration. On large codebases this takes 1-2 seconds. A file watcher (inotify/fsevents) would make it instant, but introduces complexity around watch limits and cross-platform support.

The strength matrix is static. It should learn from delegation outcomes. If Kimi keeps producing UI code that fails QA but Claude's attempts pass, the matrix should update. I have the delegation history and QA results to do this. Just haven't built the feedback loop yet.

LanceDB's lack of native table rename means the atomic swap for full rebuilds has to copy the entire table. For large indices this is slow. A different vector store might handle this better, but LanceDB's zero-infrastructure setup (no external database process) is too convenient to give up for a minor annoyance.