CHEMMRAG: Why Naive Chunking Kills Your RAG Pipeline
Every RAG tutorial starts the same way. Take your documents, split them into chunks of N tokens with some overlap, embed each chunk, store the vectors, search by cosine similarity. For blog posts and documentation this works fine. For code it's a disaster.
The Problem With Token-Based Chunking
Here's a Python class that's about 600 tokens long:
class SessionManager:
def __init__(self, db: Database, ttl: int = 3600):
self.db = db
self.ttl = ttl
self._cache: dict[str, Session] = {}
def create(self, user_id: str) -> Session:
session = Session(user_id=user_id, expires_at=time.time() + self.ttl)
self.db.insert("sessions", session.to_dict())
self._cache[session.id] = session
return session
def validate(self, token: str) -> Session | None:
if token in self._cache:
session = self._cache[token]
if session.expires_at > time.time():
return session
del self._cache[token]
row = self.db.find_one("sessions", {"token": token})
if row and row["expires_at"] > time.time():
session = Session.from_dict(row)
self._cache[token] = session
return session
return None
def revoke(self, token: str) -> bool:
self._cache.pop(token, None)
return self.db.delete("sessions", {"token": token})With a 400-token chunk size, a naive chunker splits this somewhere around the validate method. Chunk 1 gets the constructor and create. Chunk 2 gets validate and revoke. Neither chunk contains the complete class. Neither chunk's embedding accurately represents what the SessionManager does.
If someone searches "how does session validation work?", the embedding for chunk 2 (which contains validate but not the constructor that shows the cache and TTL strategy) is only a partial match. The embedding model doesn't know that self._cache was initialized in the constructor because the constructor isn't in this chunk.
Overlap helps a little. With 50-token overlap, maybe the tail of create leaks into chunk 2. But you're still splitting semantic units arbitrarily.
AST-Based Chunking
Tree-sitter gives you the actual syntax tree. For the class above, the AST looks roughly like:
%%MERMAID_START%%graph TD A[class_definition: SessionManager] --> B[function_definition: __init__] A --> C[function_definition: create] A --> D[function_definition: validate] A --> E[function_definition: revoke]%%MERMAID_END%%
The chunker in polaris/chunker/bridge.py walks this tree and produces five chunks:
- The full
SessionManagerclass (as a "class" chunk type) __init__methodcreatemethodvalidatemethodrevokemethod
Each chunk is a complete semantic unit. The embedding for the validate method captures everything about validation. The embedding for the full class captures the architectural overview. When someone searches for session validation, both the method-level and class-level chunks are relevant, and their embeddings are accurate representations of their content.
Language Support
Tree-sitter has grammars for basically every language that matters. The chunker supports 20+ languages through their file extensions:
SUPPORTED_EXTENSIONS = {
".ts": "typescript", ".tsx": "tsx",
".js": "javascript", ".jsx": "jsx",
".py": "python", ".java": "java",
".rs": "rust", ".go": "go",
".cpp": "cpp", ".c": "c",
".cs": "csharp", ".rb": "ruby",
# ... and more
}Each language has its own config defining which AST node types count as functions, containers, and top-level declarations. Python's function_definition and class_definition map cleanly. TypeScript has function_declaration, method_definition, class_declaration, interface_declaration, and a few others. The config is explicit per language because tree-sitter node types aren't standardized across grammars.
The Graph Layer
Here's where it gets interesting. Each chunk doesn't just contain its own code. It also tracks relationships:
@dataclass
class GraphMetadata:
stable_id: str # Deterministic: file + type + start_line
node_type: str # Raw tree-sitter node type
parent_id: str # Parent chunk's stable_id
children_ids: list[str] # Child chunks
calls: list[str] # Functions/methods called inside this chunk
called_by: list[str] # Chunks that call this one (filled post-process)The parent_id means every method knows which class it belongs to. The children_ids means every class knows its methods. The calls list is populated by walking the AST for identifier nodes inside function call expressions.
This powers graph expansion during search. If vector search returns the validate method, graph expansion automatically pulls in the SessionManager class (parent), the create and revoke methods (siblings), and any functions that validate calls. You asked about one function, you get the architectural context for free.
Stable IDs and Incremental Updates
Every chunk gets a deterministic ID: filepath:nodeType:name:startLine:endLine. This is critical for incremental indexing.
When you edit session_manager.py and change the validate method, the drift detector (in polaris/core/drift.py) does a two-phase check:
- Fast phase: Compare file mtime and size against the manifest. If both match, the file hasn't changed. This takes microseconds per file.
- Deep phase: For files that failed the fast check, compute a content hash and diff the chunks against the index.
If validate changed but create didn't, only validate's chunk gets re-embedded. The other chunks keep their existing vectors. On a codebase with 10K chunks where you edited 3 files touching 15 chunks, re-indexing embeds 15 chunks instead of 10K. That's the difference between 2 seconds and 90 seconds.
The churn rate (percentage of chunks affected) feeds back into the orchestrator. If drift exceeds 5%, a full reindex triggers automatically. Below that, incremental updates handle it.
Embedding Choice: Why voyage-code-3
I tested three embedding models:
- text-embedding-3-large (OpenAI, 3072 dim): General purpose, not code-optimized
- voyage-code-3 (Voyage AI, 1024 dim): Trained specifically on code
- cohere-embed-english-v3 (Cohere, 1024 dim): Good general purpose
For code retrieval, voyage-code-3 won by a significant margin. The key difference is asymmetric retrieval: voyage-code-3 distinguishes between "document" embeddings (the code chunks) and "query" embeddings (the natural language question). This means "how does authentication work?" maps to a different region of embedding space than the code itself, but the model is trained so that the query region overlaps with relevant document regions.
The 1024-dim vectors are a good tradeoff. Half the size of OpenAI's 3072-dim vectors (so half the storage and faster similarity computation) with better code retrieval quality.
The Reranking Step
Initial vector search returns ~20 candidates. These get reranked by Voyage Rerank 2.5, which does a full cross-attention pass between the query and each candidate. Cross-attention is expensive (quadratic in sequence length), which is why you only run it on 20 candidates instead of 10K.
The reranker catches things that embedding similarity misses. Two functions might have similar structure (both take a dict and return a bool) but do completely different things. The embeddings might be close, but the reranker reads the actual content and can distinguish between "validate a session token" and "validate a webhook signature."
Hybrid Search: Vectors Aren't Enough
Pure vector search misses exact string matches. If someone searches for SessionManager.validate, vector search might not rank the actual validate method first because the embedding captures the semantic meaning, not the literal identifier.
The solution is hybrid search: run vector similarity and full-text search in parallel, then combine with Reciprocal Rank Fusion:
def rrf_fusion(vector_results, fts_results, k=60):
scores = {}
for rank, doc in enumerate(vector_results):
scores[doc["stable_id"]] = scores.get(doc["stable_id"], 0) + 1/(k + rank + 1)
for rank, doc in enumerate(fts_results):
scores[doc["stable_id"]] = scores.get(doc["stable_id"], 0) + 1/(k + rank + 1)
return sorted(scores.items(), key=lambda x: -x[1])Documents that rank high in both searches get boosted. Documents that only rank high in one still appear but lower. The k=60 constant dampens the impact of exact rank position, which prevents a single outlier from dominating.
HyDE: The Sneaky Trick
Hypothetical Document Embedding is the single biggest quality improvement in the pipeline. Before embedding the user's query, Claude Haiku generates a hypothetical code snippet that would answer the question:
Query: "how does the auth middleware work?"
HyDE generates something like:
def auth_middleware(request):
token = request.headers.get("Authorization")
session = session_manager.validate(token)
if not session:
raise HTTPException(401)
request.state.user = session.user_idThis synthetic code gets embedded alongside the original query. The combined embedding is much closer in vector space to the actual auth middleware implementation than the bare question would be.
The cost is one Haiku call per search (~$0.0003), and it takes about 500ms. For interactive searches this is noticeable. For background searches during autonomous loops it's invisible. I considered caching HyDE results but the hit rate was too low to justify the complexity.
What I'd Do Differently
Language-specific chunking heuristics need more tuning. Python's AST maps cleanly to semantic units. TypeScript is messier because of arrow functions, destructured exports, and JSX. I've seen cases where a complex JSX component gets chunked in a way that separates the JSX from the data fetching logic above it. Need to add JSX-aware merging for components under a certain size.
The call graph is incomplete. I extract direct function calls from the AST, but I don't resolve imports. If validate() calls check_expiry() which is imported from another file, the graph edge exists for check_expiry as a string identifier but isn't linked to the actual chunk. Resolving cross-file references requires something closer to a language server, which is a whole different level of complexity.
I should track embedding model versions. If I upgrade from voyage-code-3 to a future voyage-code-4, all existing embeddings are incompatible. Right now this requires a full reindex. Storing the model version per chunk and auto-triggering reindex on model change would be cleaner.