Back to projects

CHEMMRAG

Active

AST-aware RAG pipeline with voyage-code-3 embeddings for code-aware retrieval

PythonLanceDBVoyage AItree-sitter

Architecture

AST-aware chunking preserves function/class boundaries. voyage-code-3 1024-dim embeddings stored in LanceDB with hybrid retrieval.

Data Pipeline

tree-sitter AST parsing -> semantic unit extraction -> embedding -> vector store

Novel Approaches

  • Code-aware chunking via tree-sitter AST parsing vs naive line/token splitting
  • Semantic unit preservation (functions, classes, methods as atomic chunks)
  • Hybrid retrieval: vector similarity + AST-level reranking

Key Files

polaris/chunker/bridge.py
polaris/chunker/smart.py