Back to projects
CHEMMRAG
ActiveAST-aware RAG pipeline with voyage-code-3 embeddings for code-aware retrieval
PythonLanceDBVoyage AItree-sitter
Architecture
AST-aware chunking preserves function/class boundaries. voyage-code-3 1024-dim embeddings stored in LanceDB with hybrid retrieval.
Data Pipeline
tree-sitter AST parsing -> semantic unit extraction -> embedding -> vector store
Novel Approaches
- •Code-aware chunking via tree-sitter AST parsing vs naive line/token splitting
- •Semantic unit preservation (functions, classes, methods as atomic chunks)
- •Hybrid retrieval: vector similarity + AST-level reranking
Key Files
polaris/chunker/bridge.py
polaris/chunker/smart.py