Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@ build/
*.iml
.idea/

blog.md

# Local config (use config.example.yaml as a template)
config.yaml
Expand Down
304 changes: 304 additions & 0 deletions blog.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,304 @@
# Blog Post Plan: How semcode Builds a RAG System for Code Search

## Context

This blog post explains the RAG (retrieval-augmented generation) pipeline behind
[**semcode**](https://github.com/GoodbyePlanet/semcode), an MCP server that does
semantic code search across your GitHub repositories. It covers both parts of the pipeline: the **ingestion** side — how
repositories are found, how code is parsed into symbols with Tree-sitter, how embedding inputs are constructed both
dense and sparse, and how
points land in Qdrant incrementally — and the **retrieval** side — how queries are encoded into both dense and sparse
vectors and fused server-side with RRF (Reciprocal Rank Fusion). Along the way we'll cover why a hybrid dense+sparse
approach beats either one alone for code, and why the *payload* stored next to each vector matters as much as the vector
itself.

Audience: engineers familiar with RAG, embeddings, and vector DBs, curious about applying RAG to source code
specifically (not prose).

---

## Section 1 — Why RAG for code is different from RAG for documents

Most RAG systems are built around prose — PDFs, internal documentation, wikis... The content is natural language written
for humans, meaning is carried in sentences, and semantic search over plain text works well, and when you add second
stage retrieval (reranker), you get a system that can answer your questions with high confidence.
Software code is different: it's structured, symbolic, it's written for compilers and interpreters. Meaning is
distributed across structure, not sentences:

- A function name (retryWithBackoff) carries intent
- The signature (attempts: int, delay_ms: int) carries contract
- The body carries implementation details
- Annotations (@Retryable, @CircuitBreaker) carry framework behavior
- The class it belongs to (OrderProcessingService) carries domain context

None of that is a sentence. You can't chunk code by paragraph — you chunk by symbol (function, class, method).
Let's see how that is implemented in **semcode**.

---

## Section 2 — From source files to Code Symbols - Tree-sitter parsing

What is an AST?

An Abstract Syntax Tree is a tree representation of source code's grammatical structure (logical parts of this code and
how do they relate to each other). Every construct in your code —
a function definition, a class, an if statement, a variable assignment — becomes a node in the tree, where parent-child
relationships express nesting and ownership.

For clarity, bellow is a pruned AST. Just to give you a mental model of how a parser sees
a function: a decorated async definition with typed parameters, a return annotation, and a body containing a
docstring and a single return.

```shell
@app.get("/users")
async def list_users(db: Session) -> list[User]:
"""Return all users."""
return db.query(User).all()

module
└── decorated_definition
├── decorator → "@app.get("/users")"
└── function_definition
├── name → "list_users"
├── parameters → "(db: Session)"
├── return_type → "list[User]"
└── body
├── expression_statement
│ └── string → '"""Return all users."""'
└── return_statement
```

What is Tree sitter?

Tree-sitter is a parser generator tool and an incremental parsing library. It can build a concrete syntax tree for a
source file and efficiently update the syntax tree as the source file is edited.
[Tree-sitter official documentation](https://tree-sitter.github.io/tree-sitter/)

What is a Code Symbol in **semcode**?

A symbol is one named, self-contained unit of code that a language considers meaningful — a function, a class, a method,
an interface, a React component, a hook... In **semcode** a symbol is a CodeSymbol dataclass,
which captures everything needed to search, understand, and locate it without reading the surrounding file.

What a `CodeSymbol` carries:

**name / symbol_type / language** — These uniquely describe what kind of thing this is (save,
method, java) so retrieval can filter by language or type before even looking at embeddings.

**signature** — The declaration line only, e.g. *def save(self, db: Session) -> User*. This is what you'd see in an
IDE's autocomplete popup — compact enough to show in search results without including the full body.

**source** — The complete raw text of the symbol from open brace to closing brace. This is what gets embedded into the
vector store, giving the model the full implementation context when a chunk is retrieved.

**start_line / end_line** — Position recorded by Tree-sitter during parsing, used to link a search result back
to an exact location in the file.

**parent_name / package** — Structural context. **parent_name** says which class owns this method; **package** says
which Java
package or Python module the file belongs to. Without these, two methods both named save in different services are
indistinguishable.

**annotations / extras** — Language-specific enrichment. A Java @GetMapping("/users") lands in annotations; the
extracted
HTTP route string (GET /users) lands in extras. For TypeScript, extras flags whether a component uses hooks, or whether
a function matches the React component signature pattern.

Example:

```shell
CodeSymbol(
name="list_users",
symbol_type="api_route",
language="python",
source="async def list_users(db: Session) -> list[User]:\n ...",
file_path="auth-service/routers/users.py",
start_line=2,
end_line=4,
parent_name=None,
package="auth-service.routers.users",
annotations=["app.get(\"/users\")"],
signature="async def list_users (db: Session) -> list[User]",
docstring='"""Return all users."""',
extras={"is_async": True, "http_method": "GET", "http_route": "/users"},
)
```

So the full pipeline is:
Tree-sitter parses code into an AST. The parser goes through that AST node by node, asks each node where it starts/ends
and what it contains, and puts all of that into a **CodeSymbol** — one symbol per meaningful language construct.
---

## Section 3 — Building the embedding input

Now, having knowledge about **CodeSymbols**, we can build the input for a vector database. In **semcode**
[Qdrant](https://qdrant.tech/) is used for to store vectors we have two types of inputs: dense and sparse.

What are dense embeddings?

**Dense embeddings** encode the *meaning* of text into a fixed-size vector of floating-point numbers — typically
hundreds or thousands of dimensions depending on which embedding provider is chosen. Two pieces of text that express the
same idea will land close together in that vector space even if they share no words in common. For code search this
means a query like "find the method that handles payment retries" can surface `retryWithBackoff()`
without those words appearing anywhere in the source.

```shell
dense = [0.2, 0.3, 0.5, 0.7, ...] # several hundred floats
```

What are sparse embeddings?

**Sparse embeddings** work the opposite way: instead of capturing meaning, they represent text as a large vocabulary
vector where almost every entry is zero and only the terms that actually appear get a non-zero weight. BM25 is the
algorithm behind this — it scores each token by how often it appears in a document relative to how common it
is across the whole corpus. This makes sparse embeddings excellent at exact keyword matching: if you search for
`PlaceOrderRequest` or `@Transactional`, BM25 will find every document that contains those tokens precisely.

```shell
# Taken from Qdrant docs
sparse = [{331: 0.5}, {14136: 0.7}] # 20 key value pairs
# The numbers 331 and 14136 map to specific tokens in the vocabulary e.g. ['Transactional', 'PlaceOrderRequest'].
# The rest of the values are zero. This is why it’s called a sparse vector.
```

How does **semcode** build the dense input?

The whole `CodeSymbol` object is not embedded directly — it is first serialized into a single text string, and that
string is what the embedding model sees. One symbol produces one string, which produces one vector: an array of
floating-point numbers (e.g. 768 or 3072 floats depending on the provider). The `CodeSymbol` fields that carry
*meaning* go into that string.
It starts with a human-readable preamble that names the language, symbol type, parent class, and owning service, then
layers in framework-specific metadata — Spring stereotypes, HTTP method and route, annotations — followed by a truncated
docstring and the full signature. Finally, the raw source body is appended, capped at ~6,000 characters (~1,500
tokens). The goal is to give the embedding model everything it would need to understand the symbol's role, not just
its implementation.
The fields that are useful for *displaying or filtering* results (like `start_line`,
`file_path`, or `parent_name`, `package`) are stored separately as the Qdrant **payload** — they sit next to the vector
but are never embedded.

How does **semcode** build the sparse input?

Building BM25 text input is minimal — it concatenates only the signature, docstring, and raw source, with no metadata.
It splits camelCase and snake_case identifiers into their component words while keeping the original form alongside. A
token like `PlaceOrderRequest`becomes `Place Order Request` — so BM25 can match the exact identifier *and* a
natural-language query like "place order request" that doesn't use the original casing.

So the full picture is:
Every `CodeSymbol` produces two inputs. The dense input is wide and context-rich — it tells the model the symbol's
place in the system. The sparse input is narrow and literal — it gives BM25 the exact tokens to match against. Both
are computed in the same pipeline step and stored together as a single point in Qdrant.

---

## Section 4 — The sparse side: BM25 with code-aware tokenization

- BM25 input is intentionally coarser: signature + docstring + source only
- Reference: `server/indexer/pipeline.py:94-101`
- Identifier expansion: `CamelCase` and `snake_case` are split so BM25 can match partial queries
- Both original and split forms kept → "PlaceOrderRequest" matches exact lookups *and* "place order"
- Reference: `server/embeddings/code_tokenizer.py:6-16`
- Implementation: fastembed's `Bm25("Qdrant/bm25")`, stored as a native sparse vector in Qdrant
- Reference: `server/embeddings/bm25.py`
- What BM25 solves that dense doesn't:
- Exact symbol-name lookups
- Rare tokens (vocabulary mismatch — domain jargon, project-specific names)
- Queries that are *literal* references rather than intent descriptions

---

## Section 5 — The dense side: pluggable embedding providers

- Five providers, all behind one interface: Jina API (hosted), self-hosted Jina via TEI, OpenAI, Voyage, Ollama
- Reference: `server/embeddings/{jina_api,jina,openai,voyage,ollama}.py`
- Why pluggable matters for code: dimensions vary (768 → 3072), code-tuned models (jina-code-embeddings, voyage-code-3)
outperform general-purpose ones
- Optional callout: the factory pattern refactor (commit `cd778ee`) — each provider self-registers on import, so adding
a new one doesn't touch `factory.py` (OCP)
- Reference: `server/embeddings/__init__.py`, `server/embeddings/factory.py`

---

## Section 6 — What goes into Qdrant: the named-vector schema

- One collection (`code_symbols`) with **two named vectors per point**:
- `text-dense` — cosine, provider-dependent dims
- `text-sparse` — Qdrant native BM25 sparse index
- Reference: `server/store/qdrant.py:47-62`
- The payload (the underappreciated half of every vector DB):
- Identity: `symbol_name`, `symbol_type`, `language`, `service`, `file_path`, `package`, `parent_name`
- Display: `signature`, `source`, `docstring`, `start_line`, `end_line`
- Filtering: `annotations`, `chunk_tier`, framework `extras` (HTTP method, route, Spring stereotype)
- Bookkeeping: `file_hash` (for incremental reindex), `indexed_at`
- Reference: `server/indexer/pipeline.py:104-125`
- Keyword payload indexes on the high-cardinality filter fields → fast `language=python AND service=catalog` style
filters
- Separate `git_commits` collection — dense-only, message + diff metadata

---

## Section 7 — Hybrid retrieval at query time (RRF in one Qdrant call)

- The query goes through *both* encoders: dense (full model) and sparse (tokenizer + BM25)
- One Qdrant `query_points` call does the fusion server-side:
```
FusionQuery(fusion=Fusion.RRF),
prefetch=[
Prefetch(query=dense_vec, using="text-dense", limit=K*2),
Prefetch(query=sparse_vec, using="text-sparse", limit=K*2),
]
```
- Reference: `server/store/qdrant.py:203-223`
- How RRF works in one paragraph: each retriever returns a ranked list, RRF scores each doc by `Σ 1/(k + rank_i)`, ties
broken by combined rank. No tuning of weights needed.
- Why this beats weighted sum: scale-free, doesn't depend on score calibration between dense cosine and BM25
- Reference: `server/tools/search.py:20-78`

---

## Section 8 — Indexing flow: incremental, content-addressed

- Walk the repo (GitHub API or local), apply excludes
- For each file: compute blob SHA → compare against payload's `file_hash` → skip if unchanged
- Parse → build dense + sparse inputs → batch-embed → upsert (delete-then-insert per file path)
- Cleanup pass removes stale symbols for files no longer in the repo
- Reference: `server/indexer/pipeline.py:128-249`
- Why this matters: embedding API costs amortize across reindexes; large monorepos stay tractable

---

## Section 9 — Bonus: indexing git history as a second RAG corpus

- Separate pipeline embeds **commit messages + file deltas** into the `git_commits` collection
- Dense-only (commit messages are short, sparse adds little)
- Enables "when was retry logic introduced?" style queries
- Reference: `server/indexer/git_history.py:24-63`, `server/tools/history.py`

---

## Section 10 — What I'd do differently / open questions

- Re-ranker on top of RRF (cross-encoder) — worth the latency?
- Per-language collections vs single collection — when does the trade-off flip?
- Embedding the *call graph* (cross-symbol relationships), not just symbols in isolation
- Tuning the 6000-char source cap per language

---

## Section 11 — Takeaways

- Symbol-level chunking + rich, language-aware embedding inputs are the foundation
- Hybrid dense+sparse with RRF gives you both "intent" and "exact name" search for free, server-side
- The payload is half the system — invest in it
- Incremental indexing via blob SHAs is what makes this affordable at repo scale

---

## Appendix — Suggested diagrams

1. Pipeline overview: file → Tree-sitter → `CodeSymbol` → dense input + sparse input → Qdrant
2. Qdrant point anatomy: two named vectors + payload fields, annotated
3. Query-time RRF: query → two encoders → two ranked lists → fused result

## Reference

https://qdrant.tech/articles/sparse-vectors/
Loading