An AI-powered assistant that ingests GitHub repositories, stores code embeddings in a vector database, and enables contextual Q&A over a codebase using a local LLM — fully local, no data leaves your machine.
The system implements a two-stage Retrieval-Augmented Generation (RAG) pipeline:
- Fetch Python files from a GitHub repository via the GitHub API (no cloning)
- Parse each file with AST to extract functions and classes with full signatures
- Embed each chunk with
all-MiniLM-L6-v2(SentenceTransformer) and store in Qdrant - On query: retrieve candidate chunks by vector similarity (bi-encoder), then rerank with a Cross-Encoder for higher precision
- Build a token-aware context window and generate an answer via a local Ollama LLM
| Component | Role |
|---|---|
| FastAPI | Async REST API |
| Qdrant | Vector database (cosine similarity search) |
| Ollama | Local LLM inference |
| SentenceTransformers | Bi-encoder embeddings (all-MiniLM-L6-v2, 384-dim) + Cross-Encoder reranking (ms-marco-MiniLM-L-6-v2) |
| aiohttp | Async GitHub API client |
| Docker Compose | Three-service local deployment |
- GitHub repo ingestion — fetches files via GitHub API with async concurrency (semaphore=20, exponential backoff)
- AST-based chunking — extracts functions and classes using
ast.unparse()for complete signatures including type annotations,*args,**kwargs, and base classes - Idempotent ingest — point IDs are deterministic SHA256 hashes of
(repo_id, file_path, symbol), so re-ingesting the same repo is a no-op;force=truetriggers a full re-index - Two-stage retrieval — vector search fetches
limit×3candidates, a Cross-Encoder reranks them tolimit; each result carries bothscore(cosine) andrerank_score - Token-aware context — context is truncated to 8 000 characters (2 000 per chunk) before being sent to the LLM, avoiding context window overflows without a tokenizer dependency
- Query rewriting — optional LLM-powered query rewrite before retrieval (
adapt_user_query=true) - Observability — structured
key=valuelog lines at every pipeline stage;/healthand/readinessendpoints - Graceful error handling —
VectorDBError→ 503,LLMError→ 502,PermanentGitHubError(401/403) is not retried
GitHub API (HTTP)
↓
GitHubParser — async fetch, base64 decode, exponential backoff
↓
AST Chunker — ast.unparse() signatures, module-level + class methods
↓
SentenceTransformer (bi-encoder, 384-dim)
↓
Qdrant — deterministic point IDs, batched upsert (100 pts/batch)
↓
Vector Search — cosine similarity, score_threshold filter
↓
Cross-Encoder — ms-marco-MiniLM-L-6-v2 reranking
↓
Context Builder — char-based truncation (8k chars)
↓
Ollama LLM — local inference, 150s timeout
↓
Answer
cp .env.example .env # set GITHUB_TOKEN and optionally LLM_MODEL
make up-local # start FastAPI (:8000), Qdrant (:6333), Ollama (:11434)
make pull-model # pull the model specified in .env (once after first start)To stop: make down-local
Fetches a GitHub repository, parses Python files with AST, embeds each chunk, and stores in Qdrant.
{ "owner": "tiangolo", "repo": "fastapi", "branch": "master", "force": false }force: true— deletes existing vectors for this repo before re-indexing (handles renamed/deleted symbols)
Vector-searches the index and returns matching code chunks with similarity scores.
?query=how does dependency injection work&owner=tiangolo&repo=fastapi&branch=master&limit=5&score_threshold=0.3
Response includes items, total, and score_threshold_used.
Two-stage retrieval + LLM answer generation.
?query=how does dependency injection work&owner=tiangolo&repo=fastapi&adapt_user_query=false
adapt_user_query=true— rewrites the query via Ollama before retrieval
Response:
{ "answer": "...", "context_found": true }context_found: false— no matching code was found in the vector DB; the LLM answered from general knowledge without a RAG context constraint
Same as /ask but streams tokens as they are generated (use curl -N).
The X-Context-Found: true/false response header signals whether vector DB results were used.
Liveness and readiness probes (readiness checks Qdrant connectivity).
All settings are in .env (see .env.example):
| Variable | Default | Description |
|---|---|---|
GITHUB_TOKEN |
— | GitHub personal access token (required) |
LLM_MODEL |
qwen2.5-coder:1.5b |
Ollama model name |
QDRANT_HOST |
qdrant |
Qdrant service hostname |
LLM_HOST |
ollama |
Ollama service hostname |
RERANKER_ENABLED |
true |
Enable Cross-Encoder reranking |
make check # ruff lint + mypy type check + pytest
make format # auto-format with ruff
make test # run pytest onlyGitHub Actions runs make check (lint → typecheck → tests) on every push and pull request to master. The workflow installs only requirements-dev.txt — no runtime dependencies — since tests cover pure-Python utilities and mypy is configured with ignore_missing_imports = true.