Retrieval-Augmented Generation is the backbone of "chat with your documents" systems. This project implements one end to end, runs fully on-device for privacy, and measures whether it actually works.
A conversational assistant that answers questions about your documents (PDF, Word) through a complete RAG pipeline, with source citations and quantitative evaluation of answer quality.
Its defining feature is a dual LLM provider design: it runs 100% locally by default (embeddings, vector search and generation), so no data ever leaves your machine. This is a privacy-by-design approach well-suited to GDPR-sensitive documents, while a cloud provider (Google Gemini) can be switched on in one click when raw performance matters more than data locality.
AskMyDocs lets you query a document in natural language. You ask a question, it retrieves the relevant passages from the document and generates a sourced answer, without hallucinating, relying solely on the provided content.
The project implements the full RAG (Retrieval-Augmented Generation) chain, from document ingestion to answer generation, plus an evaluation harness to objectively measure its performance.
The standout feature is that you choose where inference happens, directly from the UI:
- π Ollama (default, local) β embeddings, vector search and the LLM all run on your machine. Documents, questions and answers never leave the device. This is the GDPR-compliant mode, and it's the default for a reason: the project's own test document is the GDPR regulation itself.
- βοΈ Google Gemini (optional, cloud) β a more powerful model for when answer quality matters more than data locality. A clear in-app warning reminds you that data is sent to a third party.
This isn't a hidden config flag: the trade-off between confidentiality and performance is surfaced to the user as an explicit, conscious choice β privacy by design in practice.
Architecturally, both providers sit behind a single interface. Swapping between them touches one module only (src/askmydocs/llm/); the rest of the pipeline is completely unaware of which backend answered. This loose coupling is the design decision I'm most proud of in this project.
- π Ingestion of PDF and Word (.docx) documents
- βοΈ Smart chunking with metadata preservation (source, page)
- π Semantic search via multilingual embeddings (optimized for French)
- π Dual LLM provider with one-click switching: Ollama (local, GDPR) or Gemini (cloud, performance)
- π 100% local generation by default β no data leaves the machine
- π€ Answers strictly grounded in the retrieved context (no hallucination)
- π Source citations (document + page) for every answer
- π¬ Chat interface with history (Streamlit)
- π Evaluation harness: annotated dataset, retrieval and generation metrics
Q: "What is the deadline to notify a personal data breach?"
A: Under Article 33 of the GDPR, a personal data breach must be notified to the supervisory authority without undue delay and, where feasible, no later than 72 hours after becoming aware of it [page 17].
π Sources: RGPD.pdf, p.17
PDF / DOCX
β
βΌ
Loader βββββββΊ Splitter βββββββΊ Embedder βββββββΊ ChromaDB
(extraction) (chunking) (vectorization) (vector storage)
β
βΌ
Question βββΊ Semantic search (top-K)
β
βΌ
Generation via provider interface
ββββββββββββ΄βββββββββββ
βΌ βΌ
Ollama (local, GDPR) Gemini (cloud)
ββββββββββββ¬βββββββββββ
βΌ
Answer + cited sources
The pipeline is exposed through two high-level functions in rag.py:
ingest(file_path): loads, chunks and indexes a documentask(question, provider=...): retrieves the relevant passages and generates the answer with the chosen provider
The layered design keeps the LLM provider behind a single interface: switching between Ollama and Gemini only touches one module, the rest of the pipeline is untouched.
| Layer | Tool | Why this choice |
|---|---|---|
| Language | Python 3.11+ | Standard for ML/data work |
| Dependency management | uv | Fast, modern, reproducible locks |
| PDF / DOCX extraction | pypdf, python-docx | Lightweight, no system deps |
| Chunking | langchain-text-splitters | Robust recursive splitting |
| Embeddings | sentence-transformers (paraphrase-multilingual-MiniLM-L12-v2) |
Local, free, multilingual (French) |
| Vector store | ChromaDB | Zero-config local persistence |
| LLM (default) | Ollama (llama3.2), local |
100% on-device, no data leaves the machine (GDPR) |
| LLM (optional) | Google Gemini | Cloud alternative when performance matters more than locality |
| UI | Streamlit | Fast Python-native UI |
| Evaluation | custom annotated dataset + custom metrics | Full control over what's measured |
| Tests | pytest | Coverage of core logic and edge cases |
π‘ The local model is configurable in one line (
OLLAMA_MODELinconfig.py).llama3.2is the default; switching tomistralimproves French-language output at the cost of more RAM β exactly the kind of swap the loose-coupling design makes trivial.
- Python 3.11+ and uv
- ~6 GB of RAM to run a local 7B model (e.g.
mistral) comfortably alongside the embedding model. Lighter models such asllama3.2(3B) work on less. On WSL, allocate memory explicitly in.wslconfigto avoid out-of-memory kills. - (Optional) a Google AI Studio API key, only if you want to use the Gemini provider.
# Clone the repository
git clone https://github.com/liliandoublet/askmydocs.git
cd askmydocs
# Install dependencies with uv
uv sync# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh
# Pull a model
ollama pull llama3.2
# Make sure the Ollama server is running (it listens on localhost:11434)
ollama serveNo API key is required, everything runs locally.
cp .env.example .env
# Edit .env and add your Google AI Studio keyGet a free API key from Google AI Studio.
uv run streamlit run app.pyUpload a document in the sidebar, index it, pick your provider (Ollama or Gemini) in the sidebar, then ask your questions.
# Run the full pipeline on a document
uv run python -m askmydocs.rag data/uploads/my_document.pdf "My question?"The project includes an evaluation harness that measures the pipeline's performance on a dataset of annotated questions (with reference pages and expected keywords).
uv run python -m askmydocs.eval.runner data/uploads/RGPD.pdfThe harness decouples retrieval from generation, to precisely diagnose the source of any failure: a low hit rate points to a retrieval problem, while a good hit rate paired with refusals points to a prompt or generation problem.
| Metric | What it measures |
|---|---|
| Hit rate | Does retrieval find at least one relevant page? |
| Precision | What proportion of retrieved chunks is relevant? |
| Keyword recall | Does the generated answer contain the expected facts? |
| Refusal rate | How often the LLM responds "I can't find this" |
Because the two providers share the exact same retrieval pipeline, retrieval metrics (hit rate, precision) are identical across them β only the generation metrics (keyword recall, refusal rate) reflect the model. The table below reports both so the comparison stays honest.
| Metric | Ollama (llama3.2, local) |
Gemini (cloud) |
|---|---|---|
| Hit rate | 50.0% | TBD |
| Precision | 11.7% | TBD |
| Keyword recall | 58.3% | TBD |
| Refusal rate | 2/10 | TBD |
Reading the numbers: precision is expected to be low on this corpus. With only 1 to 2 relevant pages out of 88, even perfect retrieval cannot score high. The meaningful signals here are hit rate (is the right page found?) and keyword recall (is the answer factually correct?).
The local trade-off, measured honestly: running fully on-device is not free. A small local model is slower (tens of seconds per answer on CPU) and produces rougher French than a frontier cloud model. The dual-provider design exists precisely to make that trade-off a deliberate choice rather than a hidden cost β privacy by default, performance on demand.
A few real problems solved while building this, and what they taught me:
- Embedding model / language mismatch: the initial English-centric model poorly separated French text (a discriminative gap of only ~0.08 between related and unrelated sentence pairs). Diagnosing this and switching to a multilingual model more than doubled the gap. Lesson: match the embedding model to the language of your corpus, and measure it rather than assume.
- Extraction noise leading to index pollution: figure-heavy pages produced 1-character chunks (bare page numbers) that polluted the vector index and surfaced as irrelevant top results. Added a minimum-length filter, covered by a unit test. Lesson: in RAG, ingestion quality matters as much as the model. Garbage in, garbage out.
- External API resilience: the cloud LLM API intermittently returned 429 (rate limit) and 503 (overload) responses during batch evaluation. Added retry-with-backoff covering both, plus graceful degradation so a single failure doesn't discard the whole run. This also motivated the dual-provider design: the local Ollama backend removes the external dependency entirely.
- Loose coupling that paid off: the original project ran only on Gemini.
Adding a fully local provider meant writing one new module behind a shared
generate_answerinterface βrag.py, the UI and the eval harness were never touched. When an architecture decision lets you swap a core component by adding a single file, you know the seams are in the right place. - Operational constraints are real: running a 7B model locally exposed hard
RAM limits (OOM kills under WSL's default memory allocation). Diagnosing it with
free -hand raising the WSL memory/swap budget fixed it. Local inference isn't just a code decision β it's a resource-management one.
The notebook notebooks/01_exploration.ipynb projects the corpus embedding space into 2D (PCA). It shows the thematic clustering of chunks and the position of a query among the relevant passages.
askmydocs/
βββ src/askmydocs/
β βββ config.py # Centralized configuration
β βββ types.py # Shared types (TypedDict)
β βββ loader.py # PDF / DOCX extraction
β βββ splitter.py # Chunking
β βββ embedder.py # Embedding generation
β βββ vectorstore.py # Storage and search (ChromaDB)
β βββ rag.py # Pipeline orchestration
β βββ llm/ # Answer generation (provider interface)
β β βββ __init__.py # Provider selector (routes to ollama / gemini)
β β βββ ollama.py # Local LLM (default, privacy / GDPR)
β β βββ gemini.py # Cloud LLM (optional, performance)
β β βββ prompt.py # Shared system prompt used by both providers
β βββ eval/ # Evaluation harness
β βββ metrics.py
β βββ runner.py
βββ notebooks/ # Exploration and visualization
βββ tests/ # Unit tests (pytest)
βββ data/ # Documents and evaluation dataset
βββ app.py # Streamlit interface
uv run pytest -v- Benchmark Gemini against Ollama on the full evaluation set (table above)
- Try
mistralas the local model for better French output - Result re-ranking with a cross-encoder
- Hybrid search (semantic + keyword / BM25)
- Expand the evaluation dataset
- OCR support for scanned PDFs
This project is licensed under the MIT License, see the LICENSE file for details.




