Skip to content

liliandoublet/askmydocs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

20 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

πŸ“š AskMyDocs

Retrieval-Augmented Generation is the backbone of "chat with your documents" systems. This project implements one end to end, runs fully on-device for privacy, and measures whether it actually works.

A conversational assistant that answers questions about your documents (PDF, Word) through a complete RAG pipeline, with source citations and quantitative evaluation of answer quality.

Its defining feature is a dual LLM provider design: it runs 100% locally by default (embeddings, vector search and generation), so no data ever leaves your machine. This is a privacy-by-design approach well-suited to GDPR-sensitive documents, while a cloud provider (Google Gemini) can be switched on in one click when raw performance matters more than data locality.

Python Ollama ChromaDB Streamlit RGPD License

AskMyDocs Interface

πŸŽ₯ Showcase

How to run Streamlit

Load your document

How it works

🎯 Overview

AskMyDocs lets you query a document in natural language. You ask a question, it retrieves the relevant passages from the document and generates a sourced answer, without hallucinating, relying solely on the provided content.

The project implements the full RAG (Retrieval-Augmented Generation) chain, from document ingestion to answer generation, plus an evaluation harness to objectively measure its performance.

πŸ”€ The dual-provider design

The standout feature is that you choose where inference happens, directly from the UI:

  • πŸ”’ Ollama (default, local) β€” embeddings, vector search and the LLM all run on your machine. Documents, questions and answers never leave the device. This is the GDPR-compliant mode, and it's the default for a reason: the project's own test document is the GDPR regulation itself.
  • ☁️ Google Gemini (optional, cloud) β€” a more powerful model for when answer quality matters more than data locality. A clear in-app warning reminds you that data is sent to a third party.

This isn't a hidden config flag: the trade-off between confidentiality and performance is surfaced to the user as an explicit, conscious choice β€” privacy by design in practice.

Architecturally, both providers sit behind a single interface. Swapping between them touches one module only (src/askmydocs/llm/); the rest of the pipeline is completely unaware of which backend answered. This loose coupling is the design decision I'm most proud of in this project.

✨ Features

  • πŸ“„ Ingestion of PDF and Word (.docx) documents
  • βœ‚οΈ Smart chunking with metadata preservation (source, page)
  • πŸ” Semantic search via multilingual embeddings (optimized for French)
  • πŸ”€ Dual LLM provider with one-click switching: Ollama (local, GDPR) or Gemini (cloud, performance)
  • πŸ”’ 100% local generation by default β€” no data leaves the machine
  • πŸ€– Answers strictly grounded in the retrieved context (no hallucination)
  • πŸ“Œ Source citations (document + page) for every answer
  • πŸ’¬ Chat interface with history (Streamlit)
  • πŸ“Š Evaluation harness: annotated dataset, retrieval and generation metrics

πŸ’¬ Example

Q: "What is the deadline to notify a personal data breach?"

A: Under Article 33 of the GDPR, a personal data breach must be notified to the supervisory authority without undue delay and, where feasible, no later than 72 hours after becoming aware of it [page 17].

πŸ“„ Sources: RGPD.pdf, p.17

πŸ—οΈ Architecture

PDF / DOCX
    β”‚
    β–Ό
  Loader  ──────►  Splitter  ──────►  Embedder  ──────►  ChromaDB
(extraction)     (chunking)      (vectorization)    (vector storage)
                                                            β”‚
                                                            β–Ό
                              Question ──► Semantic search (top-K)
                                                            β”‚
                                                            β–Ό
                                  Generation via provider interface
                                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                       β–Ό                     β–Ό
                               Ollama (local, GDPR)   Gemini (cloud)
                                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                  β–Ό
                                      Answer + cited sources

The pipeline is exposed through two high-level functions in rag.py:

  • ingest(file_path): loads, chunks and indexes a document
  • ask(question, provider=...): retrieves the relevant passages and generates the answer with the chosen provider

The layered design keeps the LLM provider behind a single interface: switching between Ollama and Gemini only touches one module, the rest of the pipeline is untouched.

πŸ› οΈ Tech Stack

Layer Tool Why this choice
Language Python 3.11+ Standard for ML/data work
Dependency management uv Fast, modern, reproducible locks
PDF / DOCX extraction pypdf, python-docx Lightweight, no system deps
Chunking langchain-text-splitters Robust recursive splitting
Embeddings sentence-transformers (paraphrase-multilingual-MiniLM-L12-v2) Local, free, multilingual (French)
Vector store ChromaDB Zero-config local persistence
LLM (default) Ollama (llama3.2), local 100% on-device, no data leaves the machine (GDPR)
LLM (optional) Google Gemini Cloud alternative when performance matters more than locality
UI Streamlit Fast Python-native UI
Evaluation custom annotated dataset + custom metrics Full control over what's measured
Tests pytest Coverage of core logic and edge cases

πŸ’‘ The local model is configurable in one line (OLLAMA_MODEL in config.py). llama3.2 is the default; switching to mistral improves French-language output at the cost of more RAM β€” exactly the kind of swap the loose-coupling design makes trivial.

βš™οΈ Requirements

  • Python 3.11+ and uv
  • ~6 GB of RAM to run a local 7B model (e.g. mistral) comfortably alongside the embedding model. Lighter models such as llama3.2 (3B) work on less. On WSL, allocate memory explicitly in .wslconfig to avoid out-of-memory kills.
  • (Optional) a Google AI Studio API key, only if you want to use the Gemini provider.

πŸš€ Installation

# Clone the repository
git clone https://github.com/liliandoublet/askmydocs.git
cd askmydocs

# Install dependencies with uv
uv sync

Default: local LLM with Ollama (recommended)

# Install Ollama (macOS / Linux)
curl -fsSL https://ollama.com/install.sh | sh

# Pull a model
ollama pull llama3.2

# Make sure the Ollama server is running (it listens on localhost:11434)
ollama serve

No API key is required, everything runs locally.

Optional: cloud LLM with Gemini

cp .env.example .env
# Edit .env and add your Google AI Studio key

Get a free API key from Google AI Studio.

πŸ’» Usage

Run the application

uv run streamlit run app.py

Upload a document in the sidebar, index it, pick your provider (Ollama or Gemini) in the sidebar, then ask your questions.

Command line

# Run the full pipeline on a document
uv run python -m askmydocs.rag data/uploads/my_document.pdf "My question?"

πŸ“Š Evaluation

The project includes an evaluation harness that measures the pipeline's performance on a dataset of annotated questions (with reference pages and expected keywords).

uv run python -m askmydocs.eval.runner data/uploads/RGPD.pdf

Metrics measured

The harness decouples retrieval from generation, to precisely diagnose the source of any failure: a low hit rate points to a retrieval problem, while a good hit rate paired with refusals points to a prompt or generation problem.

Metric What it measures
Hit rate Does retrieval find at least one relevant page?
Precision What proportion of retrieved chunks is relevant?
Keyword recall Does the generated answer contain the expected facts?
Refusal rate How often the LLM responds "I can't find this"

Results on the GDPR document (88 pages, 626 chunks)

Because the two providers share the exact same retrieval pipeline, retrieval metrics (hit rate, precision) are identical across them β€” only the generation metrics (keyword recall, refusal rate) reflect the model. The table below reports both so the comparison stays honest.

Metric Ollama (llama3.2, local) Gemini (cloud)
Hit rate 50.0% TBD
Precision 11.7% TBD
Keyword recall 58.3% TBD
Refusal rate 2/10 TBD

Reading the numbers: precision is expected to be low on this corpus. With only 1 to 2 relevant pages out of 88, even perfect retrieval cannot score high. The meaningful signals here are hit rate (is the right page found?) and keyword recall (is the answer factually correct?).

The local trade-off, measured honestly: running fully on-device is not free. A small local model is slower (tens of seconds per answer on CPU) and produces rougher French than a frontier cloud model. The dual-provider design exists precisely to make that trade-off a deliberate choice rather than a hidden cost β€” privacy by default, performance on demand.

🧠 Engineering notes

A few real problems solved while building this, and what they taught me:

  • Embedding model / language mismatch: the initial English-centric model poorly separated French text (a discriminative gap of only ~0.08 between related and unrelated sentence pairs). Diagnosing this and switching to a multilingual model more than doubled the gap. Lesson: match the embedding model to the language of your corpus, and measure it rather than assume.
  • Extraction noise leading to index pollution: figure-heavy pages produced 1-character chunks (bare page numbers) that polluted the vector index and surfaced as irrelevant top results. Added a minimum-length filter, covered by a unit test. Lesson: in RAG, ingestion quality matters as much as the model. Garbage in, garbage out.
  • External API resilience: the cloud LLM API intermittently returned 429 (rate limit) and 503 (overload) responses during batch evaluation. Added retry-with-backoff covering both, plus graceful degradation so a single failure doesn't discard the whole run. This also motivated the dual-provider design: the local Ollama backend removes the external dependency entirely.
  • Loose coupling that paid off: the original project ran only on Gemini. Adding a fully local provider meant writing one new module behind a shared generate_answer interface β€” rag.py, the UI and the eval harness were never touched. When an architecture decision lets you swap a core component by adding a single file, you know the seams are in the right place.
  • Operational constraints are real: running a 7B model locally exposed hard RAM limits (OOM kills under WSL's default memory allocation). Diagnosing it with free -h and raising the WSL memory/swap budget fixed it. Local inference isn't just a code decision β€” it's a resource-management one.

πŸ”¬ Embedding visualization

The notebook notebooks/01_exploration.ipynb projects the corpus embedding space into 2D (PCA). It shows the thematic clustering of chunks and the position of a query among the relevant passages.

PCA visualization of embeddings

πŸ“ Project structure

askmydocs/
β”œβ”€β”€ src/askmydocs/
β”‚   β”œβ”€β”€ config.py          # Centralized configuration
β”‚   β”œβ”€β”€ types.py           # Shared types (TypedDict)
β”‚   β”œβ”€β”€ loader.py          # PDF / DOCX extraction
β”‚   β”œβ”€β”€ splitter.py        # Chunking
β”‚   β”œβ”€β”€ embedder.py        # Embedding generation
β”‚   β”œβ”€β”€ vectorstore.py     # Storage and search (ChromaDB)
β”‚   β”œβ”€β”€ rag.py             # Pipeline orchestration
β”‚   β”œβ”€β”€ llm/               # Answer generation (provider interface)
β”‚   β”‚   β”œβ”€β”€ __init__.py    # Provider selector (routes to ollama / gemini)
β”‚   β”‚   β”œβ”€β”€ ollama.py      # Local LLM (default, privacy / GDPR)
β”‚   β”‚   β”œβ”€β”€ gemini.py      # Cloud LLM (optional, performance)
β”‚   β”‚   └── prompt.py      # Shared system prompt used by both providers
β”‚   └── eval/              # Evaluation harness
β”‚       β”œβ”€β”€ metrics.py
β”‚       └── runner.py
β”œβ”€β”€ notebooks/             # Exploration and visualization
β”œβ”€β”€ tests/                 # Unit tests (pytest)
β”œβ”€β”€ data/                  # Documents and evaluation dataset
└── app.py                 # Streamlit interface

πŸ§ͺ Tests

uv run pytest -v

πŸ”­ Possible improvements

  • Benchmark Gemini against Ollama on the full evaluation set (table above)
  • Try mistral as the local model for better French output
  • Result re-ranking with a cross-encoder
  • Hybrid search (semantic + keyword / BM25)
  • Expand the evaluation dataset
  • OCR support for scanned PDFs

πŸ“„ License

This project is licensed under the MIT License, see the LICENSE file for details.

About

RAG assistant to query your own documents, built with ChromaDB, the Gemini API/ Ollama and a Streamlit stack.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages