Internship project (Laboratoire MICS) on Retrieval-Augmented Generation (RAG) in the medical domain, combining:
- Visual document retrieval with ColPali / BiQwen2.5 (
nomic-ai/nomic-embed-multimodal-3b), - a generative LLM (Google Gemini) to answer multiple-choice questions,
- and the integration of a medical Knowledge Graph (the Mondo ontology) to enrich the context.
Evaluation is performed on a subset of the medical MIRAGE benchmark. The full report is in docs/internship_report.pdf and the slides in docs/presentation.pdf. References are in docs/CITATIONS.md.
Note: the report uses three corpora; this repository includes the code for corpus A (
vidore/syntheticDocQA_healthcare_industry_test).
The pipeline supports four modes, selected via the KG_MODE constant in scripts/run_mirage.py:
kg |
Mode | Description |
|---|---|---|
| 1 | LLM only | LLM with chain-of-thought, no retrieval. |
| 2 | RAG only | Visual retrieval of relevant pages + LLM. |
| 3 | RAG + KG context | RAG + textual context extracted from the Knowledge Graph. |
| 4 | RAG + KG retrieve | The KG is also used to retrieve additional documents. |
Evaluated on a 1,000-question subset of MIRAGE (200 questions per dataset, fixed seed 42),
with Gemini 2.5 Flash-Lite as the generator backbone. The table reports the best
configuration of each family (mean accuracy); the full per-configuration sweep — score
thresholds, k, and the three retrieval corpora — is in Table I of
docs/internship_report.pdf.
| Configuration | MMLU | MedQA | MedMCQA | PubMedQA | BioASQ | Average |
|---|---|---|---|---|---|---|
| LLM only (CoT) | 0.870 | 0.730 | 0.635 | 0.500 | 0.840 | 0.715 |
| RAG + KG context (best, thr=0.2) | 0.860 | 0.730 | 0.645 | 0.490 | 0.850 | 0.715 |
| RAG only (best: Corpus C, k=5, thr=0.25) | 0.860 | 0.775 | 0.615 | 0.500 | 0.885 | 0.727 |
| RAG + KG retrieve (best, k=3) | 0.860 | 0.780 | 0.680 | 0.480 | 0.870 | 0.730 |
The full hybrid (RAG + KG retrieve) gives the best overall accuracy, 0.730 vs. 0.715 for the LLM alone. The gain is modest but consistent, and concentrated on the clinical / exam datasets (MedQA 0.730 → 0.780, MedMCQA 0.635 → 0.680). PubMedQA stays the weak point across every configuration (~0.48–0.50), suggesting that a literature-oriented corpus (e.g. PubMed) would be needed there — using the KG to retrieve extra evidence helps more than only injecting KG text as context.
The evaluation parameters k, thresholdrag, and thresholdkg are exposed on the
CLI (see Usage) to run the per-configuration sweep.
.
├── pyproject.toml # package metadata and dependencies
├── requirements.txt # pinned dependencies (alternative to pip install -e .)
├── README.md LICENSE .env.example .gitignore
├── src/kgcolpali/ # installable library package
│ ├── api.py # Gemini client, prompt registry, retry helper
│ ├── colpali.py # loads the BiQwen2.5 (ColPali) model and processor
│ ├── embeddings.py # loads the dataset and precomputed image embeddings
│ ├── functions.py # RAG logic: retrieval, image prep, `medrag_answer`
│ ├── kg.py # Mondo Knowledge Graph: entity linking, SPARQL, context
│ ├── templates.py # prompt templates (system + user) for each mode
│ └── utils.py # `QADataset` and answer-parsing utilities
├── scripts/ # entry points
│ ├── generate_embeddings.py # build and save the corpus image embeddings
│ ├── run_mirage.py # generate answers over the MIRAGE subset
│ └── evaluate.py # compute accuracy of the generated answers
└── docs/
├── internship_report.pdf
├── presentation.pdf
└── CITATIONS.md
- Python 3.10+
- A CUDA GPU is recommended (the BiQwen2.5 model has ~3B parameters). There is a CPU fallback, but it will be slow.
- A Google AI Studio (Gemini) API key and a Hugging Face token.
git clone https://github.com/mlahozy21/Medical-RAG-Knowledge-Graphs.git
cd Medical-RAG-Knowledge-Graphs
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e . # installs the kgcolpali package and its dependenciespip install -e . reads the dependencies from pyproject.toml. Adjust the torch installation to your CUDA version following https://pytorch.org/get-started/locally/.
Keys are not hardcoded. Create a .env file in the project root (it is in .gitignore, so it is never committed):
GOOGLE_API_KEY=your_google_ai_studio_key
HF_TOKEN=your_hugging_face_tokenA template is provided in .env.example.
These resources are heavy and/or external, so they are not versioned. Place them in the project root (the scripts resolve paths relative to the current working directory, so run them from the repo root):
- Document corpus — the
vidore/syntheticDocQA_healthcare_industry_testdataset (Hugging Face), downloaded to./syntheticDocQA_healthcare_industry_test. - Image embeddings —
image_embeddings.pt. If missing, generate it withscripts/generate_embeddings.py. Pointsrc/kgcolpali/embeddings.py(colpali_embeddings_dir) to its location. - Mondo ontology —
mondo.nt(N-Triples). Download Mondo from https://mondo.monarchinitiative.org / https://github.com/monarch-initiative/mondo/releases and convert to N-Triples if needed (e.g. withrobot convertorrdflib). - MIRAGE benchmark —
benchmark.json, available in the official MIRAGE repository (https://github.com/Teddy-XiongGZ/MIRAGE), used bykgcolpali.utils.QADataset.
Run all commands from the repository root.
1. (Optional) Generate the corpus embeddings if you do not have image_embeddings.pt:
python scripts/generate_embeddings.py2. Generate the predictions over the MIRAGE subset (200 questions per dataset, fixed seed = 42). The mode and the retrieval hyper-parameters (k, thresholdrag, thresholdkg) are configurable on the command line (defaults reproduce the best hybrid config):
# defaults: --kg 1 --k 3 --thresholdrag 0.25 --thresholdkg 0.2
python scripts/run_mirage.py --kg 4 --k 3 --thresholdrag 0.25 --thresholdkg 0.2
# sweep example: RAG-only over fewer datasets
python scripts/run_mirage.py --kg 2 --k 5 --datasets mmlu pubmedqa --n-samples 50Predictions are saved to prediction/<dataset>_predictions.json.
3. Evaluate the accuracy:
python scripts/evaluate.pyIt prints the mean accuracy per dataset (mmlu, medqa, medmcqa, pubmedqa, bioasq) and the overall mean.
See docs/CITATIONS.md (ColPali, ViDoRe Benchmark V2, and MIRAGE).
Released under the MIT License — see LICENSE.