RAG-MCP Knowledge Agent: Semantic Retrieval System

Overview

This project implements a Retrieval-Augmented Generation (RAG) pipeline integrated with the Model Context Protocol (MCP), allowing an external Large Language Model (LLM) client (like Claude Desktop) to query a custom knowledge base of lecture materials (PDF slides) in real-time.

The architecture is built for traceability, speed, and modularity, using containerization (Docker) for state management and specialized Python libraries for high-performance vector search.

Technologies and Tech-Stack

Component	Technology	Rationale & Contribution
Orchestration	Python (AsyncIO) / `uv`	Manages the asynchronous data ingestion pipeline and serves the custom application logic. `uv` is used for ultra-fast dependency management.
Vector Embedding	Qwen3-Embedding-0.6B (`SentenceTransformer`)	Utilized a specialized, state-of-the-art embedding model to generate high-quality vector representations, ensuring superior semantic relevance over generic models.
Vector Database	ChromaDB (Dockerized)	Provides scalable, persistent storage for vector embeddings, original text chunks, and metadata (source tracing). Deployed in client-server mode for isolation.
API/Tooling	FastMCP (Model Context Protocol)	Implements a standardized interface, allowing any MCP-compliant LLM (e.g., Claude) to consume the custom RAG functionality as a callable tool (`get_documents_from_RAG`).
Observability	OpenTelemetry (OTEL) & Zipkin	Integrated tracing infrastructure to monitor request flow from the MCP proxy, through the Python server, to the ChromaDB database, aiding in performance analysis and debugging.
Deployment	Docker Compose	Used to manage and deploy the distributed service architecture (ChromaDB, OTEL Collector, Zipkin) within an isolated network.

Limitations of the System

Retrieval Latency: The RAG server must embed the query, run a vector search (in this case, ChromaDB), and fetch the top-K documents. For large indexes (millions of vectors), this can become a noticeable latency bottleneck.
Semantic Fragmentation: Concepts split across multiple slides/chunks can break continuity. The retriever may return only part of a section (e.g., the start of a conclusion) and miss follow-up content, leading to incomplete/improvised answers.
Context Irrelevance ("Garbage In"): The retrieval system may return documents that are technically similar by vector distance but semantically irrelevant to the user's question. If the LLM is fed irrelevant context, it can become distracted, leading to poor or generalized answers.
Prompt Dilution: Sending too much retrieved text (known as "context stuffing") can dilute the LLM's focus, suppress its existing internal knowledge, and lead to less confident or less accurate responses, despite having the correct information available.
Token Limits: Every LLM has a maximum token limit (context window). The combined length of the user query, the retrieved documents, and the prompt template/instructions must fit within this limit. If the retrieved documents exceed the limit, the RAG server must truncate them, potentially cutting off vital information.
Context Window Cost: Passing large amounts of retrieved context to a commercial LLM (like GPT-4) increases the token cost of the API call, making the RAG system significantly more expensive to operate at scale compared to sending only a short user query.

Architecture Highlights

This project demonstrates proficiency in designing and implementing modern, distributed micro-architectures:

Robust RAG Pipeline Implementation

Custom Data Ingestion: Developed a dedicated ingestion script (main.py & data_extractor.py) that handles PDF extraction, robust text cleaning (merging line wraps, removing artifacts), and semantic chunking.

Traceability: Every document chunk stored in ChromaDB is indexed with a unique, traceable ID and metadata (source_file), allowing the LLM to cite the origin of the retrieved context.

Idempotency: Utilizes the ChromaDB .upsert() method to ensure the ingestion pipeline can be rerun safely without creating duplicate data entries, crucial for maintenance and data integrity.

Microservice and Interface Design

Protocol Implementation: Successfully implemented the Model Context Protocol (MCP) using FastMCP, showcasing ability to extend third-party LLMs with custom enterprise/domain logic.

Decoupled Services: Separated the core logic: The ChromaDB Server handles state (Docker) while the Embedding/RAG logic (server.py) runs on the host, achieving modularity and allowing independent scaling.

Complex Tool Execution: Resolved intricate execution path issues using the uvx mcpo proxy to correctly bridge the host Python environment with the Stdio/MCP standard on a Linux host.

Asynchronous Networking

The Python application uses asyncio and the chromadb.AsyncHttpClient (built on httpx) to handle vector database connections non-blockingly, demonstrating knowledge of high-concurrency network communication.

Setup and Running the System

Prerequisites

Docker and Docker Compose installed.
Ollama installed and running locally on port 11434 (required for Open WebUI).
Python 3.10+ and the uv package manager.

Start the Docker Infrastructure

This command launches the vector database (ChromaDB), the OTEL collector, and Zipkin.

docker compose up -d

Start the Virtual environment

python3 -m venv .venv
source ./venv/bin/activate
pip install -r requirements.txt

In the root of the project, create a .env file and paste the following:

CHROMA_DB_HOST=<YOUR_CHROMA_DB_IP>
PORT=<PORT_WHERE_CHROMA_DB_IS_RUNNING>

Place some PDF files into the folder datasets

This folder contains the folders that are to be converted to text documents, and saved in the vector database as embeddings, along with metadata.

Run the Data Ingestion Pipeline

This loads, cleans, embeds (using Qwen3), and stores all PDF files in the datasets/ folder into ChromaDB.

python pipeline.py

Run the MCP Server (The Tool)

This command starts the local MCP server, which listens for requests from the LLM client (or Open WebUI/FastAPI proxy).

uvx mcpo --port 8000 -- /home/nadman/Desktop/Personal_Projects/MCP_Server/.venv/bin/python server.py

Run the Open WebUI Client

This provides a UI for testing, connecting to your Ollama LLM and your running MCP server.

sudo docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Access the client UI at http://localhost:3000.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data_cleaning		data_cleaning
database		database
datasets		datasets
docker_setup		docker_setup
documentations		documentations
embedding		embedding
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pipeline.py		pipeline.py
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
server.py		server.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RAG-MCP Knowledge Agent: Semantic Retrieval System

Overview

Technologies and Tech-Stack

Limitations of the System

Architecture Highlights

Setup and Running the System

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

RAG-MCP Knowledge Agent: Semantic Retrieval System

Overview

Technologies and Tech-Stack

Limitations of the System

Architecture Highlights

Setup and Running the System

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages