Skip to content

NadmanFaisal/Knowledge_MCP_RAG

Repository files navigation

RAG-MCP Knowledge Agent: Semantic Retrieval System

Overview

This project implements a Retrieval-Augmented Generation (RAG) pipeline integrated with the Model Context Protocol (MCP), allowing an external Large Language Model (LLM) client (like Claude Desktop) to query a custom knowledge base of lecture materials (PDF slides) in real-time.

The architecture is built for traceability, speed, and modularity, using containerization (Docker) for state management and specialized Python libraries for high-performance vector search.

Technologies and Tech-Stack

Component Technology Rationale & Contribution
Orchestration Python (AsyncIO) / uv Manages the asynchronous data ingestion pipeline and serves the custom application logic. uv is used for ultra-fast dependency management.
Vector Embedding Qwen3-Embedding-0.6B (SentenceTransformer) Utilized a specialized, state-of-the-art embedding model to generate high-quality vector representations, ensuring superior semantic relevance over generic models.
Vector Database ChromaDB (Dockerized) Provides scalable, persistent storage for vector embeddings, original text chunks, and metadata (source tracing). Deployed in client-server mode for isolation.
API/Tooling FastMCP (Model Context Protocol) Implements a standardized interface, allowing any MCP-compliant LLM (e.g., Claude) to consume the custom RAG functionality as a callable tool (get_documents_from_RAG).
Observability OpenTelemetry (OTEL) & Zipkin Integrated tracing infrastructure to monitor request flow from the MCP proxy, through the Python server, to the ChromaDB database, aiding in performance analysis and debugging.
Deployment Docker Compose Used to manage and deploy the distributed service architecture (ChromaDB, OTEL Collector, Zipkin) within an isolated network.

Limitations of the System

  • Retrieval Latency: The RAG server must embed the query, run a vector search (in this case, ChromaDB), and fetch the top-K documents. For large indexes (millions of vectors), this can become a noticeable latency bottleneck.
  • Semantic Fragmentation: Concepts split across multiple slides/chunks can break continuity. The retriever may return only part of a section (e.g., the start of a conclusion) and miss follow-up content, leading to incomplete/improvised answers.
  • Context Irrelevance ("Garbage In"): The retrieval system may return documents that are technically similar by vector distance but semantically irrelevant to the user's question. If the LLM is fed irrelevant context, it can become distracted, leading to poor or generalized answers.
  • Prompt Dilution: Sending too much retrieved text (known as "context stuffing") can dilute the LLM's focus, suppress its existing internal knowledge, and lead to less confident or less accurate responses, despite having the correct information available.
  • Token Limits: Every LLM has a maximum token limit (context window). The combined length of the user query, the retrieved documents, and the prompt template/instructions must fit within this limit. If the retrieved documents exceed the limit, the RAG server must truncate them, potentially cutting off vital information.
  • Context Window Cost: Passing large amounts of retrieved context to a commercial LLM (like GPT-4) increases the token cost of the API call, making the RAG system significantly more expensive to operate at scale compared to sending only a short user query.

Architecture Highlights

System_diagram

This project demonstrates proficiency in designing and implementing modern, distributed micro-architectures:

  1. Robust RAG Pipeline Implementation

Custom Data Ingestion: Developed a dedicated ingestion script (main.py & data_extractor.py) that handles PDF extraction, robust text cleaning (merging line wraps, removing artifacts), and semantic chunking.

Traceability: Every document chunk stored in ChromaDB is indexed with a unique, traceable ID and metadata (source_file), allowing the LLM to cite the origin of the retrieved context.

Idempotency: Utilizes the ChromaDB .upsert() method to ensure the ingestion pipeline can be rerun safely without creating duplicate data entries, crucial for maintenance and data integrity.

  1. Microservice and Interface Design

Protocol Implementation: Successfully implemented the Model Context Protocol (MCP) using FastMCP, showcasing ability to extend third-party LLMs with custom enterprise/domain logic.

Decoupled Services: Separated the core logic: The ChromaDB Server handles state (Docker) while the Embedding/RAG logic (server.py) runs on the host, achieving modularity and allowing independent scaling.

Complex Tool Execution: Resolved intricate execution path issues using the uvx mcpo proxy to correctly bridge the host Python environment with the Stdio/MCP standard on a Linux host.

  1. Asynchronous Networking

The Python application uses asyncio and the chromadb.AsyncHttpClient (built on httpx) to handle vector database connections non-blockingly, demonstrating knowledge of high-concurrency network communication.

Setup and Running the System

Prerequisites

  • Docker and Docker Compose installed.
  • Ollama installed and running locally on port 11434 (required for Open WebUI).
  • Python 3.10+ and the uv package manager.
  1. Start the Docker Infrastructure

This command launches the vector database (ChromaDB), the OTEL collector, and Zipkin.

docker compose up -d
  1. Start the Virtual environment
python3 -m venv .venv
source ./venv/bin/activate
pip install -r requirements.txt
  1. In the root of the project, create a .env file and paste the following:
CHROMA_DB_HOST=<YOUR_CHROMA_DB_IP>
PORT=<PORT_WHERE_CHROMA_DB_IS_RUNNING>
  1. Place some PDF files into the folder datasets

This folder contains the folders that are to be converted to text documents, and saved in the vector database as embeddings, along with metadata.

  1. Run the Data Ingestion Pipeline

This loads, cleans, embeds (using Qwen3), and stores all PDF files in the datasets/ folder into ChromaDB.

python pipeline.py
  1. Run the MCP Server (The Tool)

This command starts the local MCP server, which listens for requests from the LLM client (or Open WebUI/FastAPI proxy).

uvx mcpo --port 8000 -- /home/nadman/Desktop/Personal_Projects/MCP_Server/.venv/bin/python server.py
  1. Run the Open WebUI Client

This provides a UI for testing, connecting to your Ollama LLM and your running MCP server.

sudo docker run -d \
  -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  -e OLLAMA_BASE_URL=http://host.docker.internal:11434 \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Access the client UI at http://localhost:3000.

About

Distributed RAG Agent using FastMCP to augment LLMs with custom knowledge (PDFs). Features Qwen3 embeddings, Dockerized ChromaDB for idempotent storage, and integrated OpenTelemetry tracing.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages