A RAG agent that reads documents from the data/ directory and answers questions based solely on those documents. Everything runs locally. No API calls, no internet required.
When you start the script, it reads every text file in data/ and splits them into chunks of roughly 300 words each. Each chunk is passed through an embedding model (nomic-embed-text) which converts it into a 768-dimensional vector. All vectors are stored in a numpy matrix in memory.
When you type a question:
- The same embedding model converts your question into a vector
- Cosine similarity (dot product) finds the top 3 chunks closest to your question
- Those 3 chunks are inserted into a prompt that instructs the LLM to answer only from the provided context
- The model (
qwen2.5-coder:1.5b-instruct-q4_K_M) streams the answer back
The prompt explicitly tells the model not to use its own knowledge, only the context it receives.
ollama pull qwen2.5-coder:1.5b-instruct-q4_K_M
ollama pull nomic-embed-text:latest
pip install ollama numpy
python3 rag_agent.pyPlace .txt files in data/ before running. Type your questions at the prompt. Type quit to exit.
LLM: qwen2.5-coder:1.5b-instruct-q4_K_M — 4-bit quantized, roughly 1GB. Selected for its small footprint and strong instruction following relative to its size.
Embedding model: nomic-embed-text — 274MB, 768-dimensional output. Purpose-built for semantic similarity tasks. Using a dedicated embedding model rather than the LLM itself is more efficient and produces better retrieval results.
You can swap either model by changing the EMBED_MODEL and LLM_MODEL variables at the top of the script.
Standard chatbots have no mechanism to restrict their answers to a specific set of documents. They draw from their training data, which makes them unsuitable for querying private or domain-specific information. This agent solves that by:
- Keeping all data local (privacy)
- Using retrieval to select only relevant context (precision)
- Structuring the prompt to restrict the LLM's output (grounding)
rag/
├── rag_agent.py
├── data/ # source documents (txt files)
└── README.md
- Only supports plain text files currently
- Embeddings are recomputed on every startup (no persistence yet)
- Chunk size is fixed at ~300 words
- Retrieval is limited to top 3 chunks