A fully functional multi-source RAG chatbot prototype built on Django. The system ingests structured and unstructured data, generates semantic embeddings via SentenceTransformers, and persists them in a local ChromaDB instance. User queries are grounded in recovered context and passed to the configured LLM Providers: Gemini and OpenRouter (configurable) for generation. Strict Role-Based Access Control (RBAC) prevents cross-source data leakage, enforced identically at both the application and retrieval layers.
- Multi-Source Ingestion (
apps.sources): Automated parsing, chunking, and metadata extraction of PDF documents (e.g., Makita manuals) and YouTube video transcripts intoSourceChunkmodels. - Semantic Hybrid Retrieval (
apps.chat): Converts user queries into dense vectors using local SentenceTransformers for similarity search against indexed sources. - Strict Security Enforcement: Multi-layer RBAC guarantees users only query and retrieve context from data sources they explicitly own or have role-based permission to view.
- Hallucination Control: Prompt engineering forces the LLM to answer strictly from retrieved chunks, returning exact source metadata such as
page_numberandcitation_label. - Automated Data Pipelines: Custom Django management commands handle bulk database seeding, embedding builds, and ingestion automation.
- Web Framework: Django (Python)
- Databases: MySQL (relational storage for users, conversations, and sources) & ChromaDB (local vector storage).
- AI & Embeddings: SentenceTransformers (local embedding models) & OpenRouter API (external LLM provider inference).
- Static Serving: WhiteNoise (self-hosted static files for production).
- Ingestion Ecosystem: Python PDF parsing libraries and
youtube-transcript-apifor video ingestion.
# 1. Setup virtual environment and dependencies
python -m venv venv
source venv/bin/activate # Windows: venv\Scripts\activate
pip install -r requirements.txt
# 2. Configure Environment
cp .env.example .env
# Edit .env to configure MySQL credentials and API_KEY.
# 3. Apply Migrations
python manage.py migrate
# 4. Ingest Sources and Build Semantics
python manage.py run_pipeline # Automated: Reset → Seed → Ingest → Embed
# 5. Start Server
python manage.py runserverThe project is configured with WhiteNoise to serve static files (like the Admin CSS) directly via Django. In production environments:
- Set
DEBUG=Falsein.env. - Configure
CSRF_TRUSTED_ORIGINSwith your domain (e.g.,https://yourdomain.com). - Run
python manage.py collectstatic --noinputto prepare assets.
- "What is logistic regression?"
- "Explain the difference between linear and logistic regression."
- "What does the accessory catalogue say about Makita DC tools?"
- Unit Testing (
apps.chat.tests): Validates system resilience against edge cases such as greeting bypasses, empty retrieval fallbacks, OpenRouter API runtime limits, and invalid inputs. - Security & RBAC Auditing: Integration tests assert that user queries cannot retrieve vector chunks unassociated with their authorized scopes.
- Groundedness Verification: Mocks test deterministic extraction of metadata fields (
page_number,citation_label) to ensure RAG pipelines accurately map LLM citations back to their origin.
For detailed technical specifications, refer to the following files in the docs/ directory:
- System Architecture: High-level components and data flow diagrams.
- RAG Pipeline: Detailed explanation of retrieval, indexing, and generation.
- Access Control (RBAC): Security implementation and data isolation.
- Data Ingestion: Deep dive into PDF and YouTube preprocessing.
- Deployment Guide: Production setup, requirements, and security checklist.
- Trade-offs & Limitations: Technical assumptions and known system constraints.
- Hybrid Retrieval: We trade off some semantic flexibility for 100% accuracy in price filtering using structured SQL intercepts.
- Local Embeddings: CPU-bound ingestion is a bottleneck for very large datasets (Trade-off: Privacy/Cost over Speed).
- Context Windows: Retrieval is limited to the top-K chunks to maintain LLM performance.
See Trade-offs & Limitations for a full breakdown.