Skip to content

sarfraspc/multi_source_chatbot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Multi-Source Chatbot

1. Overview

A fully functional multi-source RAG chatbot prototype built on Django. The system ingests structured and unstructured data, generates semantic embeddings via SentenceTransformers, and persists them in a local ChromaDB instance. User queries are grounded in recovered context and passed to the configured LLM Providers: Gemini and OpenRouter (configurable) for generation. Strict Role-Based Access Control (RBAC) prevents cross-source data leakage, enforced identically at both the application and retrieval layers.

2. Features

  • Multi-Source Ingestion (apps.sources): Automated parsing, chunking, and metadata extraction of PDF documents (e.g., Makita manuals) and YouTube video transcripts into SourceChunk models.
  • Semantic Hybrid Retrieval (apps.chat): Converts user queries into dense vectors using local SentenceTransformers for similarity search against indexed sources.
  • Strict Security Enforcement: Multi-layer RBAC guarantees users only query and retrieve context from data sources they explicitly own or have role-based permission to view.
  • Hallucination Control: Prompt engineering forces the LLM to answer strictly from retrieved chunks, returning exact source metadata such as page_number and citation_label.
  • Automated Data Pipelines: Custom Django management commands handle bulk database seeding, embedding builds, and ingestion automation.

3. Tech Stack

  • Web Framework: Django (Python)
  • Databases: MySQL (relational storage for users, conversations, and sources) & ChromaDB (local vector storage).
  • AI & Embeddings: SentenceTransformers (local embedding models) & OpenRouter API (external LLM provider inference).
  • Static Serving: WhiteNoise (self-hosted static files for production).
  • Ingestion Ecosystem: Python PDF parsing libraries and youtube-transcript-api for video ingestion.

4. Setup Instructions

# 1. Setup virtual environment and dependencies
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# 2. Configure Environment
cp .env.example .env
# Edit .env to configure MySQL credentials and API_KEY.

# 3. Apply Migrations
python manage.py migrate

# 4. Ingest Sources and Build Semantics
python manage.py run_pipeline  # Automated: Reset → Seed → Ingest → Embed

# 5. Start Server
python manage.py runserver

Production Notes

The project is configured with WhiteNoise to serve static files (like the Admin CSS) directly via Django. In production environments:

  1. Set DEBUG=False in .env.
  2. Configure CSRF_TRUSTED_ORIGINS with your domain (e.g., https://yourdomain.com).
  3. Run python manage.py collectstatic --noinput to prepare assets.

5. Example Queries

  • "What is logistic regression?"
  • "Explain the difference between linear and logistic regression."
  • "What does the accessory catalogue say about Makita DC tools?"

6. Testing & Validation

  • Unit Testing (apps.chat.tests): Validates system resilience against edge cases such as greeting bypasses, empty retrieval fallbacks, OpenRouter API runtime limits, and invalid inputs.
  • Security & RBAC Auditing: Integration tests assert that user queries cannot retrieve vector chunks unassociated with their authorized scopes.
  • Groundedness Verification: Mocks test deterministic extraction of metadata fields (page_number, citation_label) to ensure RAG pipelines accurately map LLM citations back to their origin.

7. Documentation Map

For detailed technical specifications, refer to the following files in the docs/ directory:

8. Limitations & Trade-offs

  • Hybrid Retrieval: We trade off some semantic flexibility for 100% accuracy in price filtering using structured SQL intercepts.
  • Local Embeddings: CPU-bound ingestion is a bottleneck for very large datasets (Trade-off: Privacy/Cost over Speed).
  • Context Windows: Retrieval is limited to the top-K chunks to maintain LLM performance.

See Trade-offs & Limitations for a full breakdown.

About

A Django-based RAG chatbot featuring multi-source ingestion, semantic search, and strict Role-Based Access Control.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors