Multi-Source Chatbot

1. Overview

A fully functional multi-source RAG chatbot prototype built on Django. The system ingests structured and unstructured data, generates semantic embeddings via SentenceTransformers, and persists them in a local ChromaDB instance. User queries are grounded in recovered context and passed to the configured LLM Providers: Gemini and OpenRouter (configurable) for generation. Strict Role-Based Access Control (RBAC) prevents cross-source data leakage, enforced identically at both the application and retrieval layers.

2. Features

Multi-Source Ingestion (apps.sources): Automated parsing, chunking, and metadata extraction of PDF documents (e.g., Makita manuals) and YouTube video transcripts into SourceChunk models.
Semantic Hybrid Retrieval (apps.chat): Converts user queries into dense vectors using local SentenceTransformers for similarity search against indexed sources.
Strict Security Enforcement: Multi-layer RBAC guarantees users only query and retrieve context from data sources they explicitly own or have role-based permission to view.
Hallucination Control: Prompt engineering forces the LLM to answer strictly from retrieved chunks, returning exact source metadata such as page_number and citation_label.
Automated Data Pipelines: Custom Django management commands handle bulk database seeding, embedding builds, and ingestion automation.

3. Tech Stack

Web Framework: Django (Python)
Databases: MySQL (relational storage for users, conversations, and sources) & ChromaDB (local vector storage).
AI & Embeddings: SentenceTransformers (local embedding models) & OpenRouter API (external LLM provider inference).
Static Serving: WhiteNoise (self-hosted static files for production).
Ingestion Ecosystem: Python PDF parsing libraries and youtube-transcript-api for video ingestion.

4. Setup Instructions

# 1. Setup virtual environment and dependencies
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate
pip install -r requirements.txt

# 2. Configure Environment
cp .env.example .env
# Edit .env to configure MySQL credentials and API_KEY.

# 3. Apply Migrations
python manage.py migrate

# 4. Ingest Sources and Build Semantics
python manage.py run_pipeline  # Automated: Reset → Seed → Ingest → Embed

# 5. Start Server
python manage.py runserver

Production Notes

The project is configured with WhiteNoise to serve static files (like the Admin CSS) directly via Django. In production environments:

Set DEBUG=False in .env.
Configure CSRF_TRUSTED_ORIGINS with your domain (e.g., https://yourdomain.com).
Run python manage.py collectstatic --noinput to prepare assets.

5. Example Queries

"What is logistic regression?"
"Explain the difference between linear and logistic regression."
"What does the accessory catalogue say about Makita DC tools?"

6. Testing & Validation

Unit Testing (apps.chat.tests): Validates system resilience against edge cases such as greeting bypasses, empty retrieval fallbacks, OpenRouter API runtime limits, and invalid inputs.
Security & RBAC Auditing: Integration tests assert that user queries cannot retrieve vector chunks unassociated with their authorized scopes.
Groundedness Verification: Mocks test deterministic extraction of metadata fields (page_number, citation_label) to ensure RAG pipelines accurately map LLM citations back to their origin.

7. Documentation Map

For detailed technical specifications, refer to the following files in the docs/ directory:

System Architecture: High-level components and data flow diagrams.
RAG Pipeline: Detailed explanation of retrieval, indexing, and generation.
Access Control (RBAC): Security implementation and data isolation.
Data Ingestion: Deep dive into PDF and YouTube preprocessing.
Deployment Guide: Production setup, requirements, and security checklist.
Trade-offs & Limitations: Technical assumptions and known system constraints.

8. Limitations & Trade-offs

Hybrid Retrieval: We trade off some semantic flexibility for 100% accuracy in price filtering using structured SQL intercepts.
Local Embeddings: CPU-bound ingestion is a bottleneck for very large datasets (Trade-off: Privacy/Cost over Speed).
Context Windows: Retrieval is limited to the top-K chunks to maintain LLM performance.

See Trade-offs & Limitations for a full breakdown.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
apps		apps
config		config
data/raw		data/raw
docs		docs
templates		templates
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
manage.py		manage.py
passenger_wsgi.py		passenger_wsgi.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multi-Source Chatbot

1. Overview

2. Features

3. Tech Stack

4. Setup Instructions

Production Notes

5. Example Queries

6. Testing & Validation

7. Documentation Map

8. Limitations & Trade-offs

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Multi-Source Chatbot

1. Overview

2. Features

3. Tech Stack

4. Setup Instructions

Production Notes

5. Example Queries

6. Testing & Validation

7. Documentation Map

8. Limitations & Trade-offs

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages