An on-premise, deterministic, and auditable system that converts natural language queries to SQL using semantic schema retrieval and local LLMs.
| Feature | Description |
|---|---|
| Multi-Step Retrieval | Tables β Columns β Joins in 3 sequential steps |
| Confidence Scoring | 0-100% confidence for every retrieved element |
| Query Logging | All queries logged with similarity search |
| Schema Versioning | Full audit trail with checksums |
| SQL Validation | Pre-execution validation against retrieved schema |
| On-Premises | No cloud dependencies - runs fully local |
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β SCHEMA INGESTION (One-time) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββββββββ β
β β Database βββββΆβ Schema βββββΆβ Document βββββΆβ Embedder β β
β β (SQLite) β β Extractor β β Builder β β β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ ββββββ¬ββββββ β
β β β
β βΌ β
β ββββββββββββ β
β ββββββββββββββββββββββββββββββββ β ChromaDB β β
β β ββββββββββββ β
β βΌ β
β ββββββββββββ β
β β Version β Creates version with checksum β
β β Manager β for audit trail β
β ββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β QUERY TIME (Multi-Step) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ β
β β User β β
β β Question β β
β ββββββββ¬ββββββββ β
β β β
β βΌ β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β MULTI-STEP RETRIEVAL β β
β β ββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββββββββββ β β
β β β STEP 1 ββββΆβ STEP 2 ββββΆβ STEP 3 β β β
β β β Tables β β Columns for β β Joins between β β β
β β β (top-K) β β those tables β β retrieved tables β β β
β β ββββββββββββββ ββββββββββββββββββ ββββββββββββββββββββββββββββ β β
β β β β
β β Each result includes CONFIDENCE SCORE (0-100%) β β
β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β β
β βΌ β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β Prompt βββββΆβ Ollama LLM βββββΆβ SQL β β
β β Builder β β (SQLCoder) β β Validator β β
β ββββββββββββββββ ββββββββββββββββ ββββββββ¬ββββββββ β
β β β
β ββββββββΌββββββββ β
β β Execute β β
β β + Log βββββ Query ID β
β ββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Unlike single-pass retrieval, this system uses a 3-step sequential approach:
βββββββββββββββββββ
β Step 1 β Embed query β Find top-K relevant TABLES
β Find Tables β Filter by confidence threshold (default: 30%)
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Step 2 β For EACH retrieved table:
β Get Columns β β Retrieve ALL its columns
ββββββββββ¬βββββββββ β Inherit confidence from parent table
β
βΌ
βββββββββββββββββββ
β Step 3 β Find JOINS between retrieved tables only
β Find Joins β β Filter out joins to non-retrieved tables
βββββββββββββββββββ
Why This Matters:
- Ensures columns always belong to retrieved tables
- Prevents orphaned join conditions
- Makes retrieval deterministic and auditable
Every retrieved element has a confidence score (0-100%):
Retrieved Schema:
Tables:
- employees [conf: 92%] β HIGH: very relevant
- departments [conf: 71%] β MEDIUM: possibly relevant
- orders [conf: 45%] β LOW: marginally relevant
Overall Confidence: 75%
How Confidence Works:
- Vector distance converted to 0-1 scale
- Tables weighted 50%, columns 30%, joins 20%
- Overall confidence shown for every query
Confidence Thresholds:
| Score | Level | Meaning |
|---|---|---|
| β₯70% | HIGH | Very confident in retrieval |
| 50-70% | MEDIUM | Reasonable confidence |
| 30-50% | LOW | Low confidence, verify results |
| <30% | VERY LOW | Element excluded from retrieval |
Step 1: Vector Distance β Confidence
ChromaDB returns a cosine distance (0 = identical, 2 = opposite). We convert to confidence:
confidence = 1 - (distance / 2)| Distance | Calculation | Confidence |
|---|---|---|
| 0.0 | 1 - (0/2) | 100% (identical match) |
| 0.5 | 1 - (0.5/2) | 75% |
| 1.0 | 1 - (1/2) | 50% |
| 1.5 | 1 - (1.5/2) | 25% |
| 2.0 | 1 - (2/2) | 0% (opposite meaning) |
Step 2: Per-Element Confidence
| Element | Confidence Source |
|---|---|
| Tables | Direct from vector similarity |
| Columns | 50% column similarity + 50% parent table confidence |
| Joins | Direct from vector similarity |
Column confidence is boosted by its table's confidence:
column_conf = (column_similarity * 0.5) + (table_confidence * 0.5)Step 3: Overall Confidence
Weighted average across all elements:
overall = (table_conf * 0.5) + (column_conf * 0.3) + (join_conf * 0.2)| Component | Weight | Reason |
|---|---|---|
| Tables | 50% | Most critical - wrong table = wrong query |
| Columns | 30% | Important but derived from tables |
| Joins | 20% | Supporting structure |
Example Calculation:
Query: "Show employees with high salary"
Retrieval Results:
employees: distance=0.66 β conf = 1-(0.66/2) = 67%
departments: distance=0.78 β conf = 1-(0.78/2) = 61%
orders: distance=0.84 β conf = 1-(0.84/2) = 58%
Table avg: (67 + 61 + 58) / 3 = 62%
Column avg: ~55% (inherited from tables)
Join avg: ~60%
Overall = (62% Γ 0.5) + (55% Γ 0.3) + (60% Γ 0.2)
= 31% + 16.5% + 12%
= 59.5% β 60%
All queries are logged for analytics and improvement:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Query Log Entry β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Query ID: abc123-... β
β Timestamp: 2024-01-15T10:30:00 β
β Question: "Show employees with salary > 80000" β
β Tables Used: employees, departments β
β SQL: SELECT ... FROM employees ... β
β Success: β β
β Confidence: 75% β
β Exec Time: 234ms β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Features:
- Similar Query Search: Find past queries similar to current one
- Success Rate Tracking: Monitor query success over time
- Feedback Loop: Learn from past successful queries
# Show query statistics
python main.py --stats
# Find similar past queries (interactive mode)
/similar Show all customersTrack schema changes over time:
Schema Version History
βββββββββββ¬ββββββββββββββββββββββ¬βββββββββ¬βββββββββββββββββββ
β Version β Timestamp β Tables β Checksum β
βββββββββββΌββββββββββββββββββββββΌβββββββββΌβββββββββββββββββββ€
β v1 β 2024-01-10 09:00:00 β 5 β a3f2b8c1e4d7... β
β v2 β 2024-01-15 14:30:00 β 6 β b7e9c3a2f1d5... β
β v3 β 2024-02-01 11:00:00 β 6 β c4d8f2e7a9b3... β
βββββββββββ΄ββββββββββββββββββββββ΄βββββββββ΄βββββββββββββββββββ
Features:
- Checksum: Detects schema changes automatically
- Version Comparison: See what changed between versions
- Audit Trail: Know when schema was last indexed
# Show version history
python main.py --version-history
# Or in interactive mode
/historyHandOfTheGod/
βββ main.py # CLI entry point
βββ config.py # Configuration settings
βββ requirements.txt # Python dependencies
βββ sample_database.db # Sample SQLite database
βββ schema_vectors/ # ChromaDB vector storage
β βββ schema_versions.json # Version history file
β
βββ nl2sql/ # Core package
βββ schema/ # Schema extraction
β βββ models.py # TableDocument, ColumnDocument, etc.
β βββ extractor.py # Database introspection
β
βββ vectorstore/ # Vector storage & retrieval
β βββ embedder.py # Local embeddings
β βββ store.py # ChromaDB integration
β βββ retriever.py # Multi-step retrieval + confidence
β βββ query_logger.py # Query logging & analytics
β βββ versioning.py # Schema version tracking
β
βββ generator/ # SQL generation
β βββ prompt_builder.py # Schema-constrained prompts
β βββ llm.py # Ollama LLM interface
β
βββ validator/ # SQL validation
β βββ sql_validator.py # Pre-execution verification
β
βββ orchestrator.py # End-to-end pipeline
pip install -r requirements.txt# macOS
brew install ollama
# Start Ollama service
ollama serve
# Pull the SQLCoder model
ollama pull sqlcoderpython main.py --index-schemaOutput:
β
Indexed 40 schema documents
Tables: 6
Columns: 30
Relationships: 4
Version: v1 (checksum: a3f2b8c1...)
# Interactive mode
python main.py
# Single query with debug
python main.py -q "Show employees earning above 80000" --debug| Command | Description |
|---|---|
python main.py --index-schema |
Index schema with versioning |
python main.py |
Interactive mode |
python main.py -q "question" |
Single query mode |
python main.py --debug |
Show retrieval details |
python main.py --stats |
Show query statistics |
python main.py --version-history |
Show schema versions |
python main.py --no-execute |
Generate SQL only |
Interactive Commands:
/schema- Show schema version info/stats- Show query statistics/similar <q>- Find similar past queries/history- Show version history/debug- Toggle debug mode/quit- Exit
Edit config.py:
# Retrieval tuning
RETRIEVAL_TOP_K_TABLES = 5 # Max tables in Step 1
MIN_TABLE_CONFIDENCE = 0.3 # Minimum confidence threshold
# Validation rules
ALLOW_SELECT_STAR = False # Block SELECT *
REQUIRE_EXPLICIT_JOINS = True # Require explicit JOINs| Feature | Protection |
|---|---|
| Schema Validation | Tables/columns must be in retrieved schema |
| Confidence Threshold | Low-confidence elements excluded |
| No SELECT * | Forces explicit column listing |
| Query Logging | Full audit trail |
| Version Tracking | Schema change detection |
For Mac Air M2 8GB RAM:
- SQLCoder model works well (quantized)
- Embedding model is lightweight (~80MB)
- First query may be slow while models load
If memory issues occur:
ollama pull phi3:mini
# Update MODEL_NAME in config.py