Skip to content

Mrugendra7911/OnPremNLtoSQL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

NL-to-SQL: Vector-Based Natural Language to SQL Converter

An on-premise, deterministic, and auditable system that converts natural language queries to SQL using semantic schema retrieval and local LLMs.

✨ Key Features

Feature Description
Multi-Step Retrieval Tables β†’ Columns β†’ Joins in 3 sequential steps
Confidence Scoring 0-100% confidence for every retrieved element
Query Logging All queries logged with similarity search
Schema Versioning Full audit trail with checksums
SQL Validation Pre-execution validation against retrieved schema
On-Premises No cloud dependencies - runs fully local

πŸ—οΈ System Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                           SCHEMA INGESTION (One-time)                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚
β”‚   β”‚   Database   │───▢│    Schema    │───▢│   Document   │───▢│ Embedder β”‚ β”‚
β”‚   β”‚   (SQLite)   β”‚    β”‚   Extractor  β”‚    β”‚   Builder    β”‚    β”‚          β”‚ β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜ β”‚
β”‚                                                                     β”‚       β”‚
β”‚                                                                     β–Ό       β”‚
β”‚                                                               β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”‚ ChromaDB β”‚  β”‚
β”‚                              β”‚                                β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚                              β–Ό                                              β”‚
β”‚                        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                         β”‚
β”‚                        β”‚ Version  β”‚  Creates version with checksum          β”‚
β”‚                        β”‚ Manager  β”‚  for audit trail                        β”‚
β”‚                        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                                         β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                         QUERY TIME (Multi-Step)                             β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚                                                                             β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                                                          β”‚
β”‚   β”‚    User      β”‚                                                          β”‚
β”‚   β”‚   Question   β”‚                                                          β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                                                          β”‚
β”‚          β”‚                                                                  β”‚
β”‚          β–Ό                                                                  β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚   β”‚                    MULTI-STEP RETRIEVAL                              β”‚  β”‚
β”‚   β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚  β”‚
β”‚   β”‚  β”‚  STEP 1    │──▢│    STEP 2      │──▢│       STEP 3             β”‚  β”‚  β”‚
β”‚   β”‚  β”‚  Tables    β”‚   β”‚ Columns for    β”‚   β”‚ Joins between            β”‚  β”‚  β”‚
β”‚   β”‚  β”‚  (top-K)   β”‚   β”‚ those tables   β”‚   β”‚ retrieved tables         β”‚  β”‚  β”‚
β”‚   β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚  β”‚
β”‚   β”‚                                                                      β”‚  β”‚
β”‚   β”‚  Each result includes CONFIDENCE SCORE (0-100%)                      β”‚  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”‚          β”‚                                                                  β”‚
β”‚          β–Ό                                                                  β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚   β”‚   Prompt     │───▢│  Ollama LLM  │───▢│     SQL      β”‚                  β”‚
β”‚   β”‚   Builder    β”‚    β”‚  (SQLCoder)  β”‚    β”‚  Validator   β”‚                  β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β”‚                                                  β”‚                          β”‚
β”‚                                           β”Œβ”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”                  β”‚
β”‚                                           β”‚   Execute    β”‚                  β”‚
β”‚                                           β”‚   + Log      │◀─── Query ID     β”‚
β”‚                                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ”„ Multi-Step Retrieval

Unlike single-pass retrieval, this system uses a 3-step sequential approach:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Step 1      β”‚  Embed query β†’ Find top-K relevant TABLES
β”‚  Find Tables    β”‚  Filter by confidence threshold (default: 30%)
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Step 2      β”‚  For EACH retrieved table:
β”‚  Get Columns    β”‚  β†’ Retrieve ALL its columns
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β†’ Inherit confidence from parent table
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚     Step 3      β”‚  Find JOINS between retrieved tables only
β”‚  Find Joins     β”‚  β†’ Filter out joins to non-retrieved tables
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Why This Matters:

  • Ensures columns always belong to retrieved tables
  • Prevents orphaned join conditions
  • Makes retrieval deterministic and auditable

πŸ“Š Confidence Scoring

Every retrieved element has a confidence score (0-100%):

Retrieved Schema:
  Tables:
    - employees [conf: 92%]     ← HIGH: very relevant
    - departments [conf: 71%]   ← MEDIUM: possibly relevant
    - orders [conf: 45%]        ← LOW: marginally relevant

  Overall Confidence: 75%

How Confidence Works:

  • Vector distance converted to 0-1 scale
  • Tables weighted 50%, columns 30%, joins 20%
  • Overall confidence shown for every query

Confidence Thresholds:

Score Level Meaning
β‰₯70% HIGH Very confident in retrieval
50-70% MEDIUM Reasonable confidence
30-50% LOW Low confidence, verify results
<30% VERY LOW Element excluded from retrieval

Confidence Calculation (Technical Details)

Step 1: Vector Distance β†’ Confidence

ChromaDB returns a cosine distance (0 = identical, 2 = opposite). We convert to confidence:

confidence = 1 - (distance / 2)
Distance Calculation Confidence
0.0 1 - (0/2) 100% (identical match)
0.5 1 - (0.5/2) 75%
1.0 1 - (1/2) 50%
1.5 1 - (1.5/2) 25%
2.0 1 - (2/2) 0% (opposite meaning)

Step 2: Per-Element Confidence

Element Confidence Source
Tables Direct from vector similarity
Columns 50% column similarity + 50% parent table confidence
Joins Direct from vector similarity

Column confidence is boosted by its table's confidence:

column_conf = (column_similarity * 0.5) + (table_confidence * 0.5)

Step 3: Overall Confidence

Weighted average across all elements:

overall = (table_conf * 0.5) + (column_conf * 0.3) + (join_conf * 0.2)
Component Weight Reason
Tables 50% Most critical - wrong table = wrong query
Columns 30% Important but derived from tables
Joins 20% Supporting structure

Example Calculation:

Query: "Show employees with high salary"

Retrieval Results:
  employees:   distance=0.66 β†’ conf = 1-(0.66/2) = 67%
  departments: distance=0.78 β†’ conf = 1-(0.78/2) = 61%
  orders:      distance=0.84 β†’ conf = 1-(0.84/2) = 58%

Table avg: (67 + 61 + 58) / 3 = 62%
Column avg: ~55% (inherited from tables)
Join avg: ~60%

Overall = (62% Γ— 0.5) + (55% Γ— 0.3) + (60% Γ— 0.2)
        = 31% + 16.5% + 12%
        = 59.5% β‰ˆ 60%

πŸ“ Query Logging

All queries are logged for analytics and improvement:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    Query Log Entry                      β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Query ID:      abc123-...                             β”‚
β”‚  Timestamp:     2024-01-15T10:30:00                    β”‚
β”‚  Question:      "Show employees with salary > 80000"   β”‚
β”‚  Tables Used:   employees, departments                  β”‚
β”‚  SQL:           SELECT ... FROM employees ...          β”‚
β”‚  Success:       βœ“                                      β”‚
β”‚  Confidence:    75%                                    β”‚
β”‚  Exec Time:     234ms                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Features:

  • Similar Query Search: Find past queries similar to current one
  • Success Rate Tracking: Monitor query success over time
  • Feedback Loop: Learn from past successful queries
# Show query statistics
python main.py --stats

# Find similar past queries (interactive mode)
/similar Show all customers

πŸ“‹ Schema Versioning

Track schema changes over time:

Schema Version History
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Version β”‚ Timestamp           β”‚ Tables β”‚ Checksum         β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ v1      β”‚ 2024-01-10 09:00:00 β”‚ 5      β”‚ a3f2b8c1e4d7...  β”‚
β”‚ v2      β”‚ 2024-01-15 14:30:00 β”‚ 6      β”‚ b7e9c3a2f1d5...  β”‚
β”‚ v3      β”‚ 2024-02-01 11:00:00 β”‚ 6      β”‚ c4d8f2e7a9b3...  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Features:

  • Checksum: Detects schema changes automatically
  • Version Comparison: See what changed between versions
  • Audit Trail: Know when schema was last indexed
# Show version history
python main.py --version-history

# Or in interactive mode
/history

πŸ“ Project Structure

HandOfTheGod/
β”œβ”€β”€ main.py                    # CLI entry point
β”œβ”€β”€ config.py                  # Configuration settings
β”œβ”€β”€ requirements.txt           # Python dependencies
β”œβ”€β”€ sample_database.db         # Sample SQLite database
β”œβ”€β”€ schema_vectors/            # ChromaDB vector storage
β”‚   └── schema_versions.json   # Version history file
β”‚
└── nl2sql/                    # Core package
    β”œβ”€β”€ schema/                # Schema extraction
    β”‚   β”œβ”€β”€ models.py          # TableDocument, ColumnDocument, etc.
    β”‚   └── extractor.py       # Database introspection
    β”‚
    β”œβ”€β”€ vectorstore/           # Vector storage & retrieval
    β”‚   β”œβ”€β”€ embedder.py        # Local embeddings
    β”‚   β”œβ”€β”€ store.py           # ChromaDB integration
    β”‚   β”œβ”€β”€ retriever.py       # Multi-step retrieval + confidence
    β”‚   β”œβ”€β”€ query_logger.py    # Query logging & analytics
    β”‚   └── versioning.py      # Schema version tracking
    β”‚
    β”œβ”€β”€ generator/             # SQL generation
    β”‚   β”œβ”€β”€ prompt_builder.py  # Schema-constrained prompts
    β”‚   └── llm.py             # Ollama LLM interface
    β”‚
    β”œβ”€β”€ validator/             # SQL validation
    β”‚   └── sql_validator.py   # Pre-execution verification
    β”‚
    └── orchestrator.py        # End-to-end pipeline

πŸš€ Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Install & Start Ollama

# macOS
brew install ollama

# Start Ollama service
ollama serve

# Pull the SQLCoder model
ollama pull sqlcoder

3. Index Your Database Schema

python main.py --index-schema

Output:

βœ… Indexed 40 schema documents
   Tables: 6
   Columns: 30
   Relationships: 4
   Version: v1 (checksum: a3f2b8c1...)

4. Query

# Interactive mode
python main.py

# Single query with debug
python main.py -q "Show employees earning above 80000" --debug

πŸ“‹ CLI Commands

Command Description
python main.py --index-schema Index schema with versioning
python main.py Interactive mode
python main.py -q "question" Single query mode
python main.py --debug Show retrieval details
python main.py --stats Show query statistics
python main.py --version-history Show schema versions
python main.py --no-execute Generate SQL only

Interactive Commands:

  • /schema - Show schema version info
  • /stats - Show query statistics
  • /similar <q> - Find similar past queries
  • /history - Show version history
  • /debug - Toggle debug mode
  • /quit - Exit

πŸ”§ Configuration

Edit config.py:

# Retrieval tuning
RETRIEVAL_TOP_K_TABLES = 5        # Max tables in Step 1
MIN_TABLE_CONFIDENCE = 0.3        # Minimum confidence threshold

# Validation rules
ALLOW_SELECT_STAR = False         # Block SELECT *
REQUIRE_EXPLICIT_JOINS = True     # Require explicit JOINs

πŸ›‘οΈ Safety Features

Feature Protection
Schema Validation Tables/columns must be in retrieved schema
Confidence Threshold Low-confidence elements excluded
No SELECT * Forces explicit column listing
Query Logging Full audit trail
Version Tracking Schema change detection

⚑ Performance Notes

For Mac Air M2 8GB RAM:

  • SQLCoder model works well (quantized)
  • Embedding model is lightweight (~80MB)
  • First query may be slow while models load

If memory issues occur:

ollama pull phi3:mini
# Update MODEL_NAME in config.py

About

This is a security-first NL-to-SQL solution designed for enterprises with small to medium-sized databases. It enables natural language querying of data while ensuring that all processing remains fully on-premises. No data is sent to external cloud environments, and no sensitive information is exposed to external LLM services.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages