NL-to-SQL: Vector-Based Natural Language to SQL Converter

An on-premise, deterministic, and auditable system that converts natural language queries to SQL using semantic schema retrieval and local LLMs.

✨ Key Features

Feature	Description
Multi-Step Retrieval	Tables → Columns → Joins in 3 sequential steps
Confidence Scoring	0-100% confidence for every retrieved element
Query Logging	All queries logged with similarity search
Schema Versioning	Full audit trail with checksums
SQL Validation	Pre-execution validation against retrieved schema
On-Premises	No cloud dependencies - runs fully local

🏗️ System Architecture

┌─────────────────────────────────────────────────────────────────────────────┐
│                           SCHEMA INGESTION (One-time)                       │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌──────────────┐    ┌──────────────┐    ┌──────────────┐    ┌──────────┐ │
│   │   Database   │───▶│    Schema    │───▶│   Document   │───▶│ Embedder │ │
│   │   (SQLite)   │    │   Extractor  │    │   Builder    │    │          │ │
│   └──────────────┘    └──────────────┘    └──────────────┘    └────┬─────┘ │
│                                                                     │       │
│                                                                     ▼       │
│                                                               ┌──────────┐  │
│                              ┌─────────────────────────────── │ ChromaDB │  │
│                              │                                └──────────┘  │
│                              ▼                                              │
│                        ┌──────────┐                                         │
│                        │ Version  │  Creates version with checksum          │
│                        │ Manager  │  for audit trail                        │
│                        └──────────┘                                         │
└─────────────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────────────┐
│                         QUERY TIME (Multi-Step)                             │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌──────────────┐                                                          │
│   │    User      │                                                          │
│   │   Question   │                                                          │
│   └──────┬───────┘                                                          │
│          │                                                                  │
│          ▼                                                                  │
│   ┌──────────────────────────────────────────────────────────────────────┐  │
│   │                    MULTI-STEP RETRIEVAL                              │  │
│   │  ┌────────────┐   ┌────────────────┐   ┌──────────────────────────┐  │  │
│   │  │  STEP 1    │──▶│    STEP 2      │──▶│       STEP 3             │  │  │
│   │  │  Tables    │   │ Columns for    │   │ Joins between            │  │  │
│   │  │  (top-K)   │   │ those tables   │   │ retrieved tables         │  │  │
│   │  └────────────┘   └────────────────┘   └──────────────────────────┘  │  │
│   │                                                                      │  │
│   │  Each result includes CONFIDENCE SCORE (0-100%)                      │  │
│   └──────────────────────────────────────────────────────────────────────┘  │
│          │                                                                  │
│          ▼                                                                  │
│   ┌──────────────┐    ┌──────────────┐    ┌──────────────┐                  │
│   │   Prompt     │───▶│  Ollama LLM  │───▶│     SQL      │                  │
│   │   Builder    │    │  (SQLCoder)  │    │  Validator   │                  │
│   └──────────────┘    └──────────────┘    └──────┬───────┘                  │
│                                                  │                          │
│                                           ┌──────▼───────┐                  │
│                                           │   Execute    │                  │
│                                           │   + Log      │◀─── Query ID     │
│                                           └──────────────┘                  │
└─────────────────────────────────────────────────────────────────────────────┘

🔄 Multi-Step Retrieval

Unlike single-pass retrieval, this system uses a 3-step sequential approach:

┌─────────────────┐
│     Step 1      │  Embed query → Find top-K relevant TABLES
│  Find Tables    │  Filter by confidence threshold (default: 30%)
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│     Step 2      │  For EACH retrieved table:
│  Get Columns    │  → Retrieve ALL its columns
└────────┬────────┘  → Inherit confidence from parent table
         │
         ▼
┌─────────────────┐
│     Step 3      │  Find JOINS between retrieved tables only
│  Find Joins     │  → Filter out joins to non-retrieved tables
└─────────────────┘

Why This Matters:

Ensures columns always belong to retrieved tables
Prevents orphaned join conditions
Makes retrieval deterministic and auditable

📊 Confidence Scoring

Every retrieved element has a confidence score (0-100%):

Retrieved Schema:
  Tables:
    - employees [conf: 92%]     ← HIGH: very relevant
    - departments [conf: 71%]   ← MEDIUM: possibly relevant
    - orders [conf: 45%]        ← LOW: marginally relevant

  Overall Confidence: 75%

How Confidence Works:

Vector distance converted to 0-1 scale
Tables weighted 50%, columns 30%, joins 20%
Overall confidence shown for every query

Confidence Thresholds:

Score	Level	Meaning
≥70%	HIGH	Very confident in retrieval
50-70%	MEDIUM	Reasonable confidence
30-50%	LOW	Low confidence, verify results
<30%	VERY LOW	Element excluded from retrieval

Confidence Calculation (Technical Details)

Step 1: Vector Distance → Confidence

ChromaDB returns a cosine distance (0 = identical, 2 = opposite). We convert to confidence:

confidence = 1 - (distance / 2)

Distance	Calculation	Confidence
0.0	1 - (0/2)	100% (identical match)
0.5	1 - (0.5/2)	75%
1.0	1 - (1/2)	50%
1.5	1 - (1.5/2)	25%
2.0	1 - (2/2)	0% (opposite meaning)

Step 2: Per-Element Confidence

Element	Confidence Source
Tables	Direct from vector similarity
Columns	50% column similarity + 50% parent table confidence
Joins	Direct from vector similarity

Column confidence is boosted by its table's confidence:

column_conf = (column_similarity * 0.5) + (table_confidence * 0.5)

Step 3: Overall Confidence

Weighted average across all elements:

overall = (table_conf * 0.5) + (column_conf * 0.3) + (join_conf * 0.2)

Component	Weight	Reason
Tables	50%	Most critical - wrong table = wrong query
Columns	30%	Important but derived from tables
Joins	20%	Supporting structure

Example Calculation:

Query: "Show employees with high salary"

Retrieval Results:
  employees:   distance=0.66 → conf = 1-(0.66/2) = 67%
  departments: distance=0.78 → conf = 1-(0.78/2) = 61%
  orders:      distance=0.84 → conf = 1-(0.84/2) = 58%

Table avg: (67 + 61 + 58) / 3 = 62%
Column avg: ~55% (inherited from tables)
Join avg: ~60%

Overall = (62% × 0.5) + (55% × 0.3) + (60% × 0.2)
        = 31% + 16.5% + 12%
        = 59.5% ≈ 60%

📝 Query Logging

All queries are logged for analytics and improvement:

┌────────────────────────────────────────────────────────┐
│                    Query Log Entry                      │
├────────────────────────────────────────────────────────┤
│  Query ID:      abc123-...                             │
│  Timestamp:     2024-01-15T10:30:00                    │
│  Question:      "Show employees with salary > 80000"   │
│  Tables Used:   employees, departments                  │
│  SQL:           SELECT ... FROM employees ...          │
│  Success:       ✓                                      │
│  Confidence:    75%                                    │
│  Exec Time:     234ms                                  │
└────────────────────────────────────────────────────────┘

Features:

Similar Query Search: Find past queries similar to current one
Success Rate Tracking: Monitor query success over time
Feedback Loop: Learn from past successful queries

# Show query statistics
python main.py --stats

# Find similar past queries (interactive mode)
/similar Show all customers

📋 Schema Versioning

Track schema changes over time:

Schema Version History
┌─────────┬─────────────────────┬────────┬──────────────────┐
│ Version │ Timestamp           │ Tables │ Checksum         │
├─────────┼─────────────────────┼────────┼──────────────────┤
│ v1      │ 2024-01-10 09:00:00 │ 5      │ a3f2b8c1e4d7...  │
│ v2      │ 2024-01-15 14:30:00 │ 6      │ b7e9c3a2f1d5...  │
│ v3      │ 2024-02-01 11:00:00 │ 6      │ c4d8f2e7a9b3...  │
└─────────┴─────────────────────┴────────┴──────────────────┘

Features:

Checksum: Detects schema changes automatically
Version Comparison: See what changed between versions
Audit Trail: Know when schema was last indexed

# Show version history
python main.py --version-history

# Or in interactive mode
/history

📁 Project Structure

HandOfTheGod/
├── main.py                    # CLI entry point
├── config.py                  # Configuration settings
├── requirements.txt           # Python dependencies
├── sample_database.db         # Sample SQLite database
├── schema_vectors/            # ChromaDB vector storage
│   └── schema_versions.json   # Version history file
│
└── nl2sql/                    # Core package
    ├── schema/                # Schema extraction
    │   ├── models.py          # TableDocument, ColumnDocument, etc.
    │   └── extractor.py       # Database introspection
    │
    ├── vectorstore/           # Vector storage & retrieval
    │   ├── embedder.py        # Local embeddings
    │   ├── store.py           # ChromaDB integration
    │   ├── retriever.py       # Multi-step retrieval + confidence
    │   ├── query_logger.py    # Query logging & analytics
    │   └── versioning.py      # Schema version tracking
    │
    ├── generator/             # SQL generation
    │   ├── prompt_builder.py  # Schema-constrained prompts
    │   └── llm.py             # Ollama LLM interface
    │
    ├── validator/             # SQL validation
    │   └── sql_validator.py   # Pre-execution verification
    │
    └── orchestrator.py        # End-to-end pipeline

🚀 Quick Start

1. Install Dependencies

pip install -r requirements.txt

2. Install & Start Ollama

# macOS
brew install ollama

# Start Ollama service
ollama serve

# Pull the SQLCoder model
ollama pull sqlcoder

3. Index Your Database Schema

python main.py --index-schema

Output:

✅ Indexed 40 schema documents
   Tables: 6
   Columns: 30
   Relationships: 4
   Version: v1 (checksum: a3f2b8c1...)

4. Query

# Interactive mode
python main.py

# Single query with debug
python main.py -q "Show employees earning above 80000" --debug

📋 CLI Commands

Command	Description
`python main.py --index-schema`	Index schema with versioning
`python main.py`	Interactive mode
`python main.py -q "question"`	Single query mode
`python main.py --debug`	Show retrieval details
`python main.py --stats`	Show query statistics
`python main.py --version-history`	Show schema versions
`python main.py --no-execute`	Generate SQL only

Interactive Commands:

/schema - Show schema version info
/stats - Show query statistics
/similar <q> - Find similar past queries
/history - Show version history
/debug - Toggle debug mode
/quit - Exit

🔧 Configuration

Edit config.py:

# Retrieval tuning
RETRIEVAL_TOP_K_TABLES = 5        # Max tables in Step 1
MIN_TABLE_CONFIDENCE = 0.3        # Minimum confidence threshold

# Validation rules
ALLOW_SELECT_STAR = False         # Block SELECT *
REQUIRE_EXPLICIT_JOINS = True     # Require explicit JOINs

🛡️ Safety Features

Feature	Protection
Schema Validation	Tables/columns must be in retrieved schema
Confidence Threshold	Low-confidence elements excluded
No SELECT *	Forces explicit column listing
Query Logging	Full audit trail
Version Tracking	Schema change detection

⚡ Performance Notes

For Mac Air M2 8GB RAM:

SQLCoder model works well (quantized)
Embedding model is lightweight (~80MB)
First query may be slow while models load

If memory issues occur:

ollama pull phi3:mini
# Update MODEL_NAME in config.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NL-to-SQL: Vector-Based Natural Language to SQL Converter

✨ Key Features

🏗️ System Architecture

🔄 Multi-Step Retrieval

📊 Confidence Scoring

Confidence Calculation (Technical Details)

📝 Query Logging

📋 Schema Versioning

📁 Project Structure

🚀 Quick Start

1. Install Dependencies

2. Install & Start Ollama

3. Index Your Database Schema

4. Query

📋 CLI Commands

🔧 Configuration

🛡️ Safety Features

⚡ Performance Notes

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
nl2sql		nl2sql
schema_vectors		schema_vectors
.gitignore		.gitignore
README.md		README.md
config.py		config.py
main.py		main.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

NL-to-SQL: Vector-Based Natural Language to SQL Converter

✨ Key Features

🏗️ System Architecture

🔄 Multi-Step Retrieval

📊 Confidence Scoring

Confidence Calculation (Technical Details)

📝 Query Logging

📋 Schema Versioning

📁 Project Structure

🚀 Quick Start

1. Install Dependencies

2. Install & Start Ollama

3. Index Your Database Schema

4. Query

📋 CLI Commands

🔧 Configuration

🛡️ Safety Features

⚡ Performance Notes

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages