Skip to content

rohyeeet/QueryGPT

Repository files navigation

JattBot — Varaha Data Assistant

Python Flask Gemini PostgreSQL Google Cloud

image image

An internal data assistant for Varaha's agroforestry and carbon platform. Ask questions in plain English — JattBot generates the correct SQL, runs it against the production database, and gives you a plain-English interpretation with the real numbers.

image

Table of Contents

  1. How it works (quick)
  2. End-to-end architecture
  3. Tools & tech stack
  4. How the bot was trained
  5. Domain knowledge embedded
  6. Quick start
  7. API reference
  8. Project structure
  9. Extending the bot

How it works

  1. You type a question in the chat UI ("How many kyaris onboarded in Andhra Pradesh this cohort?")
  2. The Flask backend assembles a system prompt containing the live database schema + all business rules
  3. Gemini 2.5 Flash (thinking mode on) generates correct SQL wrapped in <SQL>...</SQL> tags
  4. The backend validates (SELECT only), executes against the read-only production DB
  5. Real results are fed back to Gemini for an accurate plain-English interpretation
  6. You see a 5-row preview → confirm → expand all rows → export CSV or JSON
image

End-to-end architecture

┌─────────────────────────────────────────────────────────────────────┐
│  BROWSER  (index.html — single file, no build step)                 │
│                                                                     │
│  ┌──────────────────┐   POST /api/chat    ┌────────────────────┐   │
│  │  Chat interface  │ ─────────────────→  │                    │   │
│  │  SQL display     │                     │  Flask backend     │   │
│  │  Results table   │ ←──────────────── ─ │  (app.py :5001)    │   │
│  │  CSV / JSON exp. │   {message, sql,    │                    │   │
│  └──────────────────┘    results, error}  └────────┬───────────┘   │
└──────────────────────────────────────────────────── │ ─────────────┘
                                                       │
                    ┌──────────────────────────────────┤
                    │                                  │
                    ▼                                  ▼
     ┌──────────────────────────┐      ┌───────────────────────────┐
     │  Google Cloud Vertex AI  │      │  PostgreSQL (read-only)   │
     │                          │      │                           │
     │  Project: QueryGPT-Final │      │  Host: backenddb-read     │
     │  Model: gemini-2.5-flash │      │       .varaha.com         │
     │  Auth: ADC (gcloud)      │      │  User: readonly           │
     │  thinking_budget: 2048   │      │  ~442 tables              │
     └──────────────────────────┘      └───────────────────────────┘

Request lifecycle

User message
     │
     ▼
[1] Build context
     │  schema_cache['schema_text']  ← live DB schema (auto-discovered, 3h refresh)
     │  get_system_prompt()          ← domain rules, business logic, join paths
     │  example_queries.txt          ← reference SQL patterns
     │  chat history (last 30 msgs)  ← multi-turn memory
     │
     ▼
[2] First Gemini call  →  generates SQL in <SQL>...</SQL> tags + explanation
     │
     ▼
[3] Validate SQL
     │  Blocks: INSERT / UPDATE / DELETE / DROP / ALTER / TRUNCATE / GRANT
     │  Requires: query must start with SELECT or WITH
     │
     ▼
[4] Execute on DB  →  rows, elapsed_seconds
     │  On SQL error: auto-retry (feeds error back to Gemini, one retry)
     │  On timeout/connect error: returns vpn_required flag to UI
     │
     ▼
[5] Second Gemini call  →  interprets *actual* result rows
     │  "Using ONLY these real numbers, write a 2-3 sentence interpretation"
     │  Prevents hallucinated numbers in the explanation
     │
     ▼
[6] Return to UI
     {message, sql, results[], row_count, execution_time, sql_error, session_id}

Schema discovery

On startup the backend runs discover_schema() in a background thread. It queries information_schema for all tables, columns, primary keys, foreign keys, row counts, and unique indexes. It also runs ~20 sample queries against key lookup tables to pull the actual enum values (e.g., all state.name values, all ag_farm_kyari.kyari_type values). This gives the AI exact filter values instead of guessing.

The full text is written to backend/discovered_schema.txt and cached in memory. A background worker refreshes the full schema every 3 hours and refreshes sampled values every 1 hour.

If the database is unreachable (e.g., no VPN), the server falls back to the last cached discovered_schema.txt, and if that doesn't exist, to the static backend/schema_context.txt snapshot — so the AI still works offline, just with a potentially stale schema.


Tools & tech stack

Layer Tool Purpose
AI model Google Gemini 2.5 Flash NL → SQL generation and result interpretation
AI thinking thinking_budget=2048 Gemini reasons through join paths before writing SQL
AI SDK google-genai (Python) Official Google Gen AI client library
AI auth Google Cloud Vertex AI + ADC Billed to org's GCP project (QueryGPT-Final); no API key stored
Backend Flask 3.0 (Python) HTTP server, session management, schema cache, SQL execution
Database PostgreSQL (psycopg2) Read-only production replica via VPN
Frontend Vanilla HTML/CSS/JS Single-file UI — no framework, no build step
Session memory In-memory dict Last 30 messages per session for multi-turn conversation
Schema cache In-memory + disk discovered_schema.txt / .json persisted across restarts

Why Gemini 2.5 Flash

  • Thinking mode (thinking_budget) is unique to the Gemini family — it lets the model reason step-by-step through complex multi-table joins before committing to SQL, dramatically reducing hallucinated column names
  • Temperature 0.1 keeps SQL deterministic and precise
  • 8192 output token limit handles large schema contexts and verbose SQL
  • Safety settings all OFF — internal tool, no content filtering needed

Why Vertex AI instead of AI Studio

The bot was originally on an AI Studio API key (AQ.* prefix keys, free tier). It was switched to Google Cloud Vertex AI because:

  • Org's GCP billing covers it — no free-tier rate limits or expiry
  • ADC authentication (Application Default Credentials via gcloud auth application-default login) — no API key stored in any file
  • The same google-genai SDK works for both; only the client initialisation differs
# AI Studio (old)
gclient = genai.Client(api_key=GEMINI_API_KEY)

# Vertex AI (current)
gclient = genai.Client(vertexai=True, project=GOOGLE_CLOUD_PROJECT, location=GOOGLE_CLOUD_LOCATION)

How the bot was trained

JattBot has no traditional ML training. Its accuracy comes entirely from what is loaded into the system prompt at runtime. This is the standard pattern for domain-specific LLM tools — the model's general SQL and reasoning ability is fixed; we steer it with context.

There are three sources of context loaded on every request:

1. System prompt — get_system_prompt() in app.py

This is the primary intelligence layer. It is rebuilt on every chat request and injected as the Gemini system instruction. It contains:

  • Strict SQL rules — no LIMIT on outer queries, read-only validation, is_active filter requirements, test data exclusion patterns, farmer_media.is_archive = false requirement, double-quoting "UserManager"
  • Domain glossary — what a kyari is, what a farm is, what RS means, what AWD/Biochar are, what verification_status values mean
  • Location hierarchycountry → state → district → block → farmer, and how to match each level with ILIKE
  • Canonical join paths — hardcoded chains like ag_farm_kyari → ag_farm → farmer → block → district → state → country so the AI never guesses join conditions
  • Eligibility logic — how ACCEPTED / REJECTED / PENDING / MISSING statuses are derived for farmers, farms, and kyaris from farmer_media verification states using MAX(CASE WHEN ref_sub_type = '...' THEN verification_status END) pivot patterns
  • ARR Andhra Pradesh onboarding metrics — complete dictionary of all ~80 output columns from the main ARR dashboard query, cohort formula, scope filters, TAT/SLA rules, ArrFPIC document types, soil sampling and LSC table structures
  • TAT/SLA calculation — the gate-adjusted TAT clock logic: GREATEST(kyari_verified_on, fm.created_on), 3-day SLA definition, cohort grouping
  • Plantation kyari funnel — all 11 progressive stages from kyari acceptance through geo-tagging approval
  • Species data model — difference between plantation (saplings) and retrospective (ag_farm_specie) tree counts
  • survey_detail exclusion rule — when and how to join survey_detail to exclude Bhutan baseline-only farmers (only on direct farmer count queries)
  • After-results instructions — always interpret with real numbers, flag surprises, suggest one follow-up

This section is the main lever for accuracy. When the bot gives wrong answers, the fix is almost always in get_system_prompt().

2. Example queries — backend/example_queries.txt

A library of ~18 worked SQL examples covering common request patterns. These are appended verbatim to the schema context before every call. The AI uses them as reference patterns for new queries.

Examples cover:

  • Country/state/district farmer and kyari counts
  • Partner-wise breakdowns (joining through accounts.usersaccounts.organizations)
  • Documentation status pivots (land record, ID card, profile picture)
  • Plantation kyari funnel (species, saplings, spacing, mid-plantation, geo-tagging)
  • Retrospective kyari tree counts
  • Plantation model breakdowns
  • RS verification status
  • ARR AP onboarding summary by 6-month cohort
  • Farmer TAT by cohort (gate-adjusted)
  • Farm TAT by cohort (gate-adjusted)

Adding a new example here is the fastest way to teach the bot a new query pattern.

3. Live schema — discover_schema() in app.py

The full database schema auto-discovered from information_schema. The AI gets:

  • All 442+ table names with column names, data types, nullable flags
  • Primary keys and foreign keys (where visible to the read-only user)
  • Approximate row counts per table (from pg_class.reltuples)
  • Unique indexes
  • Sampled actual values from ~20 key lookup columns — e.g., the exact set of state.name values, ag_farm_kyari.kyari_type values, farmer_media.ref_sub_type values, partner/org names — so WHERE clause filters are never guessed

Core tables (farmer, ag_farm, ag_farm_kyari, farmer_media, etc.) get full column-level detail. Other tables get just the column name list. Empty non-core tables are omitted. This keeps the schema context under ~25k tokens.


Domain knowledge embedded

The system prompt encodes Varaha's full data model and business rules for the following programs:

ARR (Afforestation/Reforestation/Revegetation) — Andhra Pradesh

The primary program. Onboarding is tracked through a 6-month cohort system (region_split), with a full pipeline of stages from farmer registration to geo-tagged plantation proof.

Farmer eligibility — derived from farmer_media:

  • IdentityImage (ID card) — ref_type='Farmer'
  • ProfilePictureref_type='Farmer'
  • ArrFPIC (FPIC English consent) — ref_type='Farmer' (ARR-specific, distinct from AgroFarmFPIC used in other programs)
  • ArrFPICLocal (FPIC local language) — ref_type='Farmer'
  • eKYC, BAV — tracked separately

Farm eligibility — requires all four: land record, land declaration (if family-owned), accepted kyari, and accepted boundary:

  • AgroFarmLandRecordref_type='AgroFarm' (required for all farms)
  • AgroFarmLandLordDeclarationref_type='AgroFarm' (family-owned only)
  • ag_farm.boundary_status — KML polygon validation

TAT / SLA — 3-day target. Clock starts at the later of kyari acceptance or doc upload (gate-adjusted). Cohorts are 6-month periods based on TAT start date.

Plantation pipeline — 11 progressive stages from kyari acceptance through species distribution, pit digging, spacing validation, mid-plantation proof, and geo-tagging.

Monitoringmonitoring_farm table tracks plantation completion status and model adherence per kyari.

Soil samplingsoil_sample_session → soil_sample_hole → soil_sample_hole_mapping → soil_sample chain.

LSC (Livelihood/Land Stakeholder Consultation)lsc table, filtered by project_id LIKE 'ARR-%'.

AWD (Alternate Wetting & Drying)

Rice paddy water conservation project. Tables: awd_farmer, awd_farm, awd_tube.

Biochar

Charcoal soil amendment project. Tables: bio_char_site, bio_char_kiln, production_site.

Multi-country

Farmers, farms, and kyaris span India, Nepal, Bangladesh, Bhutan, and Kenya. Location hierarchy: country → state → district → block → farmer.


Quick start

Prerequisites

  • Python 3.9
  • VPN access to backenddb-read.varaha.com
  • Google Cloud account with gcloud CLI installed and authenticated

1. Clone and install

git clone https://github.com/rohyeeet/QueryGPT.git
cd QueryGPT
pip3 install -r requirements.txt

2. Configure credentials

cp env_template.txt .env

Edit .env:

# Google Cloud Vertex AI (recommended — billed to org GCP project)
GOOGLE_CLOUD_PROJECT=gen-lang-client-0780883925
GOOGLE_CLOUD_LOCATION=us-central1

# Database
DB_HOST=backenddb-read.varaha.com
DB_NAME=production
DB_READ_USER_NAME=readonly
DB_READ_PASSWORD=readonly$32412
DB_PORT=5432

# Model
GEMINI_MODEL=gemini-2.5-flash

Authenticate with Google Cloud (one-time per machine):

gcloud auth application-default login

3. Run

bash start_server.sh
# or directly:
/usr/bin/python3 backend/app.py

Open index.html in your browser. Connect to VPN before asking questions that need live data.

Example questions

How many kyaris onboarded in Andhra Pradesh?
Show ARR onboarding numbers by 6-month cohort
What's the farmer doc TAT for the Jan 25 cohort?
Farms with missing land records in AP
Top 5 partners by active farmers
Documentation status by partner
Plantation kyaris created in the last 30 days
RS verification breakdown for AP kyaris
How many kyaris are at each plantation stage?
Show monitoring completion rates by cohort

API reference

Endpoint Method Body Description
/api/chat POST {message, session_id, image?, image_type?} Main chat — NL → SQL → results → interpretation
/api/execute-query POST {sql} Run raw SQL directly (no AI)
/api/schema GET Schema status, table list with row counts
/api/schema/refresh POST Trigger immediate full schema re-discovery
/api/schema/sanity GET Run 10-point sanity checks on schema completeness
/api/sessions GET List active chat sessions
/api/sessions/:id DELETE Clear a session's history
/api/check-connection GET Test DB connectivity
/api/health GET Health check with AI auth mode, model, schema status

/api/chat response shape

{
  "message": "Plain-English interpretation of results",
  "sql": "SELECT ...",
  "results": [{"col": "val", ...}, ...],
  "row_count": 42,
  "execution_time": 1.23,
  "sql_error": null,
  "session_id": "abc123"
}

sql_error values: null (success), "vpn_required" (DB unreachable), or a PostgreSQL error string (triggers auto-retry).


Project structure

QueryGPT/
├── backend/
│   ├── app.py                    # Flask server — all logic lives here
│   │                             #   get_system_prompt()  ← AI brain
│   │                             #   discover_schema()    ← live DB introspection
│   │                             #   call_gemini()        ← Vertex AI call
│   │                             #   /api/chat            ← main route
│   ├── example_queries.txt       # Reference SQL patterns injected into AI context
│   ├── schema_context.txt        # Static fallback schema snapshot (283KB)
│   ├── discovered_schema.txt     # Live auto-discovered schema (regenerated every 3h)
│   └── discovered_schema.json    # Structured version of above
├── index.html                    # Single-file chat UI (HTML + CSS + JS, no build)
├── requirements.txt
├── env_template.txt              # Config template with comments
├── start_server.sh               # Launch script (auto-detects correct Python)
└── README.md

Key functions in app.py

Function What it does
get_system_prompt() Assembles the full system instruction: SQL rules + domain knowledge + live schema
discover_schema() Queries information_schema + runs sample queries; builds schema text for the AI
_schema_refresh_worker() Background thread: full refresh every 3h, values refresh every 1h
_load_cached_schema() Loads from disk on startup if DB is unreachable
call_gemini(history) Calls Vertex AI with full conversation history and system prompt
validate_sql(sql) Blocks write operations; enforces SELECT/WITH only
extract_sql(text) Pulls SQL from <SQL>...</SQL> tags in model response
run_sql(sql) Executes query, serialises datetimes, returns (rows, elapsed)

Extending the bot

Adding a new query pattern

Add an example to backend/example_queries.txt in the format:

=== Short description ===
REQUEST: "How a user would phrase this"
SQL:
SELECT ...

The example is appended to every system prompt call automatically.

Teaching a new business rule

Edit get_system_prompt() in backend/app.py. Add a new section with the ═══ separator style. Document:

  • What the rule is
  • The exact column names and values involved
  • Example SQL that demonstrates correct usage

Adding a new lookup table's sampled values

In discover_schema()SAMPLE_QUERIES dict, add:

'my_new_key': "SELECT DISTINCT my_column FROM my_table WHERE my_column IS NOT NULL ORDER BY my_column"

Then register it in the display section below that dict with a human-readable label.

Switching the AI model

Set GEMINI_MODEL=gemini-2.5-pro in .env for harder queries. The -pro variant has stronger reasoning at the cost of latency. Both support thinking_budget.


Security

  • Read-only enforced at two levels: the DB user has no write permissions; the server also blocks all write SQL keywords (INSERT, UPDATE, DELETE, DROP, CREATE, ALTER, TRUNCATE, GRANT, REVOKE)
  • No credentials in code: DB password and GCP project via .env (not committed); GCP auth via ADC, no API key stored
  • Session isolation: each session ID has independent conversation history; no cross-user leakage
  • Internal only: no authentication on the HTTP endpoints — intended for VPN-only internal access

Built for Varaha's internal data team.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors