An internal data assistant for Varaha's agroforestry and carbon platform. Ask questions in plain English — JattBot generates the correct SQL, runs it against the production database, and gives you a plain-English interpretation with the real numbers.
- How it works (quick)
- End-to-end architecture
- Tools & tech stack
- How the bot was trained
- Domain knowledge embedded
- Quick start
- API reference
- Project structure
- Extending the bot
- You type a question in the chat UI ("How many kyaris onboarded in Andhra Pradesh this cohort?")
- The Flask backend assembles a system prompt containing the live database schema + all business rules
- Gemini 2.5 Flash (thinking mode on) generates correct SQL wrapped in
<SQL>...</SQL>tags - The backend validates (SELECT only), executes against the read-only production DB
- Real results are fed back to Gemini for an accurate plain-English interpretation
- You see a 5-row preview → confirm → expand all rows → export CSV or JSON
┌─────────────────────────────────────────────────────────────────────┐
│ BROWSER (index.html — single file, no build step) │
│ │
│ ┌──────────────────┐ POST /api/chat ┌────────────────────┐ │
│ │ Chat interface │ ─────────────────→ │ │ │
│ │ SQL display │ │ Flask backend │ │
│ │ Results table │ ←──────────────── ─ │ (app.py :5001) │ │
│ │ CSV / JSON exp. │ {message, sql, │ │ │
│ └──────────────────┘ results, error} └────────┬───────────┘ │
└──────────────────────────────────────────────────── │ ─────────────┘
│
┌──────────────────────────────────┤
│ │
▼ ▼
┌──────────────────────────┐ ┌───────────────────────────┐
│ Google Cloud Vertex AI │ │ PostgreSQL (read-only) │
│ │ │ │
│ Project: QueryGPT-Final │ │ Host: backenddb-read │
│ Model: gemini-2.5-flash │ │ .varaha.com │
│ Auth: ADC (gcloud) │ │ User: readonly │
│ thinking_budget: 2048 │ │ ~442 tables │
└──────────────────────────┘ └───────────────────────────┘
User message
│
▼
[1] Build context
│ schema_cache['schema_text'] ← live DB schema (auto-discovered, 3h refresh)
│ get_system_prompt() ← domain rules, business logic, join paths
│ example_queries.txt ← reference SQL patterns
│ chat history (last 30 msgs) ← multi-turn memory
│
▼
[2] First Gemini call → generates SQL in <SQL>...</SQL> tags + explanation
│
▼
[3] Validate SQL
│ Blocks: INSERT / UPDATE / DELETE / DROP / ALTER / TRUNCATE / GRANT
│ Requires: query must start with SELECT or WITH
│
▼
[4] Execute on DB → rows, elapsed_seconds
│ On SQL error: auto-retry (feeds error back to Gemini, one retry)
│ On timeout/connect error: returns vpn_required flag to UI
│
▼
[5] Second Gemini call → interprets *actual* result rows
│ "Using ONLY these real numbers, write a 2-3 sentence interpretation"
│ Prevents hallucinated numbers in the explanation
│
▼
[6] Return to UI
{message, sql, results[], row_count, execution_time, sql_error, session_id}
On startup the backend runs discover_schema() in a background thread. It queries information_schema for all tables, columns, primary keys, foreign keys, row counts, and unique indexes. It also runs ~20 sample queries against key lookup tables to pull the actual enum values (e.g., all state.name values, all ag_farm_kyari.kyari_type values). This gives the AI exact filter values instead of guessing.
The full text is written to backend/discovered_schema.txt and cached in memory. A background worker refreshes the full schema every 3 hours and refreshes sampled values every 1 hour.
If the database is unreachable (e.g., no VPN), the server falls back to the last cached discovered_schema.txt, and if that doesn't exist, to the static backend/schema_context.txt snapshot — so the AI still works offline, just with a potentially stale schema.
| Layer | Tool | Purpose |
|---|---|---|
| AI model | Google Gemini 2.5 Flash | NL → SQL generation and result interpretation |
| AI thinking | thinking_budget=2048 |
Gemini reasons through join paths before writing SQL |
| AI SDK | google-genai (Python) |
Official Google Gen AI client library |
| AI auth | Google Cloud Vertex AI + ADC | Billed to org's GCP project (QueryGPT-Final); no API key stored |
| Backend | Flask 3.0 (Python) | HTTP server, session management, schema cache, SQL execution |
| Database | PostgreSQL (psycopg2) | Read-only production replica via VPN |
| Frontend | Vanilla HTML/CSS/JS | Single-file UI — no framework, no build step |
| Session memory | In-memory dict | Last 30 messages per session for multi-turn conversation |
| Schema cache | In-memory + disk | discovered_schema.txt / .json persisted across restarts |
- Thinking mode (
thinking_budget) is unique to the Gemini family — it lets the model reason step-by-step through complex multi-table joins before committing to SQL, dramatically reducing hallucinated column names - Temperature 0.1 keeps SQL deterministic and precise
- 8192 output token limit handles large schema contexts and verbose SQL
- Safety settings all OFF — internal tool, no content filtering needed
The bot was originally on an AI Studio API key (AQ.* prefix keys, free tier). It was switched to Google Cloud Vertex AI because:
- Org's GCP billing covers it — no free-tier rate limits or expiry
- ADC authentication (Application Default Credentials via
gcloud auth application-default login) — no API key stored in any file - The same
google-genaiSDK works for both; only the client initialisation differs
# AI Studio (old)
gclient = genai.Client(api_key=GEMINI_API_KEY)
# Vertex AI (current)
gclient = genai.Client(vertexai=True, project=GOOGLE_CLOUD_PROJECT, location=GOOGLE_CLOUD_LOCATION)JattBot has no traditional ML training. Its accuracy comes entirely from what is loaded into the system prompt at runtime. This is the standard pattern for domain-specific LLM tools — the model's general SQL and reasoning ability is fixed; we steer it with context.
There are three sources of context loaded on every request:
This is the primary intelligence layer. It is rebuilt on every chat request and injected as the Gemini system instruction. It contains:
- Strict SQL rules — no LIMIT on outer queries, read-only validation,
is_activefilter requirements, test data exclusion patterns,farmer_media.is_archive = falserequirement, double-quoting"UserManager" - Domain glossary — what a kyari is, what a farm is, what RS means, what AWD/Biochar are, what verification_status values mean
- Location hierarchy —
country → state → district → block → farmer, and how to match each level withILIKE - Canonical join paths — hardcoded chains like
ag_farm_kyari → ag_farm → farmer → block → district → state → countryso the AI never guesses join conditions - Eligibility logic — how
ACCEPTED / REJECTED / PENDING / MISSINGstatuses are derived for farmers, farms, and kyaris fromfarmer_mediaverification states usingMAX(CASE WHEN ref_sub_type = '...' THEN verification_status END)pivot patterns - ARR Andhra Pradesh onboarding metrics — complete dictionary of all ~80 output columns from the main ARR dashboard query, cohort formula, scope filters, TAT/SLA rules, ArrFPIC document types, soil sampling and LSC table structures
- TAT/SLA calculation — the gate-adjusted TAT clock logic:
GREATEST(kyari_verified_on, fm.created_on), 3-day SLA definition, cohort grouping - Plantation kyari funnel — all 11 progressive stages from kyari acceptance through geo-tagging approval
- Species data model — difference between plantation (saplings) and retrospective (ag_farm_specie) tree counts
- survey_detail exclusion rule — when and how to join
survey_detailto exclude Bhutan baseline-only farmers (only on direct farmer count queries) - After-results instructions — always interpret with real numbers, flag surprises, suggest one follow-up
This section is the main lever for accuracy. When the bot gives wrong answers, the fix is almost always in get_system_prompt().
A library of ~18 worked SQL examples covering common request patterns. These are appended verbatim to the schema context before every call. The AI uses them as reference patterns for new queries.
Examples cover:
- Country/state/district farmer and kyari counts
- Partner-wise breakdowns (joining through
accounts.users→accounts.organizations) - Documentation status pivots (land record, ID card, profile picture)
- Plantation kyari funnel (species, saplings, spacing, mid-plantation, geo-tagging)
- Retrospective kyari tree counts
- Plantation model breakdowns
- RS verification status
- ARR AP onboarding summary by 6-month cohort
- Farmer TAT by cohort (gate-adjusted)
- Farm TAT by cohort (gate-adjusted)
Adding a new example here is the fastest way to teach the bot a new query pattern.
The full database schema auto-discovered from information_schema. The AI gets:
- All 442+ table names with column names, data types, nullable flags
- Primary keys and foreign keys (where visible to the read-only user)
- Approximate row counts per table (from
pg_class.reltuples) - Unique indexes
- Sampled actual values from ~20 key lookup columns — e.g., the exact set of
state.namevalues,ag_farm_kyari.kyari_typevalues,farmer_media.ref_sub_typevalues, partner/org names — so WHERE clause filters are never guessed
Core tables (farmer, ag_farm, ag_farm_kyari, farmer_media, etc.) get full column-level detail. Other tables get just the column name list. Empty non-core tables are omitted. This keeps the schema context under ~25k tokens.
The system prompt encodes Varaha's full data model and business rules for the following programs:
The primary program. Onboarding is tracked through a 6-month cohort system (region_split), with a full pipeline of stages from farmer registration to geo-tagged plantation proof.
Farmer eligibility — derived from farmer_media:
IdentityImage(ID card) —ref_type='Farmer'ProfilePicture—ref_type='Farmer'ArrFPIC(FPIC English consent) —ref_type='Farmer'(ARR-specific, distinct from AgroFarmFPIC used in other programs)ArrFPICLocal(FPIC local language) —ref_type='Farmer'eKYC,BAV— tracked separately
Farm eligibility — requires all four: land record, land declaration (if family-owned), accepted kyari, and accepted boundary:
AgroFarmLandRecord—ref_type='AgroFarm'(required for all farms)AgroFarmLandLordDeclaration—ref_type='AgroFarm'(family-owned only)ag_farm.boundary_status— KML polygon validation
TAT / SLA — 3-day target. Clock starts at the later of kyari acceptance or doc upload (gate-adjusted). Cohorts are 6-month periods based on TAT start date.
Plantation pipeline — 11 progressive stages from kyari acceptance through species distribution, pit digging, spacing validation, mid-plantation proof, and geo-tagging.
Monitoring — monitoring_farm table tracks plantation completion status and model adherence per kyari.
Soil sampling — soil_sample_session → soil_sample_hole → soil_sample_hole_mapping → soil_sample chain.
LSC (Livelihood/Land Stakeholder Consultation) — lsc table, filtered by project_id LIKE 'ARR-%'.
Rice paddy water conservation project. Tables: awd_farmer, awd_farm, awd_tube.
Charcoal soil amendment project. Tables: bio_char_site, bio_char_kiln, production_site.
Farmers, farms, and kyaris span India, Nepal, Bangladesh, Bhutan, and Kenya. Location hierarchy: country → state → district → block → farmer.
- Python 3.9
- VPN access to
backenddb-read.varaha.com - Google Cloud account with
gcloudCLI installed and authenticated
git clone https://github.com/rohyeeet/QueryGPT.git
cd QueryGPT
pip3 install -r requirements.txtcp env_template.txt .envEdit .env:
# Google Cloud Vertex AI (recommended — billed to org GCP project)
GOOGLE_CLOUD_PROJECT=gen-lang-client-0780883925
GOOGLE_CLOUD_LOCATION=us-central1
# Database
DB_HOST=backenddb-read.varaha.com
DB_NAME=production
DB_READ_USER_NAME=readonly
DB_READ_PASSWORD=readonly$32412
DB_PORT=5432
# Model
GEMINI_MODEL=gemini-2.5-flashAuthenticate with Google Cloud (one-time per machine):
gcloud auth application-default loginbash start_server.sh
# or directly:
/usr/bin/python3 backend/app.pyOpen index.html in your browser. Connect to VPN before asking questions that need live data.
How many kyaris onboarded in Andhra Pradesh?
Show ARR onboarding numbers by 6-month cohort
What's the farmer doc TAT for the Jan 25 cohort?
Farms with missing land records in AP
Top 5 partners by active farmers
Documentation status by partner
Plantation kyaris created in the last 30 days
RS verification breakdown for AP kyaris
How many kyaris are at each plantation stage?
Show monitoring completion rates by cohort
| Endpoint | Method | Body | Description |
|---|---|---|---|
/api/chat |
POST | {message, session_id, image?, image_type?} |
Main chat — NL → SQL → results → interpretation |
/api/execute-query |
POST | {sql} |
Run raw SQL directly (no AI) |
/api/schema |
GET | — | Schema status, table list with row counts |
/api/schema/refresh |
POST | — | Trigger immediate full schema re-discovery |
/api/schema/sanity |
GET | — | Run 10-point sanity checks on schema completeness |
/api/sessions |
GET | — | List active chat sessions |
/api/sessions/:id |
DELETE | — | Clear a session's history |
/api/check-connection |
GET | — | Test DB connectivity |
/api/health |
GET | — | Health check with AI auth mode, model, schema status |
{
"message": "Plain-English interpretation of results",
"sql": "SELECT ...",
"results": [{"col": "val", ...}, ...],
"row_count": 42,
"execution_time": 1.23,
"sql_error": null,
"session_id": "abc123"
}sql_error values: null (success), "vpn_required" (DB unreachable), or a PostgreSQL error string (triggers auto-retry).
QueryGPT/
├── backend/
│ ├── app.py # Flask server — all logic lives here
│ │ # get_system_prompt() ← AI brain
│ │ # discover_schema() ← live DB introspection
│ │ # call_gemini() ← Vertex AI call
│ │ # /api/chat ← main route
│ ├── example_queries.txt # Reference SQL patterns injected into AI context
│ ├── schema_context.txt # Static fallback schema snapshot (283KB)
│ ├── discovered_schema.txt # Live auto-discovered schema (regenerated every 3h)
│ └── discovered_schema.json # Structured version of above
├── index.html # Single-file chat UI (HTML + CSS + JS, no build)
├── requirements.txt
├── env_template.txt # Config template with comments
├── start_server.sh # Launch script (auto-detects correct Python)
└── README.md
| Function | What it does |
|---|---|
get_system_prompt() |
Assembles the full system instruction: SQL rules + domain knowledge + live schema |
discover_schema() |
Queries information_schema + runs sample queries; builds schema text for the AI |
_schema_refresh_worker() |
Background thread: full refresh every 3h, values refresh every 1h |
_load_cached_schema() |
Loads from disk on startup if DB is unreachable |
call_gemini(history) |
Calls Vertex AI with full conversation history and system prompt |
validate_sql(sql) |
Blocks write operations; enforces SELECT/WITH only |
extract_sql(text) |
Pulls SQL from <SQL>...</SQL> tags in model response |
run_sql(sql) |
Executes query, serialises datetimes, returns (rows, elapsed) |
Add an example to backend/example_queries.txt in the format:
=== Short description ===
REQUEST: "How a user would phrase this"
SQL:
SELECT ...
The example is appended to every system prompt call automatically.
Edit get_system_prompt() in backend/app.py. Add a new section with the ═══ separator style. Document:
- What the rule is
- The exact column names and values involved
- Example SQL that demonstrates correct usage
In discover_schema() → SAMPLE_QUERIES dict, add:
'my_new_key': "SELECT DISTINCT my_column FROM my_table WHERE my_column IS NOT NULL ORDER BY my_column"Then register it in the display section below that dict with a human-readable label.
Set GEMINI_MODEL=gemini-2.5-pro in .env for harder queries. The -pro variant has stronger reasoning at the cost of latency. Both support thinking_budget.
- Read-only enforced at two levels: the DB user has no write permissions; the server also blocks all write SQL keywords (
INSERT,UPDATE,DELETE,DROP,CREATE,ALTER,TRUNCATE,GRANT,REVOKE) - No credentials in code: DB password and GCP project via
.env(not committed); GCP auth via ADC, no API key stored - Session isolation: each session ID has independent conversation history; no cross-user leakage
- Internal only: no authentication on the HTTP endpoints — intended for VPN-only internal access
Built for Varaha's internal data team.