JattBot — Varaha Data Assistant

An internal data assistant for Varaha's agroforestry and carbon platform. Ask questions in plain English — JattBot generates the correct SQL, runs it against the production database, and gives you a plain-English interpretation with the real numbers.

How it works

You type a question in the chat UI ("How many kyaris onboarded in Andhra Pradesh this cohort?")
The Flask backend assembles a system prompt containing the live database schema + all business rules
Gemini 2.5 Flash (thinking mode on) generates correct SQL wrapped in <SQL>...</SQL> tags
The backend validates (SELECT only), executes against the read-only production DB
Real results are fed back to Gemini for an accurate plain-English interpretation
You see a 5-row preview → confirm → expand all rows → export CSV or JSON

End-to-end architecture

┌─────────────────────────────────────────────────────────────────────┐
│  BROWSER  (index.html — single file, no build step)                 │
│                                                                     │
│  ┌──────────────────┐   POST /api/chat    ┌────────────────────┐   │
│  │  Chat interface  │ ─────────────────→  │                    │   │
│  │  SQL display     │                     │  Flask backend     │   │
│  │  Results table   │ ←──────────────── ─ │  (app.py :5001)    │   │
│  │  CSV / JSON exp. │   {message, sql,    │                    │   │
│  └──────────────────┘    results, error}  └────────┬───────────┘   │
└──────────────────────────────────────────────────── │ ─────────────┘
                                                       │
                    ┌──────────────────────────────────┤
                    │                                  │
                    ▼                                  ▼
     ┌──────────────────────────┐      ┌───────────────────────────┐
     │  Google Cloud Vertex AI  │      │  PostgreSQL (read-only)   │
     │                          │      │                           │
     │  Project: QueryGPT-Final │      │  Host: backenddb-read     │
     │  Model: gemini-2.5-flash │      │       .varaha.com         │
     │  Auth: ADC (gcloud)      │      │  User: readonly           │
     │  thinking_budget: 2048   │      │  ~442 tables              │
     └──────────────────────────┘      └───────────────────────────┘

Request lifecycle

User message
     │
     ▼
[1] Build context
     │  schema_cache['schema_text']  ← live DB schema (auto-discovered, 3h refresh)
     │  get_system_prompt()          ← domain rules, business logic, join paths
     │  example_queries.txt          ← reference SQL patterns
     │  chat history (last 30 msgs)  ← multi-turn memory
     │
     ▼
[2] First Gemini call  →  generates SQL in <SQL>...</SQL> tags + explanation
     │
     ▼
[3] Validate SQL
     │  Blocks: INSERT / UPDATE / DELETE / DROP / ALTER / TRUNCATE / GRANT
     │  Requires: query must start with SELECT or WITH
     │
     ▼
[4] Execute on DB  →  rows, elapsed_seconds
     │  On SQL error: auto-retry (feeds error back to Gemini, one retry)
     │  On timeout/connect error: returns vpn_required flag to UI
     │
     ▼
[5] Second Gemini call  →  interprets *actual* result rows
     │  "Using ONLY these real numbers, write a 2-3 sentence interpretation"
     │  Prevents hallucinated numbers in the explanation
     │
     ▼
[6] Return to UI
     {message, sql, results[], row_count, execution_time, sql_error, session_id}

Schema discovery

On startup the backend runs discover_schema() in a background thread. It queries information_schema for all tables, columns, primary keys, foreign keys, row counts, and unique indexes. It also runs ~20 sample queries against key lookup tables to pull the actual enum values (e.g., all state.name values, all ag_farm_kyari.kyari_type values). This gives the AI exact filter values instead of guessing.

The full text is written to backend/discovered_schema.txt and cached in memory. A background worker refreshes the full schema every 3 hours and refreshes sampled values every 1 hour.

If the database is unreachable (e.g., no VPN), the server falls back to the last cached discovered_schema.txt, and if that doesn't exist, to the static backend/schema_context.txt snapshot — so the AI still works offline, just with a potentially stale schema.

Tools & tech stack

Layer	Tool	Purpose
AI model	Google Gemini 2.5 Flash	NL → SQL generation and result interpretation
AI thinking	`thinking_budget=2048`	Gemini reasons through join paths before writing SQL
AI SDK	`google-genai` (Python)	Official Google Gen AI client library
AI auth	Google Cloud Vertex AI + ADC	Billed to org's GCP project (QueryGPT-Final); no API key stored
Backend	Flask 3.0 (Python)	HTTP server, session management, schema cache, SQL execution
Database	PostgreSQL (psycopg2)	Read-only production replica via VPN
Frontend	Vanilla HTML/CSS/JS	Single-file UI — no framework, no build step
Session memory	In-memory dict	Last 30 messages per session for multi-turn conversation
Schema cache	In-memory + disk	`discovered_schema.txt` / `.json` persisted across restarts

Why Gemini 2.5 Flash

Thinking mode (thinking_budget) is unique to the Gemini family — it lets the model reason step-by-step through complex multi-table joins before committing to SQL, dramatically reducing hallucinated column names
Temperature 0.1 keeps SQL deterministic and precise
8192 output token limit handles large schema contexts and verbose SQL
Safety settings all OFF — internal tool, no content filtering needed

Why Vertex AI instead of AI Studio

The bot was originally on an AI Studio API key (AQ.* prefix keys, free tier). It was switched to Google Cloud Vertex AI because:

Org's GCP billing covers it — no free-tier rate limits or expiry
ADC authentication (Application Default Credentials via gcloud auth application-default login) — no API key stored in any file
The same google-genai SDK works for both; only the client initialisation differs

# AI Studio (old)
gclient = genai.Client(api_key=GEMINI_API_KEY)

# Vertex AI (current)
gclient = genai.Client(vertexai=True, project=GOOGLE_CLOUD_PROJECT, location=GOOGLE_CLOUD_LOCATION)

How the bot was trained

JattBot has no traditional ML training. Its accuracy comes entirely from what is loaded into the system prompt at runtime. This is the standard pattern for domain-specific LLM tools — the model's general SQL and reasoning ability is fixed; we steer it with context.

There are three sources of context loaded on every request:

1. System prompt — `get_system_prompt()` in `app.py`

This is the primary intelligence layer. It is rebuilt on every chat request and injected as the Gemini system instruction. It contains:

Strict SQL rules — no LIMIT on outer queries, read-only validation, is_active filter requirements, test data exclusion patterns, farmer_media.is_archive = false requirement, double-quoting "UserManager"
Domain glossary — what a kyari is, what a farm is, what RS means, what AWD/Biochar are, what verification_status values mean
Location hierarchy — country → state → district → block → farmer, and how to match each level with ILIKE
Canonical join paths — hardcoded chains like ag_farm_kyari → ag_farm → farmer → block → district → state → country so the AI never guesses join conditions
Eligibility logic — how ACCEPTED / REJECTED / PENDING / MISSING statuses are derived for farmers, farms, and kyaris from farmer_media verification states using MAX(CASE WHEN ref_sub_type = '...' THEN verification_status END) pivot patterns
ARR Andhra Pradesh onboarding metrics — complete dictionary of all ~80 output columns from the main ARR dashboard query, cohort formula, scope filters, TAT/SLA rules, ArrFPIC document types, soil sampling and LSC table structures
TAT/SLA calculation — the gate-adjusted TAT clock logic: GREATEST(kyari_verified_on, fm.created_on), 3-day SLA definition, cohort grouping
Plantation kyari funnel — all 11 progressive stages from kyari acceptance through geo-tagging approval
Species data model — difference between plantation (saplings) and retrospective (ag_farm_specie) tree counts
survey_detail exclusion rule — when and how to join survey_detail to exclude Bhutan baseline-only farmers (only on direct farmer count queries)
After-results instructions — always interpret with real numbers, flag surprises, suggest one follow-up

This section is the main lever for accuracy. When the bot gives wrong answers, the fix is almost always in get_system_prompt().

2. Example queries — `backend/example_queries.txt`

A library of ~18 worked SQL examples covering common request patterns. These are appended verbatim to the schema context before every call. The AI uses them as reference patterns for new queries.

Examples cover:

Country/state/district farmer and kyari counts
Partner-wise breakdowns (joining through accounts.users → accounts.organizations)
Documentation status pivots (land record, ID card, profile picture)
Plantation kyari funnel (species, saplings, spacing, mid-plantation, geo-tagging)
Retrospective kyari tree counts
Plantation model breakdowns
RS verification status
ARR AP onboarding summary by 6-month cohort
Farmer TAT by cohort (gate-adjusted)
Farm TAT by cohort (gate-adjusted)

Adding a new example here is the fastest way to teach the bot a new query pattern.

3. Live schema — `discover_schema()` in `app.py`

The full database schema auto-discovered from information_schema. The AI gets:

All 442+ table names with column names, data types, nullable flags
Primary keys and foreign keys (where visible to the read-only user)
Approximate row counts per table (from pg_class.reltuples)
Unique indexes
Sampled actual values from ~20 key lookup columns — e.g., the exact set of state.name values, ag_farm_kyari.kyari_type values, farmer_media.ref_sub_type values, partner/org names — so WHERE clause filters are never guessed

Core tables (farmer, ag_farm, ag_farm_kyari, farmer_media, etc.) get full column-level detail. Other tables get just the column name list. Empty non-core tables are omitted. This keeps the schema context under ~25k tokens.

Domain knowledge embedded

The system prompt encodes Varaha's full data model and business rules for the following programs:

ARR (Afforestation/Reforestation/Revegetation) — Andhra Pradesh

The primary program. Onboarding is tracked through a 6-month cohort system (region_split), with a full pipeline of stages from farmer registration to geo-tagged plantation proof.

Farmer eligibility — derived from farmer_media:

IdentityImage (ID card) — ref_type='Farmer'
ProfilePicture — ref_type='Farmer'
ArrFPIC (FPIC English consent) — ref_type='Farmer' (ARR-specific, distinct from AgroFarmFPIC used in other programs)
ArrFPICLocal (FPIC local language) — ref_type='Farmer'
eKYC, BAV — tracked separately

Farm eligibility — requires all four: land record, land declaration (if family-owned), accepted kyari, and accepted boundary:

AgroFarmLandRecord — ref_type='AgroFarm' (required for all farms)
AgroFarmLandLordDeclaration — ref_type='AgroFarm' (family-owned only)
ag_farm.boundary_status — KML polygon validation

TAT / SLA — 3-day target. Clock starts at the later of kyari acceptance or doc upload (gate-adjusted). Cohorts are 6-month periods based on TAT start date.

Plantation pipeline — 11 progressive stages from kyari acceptance through species distribution, pit digging, spacing validation, mid-plantation proof, and geo-tagging.

Monitoring — monitoring_farm table tracks plantation completion status and model adherence per kyari.

Soil sampling — soil_sample_session → soil_sample_hole → soil_sample_hole_mapping → soil_sample chain.

LSC (Livelihood/Land Stakeholder Consultation) — lsc table, filtered by project_id LIKE 'ARR-%'.

AWD (Alternate Wetting & Drying)

Rice paddy water conservation project. Tables: awd_farmer, awd_farm, awd_tube.

Biochar

Charcoal soil amendment project. Tables: bio_char_site, bio_char_kiln, production_site.

Multi-country

Farmers, farms, and kyaris span India, Nepal, Bangladesh, Bhutan, and Kenya. Location hierarchy: country → state → district → block → farmer.

Quick start

Prerequisites

Python 3.9
VPN access to backenddb-read.varaha.com
Google Cloud account with gcloud CLI installed and authenticated

1. Clone and install

git clone https://github.com/rohyeeet/QueryGPT.git
cd QueryGPT
pip3 install -r requirements.txt

2. Configure credentials

cp env_template.txt .env

Edit .env:

# Google Cloud Vertex AI (recommended — billed to org GCP project)
GOOGLE_CLOUD_PROJECT=gen-lang-client-0780883925
GOOGLE_CLOUD_LOCATION=us-central1

# Database
DB_HOST=backenddb-read.varaha.com
DB_NAME=production
DB_READ_USER_NAME=readonly
DB_READ_PASSWORD=readonly$32412
DB_PORT=5432

# Model
GEMINI_MODEL=gemini-2.5-flash

Authenticate with Google Cloud (one-time per machine):

gcloud auth application-default login

3. Run

bash start_server.sh
# or directly:
/usr/bin/python3 backend/app.py

Open index.html in your browser. Connect to VPN before asking questions that need live data.

Example questions

How many kyaris onboarded in Andhra Pradesh?
Show ARR onboarding numbers by 6-month cohort
What's the farmer doc TAT for the Jan 25 cohort?
Farms with missing land records in AP
Top 5 partners by active farmers
Documentation status by partner
Plantation kyaris created in the last 30 days
RS verification breakdown for AP kyaris
How many kyaris are at each plantation stage?
Show monitoring completion rates by cohort

API reference

Endpoint	Method	Body	Description
`/api/chat`	POST	`{message, session_id, image?, image_type?}`	Main chat — NL → SQL → results → interpretation
`/api/execute-query`	POST	`{sql}`	Run raw SQL directly (no AI)
`/api/schema`	GET	—	Schema status, table list with row counts
`/api/schema/refresh`	POST	—	Trigger immediate full schema re-discovery
`/api/schema/sanity`	GET	—	Run 10-point sanity checks on schema completeness
`/api/sessions`	GET	—	List active chat sessions
`/api/sessions/:id`	DELETE	—	Clear a session's history
`/api/check-connection`	GET	—	Test DB connectivity
`/api/health`	GET	—	Health check with AI auth mode, model, schema status

`/api/chat` response shape

{
  "message": "Plain-English interpretation of results",
  "sql": "SELECT ...",
  "results": [{"col": "val", ...}, ...],
  "row_count": 42,
  "execution_time": 1.23,
  "sql_error": null,
  "session_id": "abc123"
}

sql_error values: null (success), "vpn_required" (DB unreachable), or a PostgreSQL error string (triggers auto-retry).

Project structure

QueryGPT/
├── backend/
│   ├── app.py                    # Flask server — all logic lives here
│   │                             #   get_system_prompt()  ← AI brain
│   │                             #   discover_schema()    ← live DB introspection
│   │                             #   call_gemini()        ← Vertex AI call
│   │                             #   /api/chat            ← main route
│   ├── example_queries.txt       # Reference SQL patterns injected into AI context
│   ├── schema_context.txt        # Static fallback schema snapshot (283KB)
│   ├── discovered_schema.txt     # Live auto-discovered schema (regenerated every 3h)
│   └── discovered_schema.json    # Structured version of above
├── index.html                    # Single-file chat UI (HTML + CSS + JS, no build)
├── requirements.txt
├── env_template.txt              # Config template with comments
├── start_server.sh               # Launch script (auto-detects correct Python)
└── README.md

Key functions in `app.py`

Function	What it does
`get_system_prompt()`	Assembles the full system instruction: SQL rules + domain knowledge + live schema
`discover_schema()`	Queries `information_schema` + runs sample queries; builds schema text for the AI
`_schema_refresh_worker()`	Background thread: full refresh every 3h, values refresh every 1h
`_load_cached_schema()`	Loads from disk on startup if DB is unreachable
`call_gemini(history)`	Calls Vertex AI with full conversation history and system prompt
`validate_sql(sql)`	Blocks write operations; enforces SELECT/WITH only
`extract_sql(text)`	Pulls SQL from `<SQL>...</SQL>` tags in model response
`run_sql(sql)`	Executes query, serialises datetimes, returns (rows, elapsed)

Extending the bot

Adding a new query pattern

Add an example to backend/example_queries.txt in the format:

=== Short description ===
REQUEST: "How a user would phrase this"
SQL:
SELECT ...

The example is appended to every system prompt call automatically.

Teaching a new business rule

Edit get_system_prompt() in backend/app.py. Add a new section with the ═══ separator style. Document:

What the rule is
The exact column names and values involved
Example SQL that demonstrates correct usage

Adding a new lookup table's sampled values

In discover_schema() → SAMPLE_QUERIES dict, add:

'my_new_key': "SELECT DISTINCT my_column FROM my_table WHERE my_column IS NOT NULL ORDER BY my_column"

Then register it in the display section below that dict with a human-readable label.

Switching the AI model

Set GEMINI_MODEL=gemini-2.5-pro in .env for harder queries. The -pro variant has stronger reasoning at the cost of latency. Both support thinking_budget.

Security

Read-only enforced at two levels: the DB user has no write permissions; the server also blocks all write SQL keywords (INSERT, UPDATE, DELETE, DROP, CREATE, ALTER, TRUNCATE, GRANT, REVOKE)
No credentials in code: DB password and GCP project via .env (not committed); GCP auth via ADC, no API key stored
Session isolation: each session ID has independent conversation history; no cross-user leakage
Internal only: no authentication on the HTTP endpoints — intended for VPN-only internal access

Built for Varaha's internal data team.

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
backend		backend
.gitignore		.gitignore
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
START_HERE.md		START_HERE.md
env_template.txt		env_template.txt
index.html		index.html
requirements.txt		requirements.txt
setup_github.md		setup_github.md
start_server.sh		start_server.sh

Folders and files

Latest commit

History

Repository files navigation

JattBot — Varaha Data Assistant

Table of Contents

How it works

End-to-end architecture

Request lifecycle

Schema discovery

Tools & tech stack

Why Gemini 2.5 Flash

Why Vertex AI instead of AI Studio

How the bot was trained

1. System prompt — get_system_prompt() in app.py

2. Example queries — backend/example_queries.txt

3. Live schema — discover_schema() in app.py

Domain knowledge embedded

ARR (Afforestation/Reforestation/Revegetation) — Andhra Pradesh

AWD (Alternate Wetting & Drying)

Biochar

Multi-country

Quick start

Prerequisites

1. Clone and install

2. Configure credentials

3. Run

Example questions

API reference

/api/chat response shape

Project structure

Key functions in app.py

Extending the bot

Adding a new query pattern

Teaching a new business rule

Adding a new lookup table's sampled values

Switching the AI model

Security

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

1. System prompt — `get_system_prompt()` in `app.py`

2. Example queries — `backend/example_queries.txt`

3. Live schema — `discover_schema()` in `app.py`

`/api/chat` response shape

Key functions in `app.py`

Packages