perf: add ANN vector index if/when flat-scan query latency becomes a bottleneck (parked)

## Summary

Proactively document the option to add an **ANN (approximate nearest neighbor) vector index** to the LanceDB tables, so that *if/when* flat-scan query latency becomes a bottleneck on a large corpus, the plan and its cost are already on record. **Not something to do today** — query latency is fine, and adding ANN would tax `init`/`increment`, which is the actual pain point.

## Current state

The three LanceDB tables (`javacodeindex_java_code`, `sqlschemaindex_sql_schema`, `yamlconfigindex_yaml_config`) have:

- a **full-text search** index (`create_fts_index("text")`, created lazily on first `find`), and
- **no ANN index** — vector similarity (`search`) does an **exhaustive flat scan** over all rows.

On the corpora tested so far that flat scan is fast enough that nobody has hit a query-latency wall.

## Why this issue exists (and why it's parked)

Profiling `init` on a medium Java corpus (Shopizer: ~1.2k files, ~3.9k chunks) on 2026-06-21 showed:

| phase | share of init |
|---|---|
| LadybugDB graph writes | ~81% |
| cocoindex vectors | ~17% |
| optimize | ~1% |

The project's pain point is unambiguously **`init`/`increment` time**, not query time. An ANN index is paid for at *indexing* time (build + ongoing maintenance during each `increment`), and would make the second-biggest init phase bigger — the wrong direction right now. So this is a deliberate "not yet, but here's the lever" record.

## The option, when we need it

LanceDB supports creating an ANN index on a vector column, e.g. `IVF_FLAT` / `IVF_PQ` / `HNSW` via `tbl.create_index(...)`. That turns the exhaustive flat scan into an approximate search with a large constant-factor speedup, at the cost of some recall and index-build/maintenance time.

### Constraint worth knowing up front

cocoindex's LanceDB `TableTarget` exposes `declare_row(...)` + `table.optimize()` but **no** `declare_vector_index(...)` in the flow definition. So an ANN index can't be created *through* the cocoindex flow. The natural place is the **serialized post-flow step** (`lance_optimize.py`, which already runs `table.optimize()` under a retry loop) — that's the only moment with no concurrent writers, so it's the safe spot to also call `create_index`. `_NUM_TXN_BEFORE_OPTIMIZE` currently disables background optimize (lancedb#1504 race); ANN creation would slot into the same serialized window.

## Trigger condition (when to revisit)

Revisit when **flat-scan query latency** on a real target corpus crosses an unacceptable threshold for top-k `search` (e.g. perceptible multi-hundred-ms p95 on a large repo). Until then, exhaustive search is simpler, exact, and imposes zero indexing cost.

## Out of scope

- Doing this now. Query latency is fine; `init`/`increment` time is the priority (tracked in the init/increment perf proposal).
- Replacing flat scan as the default — ANN would be additive, not a swap.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: add ANN vector index if/when flat-scan query latency becomes a bottleneck (parked) #337

Summary

Current state

Why this issue exists (and why it's parked)

The option, when we need it

Constraint worth knowing up front

Trigger condition (when to revisit)

Out of scope

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

perf: add ANN vector index if/when flat-scan query latency becomes a bottleneck (parked) #337

Description

Summary

Current state

Why this issue exists (and why it's parked)

The option, when we need it

Constraint worth knowing up front

Trigger condition (when to revisit)

Out of scope

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions