Skip to content

perf: add ANN vector index if/when flat-scan query latency becomes a bottleneck (parked) #337

Description

@HumanBean17

Summary

Proactively document the option to add an ANN (approximate nearest neighbor) vector index to the LanceDB tables, so that if/when flat-scan query latency becomes a bottleneck on a large corpus, the plan and its cost are already on record. Not something to do today — query latency is fine, and adding ANN would tax init/increment, which is the actual pain point.

Current state

The three LanceDB tables (javacodeindex_java_code, sqlschemaindex_sql_schema, yamlconfigindex_yaml_config) have:

  • a full-text search index (create_fts_index("text"), created lazily on first find), and
  • no ANN index — vector similarity (search) does an exhaustive flat scan over all rows.

On the corpora tested so far that flat scan is fast enough that nobody has hit a query-latency wall.

Why this issue exists (and why it's parked)

Profiling init on a medium Java corpus (Shopizer: ~1.2k files, ~3.9k chunks) on 2026-06-21 showed:

phase share of init
LadybugDB graph writes ~81%
cocoindex vectors ~17%
optimize ~1%

The project's pain point is unambiguously init/increment time, not query time. An ANN index is paid for at indexing time (build + ongoing maintenance during each increment), and would make the second-biggest init phase bigger — the wrong direction right now. So this is a deliberate "not yet, but here's the lever" record.

The option, when we need it

LanceDB supports creating an ANN index on a vector column, e.g. IVF_FLAT / IVF_PQ / HNSW via tbl.create_index(...). That turns the exhaustive flat scan into an approximate search with a large constant-factor speedup, at the cost of some recall and index-build/maintenance time.

Constraint worth knowing up front

cocoindex's LanceDB TableTarget exposes declare_row(...) + table.optimize() but no declare_vector_index(...) in the flow definition. So an ANN index can't be created through the cocoindex flow. The natural place is the serialized post-flow step (lance_optimize.py, which already runs table.optimize() under a retry loop) — that's the only moment with no concurrent writers, so it's the safe spot to also call create_index. _NUM_TXN_BEFORE_OPTIMIZE currently disables background optimize (lancedb#1504 race); ANN creation would slot into the same serialized window.

Trigger condition (when to revisit)

Revisit when flat-scan query latency on a real target corpus crosses an unacceptable threshold for top-k search (e.g. perceptible multi-hundred-ms p95 on a large repo). Until then, exhaustive search is simpler, exact, and imposes zero indexing cost.

Out of scope

  • Doing this now. Query latency is fine; init/increment time is the priority (tracked in the init/increment perf proposal).
  • Replacing flat scan as the default — ANN would be additive, not a swap.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions