Summary
Proactively document the option to add an ANN (approximate nearest neighbor) vector index to the LanceDB tables, so that if/when flat-scan query latency becomes a bottleneck on a large corpus, the plan and its cost are already on record. Not something to do today — query latency is fine, and adding ANN would tax init/increment, which is the actual pain point.
Current state
The three LanceDB tables (javacodeindex_java_code, sqlschemaindex_sql_schema, yamlconfigindex_yaml_config) have:
- a full-text search index (
create_fts_index("text"), created lazily on first find), and
- no ANN index — vector similarity (
search) does an exhaustive flat scan over all rows.
On the corpora tested so far that flat scan is fast enough that nobody has hit a query-latency wall.
Why this issue exists (and why it's parked)
Profiling init on a medium Java corpus (Shopizer: ~1.2k files, ~3.9k chunks) on 2026-06-21 showed:
| phase |
share of init |
| LadybugDB graph writes |
~81% |
| cocoindex vectors |
~17% |
| optimize |
~1% |
The project's pain point is unambiguously init/increment time, not query time. An ANN index is paid for at indexing time (build + ongoing maintenance during each increment), and would make the second-biggest init phase bigger — the wrong direction right now. So this is a deliberate "not yet, but here's the lever" record.
The option, when we need it
LanceDB supports creating an ANN index on a vector column, e.g. IVF_FLAT / IVF_PQ / HNSW via tbl.create_index(...). That turns the exhaustive flat scan into an approximate search with a large constant-factor speedup, at the cost of some recall and index-build/maintenance time.
Constraint worth knowing up front
cocoindex's LanceDB TableTarget exposes declare_row(...) + table.optimize() but no declare_vector_index(...) in the flow definition. So an ANN index can't be created through the cocoindex flow. The natural place is the serialized post-flow step (lance_optimize.py, which already runs table.optimize() under a retry loop) — that's the only moment with no concurrent writers, so it's the safe spot to also call create_index. _NUM_TXN_BEFORE_OPTIMIZE currently disables background optimize (lancedb#1504 race); ANN creation would slot into the same serialized window.
Trigger condition (when to revisit)
Revisit when flat-scan query latency on a real target corpus crosses an unacceptable threshold for top-k search (e.g. perceptible multi-hundred-ms p95 on a large repo). Until then, exhaustive search is simpler, exact, and imposes zero indexing cost.
Out of scope
- Doing this now. Query latency is fine;
init/increment time is the priority (tracked in the init/increment perf proposal).
- Replacing flat scan as the default — ANN would be additive, not a swap.
Summary
Proactively document the option to add an ANN (approximate nearest neighbor) vector index to the LanceDB tables, so that if/when flat-scan query latency becomes a bottleneck on a large corpus, the plan and its cost are already on record. Not something to do today — query latency is fine, and adding ANN would tax
init/increment, which is the actual pain point.Current state
The three LanceDB tables (
javacodeindex_java_code,sqlschemaindex_sql_schema,yamlconfigindex_yaml_config) have:create_fts_index("text"), created lazily on firstfind), andsearch) does an exhaustive flat scan over all rows.On the corpora tested so far that flat scan is fast enough that nobody has hit a query-latency wall.
Why this issue exists (and why it's parked)
Profiling
initon a medium Java corpus (Shopizer: ~1.2k files, ~3.9k chunks) on 2026-06-21 showed:The project's pain point is unambiguously
init/incrementtime, not query time. An ANN index is paid for at indexing time (build + ongoing maintenance during eachincrement), and would make the second-biggest init phase bigger — the wrong direction right now. So this is a deliberate "not yet, but here's the lever" record.The option, when we need it
LanceDB supports creating an ANN index on a vector column, e.g.
IVF_FLAT/IVF_PQ/HNSWviatbl.create_index(...). That turns the exhaustive flat scan into an approximate search with a large constant-factor speedup, at the cost of some recall and index-build/maintenance time.Constraint worth knowing up front
cocoindex's LanceDB
TableTargetexposesdeclare_row(...)+table.optimize()but nodeclare_vector_index(...)in the flow definition. So an ANN index can't be created through the cocoindex flow. The natural place is the serialized post-flow step (lance_optimize.py, which already runstable.optimize()under a retry loop) — that's the only moment with no concurrent writers, so it's the safe spot to also callcreate_index._NUM_TXN_BEFORE_OPTIMIZEcurrently disables background optimize (lancedb#1504 race); ANN creation would slot into the same serialized window.Trigger condition (when to revisit)
Revisit when flat-scan query latency on a real target corpus crosses an unacceptable threshold for top-k
search(e.g. perceptible multi-hundred-ms p95 on a large repo). Until then, exhaustive search is simpler, exact, and imposes zero indexing cost.Out of scope
init/incrementtime is the priority (tracked in the init/increment perf proposal).