fix(rule): keep index for k-NN returning metadata, fall back when vector is projected by anoop-narang · Pull Request #25 · hotdata-dev/datafusion-vector-search-ext

anoop-narang · 2026-06-02T09:48:35Z

Problem

A k-NN query (ORDER BY <distance>(vec, lit) LIMIT k) that projects the indexed vector column crashed with a post-optimizer schema-mismatch when an index was present:

SELECT id, embedding FROM t ORDER BY l2_distance(embedding, ARRAY[...]) LIMIT 3;  -- 500
SELECT *            FROM t ORDER BY l2_distance(embedding, ARRAY[...]) LIMIT 3;   -- 500

Internal error: Assertion failed: compatible: Failed due to a difference in schemas:
  original: [id, embedding]   new: [_key, id]
Check optimizer-specific invariants after optimizer rule: usearch_rule

The rewrite builds its output from the index node's columns (addressing key + non-vector sidecar columns + _distance). The indexed vector is never stored in the fetch path, so the rewritten plan's schema differs from the original and DataFusion's invariant check aborts. Affects parquet- and ducklake-backed tables identically (shared logical rule). Fixes hotdata-dev/runtimedb#508.

Why the naive fix regresses the common case

At the Sort node, SELECT id (vector only used to compute the inline distance) and SELECT id, embedding (vector in output) are indistinguishable — both are Sort → TableScan[id, embedding], because the scan must supply embedding for the distance either way. Declining whenever the vector appears in the Sort's schema therefore also kills the index for the common "return the nearest rows' ids" query (verified: it regresses SELECT id ... ORDER BY l2_distance(...) from index → full scan).

Fix — output-aware

The rule now also anchors on a Projection sitting directly over a passthrough k-NN Sort, and drives the rewrite from that outer projection's columns (the query's real output):

vector not in output (SELECT id ... ORDER BY l2_distance(emb, ...)) → all output columns are producible from the node → index still used.
vector in output (SELECT *, SELECT id, embedding) → the rewrite can't produce it → the rule declines and the query falls back to exact brute-force search (correct, like the existing DESC / metric-mismatch / stacked-filter fallbacks) instead of crashing.

The Sort-anchored path keeps a producibility guard for the no-projection (SELECT *) shape.

Validation

Verified end-to-end against real runtimedb (parquet), before vs after:

Query	Before	After
`SELECT id … ORDER BY l2_distance`	index ✅	index ✅
`SELECT id, l2_distance(...) AS d … ORDER BY d`	index ✅	index ✅
`SELECT id, embedding …`	500	exact scan, correct rows ✅
`SELECT *`	500	exact scan, correct rows ✅

Tests

tests/vector_col_projection.rs models production (lookup schema excludes the vector column — the existing tests/optimizer_rule.rs provider included it, which masked the bug):

SELECT * / SELECT id, embedding → no rewrite, no crash;
SELECT id (inline distance) → still rewrites (regression guard);
aliased distance → still rewrites.

Full suite green: lib 7, execution 29, optimizer_rule 23, sqlite_provider 11, new 4. cargo fmt --check and cargo clippy --features sqlite-provider --all-targets -- -D warnings clean.

Alternative considered (not taken)

Teach USearchExec to reconstruct the vector column via index.get(key) so vector-returning queries also stay on the index. Rejected to keep a single source of truth for returned vectors: the index would become a second source that must byte-match the source storage, which breaks under F16 quantization (and relies on USearch never transforming stored vectors). Recorded as a code comment in rule.rs and the README Limitations table. The index.get round-trip was verified to return exact (un-normalized) vectors today, so the door is open if vector-returning k-NN later proves performance-critical.

…vector is projected A k-NN query that projects the indexed vector column (SELECT *, or SELECT id, embedding) crashed with a post-optimizer schema-mismatch when an index was present: the passthrough branch built its output from the index node's columns (addressing key + non-vector columns), which can't include the vector, so the rewritten plan's schema differed from the original and DataFusion's invariant check aborted the query. The fix is output-aware. The rule now also anchors on a Projection sitting directly over a passthrough k-NN Sort and drives the rewrite from that outer projection's columns — the query's real output: - vector NOT in output (e.g. SELECT id ... ORDER BY l2_distance(emb, ...), the common "nearest ids" query) -> every output column is producible from the node, so the index is still used. - vector IN output (SELECT *, SELECT id, embedding) -> the rewrite can't produce it, so the rule declines and the query falls back to exact brute-force search (correct, like the existing DESC / metric-mismatch fallbacks) instead of crashing. This keeps the metadata-only k-NN path on the index (no regression) while fixing the crash. A code comment records the rejected alternative (have USearchExec reconstruct the vector via index.get) and why: it would make the index a second source of returned vectors that must byte-match the source, which breaks under F16 quantization. Regression tests model production (lookup schema excludes the vector column, which the existing tests' provider included, masking the bug). README documents the fallback. Fixes #508

claude Bot approved these changes Jun 2, 2026

View reviewed changes

anoop-narang merged commit 2c65279 into main Jun 2, 2026
6 checks passed

anoop-narang deleted the fix/usearch-vector-col-in-select branch June 2, 2026 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(rule): keep index for k-NN returning metadata, fall back when vector is projected#25

fix(rule): keep index for k-NN returning metadata, fall back when vector is projected#25
anoop-narang merged 1 commit into
mainfrom
fix/usearch-vector-col-in-select

anoop-narang commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anoop-narang commented Jun 2, 2026

Problem

Why the naive fix regresses the common case

Fix — output-aware

Validation

Tests

Alternative considered (not taken)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant