Skip to content

fix(rule): keep index for k-NN returning metadata, fall back when vector is projected#25

Merged
anoop-narang merged 1 commit into
mainfrom
fix/usearch-vector-col-in-select
Jun 2, 2026
Merged

fix(rule): keep index for k-NN returning metadata, fall back when vector is projected#25
anoop-narang merged 1 commit into
mainfrom
fix/usearch-vector-col-in-select

Conversation

@anoop-narang
Copy link
Copy Markdown
Collaborator

Problem

A k-NN query (ORDER BY <distance>(vec, lit) LIMIT k) that projects the indexed vector column crashed with a post-optimizer schema-mismatch when an index was present:

SELECT id, embedding FROM t ORDER BY l2_distance(embedding, ARRAY[...]) LIMIT 3;  -- 500
SELECT *            FROM t ORDER BY l2_distance(embedding, ARRAY[...]) LIMIT 3;   -- 500
Internal error: Assertion failed: compatible: Failed due to a difference in schemas:
  original: [id, embedding]   new: [_key, id]
Check optimizer-specific invariants after optimizer rule: usearch_rule

The rewrite builds its output from the index node's columns (addressing key + non-vector sidecar columns + _distance). The indexed vector is never stored in the fetch path, so the rewritten plan's schema differs from the original and DataFusion's invariant check aborts. Affects parquet- and ducklake-backed tables identically (shared logical rule). Fixes hotdata-dev/runtimedb#508.

Why the naive fix regresses the common case

At the Sort node, SELECT id (vector only used to compute the inline distance) and SELECT id, embedding (vector in output) are indistinguishable — both are Sort → TableScan[id, embedding], because the scan must supply embedding for the distance either way. Declining whenever the vector appears in the Sort's schema therefore also kills the index for the common "return the nearest rows' ids" query (verified: it regresses SELECT id ... ORDER BY l2_distance(...) from index → full scan).

Fix — output-aware

The rule now also anchors on a Projection sitting directly over a passthrough k-NN Sort, and drives the rewrite from that outer projection's columns (the query's real output):

  • vector not in output (SELECT id ... ORDER BY l2_distance(emb, ...)) → all output columns are producible from the node → index still used.
  • vector in output (SELECT *, SELECT id, embedding) → the rewrite can't produce it → the rule declines and the query falls back to exact brute-force search (correct, like the existing DESC / metric-mismatch / stacked-filter fallbacks) instead of crashing.

The Sort-anchored path keeps a producibility guard for the no-projection (SELECT *) shape.

Validation

Verified end-to-end against real runtimedb (parquet), before vs after:

Query Before After
SELECT id … ORDER BY l2_distance index ✅ index ✅
SELECT id, l2_distance(...) AS d … ORDER BY d index ✅ index ✅
SELECT id, embedding … 500 exact scan, correct rows ✅
SELECT * 500 exact scan, correct rows ✅

Tests

tests/vector_col_projection.rs models production (lookup schema excludes the vector column — the existing tests/optimizer_rule.rs provider included it, which masked the bug):

  • SELECT * / SELECT id, embedding → no rewrite, no crash;
  • SELECT id (inline distance) → still rewrites (regression guard);
  • aliased distance → still rewrites.

Full suite green: lib 7, execution 29, optimizer_rule 23, sqlite_provider 11, new 4. cargo fmt --check and cargo clippy --features sqlite-provider --all-targets -- -D warnings clean.

Alternative considered (not taken)

Teach USearchExec to reconstruct the vector column via index.get(key) so vector-returning queries also stay on the index. Rejected to keep a single source of truth for returned vectors: the index would become a second source that must byte-match the source storage, which breaks under F16 quantization (and relies on USearch never transforming stored vectors). Recorded as a code comment in rule.rs and the README Limitations table. The index.get round-trip was verified to return exact (un-normalized) vectors today, so the door is open if vector-returning k-NN later proves performance-critical.

…vector is projected

A k-NN query that projects the indexed vector column (SELECT *, or
SELECT id, embedding) crashed with a post-optimizer schema-mismatch when an
index was present: the passthrough branch built its output from the index
node's columns (addressing key + non-vector columns), which can't include the
vector, so the rewritten plan's schema differed from the original and
DataFusion's invariant check aborted the query.

The fix is output-aware. The rule now also anchors on a Projection sitting
directly over a passthrough k-NN Sort and drives the rewrite from that outer
projection's columns — the query's real output:

  - vector NOT in output (e.g. SELECT id ... ORDER BY l2_distance(emb, ...),
    the common "nearest ids" query) -> every output column is producible from
    the node, so the index is still used.
  - vector IN output (SELECT *, SELECT id, embedding) -> the rewrite can't
    produce it, so the rule declines and the query falls back to exact
    brute-force search (correct, like the existing DESC / metric-mismatch
    fallbacks) instead of crashing.

This keeps the metadata-only k-NN path on the index (no regression) while
fixing the crash. A code comment records the rejected alternative (have
USearchExec reconstruct the vector via index.get) and why: it would make the
index a second source of returned vectors that must byte-match the source,
which breaks under F16 quantization.

Regression tests model production (lookup schema excludes the vector column,
which the existing tests' provider included, masking the bug). README documents
the fallback.

Fixes #508
@anoop-narang anoop-narang merged commit 2c65279 into main Jun 2, 2026
6 checks passed
@anoop-narang anoop-narang deleted the fix/usearch-vector-col-in-select branch June 2, 2026 09:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant