fix(rule): keep index for k-NN returning metadata, fall back when vector is projected#25
Merged
Merged
Conversation
…vector is projected
A k-NN query that projects the indexed vector column (SELECT *, or
SELECT id, embedding) crashed with a post-optimizer schema-mismatch when an
index was present: the passthrough branch built its output from the index
node's columns (addressing key + non-vector columns), which can't include the
vector, so the rewritten plan's schema differed from the original and
DataFusion's invariant check aborted the query.
The fix is output-aware. The rule now also anchors on a Projection sitting
directly over a passthrough k-NN Sort and drives the rewrite from that outer
projection's columns — the query's real output:
- vector NOT in output (e.g. SELECT id ... ORDER BY l2_distance(emb, ...),
the common "nearest ids" query) -> every output column is producible from
the node, so the index is still used.
- vector IN output (SELECT *, SELECT id, embedding) -> the rewrite can't
produce it, so the rule declines and the query falls back to exact
brute-force search (correct, like the existing DESC / metric-mismatch
fallbacks) instead of crashing.
This keeps the metadata-only k-NN path on the index (no regression) while
fixing the crash. A code comment records the rejected alternative (have
USearchExec reconstruct the vector via index.get) and why: it would make the
index a second source of returned vectors that must byte-match the source,
which breaks under F16 quantization.
Regression tests model production (lookup schema excludes the vector column,
which the existing tests' provider included, masking the bug). README documents
the fallback.
Fixes #508
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
A k-NN query (
ORDER BY <distance>(vec, lit) LIMIT k) that projects the indexed vector column crashed with a post-optimizer schema-mismatch when an index was present:The rewrite builds its output from the index node's columns (addressing key + non-vector sidecar columns +
_distance). The indexed vector is never stored in the fetch path, so the rewritten plan's schema differs from the original and DataFusion's invariant check aborts. Affects parquet- and ducklake-backed tables identically (shared logical rule). Fixes hotdata-dev/runtimedb#508.Why the naive fix regresses the common case
At the
Sortnode,SELECT id(vector only used to compute the inline distance) andSELECT id, embedding(vector in output) are indistinguishable — both areSort → TableScan[id, embedding], because the scan must supplyembeddingfor the distance either way. Declining whenever the vector appears in the Sort's schema therefore also kills the index for the common "return the nearest rows' ids" query (verified: it regressesSELECT id ... ORDER BY l2_distance(...)from index → full scan).Fix — output-aware
The rule now also anchors on a
Projectionsitting directly over a passthrough k-NNSort, and drives the rewrite from that outer projection's columns (the query's real output):SELECT id ... ORDER BY l2_distance(emb, ...)) → all output columns are producible from the node → index still used.SELECT *,SELECT id, embedding) → the rewrite can't produce it → the rule declines and the query falls back to exact brute-force search (correct, like the existing DESC / metric-mismatch / stacked-filter fallbacks) instead of crashing.The
Sort-anchored path keeps a producibility guard for the no-projection (SELECT *) shape.Validation
Verified end-to-end against real runtimedb (parquet), before vs after:
SELECT id … ORDER BY l2_distanceSELECT id, l2_distance(...) AS d … ORDER BY dSELECT id, embedding …SELECT *Tests
tests/vector_col_projection.rsmodels production (lookup schema excludes the vector column — the existingtests/optimizer_rule.rsprovider included it, which masked the bug):SELECT */SELECT id, embedding→ no rewrite, no crash;SELECT id(inline distance) → still rewrites (regression guard);Full suite green: lib 7, execution 29, optimizer_rule 23, sqlite_provider 11, new 4.
cargo fmt --checkandcargo clippy --features sqlite-provider --all-targets -- -D warningsclean.Alternative considered (not taken)
Teach
USearchExecto reconstruct the vector column viaindex.get(key)so vector-returning queries also stay on the index. Rejected to keep a single source of truth for returned vectors: the index would become a second source that must byte-match the source storage, which breaks under F16 quantization (and relies on USearch never transforming stored vectors). Recorded as a code comment inrule.rsand the README Limitations table. Theindex.getround-trip was verified to return exact (un-normalized) vectors today, so the door is open if vector-returning k-NN later proves performance-critical.