Selective Context can return 0 results for a user with a small share of a large collection (HNSW post-filter)

With Selective Context (a scoped provider), retrieval can return 0 documents for a user whose indexed chunks are only a small fraction of a large collection; users with small corpora are fine.

**Cause:** `VectorDB._similarity_search` runs roughly `... WHERE id IN (<chunk_ids>) ORDER BY embedding <=> :q LIMIT k`. For a large `IN` list the planner can pick an HNSW index scan: it finds the globally-nearest ~`hnsw.ef_search` vectors first, then post-filters by `id`. If none of the user's chunks are among the global-nearest, the result is empty.

**Two ways to address:**
- *pgvector ≥ 0.8.0 native:* `SET hnsw.iterative_scan = relaxed_order` (off by default) — HNSW keeps scanning until enough rows pass the filter. Simplest where available.
- *App-side (what we used on 5.3.x):* a `MATERIALIZED` CTE that pre-filters by `id` (btree) before computing similarity, so the HNSW plan is never chosen — deterministic and exact.

**Note:** the plan choice is cost-dependent — in a re-test the planner used the btree id index and returned correct rows, so the 0-result plan isn't guaranteed; it's a latent risk under certain id-list sizes/stats rather than a constant failure.

*(Disclosure: investigated with AI assistance; verified against the source and on a live deployment.)*


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Selective Context can return 0 results for a user with a small share of a large collection (HNSW post-filter) #320

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Selective Context can return 0 results for a user with a small share of a large collection (HNSW post-filter) #320

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions