Skip to content

Selective Context can return 0 results for a user with a small share of a large collection (HNSW post-filter) #320

@bygadd

Description

@bygadd

With Selective Context (a scoped provider), retrieval can return 0 documents for a user whose indexed chunks are only a small fraction of a large collection; users with small corpora are fine.

Cause: VectorDB._similarity_search runs roughly ... WHERE id IN (<chunk_ids>) ORDER BY embedding <=> :q LIMIT k. For a large IN list the planner can pick an HNSW index scan: it finds the globally-nearest ~hnsw.ef_search vectors first, then post-filters by id. If none of the user's chunks are among the global-nearest, the result is empty.

Two ways to address:

  • pgvector ≥ 0.8.0 native: SET hnsw.iterative_scan = relaxed_order (off by default) — HNSW keeps scanning until enough rows pass the filter. Simplest where available.
  • App-side (what we used on 5.3.x): a MATERIALIZED CTE that pre-filters by id (btree) before computing similarity, so the HNSW plan is never chosen — deterministic and exact.

Note: the plan choice is cost-dependent — in a re-test the planner used the btree id index and returned correct rows, so the 0-result plan isn't guaranteed; it's a latent risk under certain id-list sizes/stats rather than a constant failure.

(Disclosure: investigated with AI assistance; verified against the source and on a live deployment.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions