Simplify the `DataProvider` contract for graph search by hildebrandmw · Pull Request #1067 · microsoft/DiskANN

hildebrandmw · 2026-05-14T01:43:04Z

Executive Summary: What happens if we start simplifying our trait hierarchy and rely on ExpandBeam to do the heavy lifting?

This is controversial and I have no intention (yet) of turning this into a full PR, but it demonstrates the feasibility of something I've come to realize recently. The whole Accessor -> BuildQueryComputer -> ExpandBeam originated from earlier patterns in the library of building up indexing algorithms from tiny pieces, enabling a user to implement a small set of methods and get the whole indexing algorithm for free.

This breaks down, though, when we consider performance: the more information we provide a backend database about the operation we want to perform, the better the database can optimize. This lead to on_elements_unordered (bulk element retrieval, allowing for example prefetching), distances_unordered (bulk retrieval with a concrete distance computations), and finally ExpandBeam. Of these, the latter has proved by far the most useful, being used by diskann-disk and diskann-garnet to specialize the algorithm to their preferred data access pattern.

I've been exploring how to make filtered search more of a first-class citizen to allow us to batch together predicate queries and potentially fuse predicate evaluation with data access. I was halfway through writing a parallel Accessor/Computer/ExpandBeam/Strategy hierarchy for predicate aware search (and struggling with expressing the more complicated type relationships required), when I realized that all we really needed were a few slightly different flavors of ExpandBeam. Instead of building a complex nest of interrelated traits to express these subtly different ExpandBeam methods, why not just require users to implement the necessary ExpandBeam? The required semantics are not that complicated and it gives the implementor full control of the data access.

This PR is an experiment in removing all but the ElementRef GAT of the Accessor trait and instead making ExpandBeam the most relevant method. There's more simplification that can be done: fusing SearchExt with ExpandBeam, getting rid of the pesky AsNeighbor bound from ExpandBeam, simplifying BuildQueryComputer to maybe not even need the PreprocessedDistanceFunction bound. However, this is already a large reduction in code and has a couple extra benefits:

The ExpandBeam implementations for inmem are actually more efficient than the previous provided implementations because we can more aggressively evaluate the visited predicate (and merge the neighbor buffer in the prefetch id_buffer to save an allocation on each ExpandBeam call. This is not a significant increase in code because we simply move the old on_elements_unordered.
We can drop a bunch of test code from diskann testing the provided implementations that are now no longer needed.
If we merge ExpandBeam with SearchExt, then all the relevant bits for search are grouped in a single place instead of spread out for easier discovery.
We can shed a bunch of complexity in diskann, with potentially more simplification if we continue down this path.
Users need to implement fewer traits, and the implementation of ExpandBeam is perhaps a little more prone to error. This is a distinct negative, but if everything is collected into a single place, it's easier to audit.

Now for some tradeoffs:

We don't have vocabulary to implement the flat_search function. I kind of view this as a bonus: the existing flat search relies on an ID iterator (not easy for most backends) and uses an inefficient loop. The call in diskann-disk would likely be much more efficient if it was just done directly on the PQ dataset.
This substantially breaks the design in diskann-label-filter. Coarsening the granularity of data access methods means the wrapping done by the label provider is less effective. Again, though, this is probably a feature: since real implementations are moving towards ExpandBeam for performance, the label provider may not be a good approach anyways. We certainly had this experience in Remove the Caching Provider #1052.
There is still some friction in test code that is relying on Accessor's get-element. This can be grouped into the larger diskann-providers cleanup.
The provided implementation of Fill for diskann::graph::workingset::Map is no longer applicable. Users have to write their own relatively easy Fill method. I don't think this is as big of a problem as it sounds because I don't think any serious implementation uses the provided implementation: it's too slow.

Why is this relevant now? Aside from the filtered search work, #983 is working towards adding flat search and introduces an even finer granularity of traits in the hierarchy, when my gut tells me we should be moving the other direction towards slightly coarser traits. Also, if we have an opportunity to simplify our codebase, I think we should take it.

Why is this possible now? #859 moved one of the last big get_element related operations behind a bulk operation. Surprisingly, that removed all calls to the raw get_element from insert or inplace delete.

metajack · 2026-05-14T15:42:31Z

get_element is basically unused in garnet, except for implementing VEMB (return an vector) and other debugging-style functions.

This sounds like a very promising direction, and I am very supportive.

Remove `flat_search` from `DiskANNIndex` and the `IdIterator` trait from `diskann`. Since the only caller was from `diskann-disk`, add a new `flat_search` inherent method to `DiskIndexSearcher`. The flat search method is not compatible with the experimental direction in #1067 and with #983 on the horizon, this is safe to move for now.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

…periment

Paged search has been causing all kinds of issues for our code base and is actively getting in the way of simplifications in #1067 due to interactions with the `PagedSearchState`. The TLDR of the issue is that `PagedSearchState` requires types to be `'static` and introduces the need to "pause" and "resume" search state in a way that is complex to describe in trait bounds. Since our code is already async, we can lean into that and use the usual Rust machinery to embed non-`'static` paged searcher inside an otherwise `'static` future. The recommended way to now interact with paged search is via [channels](https://docs.rs/tokio/latest/tokio/sync/mpsc/fn.channel.html). [Rendered RFC](https://github.com/microsoft/DiskANN/blob/b63d009893f362e07ace518314f7c85733cba212/rfcs/01078-paged-search.md) ## API Migration Guide | Old pattern | New pattern | |---|---| | `index.start_paged_search(s, ctx, q, l).await` | `index.paged_search(s, ctx, q, l).await` | | `index.next_search_results(ctx, &mut state, k, &mut buf).await` | `search.next_page(k).await` | | `SearchState<Id, (S, C)>` | `PagedSearch<'a, DP, S, T>` | | `PagedSearchState<DP, S, C>` | `PagedSearch<'a, DP, S, T>` | | Check return count for exhaustion | Check `page.is_empty()` | If existing code embedded the `SearchState` in some `'static` container, that is no longer viable because of the borrow. Instead, channels can be used for this communication: ```rust // Types are illustrative — adapt names to your crate. type PageResult = ANNResult<Vec<Neighbor<ExternalId>>>; /// Spawn a paged search session. The index is held by Arc so the task is 'static. /// /// Returns a request channel and a result channel. The caller sends the desired /// page size (`k`) and awaits the corresponding result on the other end. fn spawn_paged_session( index: Arc<DiskANNIndex<DP>>, context: Arc<DP::Context>, query: T, l: usize, ) -> (mpsc::Sender<usize>, mpsc::Receiver<PageResult>) { let (req_tx, mut req_rx) = mpsc::channel::<usize>(1); let (res_tx, res_rx) = mpsc::channel::<PageResult>(1); tokio::spawn(async move { // Borrow from the Arc — these references are scoped to the task. let mut search = index.paged_search(strategy, &*context, query, l).await.unwrap(); while let Some(k) = req_rx.recv().await { let page = search.next_page(k).await; if res_tx.send(page).await.is_err() { break; // caller dropped the result receiver } } // Request channel closed -> caller dropped sender -> clean shutdown. }); (req_tx, res_rx) } ``` If code was already explicitly using a `.await` loop with `SearchState`, then minimal changes should be needed. ## For Users of Paged Search via `wrapped_async` Users of paged search via `wrapped_async::DiskANNIndex` that know their inner futures will never suspend can use the new `wrapped_async::DiskANNIndex::paged_search_no_await`. This will use the new API transparently via `wrapped_async::noawait::PagedSearch` and efficiently run paged searches with minimal synchronization overhead. This should **only** be used if the implementation of `Accessor`, `BuildQueryComputer`, `SearchExt`, `DataProvider`, and `ExpandBeam` are known to never yield and always complete with `Poll::Ready`. --------- Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

In preparation for #1067, switch the implementation of `DynIndex::search_element` from using the `Accessor` trait to an inherent method on the underlying provider.

Refactor `DiskIndexSearcher::flat_search` to use the bulk `pq_distances` method instead of one-by-one computation. This does two things: 1. The bulk method is presumably a little more efficient. 2. This moves the implementation away from `Accessor`/`BuildQueryComputer` to help with #1067. The small tricky bit is deriving the maximum batch size from the size of scratch. To that end, I added a private accessor for `PQScratch` to return this bound. --------- Co-authored-by: Mark Hildebrand <mhildebrand@microsoft.com>

This PR introduces a trait interface and a light index to support brute-force search for providers that can be used as/are a flat-index. There is an associated RFC that walks through the interface and associated implementation in `diskann` as a new `flat` module. Rendered RFC [link](https://github.com/microsoft/DiskANN/blob/u/adkrishnan/flat-index/rfcs/00983-flat-search.md). ## Motivation The repo has no first-class surface for brute-force search. This PR adds a small trait hierarchy that gives flat search the same provider-agnostic shape that graph search has, so any backend (in-memory, quantized, disk, remote) can plug in once and reuse a shared algorithm. ## Traits (`flat/strategy.rs`) **`DistancesUnordered<C>`** — the single trait a backend must implement. Fuses iteration and scoring into one method: the implementation drives a full scan, scoring each element with a precomputed query computer `C`, and invokes a callback with `(id, distance)` pairs. Key associated types: - `ElementRef<'a>` -- the reference shape `C` scores against. - `Id` -- the id type yielded to the callback (decoupled from `HasId` so visitors can yield any id shape). - `C : for<'a> PreprocessedDistanceFunction<Self::ElementRef<'a>, f32>` -- the precomputer query computer. ```rust pub trait DistancesUnordered<C>: Send + Sync where C: for<'a> PreprocessedDistanceFunction<Self::ElementRef<'a>, f32>, { type ElementRef<'a>; type Id; type Error: ToRanked + Debug + Send + Sync + 'static; fn distances_unordered<F>( &mut self, computer: &C, f: F, ) -> impl SendFuture<Result<(), Self::Error>> where F: Send + FnMut(Self::Id, f32); } ``` **`SearchStrategy<P, T>`** — factory that creates a `DistancesUnordered` visitor from a provider + context, and builds the per-query computer. Mirrors the graph-side strategy pattern. Two fallible methods: - `create_visitor` — borrows provider + context, returns a `Visitor` - `build_query_computer` — preprocesses the query `T` into a `QueryComputer` ```rust pub trait SearchStrategy<P, T>: Send + Sync where P: DataProvider, { type ElementRef<'a>; type Id; type QueryComputer: for<'a> PreprocessedDistanceFunction<Self::ElementRef<'a>, f32>; type QueryComputerError: StandardError; type Visitor<'a>: for<'b> DistancesUnordered< Self::QueryComputer, ElementRef<'b> = Self::ElementRef<'b>, Id = Self::Id, >; type Error: StandardError; fn create_visitor<'a>(&'a self, provider: &'a P, context: &'a P::Context) -> Result<Self::Visitor<'a>, Self::Error>; fn build_query_computer(&self, query: T) -> Result<Self::QueryComputer, Self::QueryComputerError>; } ``` ## Index (`flat/index.rs`) **`FlatIndex<P>`** — thin `'static` wrapper around a `DataProvider`. Currently we have implemented the naive kNN search algorithm for the flat index. `knn_search` asks the strategy for a visitor, builds the query computer, drives `distances_unordered` through a priority queue, and writes results via `SearchPostProcess`. ## Test infrastructure (`flat/test/`) A self-contained test provider with dimension-validated `Strategy`, transient-error injection, and a `KnnOracleRun` harness that compares `knn_search` results against a brute-force reference with baseline caching for regression detection. ## Future work - **Post-processing** — `knn_search` currently uses the graph-side `SearchPostProcess` trait to write results into the output buffer. #1067 will introduce a flat-specific post-processing step that decouples flat search from the graph module's output machinery. - **Vector ids** - The vector id over which `DistancesUnordered` acts is an associated type, instead of the `InternalId` of the provider. This is due to overly restrictive trait bounds on `VectorId` trait. We plan on relaxing this allowing for more generic id types. --------- Co-authored-by: Alex Razumov (from Dev Box) <alrazu@microsoft.com>

Mark Hildebrand added 2 commits May 13, 2026 17:49

Checkpoint.

a496628

Remove now unneeded code.

b8e382b

Mark Hildebrand added 3 commits May 14, 2026 15:11

Checkpoint.

0a241a2

More simplification.

8aedf0c

Tidy up betafilter.

e2d495a

hildebrandmw mentioned this pull request May 15, 2026

[flat index] Flat Search Interface #983

Merged

hildebrandmw changed the title ~~[DO NOT MERGE] Delete all the things!~~ Simplify the DataProvider contract for graph search May 15, 2026

hildebrandmw mentioned this pull request May 15, 2026

[diskann/disk] Remove flat_search from DiskANNIndex #1076

Merged

Mark Hildebrand added 4 commits May 15, 2026 13:44

Checkpoint before one more experiment.

292c7aa

Overhaul PagedSearch.

70cfe62

Query Computers no longer need to be 'static.

2193025

Clarify use of fn pointer.

9c773eb

hildebrandmw mentioned this pull request May 15, 2026

[RFC/diskann] Overhaul paged search #1078

Merged

Mark Hildebrand and others added 11 commits May 18, 2026 12:23

Paged search 2.0

b63d009

Merge remote-tracking branch 'origin/main' into mhildebr/paged

e8e51b2

Apply suggestions from code review

c357f02

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Fix documentation.

02e2ccc

Commit simplification.

feb2e97

Add specialized noawait paged-search.

7344b00

Merge remote-tracking branch 'origin/main' into mhildebr/paged

8098d07

Checkpoint.

5e40368

Merge remote-tracking branch 'origin/mhildebr/paged' into mhildebr/ex…

db8aebe

…periment

Not sure about this line of thought.

5083dc5

Well - this is something.

56122a5

hildebrandmw mentioned this pull request May 21, 2026

Refactor DiskIndexSearcher::flat_search to use batching #1097

Merged

hildebrandmw mentioned this pull request May 21, 2026

Remove direct accessor call in diskann-garnet #1098

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify the `DataProvider` contract for graph search#1067

Simplify the `DataProvider` contract for graph search#1067
hildebrandmw wants to merge 20 commits into
mainfrom
mhildebr/experiment

hildebrandmw commented May 14, 2026 •

edited

Loading

Uh oh!

metajack commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hildebrandmw commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

metajack commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hildebrandmw commented May 14, 2026 •

edited

Loading