Significantly speed up bitmap computation by magdalendobson · Pull Request #1099 · microsoft/DiskANN

magdalendobson · 2026-05-21T22:09:59Z

Introduction

Bitmap computation in diskann-label-filter is unacceptably slow. Currently, with a 1 million size slice of yfcc and a 10k query set, computing the query bitmaps takes 43.10 seconds. With just a 100K slice of the caselaw dataset and a 10k query set, computing the bitmaps takes 6.03 seconds. This was making it hard to run experiments on filtered search algorithms for the full sizes of these datasets.

Speeding up the bitmap computation is conceptually simple. Instead of iterating over every base label for every query filter, we compute an inverted index for each label type, which maps the label value to the documents with the same value. Then, at query time, we query the inverted index for the relevant label values, and compose the resulting sets as necessary to find the documents satisfying the entire filter expression. At a high level, that is what this PR does.

Lower level details

The overall workflow of the main function, compute_query_bitmaps, is as follows:

Check whether the query expression contains any ASTExpr::Not clauses. If so, default to the existing slow path. This is because we don't store the document universe for each label, and thus can't compute the complement of an arbitrary bitset.
Otherwise, move to the fast path.
Flatten the base labels so that nested values map to a single string (e.g. the JSON string {"car": {"color":"red", "make":Mazda"}} would be transformed to {"car.color":red, "car.make":"Mazda}), and re-organize as a hash map of labels to values.
For each label, compute either an inverted index (strings and bools) or an B-tree (ints and floats) depending on its type.
At query time, use either the inverted index or the B-tree to produce a bitset for each CompareOp in the clause, and then compose them with AND and OR as needed to produce the final bitset.

We also add a utility to diskann-label-filter for computing the specificity of a set of query filters with respect to a base set, outputting some statistics on it, and optionally outputting the individual specificity values to a file for further processing.

Inverted Index

The inverted index maps each label value, converted to a string, to a bitset containing the doc ids corresponding to that value.

B-Tree

For simplicity, the B-tree implementation converts integers to floats before inserting so that we don't have to deal with two different types of B-tree. The performance of this piece of code isn't sensitive enough that it makes sense to differentiate, but this could be changed in the future.

The B-tree maps collections of ids to vectors instead of bitsets, because concatenating vectors is much cheaper than extending bitsets, and potentially many vectors would be concatenated during a range query.

Timings

Returning to the earlier discussion of timings, for the 1 million size slice of yfcc and a 10k query set, computing the query bitmaps now takes .6 seconds. For the 100K slice of the caselaw dataset and a 10k query set, computing the bitmaps now takes 1.728 seconds.

Copilot

Pull request overview

This PR targets a major performance improvement in diskann-label-filter by introducing a fast-path for computing per-query bitmaps using precomputed per-field accelerators (inverted-index style maps for equality and a numeric BTree for range queries), while falling back to the existing evaluator when NOT is present. It also adds an example utility for computing “specificity” statistics over query filters.

Changes:

Add utils::compute_bitmap::compute_query_bitmaps implementing an accelerated bitmap computation path (with a NOT-guarded slow fallback).
Export the new bitmap API from diskann-label-filter and add an example (compute_specificities) to compute stats/output.
Minor doc comment updates in flattening utilities and dependency updates for the new module.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`diskann-label-filter/src/utils/flatten_utils.rs`	Updates doc examples for configurable flattening (one example is currently inconsistent with behavior).
`diskann-label-filter/src/utils/compute_bitmap.rs`	New accelerated bitmap computation implementation plus unit tests.
`diskann-label-filter/src/lib.rs`	Exposes the new module and re-exports `compute_query_bitmaps`.
`diskann-label-filter/examples/compute_specificities.rs`	New example for computing/saving specificity stats from computed bitmaps.
`diskann-label-filter/Cargo.toml`	Adds dependencies needed by the new bitmap computation module.
`Cargo.lock`	Locks new transitive deps (`bit-set`, `rayon`) for this crate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/f16c59eb-89cf-4480-b6fe-afe4be5e7c8e Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/727d3d88-3d0b-47bf-a023-9170d72fb87a Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/47e2bd0f-cb8b-495f-8274-02a88596b0e6 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/1bc31d27-7a19-4c4c-9ecc-c10260b944a3 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

…en/add_filter_utils

This reverts commit 2161a1d.

This reverts commit cce1a8a.

This reverts commit 8f22beb.

This reverts commit d631aae.

codecov-commenter · 2026-05-22T15:34:04Z

Codecov Report

❌ Patch coverage is 84.74149% with 121 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.44%. Comparing base (4f70a82) to head (c205a87).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
diskann-tools/src/bin/compute_specificities.rs	0.00%	72 Missing ⚠️
diskann-tools/src/utils/compute_bitmap.rs	93.56%	46 Missing ⚠️
diskann-tools/src/utils/ground_truth.rs	50.00%	3 Missing ⚠️

❌ Your patch status has failed because the patch coverage (84.74%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1099      +/-   ##
==========================================
- Coverage   89.49%   89.44%   -0.05%     
==========================================
  Files         474      481       +7     
  Lines       89761    91118    +1357     
==========================================
+ Hits        80332    81504    +1172     
- Misses       9429     9614     +185

Flag	Coverage Δ
miri	`89.44% <84.74%> (-0.05%)`	⬇️
unittests	`89.09% <84.74%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-label-filter/src/utils/flatten_utils.rs	`87.65% <ø> (ø)`
diskann-tools/src/utils/ground_truth.rs	`50.00% <50.00%> (-1.50%)`	⬇️
diskann-tools/src/utils/compute_bitmap.rs	`93.56% <93.56%> (ø)`
diskann-tools/src/bin/compute_specificities.rs	`0.00% <0.00%> (ø)`

... and 7 files with indirect coverage changes

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

hildebrandmw

Performance improvement of bitmap calculation is really needed to unblock serious work on the filtered algorithms, so I have a vested interest in seeing something like this merged ASAP.

That said, there is a lot of work needed to fully productionize it. As examples:

The conversion of i64 to f64 is an imprecise conversion. Similarly u64 would have to be treated separately from i64, which is currently isn't in AttributeValue.
I'm not convinced the handling of "." as a separator is robust: what happens if a user string contains a "."? For example, currently {"a.b": 1} and {"a": {"b": 1}} get lowered the same, creating ambiguity.
Probably want to use RoaringBitSet instead of BitSet.
This adds anyhow as a low-level library error type, which is not a great fit.
This also unconditionally adds rayon as a dependency of diskann-label-filter and doesn't provide a caller with a clear ability to opt-out.
Copy-paste in eval_query_using_accelerators that could be factored out.
Many of the helper structs made public that don't necessarily need to be.
This is probably fine for the datasets we are testing on where everything is "nice", but I have concerns about the overall correctness in the presence of corner cases.
Minor, but the PR description says R-tree when the implementation uses a B-tree.

To unblock algorithmic work, though, what if we do the following:

Put this behind an "experimental" feature flag and in an "experimental" module to clearly indicate that things can go wrong.
Gate the new dependencies on this features to avoid unconditionally adding dependencies (in particularly, anyhow and rayon).
Enable this feature in benchmarks to unblock filtered work.

magdalendobson · 2026-05-22T17:39:15Z

Performance improvement of bitmap calculation is really needed to unblock serious work on the filtered algorithms, so I have a vested interest in seeing something like this merged ASAP.

That said, there is a lot of work needed to fully productionize it. As examples:

The conversion of i64 to f64 is an imprecise conversion. Similarly u64 would have to be treated separately from i64, which is currently isn't in AttributeValue.

I'm not convinced the handling of "." as a separator is robust: what happens if a user string contains a "."? For example, currently {"a.b": 1} and {"a": {"b": 1}} get lowered the same, creating ambiguity.

Probably want to use RoaringBitSet instead of BitSet.

This adds anyhow as a low-level library error type, which is not a great fit.

This also unconditionally adds rayon as a dependency of diskann-label-filter and doesn't provide a caller with a clear ability to opt-out.

Copy-paste in eval_query_using_accelerators that could be factored out.

Many of the helper structs made public that don't necessarily need to be.
This is probably fine for the datasets we are testing on where everything is "nice", but I have concerns about the overall correctness in the presence of corner cases.

Minor, but the PR description says R-tree when the implementation uses a B-tree.

To unblock algorithmic work, though, what if we do the following:

Put this behind an "experimental" feature flag and in an "experimental" module to clearly indicate that things can go wrong.

Gate the new dependencies on this features to avoid unconditionally adding dependencies (in particularly, anyhow and rayon).

Enable this feature in benchmarks to unblock filtered work.

I'll address your individual comments separately, but one thing this makes me wonder is if we should actually move this piece of code to diskann-tools. It's true that unlike most of the rest of the library, this is geared towards internal users who are computing groundtruth/benchmarking/etc. This would resolve your concern about adding rayon and anyhow to diskann-label-filter, which I agree has drawbacks. To me this would be significantly preferable for the user experience than an experimental feature.

This reverts commit 43aefb3.

magdalendobson · 2026-05-22T19:10:49Z

Performance improvement of bitmap calculation is really needed to unblock serious work on the filtered algorithms, so I have a vested interest in seeing something like this merged ASAP.

That said, there is a lot of work needed to fully productionize it. As examples:

The conversion of i64 to f64 is an imprecise conversion. Similarly u64 would have to be treated separately from i64, which is currently isn't in AttributeValue.

With the move to diskann-tools, is this still something you would like to see handled?

I'm not convinced the handling of "." as a separator is robust: what happens if a user string contains a "."? For example, currently {"a.b": 1} and {"a": {"b": 1}} get lowered the same, creating ambiguity.

Doesn't our existing code already explicitly treat those two expressions as the same? E.g.

DiskANN/diskann-label-filter/src/parser/query_parser.rs

Line 50 in e3139b4

    
           /// Helper to get a nested value from a label using dot notation (e.g., "specs.cpu")

Probably want to use RoaringBitSet instead of BitSet.

After the move to diskann-tools, since the existing drivers use usize for vector ids, I don't think it makes sense to move to RoaringBitSet.

This adds anyhow as a low-level library error type, which is not a great fit.

Resolved by the move to diskann-tools.

This also unconditionally adds rayon as a dependency of diskann-label-filter and doesn't provide a caller with a clear ability to opt-out.

Resolved by the move to diskann-tools.

Copy-paste in eval_query_using_accelerators that could be factored out.

Resolved in latest edits.

Many of the helper structs made public that don't necessarily need to be.
This is probably fine for the datasets we are testing on where everything is "nice", but I have concerns about the overall correctness in the presence of corner cases.

In latest edits I made the helper functions and structs private.

Minor, but the PR description says R-tree when the implementation uses a B-tree.

Resolved.

To unblock algorithmic work, though, what if we do the following:

Put this behind an "experimental" feature flag and in an "experimental" module to clearly indicate that things can go wrong.

Gate the new dependencies on this features to avoid unconditionally adding dependencies (in particularly, anyhow and rayon).

Enable this feature in benchmarks to unblock filtered work.

hildebrandmw · 2026-05-22T22:06:28Z

Performance improvement of bitmap calculation is really needed to unblock serious work on the filtered algorithms, so I have a vested interest in seeing something like this merged ASAP.
That said, there is a lot of work needed to fully productionize it. As examples:

The conversion of i64 to f64 is an imprecise conversion. Similarly u64 would have to be treated separately from i64, which is currently isn't in AttributeValue.

With the move to diskann-tools, is this still something you would like to see handled?

I'm not convinced the handling of "." as a separator is robust: what happens if a user string contains a "."? For example, currently {"a.b": 1} and {"a": {"b": 1}} get lowered the same, creating ambiguity.

Doesn't our existing code already explicitly treat those two expressions as the same? E.g.

DiskANN/diskann-label-filter/src/parser/query_parser.rs

Line 50 in e3139b4

/// Helper to get a nested value from a label using dot notation (e.g., "specs.cpu")

Probably want to use RoaringBitSet instead of BitSet.

After the move to diskann-tools, since the existing drivers use usize for vector ids, I don't think it makes sense to move to RoaringBitSet.

This adds anyhow as a low-level library error type, which is not a great fit.

Resolved by the move to diskann-tools.

This also unconditionally adds rayon as a dependency of diskann-label-filter and doesn't provide a caller with a clear ability to opt-out.

Resolved by the move to diskann-tools.

Copy-paste in eval_query_using_accelerators that could be factored out.

Resolved in latest edits.

Many of the helper structs made public that don't necessarily need to be.
This is probably fine for the datasets we are testing on where everything is "nice", but I have concerns about the overall correctness in the presence of corner cases.

In latest edits I made the helper functions and structs private.

Minor, but the PR description says R-tree when the implementation uses a B-tree.

Resolved.

To unblock algorithmic work, though, what if we do the following:

Put this behind an "experimental" feature flag and in an "experimental" module to clearly indicate that things can go wrong.

Gate the new dependencies on this features to avoid unconditionally adding dependencies (in particularly, anyhow and rayon).

Enable this feature in benchmarks to unblock filtered work.

Thanks Magdalen, most of this gets resolved by moving it to diskann-tools, which is a better fit.

harsha-simhadri · 2026-05-22T22:17:02Z

+    let args: Vec<String> = env::args().collect();
+    if args.len() != 3 && args.len() != 4 {
+        eprintln!(
+            "Usage: {} <base_label_file> <query_label_file> [specificity_output_file]",


Would prefer input intake with argparse. This is error prone.

harsha-simhadri · 2026-05-22T22:33:16Z

+use std::mem::discriminant;
+use std::ops::Bound::{Excluded, Included, Unbounded};
+
+struct NotNonNan;


could you provide some explanation for what a NotNonNan is?

harsha-simhadri · 2026-05-22T22:36:53Z

+
+fn check_for_disallowed_operators(query_expr: &ASTExpr) -> bool {
+    match query_expr {
+        ASTExpr::Not(_) => true,


should the check for disallowed operators be more central than here?(Such as where the syntax is parse and validated)

harsha-simhadri · 2026-05-22T22:41:02Z

+            Ok(acc.unwrap_or_else(BitSet::new))
+        }
+        ASTExpr::Not(_) => Err(anyhow::anyhow!(
+            "NOT operator is not supported when using query accelerators"


harsha-simhadri · 2026-05-22T22:43:41Z

+        }
+    }
+}
+


Coudl this be designed with "query_accelerator" as a trait with multiple concrete implementations?

harsha-simhadri

posted some questions inline. thanks

Magdalen Manohar and others added 5 commits May 19, 2026 13:01

add specificity utility

0a9980f

refactor example, add compute_bitmap

e99177a

commit to switch

ce447a3

work out kinks in OrderedFloat

e039051

undo change in docstring

8b47aef

magdalendobson marked this pull request as ready for review May 21, 2026 22:13

magdalendobson requested review from a team and Copilot May 21, 2026 22:13

Copilot started reviewing on behalf of magdalendobson May 21, 2026 22:13 View session

Copilot AI reviewed May 21, 2026

View reviewed changes

Potential fix for pull request finding

962fdf6

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot started work on behalf of magdalendobson May 22, 2026 14:07 View session

Potential fix for pull request finding

07cb5bd

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot started work on behalf of magdalendobson May 22, 2026 14:08 View session

Copilot AI and others added 2 commits May 22, 2026 14:11

fix label-filter accelerator doc id mapping for inverted index

80574ca

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/f16c59eb-89cf-4480-b6fe-afe4be5e7c8e Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

fix: use document ids for numeric btree accelerator postings

ecc3895

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/727d3d88-3d0b-47bf-a023-9170d72fb87a Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

Copilot finished work on behalf of magdalendobson May 22, 2026 14:16

Copilot AI and others added 2 commits May 22, 2026 14:17

fix: guard compute_specificities against empty base labels

4fe2935

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/47e2bd0f-cb8b-495f-8274-02a88596b0e6 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

Avoid cloning/silencing errors in query accelerator build

2161a1d

Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/1bc31d27-7a19-4c4c-9ecc-c10260b944a3 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>

Copilot finished work on behalf of magdalendobson May 22, 2026 14:20

Copilot finished work on behalf of magdalendobson May 22, 2026 14:23

Magdalen Manohar added 6 commits May 22, 2026 14:24

change format

164f4b9

fmt

cce1a8a

Merge branch 'main' of github.com:microsoft/DiskANN into users/magdal…

8c20dc9

…en/add_filter_utils

Revert "Avoid cloning/silencing errors in query accelerator build"

8f22beb

This reverts commit 2161a1d.

Revert "fmt"

f09f9b8

This reverts commit cce1a8a.

Reapply "Avoid cloning/silencing errors in query accelerator build"

1696ee3

This reverts commit 8f22beb.

Magdalen Manohar added 3 commits May 22, 2026 14:41

small changes

d631aae

Revert "small changes"

38688be

This reverts commit d631aae.

fmt

9dcca30

Copilot started work on behalf of magdalendobson May 22, 2026 14:45 View session

Copilot finished work on behalf of magdalendobson May 22, 2026 14:57

fix clippy, fmt

7553b99

update groundtruth calculation to use fast bitmap computation

9676dac

hildebrandmw requested changes May 22, 2026

View reviewed changes

Comment thread diskann-tools/src/utils/compute_bitmap.rs

Magdalen Manohar added 7 commits May 22, 2026 18:36

reduce repeated code, reduce instances of pub

17d069e

remove roaring

287d438

remove crate

43aefb3

Revert "remove crate"

50f274b

This reverts commit 43aefb3.

remove crate

1f774bb

move to diskann-tools

07b2d2b

remove from toml file

0713244

add check that i64 -> f64 conversion is lossless

c205a87

hildebrandmw approved these changes May 22, 2026

View reviewed changes

harsha-simhadri reviewed May 22, 2026

View reviewed changes

Conversation

magdalendobson commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Introduction

Lower level details

Inverted Index

B-Tree

Timings

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hildebrandmw left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

magdalendobson commented May 22, 2026

Uh oh!

magdalendobson commented May 22, 2026

Uh oh!

hildebrandmw commented May 22, 2026

Uh oh!

harsha-simhadri May 22, 2026

Choose a reason for hiding this comment

Uh oh!

harsha-simhadri May 22, 2026

Choose a reason for hiding this comment

Uh oh!

harsha-simhadri May 22, 2026

Choose a reason for hiding this comment

Uh oh!

harsha-simhadri May 22, 2026

Choose a reason for hiding this comment

Uh oh!

harsha-simhadri May 22, 2026

Choose a reason for hiding this comment

Uh oh!

harsha-simhadri left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

magdalendobson commented May 21, 2026 •

edited

Loading

codecov-commenter commented May 22, 2026 •

edited

Loading