Significantly speed up bitmap computation#1099
Conversation
There was a problem hiding this comment.
Pull request overview
This PR targets a major performance improvement in diskann-label-filter by introducing a fast-path for computing per-query bitmaps using precomputed per-field accelerators (inverted-index style maps for equality and a numeric BTree for range queries), while falling back to the existing evaluator when NOT is present. It also adds an example utility for computing “specificity” statistics over query filters.
Changes:
- Add
utils::compute_bitmap::compute_query_bitmapsimplementing an accelerated bitmap computation path (with aNOT-guarded slow fallback). - Export the new bitmap API from
diskann-label-filterand add an example (compute_specificities) to compute stats/output. - Minor doc comment updates in flattening utilities and dependency updates for the new module.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
diskann-label-filter/src/utils/flatten_utils.rs |
Updates doc examples for configurable flattening (one example is currently inconsistent with behavior). |
diskann-label-filter/src/utils/compute_bitmap.rs |
New accelerated bitmap computation implementation plus unit tests. |
diskann-label-filter/src/lib.rs |
Exposes the new module and re-exports compute_query_bitmaps. |
diskann-label-filter/examples/compute_specificities.rs |
New example for computing/saving specificity stats from computed bitmaps. |
diskann-label-filter/Cargo.toml |
Adds dependencies needed by the new bitmap computation module. |
Cargo.lock |
Locks new transitive deps (bit-set, rayon) for this crate. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/f16c59eb-89cf-4480-b6fe-afe4be5e7c8e Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/727d3d88-3d0b-47bf-a023-9170d72fb87a Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/47e2bd0f-cb8b-495f-8274-02a88596b0e6 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Agent-Logs-Url: https://github.com/microsoft/DiskANN/sessions/1bc31d27-7a19-4c4c-9ecc-c10260b944a3 Co-authored-by: magdalendobson <58752279+magdalendobson@users.noreply.github.com>
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (84.74%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage. Additional details and impacted files@@ Coverage Diff @@
## main #1099 +/- ##
==========================================
- Coverage 89.49% 89.44% -0.05%
==========================================
Files 474 481 +7
Lines 89761 91118 +1357
==========================================
+ Hits 80332 81504 +1172
- Misses 9429 9614 +185
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
hildebrandmw
left a comment
There was a problem hiding this comment.
Performance improvement of bitmap calculation is really needed to unblock serious work on the filtered algorithms, so I have a vested interest in seeing something like this merged ASAP.
That said, there is a lot of work needed to fully productionize it. As examples:
- The conversion of
i64tof64is an imprecise conversion. Similarlyu64would have to be treated separately fromi64, which is currently isn't inAttributeValue. - I'm not convinced the handling of "." as a separator is robust: what happens if a user string contains a "."? For example, currently
{"a.b": 1}and{"a": {"b": 1}}get lowered the same, creating ambiguity. - Probably want to use
RoaringBitSetinstead ofBitSet. - This adds
anyhowas a low-level library error type, which is not a great fit. - This also unconditionally adds
rayonas a dependency ofdiskann-label-filterand doesn't provide a caller with a clear ability to opt-out. - Copy-paste in
eval_query_using_acceleratorsthat could be factored out. - Many of the helper structs made public that don't necessarily need to be.
This is probably fine for the datasets we are testing on where everything is "nice", but I have concerns about the overall correctness in the presence of corner cases. - Minor, but the PR description says R-tree when the implementation uses a B-tree.
To unblock algorithmic work, though, what if we do the following:
- Put this behind an "experimental" feature flag and in an "experimental" module to clearly indicate that things can go wrong.
- Gate the new dependencies on this features to avoid unconditionally adding dependencies (in particularly,
anyhowandrayon). - Enable this feature in benchmarks to unblock filtered work.
I'll address your individual comments separately, but one thing this makes me wonder is if we should actually move this piece of code to |
With the move to
Doesn't our existing code already explicitly treat those two expressions as the same? E.g.
After the move to
Resolved by the move to
Resolved by the move to
Resolved in latest edits.
In latest edits I made the helper functions and structs private.
Resolved.
|
Thanks Magdalen, most of this gets resolved by moving it to |
| let args: Vec<String> = env::args().collect(); | ||
| if args.len() != 3 && args.len() != 4 { | ||
| eprintln!( | ||
| "Usage: {} <base_label_file> <query_label_file> [specificity_output_file]", |
There was a problem hiding this comment.
Would prefer input intake with argparse. This is error prone.
| use std::mem::discriminant; | ||
| use std::ops::Bound::{Excluded, Included, Unbounded}; | ||
|
|
||
| struct NotNonNan; |
There was a problem hiding this comment.
could you provide some explanation for what a NotNonNan is?
|
|
||
| fn check_for_disallowed_operators(query_expr: &ASTExpr) -> bool { | ||
| match query_expr { | ||
| ASTExpr::Not(_) => true, |
There was a problem hiding this comment.
should the check for disallowed operators be more central than here?(Such as where the syntax is parse and validated)
| Ok(acc.unwrap_or_else(BitSet::new)) | ||
| } | ||
| ASTExpr::Not(_) => Err(anyhow::anyhow!( | ||
| "NOT operator is not supported when using query accelerators" |
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Coudl this be designed with "query_accelerator" as a trait with multiple concrete implementations?
harsha-simhadri
left a comment
There was a problem hiding this comment.
posted some questions inline. thanks
Introduction
Bitmap computation in diskann-label-filter is unacceptably slow. Currently, with a 1 million size slice of yfcc and a 10k query set, computing the query bitmaps takes 43.10 seconds. With just a 100K slice of the caselaw dataset and a 10k query set, computing the bitmaps takes 6.03 seconds. This was making it hard to run experiments on filtered search algorithms for the full sizes of these datasets.
Speeding up the bitmap computation is conceptually simple. Instead of iterating over every base label for every query filter, we compute an inverted index for each label type, which maps the label value to the documents with the same value. Then, at query time, we query the inverted index for the relevant label values, and compose the resulting sets as necessary to find the documents satisfying the entire filter expression. At a high level, that is what this PR does.
Lower level details
The overall workflow of the main function,
compute_query_bitmaps, is as follows:ASTExpr::Notclauses. If so, default to the existing slow path. This is because we don't store the document universe for each label, and thus can't compute the complement of an arbitrary bitset.CompareOpin the clause, and then compose them with AND and OR as needed to produce the final bitset.We also add a utility to
diskann-label-filterfor computing the specificity of a set of query filters with respect to a base set, outputting some statistics on it, and optionally outputting the individual specificity values to a file for further processing.Inverted Index
The inverted index maps each label value, converted to a string, to a bitset containing the doc ids corresponding to that value.
B-Tree
For simplicity, the B-tree implementation converts integers to floats before inserting so that we don't have to deal with two different types of B-tree. The performance of this piece of code isn't sensitive enough that it makes sense to differentiate, but this could be changed in the future.
The B-tree maps collections of ids to vectors instead of bitsets, because concatenating vectors is much cheaper than extending bitsets, and potentially many vectors would be concatenated during a range query.
Timings
Returning to the earlier discussion of timings, for the 1 million size slice of yfcc and a 10k query set, computing the query bitmaps now takes .6 seconds. For the 100K slice of the caselaw dataset and a 10k query set, computing the bitmaps now takes 1.728 seconds.