Specialize filter for list-like arrays (List/LargeList/FixedSizeList/Map, …)#10236
Specialize filter for list-like arrays (List/LargeList/FixedSizeList/Map, …)#10236Jeadie wants to merge 9 commits into
filter for list-like arrays (List/LargeList/FixedSizeList/Map, …)#10236Conversation
…t/Map/…) `FilterPredicate::filter` previously fell back to the generic `MutableArrayData` path for `List`/`LargeList`/`FixedSizeList`/`Map`. This adds specialized kernels that map each retained run of parent rows to a contiguous range of child elements and reuse the already-vectorized per-type child filter kernels, instead of the generic byte-copy fallback. Child handling is selectivity-aware (work is proportional to retained runs and elements, not the full child length) and streams ranges without an intermediate `Vec`: byte children go straight to `FilterBytes`, nested lists recurse, and others use a `Slices` predicate. A child-type allowlist keeps types that can't beat the fallback (dense `Union`, `RunEndEncoded`) on `MutableArrayData`, and a cheap selectivity guard routes dense `Map` filters to the fallback too. Adds benchmarks for the affected types in `arrow/benches/filter_kernels.rs`.
|
run benchmark filter_kernel |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing jeadie/filter-list-specialization (da21a5a) to 7616e10 (merge-base) diff File an issue against this benchmark runner |
|
Benchmark for this request failed. Last 20 lines of output: Click to expandFile an issue against this benchmark runner |
|
run benchmark filter_kernels |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing jeadie/filter-list-specialization (da21a5a) to 7616e10 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
Can I assume |
|
From filter_kernels, most expected improvements are net-new benchmarks ( Regressions
Improvements
|
Re: -14% regression on
|
| Runner | Run 1 (head/base) | Run 2 (head/base) |
|---|---|---|
ubuntu-24.04-arm (aarch64) |
1.00 — 14.0±0.04 → 13.9±0.04µs | 1.02 — 14.1±0.04 → 14.5±0.05µs |
ubuntu-latest (x86-64) |
1.01 — 14.0±0.09 → 14.1±0.21µs | 1.00 — 14.3±0.08 → 14.3±0.09µs |
…ement) The specialized list/byte-child filter rebuilt offsets with a per-element, loop-carried accumulation (cur_offset += len; push). On x86-64 this lost to the generic fallback's bulk copy at high selectivity (~1.5x slower on 'filter list <byte> high selectivity (kept 1023/1024)'). Within a retained run the source offsets are contiguous and monotonic, so the run's new offsets are the source slice shifted by a constant. Emitting them as a map over the contiguous source slice removes the loop-carried dependency and lets the compiler vectorize. Applied to both the child-byte offset rebuild (FilterBytes::extend_offsets_slices) and the parent list offset rebuild (filter_list_offsets_and_child).
After vectorizing the offset rebuild, monotonicity of the byte offset buffers no longer comes from incrementally growing `cur_offset` in the slice path; it derives from source-offset monotonicity (source offsets shifted by a non-decreasing base). Update the two `new_unchecked` SAFETY comments to match.
…atch
Non-byte list children went through filter_list_child's fallback branch: collect
all child ranges into a Vec<(usize,usize)>, wrap in a FilterPredicate{Slices}, and
re-dispatch through filter_array. At ~50% selectivity the retained runs are short,
so that per-range collection + dispatch overhead dominated the (tiny) copy work and
lost to the generic MutableArrayData fallback on x86 (while winning on ARM).
Add filter_list_primitive, a streaming primitive child filter mirroring
filter_list_bytes: copy native values per run straight into the output buffer, no
Vec of ranges, no re-dispatch. Route primitive children to it in filter_list_child.
The offset-rebuild vectorization removed its only caller (the per-element extend_offsets_slices loop); drop the method to satisfy -D dead_code.
The value copy is per-run (bounds-checked once per range, negligible) and now matches the sibling byte path (extend_slices), removing an unnecessary unsafe.
…onstruction Match the pre-existing filter_bytes comment: call out UTF-8 validity preservation and the offsets/nulls length invariant, not just byte-for-byte copying.
|
Ran some benchmarks on Github linux runners. Changes since the initial PR are to improve performance, especially on some variants that regressed. For the new benchmarks, numbers should be better (not comment above about not reproducing the 14% regression). Updated results (see run). Improvement multiples are on filter selectivity of: kept 1/2, high sel 1023/1024, low sel 1/1024.
|
Which issue does this PR close?
Rationale for this change
This PR improves the performance of
FilterPredicate::filterfor array based data types, specifically:List<T>,FixedSizeList<T>,Map<T>.This optimisation is based on one idea: translate retained (i.e. filter = true) parent-row runs into child element ranges (trivially contiguous due to how list/fixed/map layouts work), then hand those ranges to a already-fast child kernels rather than copying element-by-element.
filteris one of the most-executed kernels in Apache DataFusion, and now these list/nested types have fast path. Several common Datafusion uses are especially impacted:FixedSizeListas embeddings or vectorsarray_agg,Unnest)GROUP BYoperationsWhat changes are included in this PR?
FilterPredicate::filterforDataType::FixedSizeList,DataType::Map,DataType::ListandDataType::LargeList(the latter two are only specialised for certain/most child types).arrow/benches/filter_kernels.rs.Changes Explained
Before
Prior to this PR,
List<T>, used theMutableArrayDatafallback.Example
MutableArrayData walks the full child buffer, copying by range for each retained row.
After
Non-specialised Child types.
List<T>is not specialised for someTchild types (and similarly other array types mentioned). This PR specialises if the child typeThas a fast, vectorized kernel for it that is driven only by the predicate'sSlices(never readsfilterdirectly). Everything else uses the well-tuned, correctMutableArrayDatafallback.UnionRunEndEncodedfilter_run_end_array) readspredicate.filterdirectly. The specialization streams ranges via aSlicespredicate whosefilteris intentionally empty.Every other list child is specialized: primitives, boolean, null,
Utf8/LargeUtf8/Binary/LargeBinary,Utf8View/BinaryView,FixedSizeBinary,FixedSizeList,Dictionary,Struct, sparseUnion,ListView/LargeListView, and nestedList/LargeList.Are these changes tested?
Tests in
arrow-select/src/filter.rs.Benchmark results
size = 65536before → after (speedup), whereMutableArrayDatafallbackList<T>by child typeList<List>Direct kernels (new)
filter FixedSizeListfilter MapValue-length sweep —
List<Utf8>@ kept ½Regressions / caveats
All sub-1.0 results occur only at the dense
kept 1023/1024end (rare for selective predicates), plus the nested-list½tie:filter FixedSizeList0.96× (memcpy-bound — the fallback is already tight there).List<List>@ ½: 0.99× (offset-dominated; ties the fallback, then wins 1.09×/2.27× at the other selectivities).No remaining regression exceeds ~4%. Every
kept 1/1024(highly selective) case is a 2.2–4.8× win.Are there any user-facing changes?
N/A.