Specialize `filter` for list-like arrays (List/LargeList/FixedSizeList/Map, …) by Jeadie · Pull Request #10236 · apache/arrow-rs

Jeadie · 2026-06-29T04:48:09Z

Which issue does this PR close?

No issue yet. Happy to open if changes wanted.

Rationale for this change

This PR improves the performance of FilterPredicate::filter for array based data types, specifically: List<T>, FixedSizeList<T>, Map<T>.

This optimisation is based on one idea: translate retained (i.e. filter = true) parent-row runs into child element ranges (trivially contiguous due to how list/fixed/map layouts work), then hand those ranges to a already-fast child kernels rather than copying element-by-element.

filter is one of the most-executed kernels in Apache DataFusion, and now these list/nested types have fast path. Several common Datafusion uses are especially impacted:

FixedSizeList as embeddings or vectors
JSON or nested data (array_agg, Unnest)
GROUP BY operations
Hash/sort-merge joins filter probe/build columns on list types

What changes are included in this PR?

Specialisation within FilterPredicate::filter for DataType::FixedSizeList, DataType::Map, DataType::List and DataType::LargeList (the latter two are only specialised for certain/most child types).
Associated benchmarks in arrow/benches/filter_kernels.rs.

Changes Explained

Before

Prior to this PR, List<T>, used the MutableArrayData fallback.

Example

filter:       [ T    F    T    T    F  ]
parent rows:  [row0|row1|row2|row3|row4]

child values: [a b c|d e|f g h|i j k|l m n o]

MutableArrayData walks the full child buffer, copying by range for each retained row.

After

filter:   [ T    F    T    T    F  ]
            row0 row1 row2 row3 row4

offsets:  [ 0    3    5    8   11   15 ]

predicate_row_ranges → [(0,1), (2,4)]   ← runs of kept rows

child_ranges:
  run (0,1): offsets[0]..offsets[1] = [0,  3) ← 1 parent element
  run (2,4): offsets[2]..offsets[4] = [5, 11)  ← 2 parent elements. rows 2+3 merged into ONE range

child values: [a b c|d e|f g h|i j k|l m n o]
                ╰─────╯   ╰───────────╯
                [0, 3)       [5, 11)

Rebuild new offsets from retained row lengths:
  row0: 3-0=3  → new_offsets: [0, 3]
  row2: 8-5=3  → new_offsets: [0, 3, 6]
  row3: 11-8=3 → new_offsets: [0, 3, 6, 9]

output: List([ [a,b,c], [f,g,h], [i,j,k] ])

Non-specialised Child types.

List<T> is not specialised for some T child types (and similarly other array types mentioned). This PR specialises if the child type T has a fast, vectorized kernel for it that is driven only by the predicate's Slices (never reads filter directly). Everything else uses the well-tuned, correct MutableArrayData fallback.

Child type	Why it stays on fallback
dense `Union`	Consecutive rows carry different type-ids and non-contiguous child offsets, no contiguous ranges to copy
`RunEndEncoded`	Its kernel (`filter_run_end_array`) reads `predicate.filter` directly. The specialization streams ranges via a `Slices` predicate whose `filter` is intentionally empty.
unmeasured exotics	Several exotics not benchmarked stay on fallback by default.

Every other list child is specialized: primitives, boolean, null, Utf8/LargeUtf8/Binary/LargeBinary, Utf8View/BinaryView, FixedSizeBinary, FixedSizeList, Dictionary, Struct, sparse Union, ListView/LargeListView, and nested List/LargeList.

Are these changes tested?

Tests in arrow-select/src/filter.rs.

Benchmark results

>> cargo bench -p arrow --bench filter_kernels \
  --features test_utils \
  --baseline after \
  -- "filter (list|fixedsizelist|map)"

size = 65536
Cells: before → after (speedup), where
- before = MutableArrayData fallback
- after = specialized.
- ⚠ marks a regression (sub-1.0).

`List<T>` by child type

Child	kept ½	kept 1023/1024	kept 1/1024
Int32	433→373 µs (1.16×)	128→93 µs (1.38×)	2.92→1.12 µs (2.60×)
Utf8	627→588 µs (1.07×)	455→471 µs (0.97× ⚠)	3.57→1.44 µs (2.48×)
LargeUtf8	679→593 µs (1.14×)	644→474 µs (1.36×)	3.58→1.44 µs (2.49×)
Binary	623→581 µs (1.07×)	451→469 µs (0.96× ⚠)	3.49→1.44 µs (2.43×)
LargeBinary	656→590 µs (1.11×)	628→470 µs (1.34×)	3.58→1.44 µs (2.48×)
Utf8View	598→315 µs (1.90×)	566→183 µs (3.10×)	3.16→0.94 µs (3.35×)
FixedSizeBinary	464→348 µs (1.33×)	213→118 µs (1.80×)	2.87→1.03 µs (2.79×)
FixedSizeList	540→446 µs (1.21×)	216→134 µs (1.61×)	3.61→1.42 µs (2.54×)
Dictionary	482→370 µs (1.30×)	126→94 µs (1.34×)	3.36→1.37 µs (2.45×)
Struct	519→371 µs (1.40×)	128→93 µs (1.37×)	3.69→1.12 µs (3.28×)
Map	1107→998 µs (1.11×)	780→648 µs (1.20×)	6.32→2.89 µs (2.19×)
Union (sparse)	832→663 µs (1.26×)	324→168 µs (1.93×)	5.41→2.30 µs (2.35×)
ListView	1702→548 µs (3.11×)	2687→128 µs (20.9×)	5.85→1.41 µs (4.13×)
nested `List<List>`	612→620 µs (0.99× ⚠)	320→295 µs (1.09×)	3.93→1.73 µs (2.27×)

Direct kernels (new)

Kernel	kept ½	kept 1023/1024	kept 1/1024
`filter FixedSizeList`	309→197 µs (1.57×)	20.8→21.6 µs (0.96× ⚠)	3.27→0.69 µs (4.76×)
`filter Map`	791→648 µs (1.22×)	286→291 µs (0.98×)	5.01→1.88 µs (2.66×)

Value-length sweep — `List<Utf8>` @ kept ½

Value length	before → after	speedup
8 B	669→591 µs	1.13×
64 B	1241→1074 µs	1.16×
256 B	4796→3167 µs	1.51×
10× rows (short)	6598→6446 µs	1.02×

Regressions / caveats

All sub-1.0 results occur only at the dense kept 1023/1024 end (rare for selective predicates), plus the nested-list ½ tie:

dense: Utf8 0.97×, Binary 0.96×, filter FixedSizeList 0.96× (memcpy-bound — the fallback is already tight there).
nested List<List> @ ½: 0.99× (offset-dominated; ties the fallback, then wins 1.09×/2.27× at the other selectivities).

No remaining regression exceeds ~4%. Every kept 1/1024 (highly selective) case is a 2.2–4.8× win.

Are there any user-facing changes?

N/A.

…t/Map/…) `FilterPredicate::filter` previously fell back to the generic `MutableArrayData` path for `List`/`LargeList`/`FixedSizeList`/`Map`. This adds specialized kernels that map each retained run of parent rows to a contiguous range of child elements and reuse the already-vectorized per-type child filter kernels, instead of the generic byte-copy fallback. Child handling is selectivity-aware (work is proportional to retained runs and elements, not the full child length) and streams ranges without an intermediate `Vec`: byte children go straight to `FilterBytes`, nested lists recurse, and others use a `Slices` predicate. A child-type allowlist keeps types that can't beat the fallback (dense `Union`, `RunEndEncoded`) on `MutableArrayData`, and a cheap selectivity guard routes dense `Map` filters to the fallback too. Adds benchmarks for the affected types in `arrow/benches/filter_kernels.rs`.

alamb · 2026-07-02T10:57:23Z

run benchmark filter_kernel

adriangbot · 2026-07-02T11:00:22Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4864945476-805-lk6b6 6.12.85+ #1 SMP Mon May 11 08:17:35 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing jeadie/filter-list-specialization (da21a5a) to 7616e10 (merge-base) diff
BENCH_NAME=filter_kernel
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench filter_kernel
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-07-02T11:00:24Z

Benchmark for this request failed.

Last 20 lines of output:

Click to expand

  Downloaded async-stream v0.3.6
  Downloaded cobs v0.3.0
  Downloaded ciborium-ll v0.2.2
  Downloaded ciborium-io v0.2.2
  Downloaded alloca v0.4.0
  Downloaded fnv v1.0.7
  Downloaded darling_macro v0.23.0
  Downloaded const-random v0.1.18
  Downloaded clap_lex v1.1.0
  Downloaded cfg-if v1.0.4
  Downloaded derive_arbitrary v1.4.2
  Downloaded cfg_aliases v0.2.1
  Downloaded crc v3.4.0
  Downloaded form_urlencoded v1.2.2
  Downloaded crypto-common v0.1.7
  Downloaded crunchy v0.2.4
    Blocking waiting for file lock on package cache
error: no bench target named `filter_kernel` in default-run packages

help: a target with a similar name exists: `filter_kernels`

File an issue against this benchmark runner

Jefffrey · 2026-07-02T12:44:47Z

run benchmark filter_kernels

adriangbot · 2026-07-02T12:48:14Z

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4865815942-808-4kzlp 6.12.85+ #1 SMP Mon May 11 08:17:35 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing jeadie/filter-list-specialization (da21a5a) to 7616e10 (merge-base) diff
BENCH_NAME=filter_kernels
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench filter_kernels
BENCH_FILTER=
Results will be posted here when complete

File an issue against this benchmark runner

adriangbot · 2026-07-02T13:21:01Z

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)

Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Details

group                                                                         jeadie_filter-list-specialization      main
-----                                                                         ---------------------------------      ----
filter context decimal128 (kept 1/2)                                          1.00     20.5±0.09µs        ? ?/sec    1.01     20.7±0.18µs        ? ?/sec
filter context decimal128 high selectivity (kept 1023/1024)                   1.02     19.2±0.18µs        ? ?/sec    1.00     18.8±0.20µs        ? ?/sec
filter context decimal128 low selectivity (kept 1/1024)                       1.00    146.1±0.74ns        ? ?/sec    1.01    147.2±0.61ns        ? ?/sec
filter context f32 (kept 1/2)                                                 1.00     83.2±5.49µs        ? ?/sec    1.02     84.9±6.38µs        ? ?/sec
filter context f32 high selectivity (kept 1023/1024)                          1.00      5.5±0.01µs        ? ?/sec    1.00      5.5±0.01µs        ? ?/sec
filter context f32 low selectivity (kept 1/1024)                              1.00   320.9±11.00ns        ? ?/sec    1.04   333.8±14.63ns        ? ?/sec
filter context fsb with value length 20 (kept 1/2)                            1.00     71.6±5.34µs        ? ?/sec    1.18    84.3±29.81µs        ? ?/sec
filter context fsb with value length 20 high selectivity (kept 1023/1024)     1.00     71.6±5.44µs        ? ?/sec    1.18    84.4±29.80µs        ? ?/sec
filter context fsb with value length 20 low selectivity (kept 1/1024)         1.00     71.5±5.32µs        ? ?/sec    1.18    84.3±29.87µs        ? ?/sec
filter context fsb with value length 5 (kept 1/2)                             1.00     71.7±5.41µs        ? ?/sec    1.18    84.4±29.81µs        ? ?/sec
filter context fsb with value length 5 high selectivity (kept 1023/1024)      1.00     71.6±5.48µs        ? ?/sec    1.18    84.4±29.80µs        ? ?/sec
filter context fsb with value length 5 low selectivity (kept 1/1024)          1.00     71.6±5.43µs        ? ?/sec    1.18    84.3±29.88µs        ? ?/sec
filter context fsb with value length 50 (kept 1/2)                            1.00     71.6±5.34µs        ? ?/sec    1.18    84.3±29.82µs        ? ?/sec
filter context fsb with value length 50 high selectivity (kept 1023/1024)     1.00     71.6±5.45µs        ? ?/sec    1.18    84.4±29.79µs        ? ?/sec
filter context fsb with value length 50 low selectivity (kept 1/1024)         1.00     71.5±5.34µs        ? ?/sec    1.18    84.3±29.88µs        ? ?/sec
filter context i32 (kept 1/2)                                                 1.14     14.3±0.09µs        ? ?/sec    1.00     12.5±0.01µs        ? ?/sec
filter context i32 high selectivity (kept 1023/1024)                          1.00      3.7±0.00µs        ? ?/sec    1.00      3.7±0.01µs        ? ?/sec
filter context i32 low selectivity (kept 1/1024)                              1.00    140.0±1.14ns        ? ?/sec    1.06    148.7±1.66ns        ? ?/sec
filter context i32 w NULLs (kept 1/2)                                         1.00     84.0±5.54µs        ? ?/sec    1.02     85.8±6.49µs        ? ?/sec
filter context i32 w NULLs high selectivity (kept 1023/1024)                  1.00      5.5±0.01µs        ? ?/sec    1.00      5.5±0.01µs        ? ?/sec
filter context i32 w NULLs low selectivity (kept 1/1024)                      1.00   328.4±11.29ns        ? ?/sec    1.01   330.2±13.40ns        ? ?/sec
filter context mixed string view (kept 1/2)                                   1.00     92.0±5.31µs        ? ?/sec    1.00     92.0±5.53µs        ? ?/sec
filter context mixed string view high selectivity (kept 1023/1024)            1.00     21.0±0.19µs        ? ?/sec    1.01     21.3±0.17µs        ? ?/sec
filter context mixed string view low selectivity (kept 1/1024)                1.01   417.1±15.17ns        ? ?/sec    1.00   413.2±15.71ns        ? ?/sec
filter context short string view (kept 1/2)                                   1.00     91.4±5.47µs        ? ?/sec    1.00     91.5±5.48µs        ? ?/sec
filter context short string view high selectivity (kept 1023/1024)            1.06     21.2±0.11µs        ? ?/sec    1.00     20.0±0.28µs        ? ?/sec
filter context short string view low selectivity (kept 1/1024)                1.02    356.2±9.77ns        ? ?/sec    1.00   350.6±11.41ns        ? ?/sec
filter context string (kept 1/2)                                              1.00    423.8±6.51µs        ? ?/sec    1.00    421.9±5.58µs        ? ?/sec
filter context string dictionary (kept 1/2)                                   1.00     12.7±0.02µs        ? ?/sec    1.00     12.7±0.02µs        ? ?/sec
filter context string dictionary high selectivity (kept 1023/1024)            1.02      4.2±0.00µs        ? ?/sec    1.00      4.1±0.00µs        ? ?/sec
filter context string dictionary low selectivity (kept 1/1024)                1.00    509.5±1.93ns        ? ?/sec    1.01    513.9±3.43ns        ? ?/sec
filter context string dictionary w NULLs (kept 1/2)                           1.00     85.5±5.43µs        ? ?/sec    1.03     87.8±6.44µs        ? ?/sec
filter context string dictionary w NULLs high selectivity (kept 1023/1024)    1.02      6.0±0.01µs        ? ?/sec    1.00      5.9±0.01µs        ? ?/sec
filter context string dictionary w NULLs low selectivity (kept 1/1024)        1.00   712.8±10.26ns        ? ?/sec    1.00   710.8±14.35ns        ? ?/sec
filter context string high selectivity (kept 1023/1024)                       1.00    317.9±5.65µs        ? ?/sec    1.01    320.3±2.13µs        ? ?/sec
filter context string low selectivity (kept 1/1024)                           1.00   707.8±12.44ns        ? ?/sec    1.07   757.4±10.00ns        ? ?/sec
filter context u8 (kept 1/2)                                                  1.00     12.1±0.02µs        ? ?/sec    1.00     12.1±0.01µs        ? ?/sec
filter context u8 high selectivity (kept 1023/1024)                           1.01   1066.2±3.78ns        ? ?/sec    1.00   1052.1±2.98ns        ? ?/sec
filter context u8 low selectivity (kept 1/1024)                               1.01    129.2±0.95ns        ? ?/sec    1.00    127.8±0.72ns        ? ?/sec
filter context u8 w NULLs (kept 1/2)                                          1.02     85.5±6.41µs        ? ?/sec    1.00     83.4±5.22µs        ? ?/sec
filter context u8 w NULLs high selectivity (kept 1023/1024)                   1.03      2.9±0.01µs        ? ?/sec    1.00      2.8±0.01µs        ? ?/sec
filter context u8 w NULLs low selectivity (kept 1/1024)                       1.00   313.0±13.17ns        ? ?/sec    1.00   312.1±10.03ns        ? ?/sec
filter decimal128 (kept 1/2)                                                  1.01     35.4±0.08µs        ? ?/sec    1.00     35.1±0.08µs        ? ?/sec
filter decimal128 high selectivity (kept 1023/1024)                           1.00     19.1±0.08µs        ? ?/sec    1.01     19.2±0.13µs        ? ?/sec
filter decimal128 low selectivity (kept 1/1024)                               1.00   1548.9±1.24ns        ? ?/sec    1.00   1541.7±1.55ns        ? ?/sec
filter f32 (kept 1/2)                                                         1.00    104.6±0.45µs        ? ?/sec    1.03    108.2±0.44µs        ? ?/sec
filter fixedsizelist (kept 1/2)                                               1.00    200.8±3.33µs        ? ?/sec  
filter fixedsizelist high selectivity (kept 1023/1024)                        1.00     19.8±0.12µs        ? ?/sec  
filter fixedsizelist low selectivity (kept 1/1024)                            1.00    760.5±4.78ns        ? ?/sec  
filter fsb with value length 20 (kept 1/2)                                    1.00     79.9±0.09µs        ? ?/sec    1.00     79.9±0.12µs        ? ?/sec
filter fsb with value length 20 high selectivity (kept 1023/1024)             1.00     24.3±0.47µs        ? ?/sec    1.07     26.0±1.38µs        ? ?/sec
filter fsb with value length 20 low selectivity (kept 1/1024)                 1.02   1664.1±9.02ns        ? ?/sec    1.00   1623.6±1.98ns        ? ?/sec
filter fsb with value length 5 (kept 1/2)                                     1.00     79.6±0.04µs        ? ?/sec    1.00     79.4±0.07µs        ? ?/sec
filter fsb with value length 5 high selectivity (kept 1023/1024)              1.00      5.9±0.05µs        ? ?/sec    1.10      6.5±1.14µs        ? ?/sec
filter fsb with value length 5 low selectivity (kept 1/1024)                  1.04   1619.4±3.11ns        ? ?/sec    1.00   1560.3±7.96ns        ? ?/sec
filter fsb with value length 50 (kept 1/2)                                    1.00    120.6±0.51µs        ? ?/sec    1.00    120.1±0.84µs        ? ?/sec
filter fsb with value length 50 high selectivity (kept 1023/1024)             1.00     87.6±6.60µs        ? ?/sec    1.01     88.6±5.44µs        ? ?/sec
filter fsb with value length 50 low selectivity (kept 1/1024)                 1.01   1633.3±1.22ns        ? ?/sec    1.00   1611.7±2.15ns        ? ?/sec
filter i32 (kept 1/2)                                                         1.00     29.4±0.03µs        ? ?/sec    1.00     29.3±0.03µs        ? ?/sec
filter i32 high selectivity (kept 1023/1024)                                  1.00      4.8±0.06µs        ? ?/sec    1.00      4.9±0.06µs        ? ?/sec
filter i32 low selectivity (kept 1/1024)                                      1.00   1496.7±0.90ns        ? ?/sec    1.00   1496.5±3.54ns        ? ?/sec
filter list binary (kept 1/2)                                                 1.00    722.1±9.73µs        ? ?/sec  
filter list binary high selectivity (kept 1023/1024)                          1.00    791.9±0.72µs        ? ?/sec  
filter list binary low selectivity (kept 1/1024)                              1.00  1984.5±14.12ns        ? ?/sec  
filter list dict (kept 1/2)                                                   1.00   317.3±10.22µs        ? ?/sec  
filter list dict high selectivity (kept 1023/1024)                            1.00     96.3±0.27µs        ? ?/sec  
filter list dict low selectivity (kept 1/1024)                                1.00  1543.2±10.04ns        ? ?/sec  
filter list fixedsizebinary (kept 1/2)                                        1.00   376.5±10.27µs        ? ?/sec  
filter list fixedsizebinary high selectivity (kept 1023/1024)                 1.00    128.5±0.56µs        ? ?/sec  
filter list fixedsizebinary low selectivity (kept 1/1024)                     1.00   1200.3±9.36ns        ? ?/sec  
filter list fixedsizelist (kept 1/2)                                          1.00    502.3±9.33µs        ? ?/sec  
filter list fixedsizelist high selectivity (kept 1023/1024)                   1.00    138.4±1.42µs        ? ?/sec  
filter list fixedsizelist low selectivity (kept 1/1024)                       1.00   1704.8±6.68ns        ? ?/sec  
filter list i32 (kept 1/2)                                                    1.00   315.9±10.03µs        ? ?/sec  
filter list i32 high selectivity (kept 1023/1024)                             1.00     94.9±0.25µs        ? ?/sec  
filter list i32 low selectivity (kept 1/1024)                                 1.00  1058.3±10.56ns        ? ?/sec  
filter list largebinary (kept 1/2)                                            1.00    734.4±9.89µs        ? ?/sec  
filter list largebinary high selectivity (kept 1023/1024)                     1.00    800.2±4.87µs        ? ?/sec  
filter list largebinary low selectivity (kept 1/1024)                         1.00      2.0±0.03µs        ? ?/sec  
filter list largeutf8 (kept 1/2)                                              1.00    732.2±9.07µs        ? ?/sec  
filter list largeutf8 high selectivity (kept 1023/1024)                       1.00    805.6±5.18µs        ? ?/sec  
filter list largeutf8 low selectivity (kept 1/1024)                           1.00      2.0±0.04µs        ? ?/sec  
filter list listview (kept 1/2)                                               1.00   381.0±10.84µs        ? ?/sec  
filter list listview high selectivity (kept 1023/1024)                        1.00    134.5±4.34µs        ? ?/sec  
filter list listview low selectivity (kept 1/1024)                            1.00  1298.2±13.15ns        ? ?/sec  
filter list map (kept 1/2)                                                    1.00  1154.9±11.60µs        ? ?/sec  
filter list map high selectivity (kept 1023/1024)                             1.00   1478.6±3.38µs        ? ?/sec  
filter list map low selectivity (kept 1/1024)                                 1.00      3.5±0.01µs        ? ?/sec  
filter list nested (kept 1/2)                                                 1.00   653.8±10.15µs        ? ?/sec  
filter list nested high selectivity (kept 1023/1024)                          1.00    346.5±1.89µs        ? ?/sec  
filter list nested low selectivity (kept 1/1024)                              1.00  1905.9±12.79ns        ? ?/sec  
filter list struct (kept 1/2)                                                 1.00   318.3±10.80µs        ? ?/sec  
filter list struct high selectivity (kept 1023/1024)                          1.00     97.2±5.04µs        ? ?/sec  
filter list struct low selectivity (kept 1/1024)                              1.00  1239.0±11.76ns        ? ?/sec  
filter list union (kept 1/2)                                                  1.00    475.3±9.87µs        ? ?/sec  
filter list union high selectivity (kept 1023/1024)                           1.00    185.7±0.61µs        ? ?/sec  
filter list union low selectivity (kept 1/1024)                               1.00      2.3±0.02µs        ? ?/sec  
filter list utf8 (kept 1/2)                                                   1.00    722.8±8.53µs        ? ?/sec  
filter list utf8 10xrows (kept 1/2)                                           1.00      7.9±0.14ms        ? ?/sec  
filter list utf8 high selectivity (kept 1023/1024)                            1.00    791.1±0.90µs        ? ?/sec  
filter list utf8 len256 (kept 1/2)                                            1.00     14.5±0.15ms        ? ?/sec  
filter list utf8 len64 (kept 1/2)                                             1.00  1285.5±21.41µs        ? ?/sec  
filter list utf8 len8 (kept 1/2)                                              1.00    726.6±9.37µs        ? ?/sec  
filter list utf8 low selectivity (kept 1/1024)                                1.00   1900.9±8.24ns        ? ?/sec  
filter list utf8view (kept 1/2)                                               1.00   451.1±12.24µs        ? ?/sec  
filter list utf8view high selectivity (kept 1023/1024)                        1.00    215.1±1.54µs        ? ?/sec  
filter list utf8view low selectivity (kept 1/1024)                            1.00  1092.2±13.03ns        ? ?/sec  
filter map (kept 1/2)                                                         1.00    597.9±9.36µs        ? ?/sec  
filter map high selectivity (kept 1023/1024)                                  1.00    607.5±0.54µs        ? ?/sec  
filter map low selectivity (kept 1/1024)                                      1.00      2.0±0.01µs        ? ?/sec  
filter optimize (kept 1/2)                                                    1.06     29.5±0.07µs        ? ?/sec    1.00     27.7±0.11µs        ? ?/sec
filter optimize high selectivity (kept 1023/1024)                             1.04   1383.7±3.30ns        ? ?/sec    1.00   1336.3±0.68ns        ? ?/sec
filter optimize low selectivity (kept 1/1024)                                 1.00   1321.0±0.73ns        ? ?/sec    1.00   1316.0±0.73ns        ? ?/sec
filter run array (kept 1/2)                                                   1.00    280.5±2.78µs        ? ?/sec    1.05    295.6±1.75µs        ? ?/sec
filter run array high selectivity (kept 1023/1024)                            1.00    280.7±5.38µs        ? ?/sec    1.02    287.0±4.72µs        ? ?/sec
filter run array low selectivity (kept 1/1024)                                1.00    230.8±1.07µs        ? ?/sec    1.02    236.3±1.05µs        ? ?/sec
filter single record batch                                                    1.00     29.2±0.07µs        ? ?/sec    1.01     29.4±0.07µs        ? ?/sec
filter u8 (kept 1/2)                                                          1.00     29.5±0.15µs        ? ?/sec    1.04     30.6±0.02µs        ? ?/sec
filter u8 high selectivity (kept 1023/1024)                                   1.01      2.2±0.04µs        ? ?/sec    1.00      2.2±0.04µs        ? ?/sec
filter u8 low selectivity (kept 1/1024)                                       1.01  1478.3±32.66ns        ? ?/sec    1.00   1457.6±2.21ns        ? ?/sec

Resource Usage

base (merge-base)

Metric	Value
Wall time	695.2s
Peak memory	29.0 MiB
Avg memory	17.0 MiB
CPU user	691.3s
CPU sys	0.1s
Peak spill	0 B

branch

Metric	Value
Wall time	1210.3s
Peak memory	207.5 MiB
Avg memory	46.4 MiB
CPU user	1201.6s
CPU sys	6.9s
Peak spill	0 B

File an issue against this benchmark runner

Jeadie · 2026-07-02T22:02:23Z

Can I assume run benchmark filter_kernel failing is just a typo for the successful run benchmark filter_kernels?

Jeadie · 2026-07-02T22:26:25Z

From filter_kernels, most expected improvements are net-new benchmarks (filter list *), but these are the changes for existing comparisons.

Regressions

Benchmark	Relative
filter context i32 (kept 1/2)	+14%
filter context short string view high selectivity (kept 1023/1024)	+6%
filter optimize (kept 1/2)	+6%
filter optimize high selectivity (kept 1023/1024)	+4%
filter fsb with value length 5 low selectivity (kept 1/1024)	+4%
filter context u8 w NULLs high selectivity (kept 1023/1024)	+3%
filter context u8 w NULLs (kept 1/2)	+2%
filter context string dictionary high selectivity (kept 1023/1024)	+2%
filter context string dictionary w NULLs high selectivity (kept 1023/1024)	+2%
filter fsb with value length 20 low selectivity (kept 1/1024)	+2%
filter decimal128 (kept 1/2)	+1%
filter fsb with value length 50 low selectivity (kept 1/1024)	+1%
filter u8 high selectivity (kept 1023/1024)	+1%
filter u8 low selectivity (kept 1/1024)	+1%

Improvements

Benchmark	Improvement
filter context fsb with value length 20 (kept 1/2)	18%
filter context fsb with value length 20 high selectivity	18%
filter context fsb with value length 20 low selectivity	18%
filter context fsb with value length 5 (kept 1/2)	18%
filter context fsb with value length 5 high selectivity	18%
filter context fsb with value length 5 low selectivity	18%
filter context fsb with value length 50 (kept 1/2)	18%
filter context fsb with value length 50 high selectivity	18%
filter context fsb with value length 50 low selectivity	18%
filter f32 (kept 1/2)	3%
filter run array (kept 1/2)	5%
filter run array high selectivity	2%
filter run array low selectivity	2%
filter u8 (kept 1/2)	4%

Jeadie · 2026-07-03T00:49:28Z

Re: -14% regression on `filter context i32 (kept 1/2)`

Unsure about a cause of this regressions, for two reasons:

The i32 path is unchanged.
Could not reproduce on similar linux CI (see below)

Runner	Run 1 (head/base)	Run 2 (head/base)
`ubuntu-24.04-arm` (aarch64)	1.00 — 14.0±0.04 → 13.9±0.04µs	1.02 — 14.1±0.04 → 14.5±0.05µs
`ubuntu-latest` (x86-64)	1.01 — 14.0±0.09 → 14.1±0.21µs	1.00 — 14.3±0.08 → 14.3±0.09µs

…ement) The specialized list/byte-child filter rebuilt offsets with a per-element, loop-carried accumulation (cur_offset += len; push). On x86-64 this lost to the generic fallback's bulk copy at high selectivity (~1.5x slower on 'filter list <byte> high selectivity (kept 1023/1024)'). Within a retained run the source offsets are contiguous and monotonic, so the run's new offsets are the source slice shifted by a constant. Emitting them as a map over the contiguous source slice removes the loop-carried dependency and lets the compiler vectorize. Applied to both the child-byte offset rebuild (FilterBytes::extend_offsets_slices) and the parent list offset rebuild (filter_list_offsets_and_child).

After vectorizing the offset rebuild, monotonicity of the byte offset buffers no longer comes from incrementally growing `cur_offset` in the slice path; it derives from source-offset monotonicity (source offsets shifted by a non-decreasing base). Update the two `new_unchecked` SAFETY comments to match.

…atch Non-byte list children went through filter_list_child's fallback branch: collect all child ranges into a Vec<(usize,usize)>, wrap in a FilterPredicate{Slices}, and re-dispatch through filter_array. At ~50% selectivity the retained runs are short, so that per-range collection + dispatch overhead dominated the (tiny) copy work and lost to the generic MutableArrayData fallback on x86 (while winning on ARM). Add filter_list_primitive, a streaming primitive child filter mirroring filter_list_bytes: copy native values per run straight into the output buffer, no Vec of ranges, no re-dispatch. Route primitive children to it in filter_list_child.

The offset-rebuild vectorization removed its only caller (the per-element extend_offsets_slices loop); drop the method to satisfy -D dead_code.

The value copy is per-run (bounds-checked once per range, negligible) and now matches the sibling byte path (extend_slices), removing an unnecessary unsafe.

…onstruction Match the pre-existing filter_bytes comment: call out UTF-8 validity preservation and the offsets/nulls length invariant, not just byte-for-byte copying.

Jeadie · 2026-07-03T09:48:24Z

Ran some benchmarks on Github linux runners. Changes since the initial PR are to improve performance, especially on some variants that regressed. For the new benchmarks, numbers should be better (not comment above about not reproducing the 14% regression).

Updated results (see run). Improvement multiples are on filter selectivity of: kept 1/2, high sel 1023/1024, low sel 1/1024.

Arrow type	ARM64	x86-64
list i32	1.42× · 2.48× · 3.36×	1.47× · 1.94× · 3.79×
list utf8	1.19× · 5.83× · 2.71×	1.23× · 4.38× · 2.72×
list binary	1.19× · 5.84× · 2.55×	1.21× · 4.29× · 2.89×
list largeutf8	1.24× · 4.72× · 2.68×	1.26× · 3.41× · 3.18×
list largebinary	1.24× · 4.75× · 2.60×	1.06× · 3.68× · 2.91×
list utf8view	1.42× · 3.33× · 3.46×	1.28× · 2.89× · 2.91×
list dict	1.67× · 2.93× · 3.39×	1.40× · 1.92× · 3.16×
list struct	1.59× · 2.92× · 3.32×	1.13× · 1.74× · 2.96×
list union	1.58× · 2.66× · 2.53×	1.18× · 2.45× · 2.40×
list nested	1.06× · 2.76× · 2.53×	1.16× · 1.86× · 2.69×
list fixedsizebinary	1.21× · 2.73× · 2.94×	1.00× · 2.15× · 2.57×
list listview	4.98× · 29.18× · 5.27×	3.53× · 29.85× · 4.75×
list map	1.42× · 1.22× · 2.58×	1.01× slower · 1.29× · 2.22×
list fixedsizelist	1.28× · 2.38× · 2.77×	1.01× slower · 2.03× · 2.58×
fixedsizelist (top)	1.63× · 1.06× · 4.93×	1.62× · 1.08× slower · 4.87×
map (top)	1.54× · 1.00× · 3.13×	1.25× · 1.12× · 3.15×

github-actions Bot added the arrow Changes to the arrow crate label Jun 29, 2026

Merge branch 'main' into jeadie/filter-list-specialization

da21a5a

Jeadie and others added 7 commits July 3, 2026 13:42

Merge branch 'main' into jeadie/filter-list-specialization

1c6c1f0

fix(filter): remove now-dead FilterBytes::get_value_range

0ef5385

The offset-rebuild vectorization removed its only caller (the per-element extend_offsets_slices loop); drop the method to satisfy -D dead_code.

refactor(filter): use safe slice indexing in filter_list_primitive

e25c0d6

The value copy is per-run (bounds-checked once per range, negligible) and now matches the sibling byte path (extend_slices), removing an unnecessary unsafe.

docs(filter): expand SAFETY comment on filter_list_bytes byte-array c…

6cc97b6

…onstruction Match the pre-existing filter_bytes comment: call out UTF-8 validity preservation and the offsets/nulls length invariant, not just byte-for-byte copying.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Specialize `filter` for list-like arrays (List/LargeList/FixedSizeList/Map, …)#10236

Specialize `filter` for list-like arrays (List/LargeList/FixedSizeList/Map, …)#10236
Jeadie wants to merge 9 commits into
apache:mainfrom
Jeadie:jeadie/filter-list-specialization

Jeadie commented Jun 29, 2026

Uh oh!

alamb commented Jul 2, 2026

Uh oh!

adriangbot commented Jul 2, 2026

Uh oh!

adriangbot commented Jul 2, 2026

Uh oh!

Jefffrey commented Jul 2, 2026

Uh oh!

adriangbot commented Jul 2, 2026

Uh oh!

adriangbot commented Jul 2, 2026

Uh oh!

Jeadie commented Jul 2, 2026

Uh oh!

Jeadie commented Jul 2, 2026

Uh oh!

Jeadie commented Jul 3, 2026

Uh oh!

Jeadie commented Jul 3, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

Jeadie commented Jun 29, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Changes Explained

Before

After

Non-specialised Child types.

Are these changes tested?

Benchmark results

List<T> by child type

Direct kernels (new)

Value-length sweep — List<Utf8> @ kept ½

Regressions / caveats

Are there any user-facing changes?

Uh oh!

alamb commented Jul 2, 2026

Uh oh!

adriangbot commented Jul 2, 2026

Uh oh!

adriangbot commented Jul 2, 2026

Uh oh!

Jefffrey commented Jul 2, 2026

Uh oh!

adriangbot commented Jul 2, 2026

Uh oh!

adriangbot commented Jul 2, 2026

Uh oh!

Jeadie commented Jul 2, 2026

Uh oh!

Jeadie commented Jul 2, 2026

Regressions

Improvements

Uh oh!

Jeadie commented Jul 3, 2026

Re: -14% regression on filter context i32 (kept 1/2)

Uh oh!

Jeadie commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

`List<T>` by child type

Value-length sweep — `List<Utf8>` @ kept ½

Re: -14% regression on `filter context i32 (kept 1/2)`

Jeadie commented Jul 3, 2026 •

edited

Loading