Optimize parquet row filter auto strategy with adaptive fallback by hhhizzz · Pull Request #9956 · apache/arrow-rs

hhhizzz · 2026-05-10T10:36:35Z

Which issue does this PR close?

Closes [Parquet] Adaptively to pick between RowSelection and Mask filter representation #8846
Part of [EPIC] Faster performance for parquet predicate evaluation for non selective filters #7456
Addresses Filter pushdown selectivity threshold #9591
Related to Parquet decoding/pushdown performance improvements #9589
Follow-up to [Parquet]Performance Degradation with RowFilter on Unsorted Columns due to Fragmented ReadPlan #8565

Rationale for this change

RowFilter can be much slower than a full scan when predicate pushdown produces a highly fragmented RowSelection. In that shape, the reader spends substantial time repeatedly skipping and decoding tiny row runs. #8565 showed an extreme case where row-filter pushdown was around 10x slower than scanning and filtering afterwards.

This PR makes RowSelectionPolicy::Auto more cost-aware. Instead of treating predicate pushdown as always beneficial once planned, Auto now chooses among:

selector-backed pushdown when it can skip useful work;
mask-backed execution when fragmented selections are better represented as a dense bitmap;
adaptive post-filter execution when pushdown is unlikely to save enough decoding work.

This is not intended to disable predicate evaluation. It changes where the predicate is evaluated when the observed row-selection shape suggests that pushdown overhead is likely to dominate.

This PR also fixes a correctness issue in the explicit Mask path. With sparse page-loaded ranges, a mask-backed read plan could previously try to consume selected rows outside the pages that were actually loaded, causing decoding failures. Loaded row ranges are now represented explicitly, and explicit Mask can safely execute over sparse page-loaded data.

What changes are included in this PR?

Auto strategy and cost model

Adds structured row-selection shape analysis for Auto.
Adds CostModelObservation / decision reasons so the reader can explain why it chose pushdown or post-filter execution.
Adds an adaptive post-filter cost model for row groups:
- observe early pushdown output shape;
- keep pushdown for sparse / low-selectivity cases where it still saves output decoding;
- switch later row groups to post-filter execution for high-selectivity or fragmented moderate/high-selectivity cases.
Starts directly with post-filter execution for selected cheap cases where predicate columns are already projected and pushdown cannot avoid decoding them.

Mask / selector planning

Splits row-selection strategy resolution into a dedicated planning layer.
Keeps explicit Mask and Selectors behavior intact.
Makes Auto conservative around sparse page-loaded ranges.
Adds sparse loaded-range tracking so explicit Mask no longer assumes all page data is dense.
Replaces loaded-range intersection with a linear two-pointer merge.

Post-filter execution

Adds a post-filter reader path that decodes the required output and predicate columns once, evaluates the RowFilter, and then projects back to the caller-requested output columns.
Handles nested projection conservatively by requiring whole-root batch projection support.
Avoids reusing sparse predicate chunks when rebuilding a base/full read.
Avoids evaluating the same row-group predicate twice when caller-provided row selection is present.
Disables adaptive post-filter execution for try_next_reader handoff paths.

Benchmarks and tests

Extends arrow_reader_row_filter benchmarks with strategy-sensitive cases.
Adds focused coverage for:
- sparse loaded page ranges;
- explicit Mask correctness;
- Auto strategy decisions;
- adaptive post-filter decisions;
- caller-provided row selections;
- predicate cache behavior;
- async reader snapshots.

Are these changes tested?

Yes.

Unit / integration validation:

cargo fmt -p parquet -- --check
git diff --check
cargo test -p parquet --lib arrow::push_decoder
cargo test -p parquet --lib arrow::arrow_reader::read_plan
cargo test -p parquet --lib arrow::arrow_reader::selection
cargo test -p parquet --lib
cargo test -p parquet --test arrow_reader --all-features
cargo bench -p parquet --bench arrow_reader_row_filter --features arrow,async --no-run
cargo clippy -p parquet --all-targets --all-features -- -D warnings
cargo +nightly doc --document-private-items --no-deps --workspace --all-features

Benchmark evidence:

Lower current/main is better.

`arrow_reader_row_filter`

Summary across the arrow_reader_row_filter benchmark cases:

group	current/main	delta
all	0.9685x	-3.15%
async	0.9353x	-6.47%
sync	1.0028x	+0.28%

The improvement is concentrated in async row-filter cases where fragmented pushdown previously paid extra planning/selection overhead. Sync cases are broadly neutral.

Most notable async improvements:

mode	filter	projection	main	current	current/main	delta
async	`utf8View <> ''`	all columns	8.652 ms	6.365 ms	0.7357x	-26.43%
async	`int64 > 90`	all columns	7.958 ms	5.924 ms	0.7445x	-25.55%
async	`int64 > 90`	exclude filter column	7.603 ms	5.893 ms	0.7751x	-22.49%
async	`utf8View <> ''`	exclude filter column	7.610 ms	6.003 ms	0.7889x	-21.11%

The remaining cases are mostly within noise or small regressions. The largest sync regressions in this run were around +1.5%, while the aggregate sync result was +0.28%.

TPC-DS with predicate pushdown enabled （SF10 on a AMD64 machine)

Against main, with predicate pushdown enabled, aggregate current speedup was:

suite	aggregate speedup
TPC-DS	+3.271%

Largest median improvements:

query	main median	current median	current/main	time change	speedup
q9	613.776 ms	294.409 ms	0.4797x	-52.03%	+108.48%
q59	190.115 ms	133.384 ms	0.7016x	-29.84%	+42.53%
q70	207.927 ms	150.346 ms	0.7231x	-27.69%	+38.30%
q65	342.422 ms	249.613 ms	0.7290x	-27.10%	+37.18%
q26	128.938 ms	107.054 ms	0.8303x	-16.97%	+20.44%

q9 was the largest improvement and was stable: current was faster in 10/10 rounds.

Largest median regressions:

query	main median	current median	current/main	time change	slow rounds
q2	68.829 ms	76.223 ms	1.1074x	+10.74%	9/10
q92	45.358 ms	47.347 ms	1.0439x	+4.39%	5/10
q68	177.821 ms	185.364 ms	1.0424x	+4.24%	8/10
q77	87.460 ms	91.013 ms	1.0406x	+4.06%	8/10
q19	115.416 ms	119.864 ms	1.0385x	+3.85%	7/10

Overall, the microbenchmark shows the intended improvement in async fragmented-row-filter cases, while sync behavior remains approximately neutral. The TPC-DS run shows a positive aggregate result with several large stable wins, but also identifies q2 as the main remaining regression to investigate.

Are there any user-facing changes?

No intended breaking API changes.

RowSelectionPolicy::Auto may choose different internal execution strategies than before. Explicit Mask and Selectors policies remain available for callers that want fixed behavior.

The Mask execution path is now more robust for sparse page-loaded ranges, which makes future use of Mask safer in page-index / fragmented-selection cases.

alamb · 2026-05-14T20:12:48Z

👋

hhhizzz · 2026-05-15T02:41:16Z

👋

👋🏻 I find the result is still not stable next day I publish it, I can repro some regression in some rare case, still working on it.

alamb · 2026-05-15T14:45:42Z

Yeah, we are at the point where the code is already pretty fast, so additional optimizations get harder and harder

…ack-pr' into codex/parquet-reader-auto-fallback-pr

…auto-fallback-pr

hhhizzz · 2026-05-22T14:30:46Z

A bit of context on how this PR evolved.

The initial motivation was #8565: predicate pushdown is not always cheaper than scanning when the produced RowSelection becomes highly fragmented. After my last PR(#8733), I think there would be more improvement. My first attempt was to make Auto prefer Mask more often for fragmented selections, because a dense bitmap can avoid some of the tiny select/skip run overhead.

That turned out to be incomplete. Page pruning means the loaded pages may be sparse, and the previous mask path could assume rows were available even when their pages had not been loaded. So the work split into two parts:

make explicit Mask safe by tracking loaded row ranges;
keep Auto free to choose Selectors, Mask, or post-filter execution based on the shape of the actual read.

I also tried several purely static rules based on selectivity, projection shape, and data type. Some helped, but the results were fragile: rules that improved one fragmented case could regress sparse output reads or cacheable predicate cases. In particular, string / variable-width cases were easy to overfit. That is why the final design moved toward a small adaptive cost model instead of a larger pile of static heuristics.

The current implementation is roughly:

use row-selection shape analysis to decide between Mask and Selectors;
represent sparse loaded ranges explicitly so Mask does not fail on page-pruned data;
observe early row-group pushdown behavior;
switch later row groups to post-filter execution only when pushdown appears unlikely to pay for itself;
keep explicit Mask / Selectors policies as caller intent and only apply this adaptive behavior to Auto.

I also added focused benchmark cases after seeing that the original benchmark suite did not clearly expose the cost-model-sensitive cases. The goal was to cover both the original fragmented-selection cliff and the cases where an overly aggressive rule could regress.

So the PR is larger than a single heuristic change because the important part was separating the concepts:

what rows are selected;
which pages are actually loaded;
how that selection should be represented;
whether pushdown or post-filter execution is cheaper for Auto.

That separation is what makes the Mask correctness fix and the adaptive Auto behavior fit together.

…auto-fallback-pr

…t cost

…auto-fallback-pr

alamb · 2026-05-27T19:37:45Z

Thank you @hhhizzz -- I will try and review this later today or tomorrw

alamb · 2026-05-27T19:38:14Z

(I am not likely going to be able to review 8k lines in detail, however, so I will probably look at the high level first)

hhhizzz · 2026-05-28T03:06:45Z

(I am not likely going to be able to review 8k lines in detail, however, so I will probably look at the high level first)

Thanks for taking a look at this PR! I completely understand that an 8k-line diff is daunting to review in detail.

To help make the review process easier, I wanted to clarify only about 3,450 lines are production code, while the remaining 4,800+ lines are benchmarks and extensive unit/integration tests.

If you still feel this is too large to review as a single PR, I would be more than happy to split this into smaller , incremental PRs.😄 Here is how we can cleanly divide the work:

PR 1 (Infrastructure & Metrics): Expose ArrowReaderMetrics + refactor/extract strategy.rs and its isolated test file selection/tests.rs (No functional changes to the reader).
PR 2 (Post-Decode Filter State): Introduce post_filter.rs with its internal unit tests (Laying the groundwork for the fallback path).
PR 3 (Selection Policy & Cost Model): Add cost_model.rs, selection_policy.rs, integrate them into the push decoder state machine, and add the main benchmarks.

etseidl · 2026-05-28T15:37:42Z

If you still feel this is too large to review as a single PR, I would be more than happy to split this into smaller , incremental PRs.😄 Here is how we can cleanly divide the work:

Please do...that would help me, and would provide a baseline to compare the improvements against.

hhhizzz · 2026-06-01T00:53:57Z

Closed the PR as a reference, I'll split it to smaller PRs.

sdf-jkl · 2026-06-13T23:55:57Z

@hhhizzz I just thought about coming back to #9118 #9415 and saw your work here.

I think it looks like a better framework for tackling the issues than my PRs:

I'll take a closer look at #10135 and try to help with the next PRs coming! 🚀

alamb · 2026-06-22T14:32:22Z

One thing @adriangb and @zhuqi-lucas and I have noticed in DataFusion is that getting heuristics to work well is very challenging -- for example cutoff values often vary from architecture to architecture (e.g. is 32 contiguous 1s good, or should it be 64?)

One thing we have been exploring is a more dynamic approach -- aka to switch the predicate evaluation strategy at certain times when the decoder naturally has to re-create some state, such as between row groups, like in this PR:

feat(parquet): add ParquetPushDecoder::peek_next_row_group() #10158

It seems as if you have taken a similar approach in this PR

Adds an adaptive post-filter cost model for row groups

(caveat I have not had a chance to read this one carefully, and for that I apologize)

I think we had been planning to put more of the adaptivity at a higher level (DataFusion specifically) as it has more information about things like statistics, and cross file predicate selectivity.

I wonder if you have thought about where these auto adaptive decisions would best be made.

I do think the APIs you have outlined allow for both automatic and manually overriding (e.g. DataFusion could override the decisions made automatically) which is interesting

adriangb · 2026-06-22T14:37:40Z

I think we had been planning to put more of the adaptivity at a higher level (DataFusion specifically) as it has more information about things like statistics, and cross file predicate selectivity.

The pattern that seems to work well is:

Seed with cheap heuristics
Adapt once runtime information is available
Have multiple levels of adaptivity

That last case specifically applies here:

Datafusion can decide what even is a row filter or not
arrow-rs can optimize how it applies those row filters (e.g. when it pays the price of flattening / rolling over a mask, etc.) at a more fine-grained level

hhhizzz · 2026-06-22T14:56:29Z

I think the best model is multi-level adaptivity.

DataFusion has more high-level context, such as query semantics, file / row-group statistics, projected columns across the whole scan, and cross-file predicate selectivity. That makes it a good place to decide whether a RowFilter should be used at all, or to override the default behavior when it has stronger information.

However, the part this PR is trying to handle is lower-level. The actual shape of the final RowSelection is only known after the Parquet reader has decoded the predicate columns for a specific row group and evaluated the RowFilter. For example, whether the result is clustered, fragmented, mostly selected, or sparse is not available to DataFusion at logical / physical planning time without duplicating Parquet reader internals.

So I see the responsibilities as:

DataFusion decides the higher-level scan / filtering strategy.
arrow-rs decides how to execute a row filter once it sees the actual row-group-level selection shape and loaded page ranges.

The explicit policy APIs are important because they still allow DataFusion to override the automatic choice. But I think the default Auto behavior belongs in arrow-rs, because the information it uses is produced inside the Parquet reader during execution.

alamb

I am trying to grok this PR and figure out how to split it up and make progress.

This PR makes RowSelectionPolicy::Auto more cost-aware. Instead of treating predicate pushdown as always beneficial once planned, Auto now chooses among:

selector-backed pushdown when it can skip useful work;
mask-backed execution when fragmented selections are better represented as a dense bitmap;
adaptive post-filter execution when pushdown is unlikely to save enough decoding work.

As we discussed earlier, I agree that the multiple levels of adaptive filtering is likely what is needed for a really state of the art parquet decoding pipeline.

However, I am still struggling to figure out the most maintainable split for the functionality -- for example, adding "adaptive post filter execution" in the reader itself that sounds a lot like what @adriangb and @zhuqi-lucas are planning to implement downstram in DataFusion

I would say in general we seem to be moving more towards the higher level (DataFusion) potentially changing strategies at each Row Group boundary (and migating work across cores, etc), . There are also considerations like using predicate pushdown changes the IO pattern in the reader, which the higher level may want more control over.... 🤔

As you pointed out recently, there are some types of adaptivity that only the parquet reader itself will be able to do (for example, changing from mask --> RowSelection representation.

What do you think about first starting with just "row selection adapativity" feature and then contemplating Row Group selection ?

alamb · 2026-06-23T20:42:34Z

@hhhizzz I just thought about coming back to #9118 #9415 and saw your work here.

I think it looks like a better framework for tackling the issues than my PRs:

[Parquet] Support skipping pages with mask based evaluation #9118

Add statistics-based thresholds for deferring RowFilter selection in ReadPlanBuilder #9659

I'll take a closer look at #10135 and try to help with the next PRs coming! 🚀

Uh oh!

Conversation

hhhizzz commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Auto strategy and cost model

Mask / selector planning

Post-filter execution

Benchmarks and tests

Are these changes tested?

arrow_reader_row_filter

TPC-DS with predicate pushdown enabled （SF10 on a AMD64 machine)

Are there any user-facing changes?

Uh oh!

alamb commented May 14, 2026

Uh oh!

hhhizzz commented May 15, 2026

Uh oh!

alamb commented May 15, 2026

Uh oh!

hhhizzz commented May 22, 2026

Uh oh!

alamb commented May 27, 2026

Uh oh!

alamb commented May 27, 2026

Uh oh!

hhhizzz commented May 28, 2026

Uh oh!

etseidl commented May 28, 2026

Uh oh!

hhhizzz commented Jun 1, 2026

Uh oh!

sdf-jkl commented Jun 13, 2026

Uh oh!

alamb commented Jun 22, 2026

Uh oh!

adriangb commented Jun 22, 2026

Uh oh!

hhhizzz commented Jun 22, 2026

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hhhizzz commented May 10, 2026 •

edited

Loading

`arrow_reader_row_filter`