Skip to content

bench(parquet): add ListArray benchmarks for runtime and peak memory#9846

Merged
alamb merged 2 commits into
apache:mainfrom
HippoBaro:unpadded_child_mode_bench
May 7, 2026
Merged

bench(parquet): add ListArray benchmarks for runtime and peak memory#9846
alamb merged 2 commits into
apache:mainfrom
HippoBaro:unpadded_child_mode_bench

Conversation

@HippoBaro

@HippoBaro HippoBaro commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

See #9848

Existing benchmarks have some gaps in the types of columns they exercise. Additionally, I would like to improve the memory efficiency of the read/decode path in terms of RSS requirements, especially for sparse inputs and we currently do not have any infrastructure to measure that.

What changes are included in this PR?

Extend the existing arrow_reader runtime benchmarks with Int32 and FixedBinary32 list columns alongside the existing StringList, with parameterized null density (0%, 50%, 90%, 99%). The prior benchmarks only covered string lists, which didn't surface costs specific to fixed-width and primitive element types.

Add a new arrow_reader_peak_memory benchmark that measures peak heap usage during ListArrayReader::consume_batch using a thread-local tracking allocator. It captures how RSS-efficient we are when materializing a column into its final Arrow in-memory representation.

Are these changes tested?

All tests passing.

Are there any user-facing changes?

None.

@github-actions github-actions Bot added the parquet Changes to the parquet crate label Apr 28, 2026
@HippoBaro

Copy link
Copy Markdown
Contributor Author

This PR adds benchmarks with a custom Measurement. In this case, we measure RSS and cumulative memory usage (target arrow_reader_peak_memory).

I am not sure if the benchmark bot will like those new units:

arrow_array_reader/ListArray_peak_memory/Int32List/no NULLs
                        time:   [836.51 KiB 836.51 KiB 836.51 KiB]
arrow_array_reader/ListArray_peak_memory/Int32List/half NULLs
                        time:   [482.01 KiB 482.01 KiB 482.01 KiB]
arrow_array_reader/ListArray_peak_memory/Int32List/90pct NULLs
                        time:   [271.95 KiB 271.95 KiB 271.95 KiB]
arrow_array_reader/ListArray_peak_memory/Int32List/99pct NULLs
                        time:   [217.96 KiB 217.96 KiB 217.96 KiB]

let schema = build_test_schema();

let mut group = c.benchmark_group("arrow_array_reader/FIXED_LEN_BYTE_ARRAY/Float16Array");
let mandatory_f16_leaf_desc = schema.column(17);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One quick ask: could you move the new schema elements to the bottom so we don't have to renumber everything? 🙏 😅

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do! 🙇

@etseidl

etseidl commented May 6, 2026

Copy link
Copy Markdown
Contributor

Not your problem, but I find it funny that the peak memory test prints out:

arrow_array_reader/ListArray_allocated_bytes/Int32List/no NULLs
                        time:   [10.458 MiB 10.458 MiB 10.458 MiB]
arrow_array_reader/ListArray_allocated_bytes/Int32List/half NULLs
                        time:   [5.9797 MiB 5.9797 MiB 5.9797 MiB]

I looked at the Criterion source and "time:" is sadly hard coded :(

@etseidl etseidl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks legit to me. Thanks @HippoBaro.

Extend the existing `arrow_reader` runtime benchmarks with `Int32` and
`FixedBinary32` list columns alongside the existing `StringList`, with
parameterized null density (0%, 50%, 90%, 99%). The prior benchmarks
only covered string lists, which didn't surface costs specific to
fixed-width and primitive element types.

Add a new `arrow_reader_peak_memory` benchmark that measures peak heap
usage during `ListArrayReader::consume_batch` using a thread-local
tracking allocator. It captures how RSS-efficient we are when
materializing a column into its final Arrow in-memory representation.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
@HippoBaro HippoBaro force-pushed the unpadded_child_mode_bench branch from 45bfbf2 to 989c468 Compare May 7, 2026 02:08
@HippoBaro

Copy link
Copy Markdown
Contributor Author

Thanks @etseidl! Sorry for the spurious diff. It's rebased and fixed.

I looked at the Criterion source and "time:" is sadly hard coded :(

I also tried to change that 😅 It’s also quite sad that Criterion won’t let me measure multiple things per kernel. In essence, these tests are not any different from the CPU-equivalent ones, they just measure a different thing 🤷

@alamb

alamb commented May 7, 2026

Copy link
Copy Markdown
Contributor

The MSRV failure is unrelated to this PR. For more details:

@alamb

alamb commented May 7, 2026

Copy link
Copy Markdown
Contributor

Merged up to get a clean CI run

@alamb alamb merged commit 7abb225 into apache:main May 7, 2026
17 checks passed
Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
apache#9846)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Contributes to apache#9731
- Dependency of apache#9848

# Rationale for this change

See apache#9848

Existing benchmarks have some gaps in the types of columns they
exercise. Additionally, I would like to improve the memory efficiency of
the read/decode path in terms of RSS requirements, especially for sparse
inputs and we currently do not have any infrastructure to measure that.

# What changes are included in this PR?

Extend the existing `arrow_reader` runtime benchmarks with `Int32` and
`FixedBinary32` list columns alongside the existing `StringList`, with
parameterized null density (0%, 50%, 90%, 99%). The prior benchmarks
only covered string lists, which didn't surface costs specific to
fixed-width and primitive element types.

Add a new `arrow_reader_peak_memory` benchmark that measures peak heap
usage during `ListArrayReader::consume_batch` using a thread-local
tracking allocator. It captures how RSS-efficient we are when
materializing a column into its final Arrow in-memory representation.

# Are these changes tested?

All tests passing.

# Are there any user-facing changes?

None.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
alamb pushed a commit that referenced this pull request Jul 2, 2026
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Contributes to #9731
- Depend on #9847
- Depend on #9846

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Parquet list decoding currently materializes padding for null or empty
parent lists and then copies the child array to filter that padding back
out. This is expensive for nested list columns, especially sparse lists
and fixed-width children, where memory can scale with decoded levels
instead of actual emitted child values.

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

This PR makes list child readers emit compact child arrays directly by
pushing selective null padding into the leaf `RecordReader`. It also
builds definition-level validity bitmaps word-at-a-time, sizes child
buffers after levels are decoded, and adds list runtime and peak-memory
benchmarks across element types and null densities.

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Extensive test coverage for the new logic through existing list reader
tests, which now exercise the production `PrimitiveArrayReader` path
with in-memory Parquet pages (see #9846), and
`BooleanBufferBuilder::append_word` has targeted unit coverage.

Benchmarks results:
```
  Name                                           Before       After        Delta
  ListArray/StringList/no NULLs                 7.3395 ms    5.8001 ms    (-21.0%)
  ListArray/StringList/half NULLs               4.1255 ms    3.4274 ms    (-16.9%)
  ListArray/Int32List/90pct NULLs               1.0366 ms    975.63 us     (-5.9%)
  ListArray/Fixed32List/no NULLs                4.9298 ms    3.3137 ms    (-32.8%)
  ListArray/Fixed32List/half NULLs              2.8998 ms    2.7258 ms     (-6.0%)
  ListArray/Fixed32List/90pct NULLs             1.0556 ms    995.82 us     (-5.7%)
  ListArray/Fixed32List/99pct NULLs             508.65 us    467.16 us     (-8.2%)

  Name                                           Before       After        Delta
  ListArray_peak_memory/Int32List/no NULLs      836.51 KiB   574.79 KiB   (-31.3%)
  ListArray_peak_memory/Int32List/half NULLs    482.01 KiB   336.29 KiB   (-30.2%)
  ListArray_peak_memory/Int32List/90pct NULLs   271.95 KiB   175.32 KiB   (-35.5%)
  ListArray_peak_memory/Int32List/99pct NULLs   217.96 KiB   120.04 KiB   (-44.9%)
  ListArray_peak_memory/DoubleList/no NULLs     1.2399 MiB   715.31 KiB   (-43.7%)
  ListArray_peak_memory/DoubleList/half NULLs   753.66 KiB   400.39 KiB   (-46.9%)
  ListArray_peak_memory/DoubleList/90pct NULLs  380.62 KiB   190.89 KiB   (-49.8%)
  ListArray_peak_memory/DoubleList/99pct NULLs  315.21 KiB   121.61 KiB   (-61.4%)
  ListArray_peak_memory/Fixed32List/no NULLs    3.8031 MiB   1.5760 MiB   (-58.6%)
  ListArray_peak_memory/Fixed32List/half NULLs  2.1710 MiB   849.94 KiB   (-61.8%)
  ListArray_peak_memory/Fixed32List/90pct NULLs 1.0017 MiB   277.35 KiB   (-73.0%)
  ListArray_peak_memory/Fixed32List/99pct NULLs 898.69 KiB   130.93 KiB   (-85.4%)
  ListArray_peak_memory/StringList/no NULLs     3.7925 MiB   2.4715 MiB   (-34.8%)
  ListArray_peak_memory/StringList/half NULLs   1.2541 MiB   772.94 KiB   (-39.8%)
  ListArray_peak_memory/StringList/90pct NULLs  296.63 KiB   188.96 KiB   (-36.3%)
  ListArray_peak_memory/StringList/99pct NULLs  226.75 KiB   120.37 KiB   (-46.9%)

  Name                                               Before      After       Delta
  ListArray_allocated_bytes/Int32List/no NULLs      10.458 MiB  6.8018 MiB  (-35.0%)
  ListArray_allocated_bytes/Int32List/half NULLs    5.9797 MiB  4.0127 MiB  (-32.9%)
  ListArray_allocated_bytes/Int32List/90pct NULLs   2.9985 MiB  1.8210 MiB  (-39.3%)
  ListArray_allocated_bytes/Int32List/99pct NULLs   2.5579 MiB  1.3733 MiB  (-46.3%)
  ListArray_allocated_bytes/DoubleList/no NULLs     16.083 MiB  8.6546 MiB  (-46.2%)
  ListArray_allocated_bytes/DoubleList/half NULLs   8.9134 MiB  4.8497 MiB  (-45.6%)
  ListArray_allocated_bytes/DoubleList/90pct NULLs  4.3656 MiB  2.0179 MiB  (-53.8%)
  ListArray_allocated_bytes/DoubleList/99pct NULLs  3.7482 MiB  1.3903 MiB  (-62.9%)
  ListArray_allocated_bytes/Fixed32List/no NULLs    49.441 MiB  19.505 MiB  (-60.5%)
  ListArray_allocated_bytes/Fixed32List/half NULLs  26.846 MiB  10.459 MiB  (-61.0%)
  ListArray_allocated_bytes/Fixed32List/90pct NULLs 12.483 MiB  3.1127 MiB  (-75.1%)
  ListArray_allocated_bytes/Fixed32List/99pct NULLs 10.895 MiB  1.4980 MiB  (-86.3%)
  ListArray_allocated_bytes/StringList/no NULLs     47.519 MiB  21.743 MiB  (-54.2%)
  ListArray_allocated_bytes/StringList/half NULLs   19.097 MiB  10.478 MiB  (-45.1%)
  ListArray_allocated_bytes/StringList/90pct NULLs  3.4203 MiB  2.1165 MiB  (-38.1%)
  ListArray_allocated_bytes/StringList/99pct NULLs  2.6424 MiB  1.3777 MiB  (-47.9%)
```



# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

None.

---------

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants