feat(parquet): selective null padding for list child readers #9848
Conversation
54b90c7 to
0879144
Compare
|
Although the code is ready for review, this is a draft PR because it depends on #9847 and #9848. The contents of both dependencies are included here as well so reviewers can better understand the direction of these changes, which may be hard to appreciate in isolation. I'll rebase and undraft once those have been merged 🙇 Feedback is appreciated! |
…er` in tests (#9847) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Contributes to #9731 - Dependency of #9848 # Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> `InMemoryArrayReader` couples a column's in-memory representation (an Arrow array) with its storage representation (def/rep levels) and assumes a 1:1 mapping between array elements and levels. This holds when list readers consume fully-padded child arrays — one element per level, nulls included. Upcoming work (see #9848) pushes null filtering from `ListArrayReader` down into the child reader at the storage level, breaking that 1:1 assumption: the child returns fewer array elements than levels, and the mapping between them depends on the filtering logic itself. Keeping the mock would mean reimplementing that logic: testing filtered output against a second, hand-rolled filter. # What changes are included in this PR? <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> Replace `InMemoryArrayReader` with real `PrimitiveArrayReader` instances backed by in-memory Parquet pages. Tests now accept raw non-null values and levels (matching what Parquet actually stores) and exercise the production `RecordReader` path. # Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> This is a net positive in test coverage. The existing tests now exercise real readers instead of in-memory and already-dense representations. # Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. If there are any breaking changes to public APIs, please call them out. --> None. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
#9846) # Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. --> - Contributes to #9731 - Dependency of #9848 # Rationale for this change See #9848 Existing benchmarks have some gaps in the types of columns they exercise. Additionally, I would like to improve the memory efficiency of the read/decode path in terms of RSS requirements, especially for sparse inputs and we currently do not have any infrastructure to measure that. # What changes are included in this PR? Extend the existing `arrow_reader` runtime benchmarks with `Int32` and `FixedBinary32` list columns alongside the existing `StringList`, with parameterized null density (0%, 50%, 90%, 99%). The prior benchmarks only covered string lists, which didn't surface costs specific to fixed-width and primitive element types. Add a new `arrow_reader_peak_memory` benchmark that measures peak heap usage during `ListArrayReader::consume_batch` using a thread-local tracking allocator. It captures how RSS-efficient we are when materializing a column into its final Arrow in-memory representation. # Are these changes tested? All tests passing. # Are there any user-facing changes? None. Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
|
I plan to review this once I find time. I had my own ideas about optimizing list decoding, and the core algorithm using |
|
Maybe we can rebase this PR now to make its review easier |
0879144 to
809f71f
Compare
|
@jhorstmann @alamb Thank you both for offering to review! Now that both the test and benchmark PRs are merged, I have rebased the branch. It’s ready for your 👀. The branch now has three commits. The first implements the main change by pushing selective null padding down into the leaf decoder path, avoiding dense child materialization where possible. The next two are follow-ups that optimize supporting paths that are now performance-critical: definition-level bitmap construction and child buffer sizing. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@alamb The benchmark runner never posted its results. Maybe let's give it another try? |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
Sorry @alamb, For CPU benchmarks, the results are roughly as expected ( |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
wow -- checking this one out |
alamb
left a comment
There was a problem hiding this comment.
Thank you @HippoBaro -- thhe basic idea of this PR looks really nice to me
Ileft some high level comments -- let me know what you think
|
Marking as draft as I think this PR is no longer waiting on feedback and I am trying to make it easier to find PRs in need of review. Please mark it as ready for review when it is ready for another look |
This comment has been minimized.
This comment has been minimized.
|
run benchmark arrow_writer env: |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
|
@alamb I’m not sure why the benchmarks are failing there. Is there a way to see the full output? |
alamb
left a comment
There was a problem hiding this comment.
I think this one looks ready to go from my perspective. Thank you so much @HippoBaro and @jhorstmann
I hav gone over it again, and had codex do the same. We found some additional test coverage that might help but I'll add that in a follow on PR.
Once I have confirmed with perfomance tests, I'll merge this one in
I think it has to do with how I was trying to apply a filter. I'll run the full suite intsead |
This comment was marked as outdated.
This comment was marked as outdated.
This comment has been minimized.
This comment has been minimized.
|
Actually the problem looks like I was running the writer tests (not the reader tests 🤦 ) |
|
run benchmark arrow_reader env:
BENCH_FILTER: ListArray |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing unpadded_child_mode (6598016) to 9b190c9 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing unpadded_child_mode (6598016) to 9b190c9 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing unpadded_child_mode (6598016) to 9b190c9 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
I also have a small follow on PR with some additonal tests |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
These benchmarks are quite unreliable. This code doesn't touch any writer side code and yet |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
Performance looks good to me! I agree that the performance benchmarks are not as consistent as I would like 😢 group main unpadded_child_mode
----- ---- -------------------
arrow_array_reader/ListArray/Fixed32List/90pct NULLs 1.14 1052.3±5.11µs ? ?/sec 1.00 920.5±4.44µs ? ?/sec
arrow_array_reader/ListArray/Fixed32List/99pct NULLs 1.09 490.6±2.36µs ? ?/sec 1.00 451.5±17.68µs ? ?/sec
arrow_array_reader/ListArray/Fixed32List/half NULLs 1.19 2.7±0.01ms ? ?/sec 1.00 2.3±0.03ms ? ?/sec
arrow_array_reader/ListArray/Fixed32List/no NULLs 1.40 4.3±0.02ms ? ?/sec 1.00 3.1±0.06ms ? ?/sec
arrow_array_reader/ListArray/Int32List/90pct NULLs 1.08 988.8±2.49µs ? ?/sec 1.00 912.1±3.96µs ? ?/sec
arrow_array_reader/ListArray/Int32List/99pct NULLs 1.00 444.7±1.17µs ? ?/sec 1.01 451.3±17.78µs ? ?/sec
arrow_array_reader/ListArray/Int32List/half NULLs 1.09 2.2±0.01ms ? ?/sec 1.00 2.0±0.01ms ? ?/sec
arrow_array_reader/ListArray/Int32List/no NULLs 1.07 3.0±0.02ms ? ?/sec 1.00 2.8±0.01ms ? ?/sec
arrow_array_reader/ListArray/StringList/90pct NULLs 1.16 1131.8±2.59µs ? ?/sec 1.00 973.7±3.96µs ? ?/sec
arrow_array_reader/ListArray/StringList/99pct NULLs 1.00 456.8±1.84µs ? ?/sec 1.00 455.3±18.04µs ? ?/sec
arrow_array_reader/ListArray/StringList/half NULLs 1.29 4.0±0.01ms ? ?/sec 1.00 3.1±0.04ms ? ?/sec
arrow_array_reader/ListArray/StringList/no NULLs 1.43 7.1±0.03ms ? ?/sec 1.00 5.0±0.07ms ? ?/sec
ar |
|
Thanks again @HippoBaro |
Which issue does this PR close?
InMemoryArrayReaderwithPrimitiveArrayReaderin tests #9847ListArraybenchmarks for runtime and peak memory #9846Rationale for this change
Parquet list decoding currently materializes padding for null or empty parent lists and then copies the child array to filter that padding back out. This is expensive for nested list columns, especially sparse lists and fixed-width children, where memory can scale with decoded levels instead of actual emitted child values.
What changes are included in this PR?
This PR makes list child readers emit compact child arrays directly by pushing selective null padding into the leaf
RecordReader. It also builds definition-level validity bitmaps word-at-a-time, sizes child buffers after levels are decoded, and adds list runtime and peak-memory benchmarks across element types and null densities.Are these changes tested?
Extensive test coverage for the new logic through existing list reader tests, which now exercise the production
PrimitiveArrayReaderpath with in-memory Parquet pages (see #9846), andBooleanBufferBuilder::append_wordhas targeted unit coverage.Benchmarks results:
Are there any user-facing changes?
None.