Skip to content

test(parquet): replace InMemoryArrayReader with PrimitiveArrayReader in tests#9847

Merged
alamb merged 1 commit into
apache:mainfrom
HippoBaro:unpadded_child_mode_test
May 5, 2026
Merged

test(parquet): replace InMemoryArrayReader with PrimitiveArrayReader in tests#9847
alamb merged 1 commit into
apache:mainfrom
HippoBaro:unpadded_child_mode_test

Conversation

@HippoBaro

@HippoBaro HippoBaro commented Apr 28, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

InMemoryArrayReader couples a column's in-memory representation (an Arrow array) with its storage representation (def/rep levels) and assumes a 1:1 mapping between array elements and levels. This holds when list readers consume fully-padded child arrays — one element per level, nulls included.

Upcoming work (see #9848) pushes null filtering from ListArrayReader down into the child reader at the storage level, breaking that 1:1 assumption: the child returns fewer array elements than levels, and the mapping between them depends on the filtering logic itself. Keeping the mock would mean reimplementing that logic: testing filtered output against a second, hand-rolled filter.

What changes are included in this PR?

Replace InMemoryArrayReader with real PrimitiveArrayReader instances backed by in-memory Parquet pages. Tests now accept raw non-null values and levels (matching what Parquet actually stores) and exercise the production RecordReader path.

Are these changes tested?

This is a net positive in test coverage. The existing tests now exercise real readers instead of in-memory and already-dense representations.

Are there any user-facing changes?

None.

`InMemoryArrayReader` coupled a column's in-memory representation (an
Arrow array) with its storage representation (def/rep levels) by
assuming a 1:1 mapping between array elements and levels. This held
when list readers consumed fully-padded child arrays — one element per
level, nulls included.

Upcoming work pushes null filtering from `ListArrayReader` down into the
child reader at the storage level, breaking that 1:1 assumption: the
child returns fewer array elements than levels, and the mapping between
them depends on the filtering logic itself. Keeping the mock would mean
reimplementing that logic: testing filtered output against a second,
hand-rolled filter.

Replace `InMemoryArrayReader` with real `PrimitiveArrayReader` instances
backed by in-memory Parquet pages. Tests now accept raw non-null values
and levels (matching what Parquet actually stores) and exercise the
production `RecordReader` path.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
@github-actions github-actions Bot added the parquet Changes to the parquet crate label Apr 28, 2026
@alamb alamb changed the title test(parquet): replace InMemoryArrayReader with PrimitiveArrayReader test(parquet): replace InMemoryArrayReader with PrimitiveArrayReader in tests May 1, 2026

@alamb alamb left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @HippoBaro -- looks good to me. I had some API / readability suggestions but I don't think they are required. We can do them as a follow on PR (or never)

/// `values` must contain only the non-null values (entries where
/// `def_levels[i] == max_def_level`), in order, as Parquet encodes them.
pub fn make_int32_page_reader(
values: &[i32],

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I double checked that all callsites actually used DataType::Int32 which makes sense to me

Some(vec![0, 1, 2, 3, 1]),
Some(vec![0, 1, 1, 1, 1]),
);
let array_reader_1 = make_int32_page_reader(&[4], &[0, 1, 2, 3, 1], &[0, 1, 1, 1, 1], 3, 1);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As an API suggestion (maybe for another PR, not required):

I would personally find this much easier to understand if the fields were named so I didn't have to refer to make_int32_page_reader to figure out what&[0, 1, 2, 3, 1] and &[0, 1, 1, 1, 1] meant.

Perhaps we could do this with an builder in the test. Something like this perhaps:

let array_reader_1 = TestArrayReaderBuilder::new()
 .with_non_null_i32_values(&[4]) // <-- this is especially non obvous if you are used to normal Arrow arrays
 .with_def_levels(&[0, 1, 2, 3, 1])
 .with_rep_levels(&[0, 1, 1, 1, 1])
 .with_max_def_level(3)
 .with_max_rep_level(1)
 .build()

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that would be lovely

@alamb alamb May 5, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// `RecordReader` code path (including selective padding when a parent
/// `ListArrayReader` calls `enable_selective_padding`).
///
/// `values` must contain only the non-null values (entries where

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb merged commit 21f739c into apache:main May 5, 2026
20 checks passed
@alamb

alamb commented May 5, 2026

Copy link
Copy Markdown
Contributor

Thanks again @HippoBaro

Rich-T-kid pushed a commit to Rich-T-kid/arrow-rs that referenced this pull request Jun 2, 2026
…er` in tests (apache#9847)

# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Contributes to apache#9731
- Dependency of apache#9848

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

`InMemoryArrayReader` couples a column's in-memory representation (an
Arrow array) with its storage representation (def/rep levels) and
assumes a 1:1 mapping between array elements and levels. This holds when
list readers consume fully-padded child arrays — one element per level,
nulls included.

Upcoming work (see apache#9848) pushes null filtering from `ListArrayReader`
down into the child reader at the storage level, breaking that 1:1
assumption: the child returns fewer array elements than levels, and the
mapping between them depends on the filtering logic itself. Keeping the
mock would mean reimplementing that logic: testing filtered output
against a second, hand-rolled filter.

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

Replace `InMemoryArrayReader` with real `PrimitiveArrayReader` instances
backed by in-memory Parquet pages. Tests now accept raw non-null values
and levels (matching what Parquet actually stores) and exercise the
production `RecordReader` path.

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

This is a net positive in test coverage. The existing tests now exercise
real readers instead of in-memory and already-dense representations.

# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

None.

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
alamb pushed a commit that referenced this pull request Jul 2, 2026
# Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax.
-->

- Contributes to #9731
- Depend on #9847
- Depend on #9846

# Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

Parquet list decoding currently materializes padding for null or empty
parent lists and then copies the child array to filter that padding back
out. This is expensive for nested list columns, especially sparse lists
and fixed-width children, where memory can scale with decoded levels
instead of actual emitted child values.

# What changes are included in this PR?

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

This PR makes list child readers emit compact child arrays directly by
pushing selective null padding into the leaf `RecordReader`. It also
builds definition-level validity bitmaps word-at-a-time, sizes child
buffers after levels are decoded, and adds list runtime and peak-memory
benchmarks across element types and null densities.

# Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

Extensive test coverage for the new logic through existing list reader
tests, which now exercise the production `PrimitiveArrayReader` path
with in-memory Parquet pages (see #9846), and
`BooleanBufferBuilder::append_word` has targeted unit coverage.

Benchmarks results:
```
  Name                                           Before       After        Delta
  ListArray/StringList/no NULLs                 7.3395 ms    5.8001 ms    (-21.0%)
  ListArray/StringList/half NULLs               4.1255 ms    3.4274 ms    (-16.9%)
  ListArray/Int32List/90pct NULLs               1.0366 ms    975.63 us     (-5.9%)
  ListArray/Fixed32List/no NULLs                4.9298 ms    3.3137 ms    (-32.8%)
  ListArray/Fixed32List/half NULLs              2.8998 ms    2.7258 ms     (-6.0%)
  ListArray/Fixed32List/90pct NULLs             1.0556 ms    995.82 us     (-5.7%)
  ListArray/Fixed32List/99pct NULLs             508.65 us    467.16 us     (-8.2%)

  Name                                           Before       After        Delta
  ListArray_peak_memory/Int32List/no NULLs      836.51 KiB   574.79 KiB   (-31.3%)
  ListArray_peak_memory/Int32List/half NULLs    482.01 KiB   336.29 KiB   (-30.2%)
  ListArray_peak_memory/Int32List/90pct NULLs   271.95 KiB   175.32 KiB   (-35.5%)
  ListArray_peak_memory/Int32List/99pct NULLs   217.96 KiB   120.04 KiB   (-44.9%)
  ListArray_peak_memory/DoubleList/no NULLs     1.2399 MiB   715.31 KiB   (-43.7%)
  ListArray_peak_memory/DoubleList/half NULLs   753.66 KiB   400.39 KiB   (-46.9%)
  ListArray_peak_memory/DoubleList/90pct NULLs  380.62 KiB   190.89 KiB   (-49.8%)
  ListArray_peak_memory/DoubleList/99pct NULLs  315.21 KiB   121.61 KiB   (-61.4%)
  ListArray_peak_memory/Fixed32List/no NULLs    3.8031 MiB   1.5760 MiB   (-58.6%)
  ListArray_peak_memory/Fixed32List/half NULLs  2.1710 MiB   849.94 KiB   (-61.8%)
  ListArray_peak_memory/Fixed32List/90pct NULLs 1.0017 MiB   277.35 KiB   (-73.0%)
  ListArray_peak_memory/Fixed32List/99pct NULLs 898.69 KiB   130.93 KiB   (-85.4%)
  ListArray_peak_memory/StringList/no NULLs     3.7925 MiB   2.4715 MiB   (-34.8%)
  ListArray_peak_memory/StringList/half NULLs   1.2541 MiB   772.94 KiB   (-39.8%)
  ListArray_peak_memory/StringList/90pct NULLs  296.63 KiB   188.96 KiB   (-36.3%)
  ListArray_peak_memory/StringList/99pct NULLs  226.75 KiB   120.37 KiB   (-46.9%)

  Name                                               Before      After       Delta
  ListArray_allocated_bytes/Int32List/no NULLs      10.458 MiB  6.8018 MiB  (-35.0%)
  ListArray_allocated_bytes/Int32List/half NULLs    5.9797 MiB  4.0127 MiB  (-32.9%)
  ListArray_allocated_bytes/Int32List/90pct NULLs   2.9985 MiB  1.8210 MiB  (-39.3%)
  ListArray_allocated_bytes/Int32List/99pct NULLs   2.5579 MiB  1.3733 MiB  (-46.3%)
  ListArray_allocated_bytes/DoubleList/no NULLs     16.083 MiB  8.6546 MiB  (-46.2%)
  ListArray_allocated_bytes/DoubleList/half NULLs   8.9134 MiB  4.8497 MiB  (-45.6%)
  ListArray_allocated_bytes/DoubleList/90pct NULLs  4.3656 MiB  2.0179 MiB  (-53.8%)
  ListArray_allocated_bytes/DoubleList/99pct NULLs  3.7482 MiB  1.3903 MiB  (-62.9%)
  ListArray_allocated_bytes/Fixed32List/no NULLs    49.441 MiB  19.505 MiB  (-60.5%)
  ListArray_allocated_bytes/Fixed32List/half NULLs  26.846 MiB  10.459 MiB  (-61.0%)
  ListArray_allocated_bytes/Fixed32List/90pct NULLs 12.483 MiB  3.1127 MiB  (-75.1%)
  ListArray_allocated_bytes/Fixed32List/99pct NULLs 10.895 MiB  1.4980 MiB  (-86.3%)
  ListArray_allocated_bytes/StringList/no NULLs     47.519 MiB  21.743 MiB  (-54.2%)
  ListArray_allocated_bytes/StringList/half NULLs   19.097 MiB  10.478 MiB  (-45.1%)
  ListArray_allocated_bytes/StringList/90pct NULLs  3.4203 MiB  2.1165 MiB  (-38.1%)
  ListArray_allocated_bytes/StringList/99pct NULLs  2.6424 MiB  1.3777 MiB  (-47.9%)
```



# Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.

If there are any breaking changes to public APIs, please call them out.
-->

None.

---------

Signed-off-by: Hippolyte Barraud <hippolyte.barraud@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants