Skip to content

Add validated row decode benchmark#10259

Merged
Jefffrey merged 4 commits into
apache:mainfrom
alamb:codex/bench-validated-row-decode
Jul 2, 2026
Merged

Add validated row decode benchmark#10259
Jefffrey merged 4 commits into
apache:mainfrom
alamb:codex/bench-validated-row-decode

Conversation

@alamb

@alamb alamb commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

The existing row format benchmark measures convert_rows using rows created directly by RowConverter::convert_columns, which skips UTF-8 validation during decode.

What changes are included in this PR?

This adds a convert_rows_validated benchmark that decodes rows parsed through RowParser, exercising the UTF-8 validation path. It also derives Clone for Rows so the benchmark setup can reuse the prepared rows when converting to binary.

Are these changes tested?

CI.

I also ran it locally like

cargo bench --bench row_format -- convert_rows_validated

Are there any user-facing changes?

No breaking changes. Rows now implements Clone.

@github-actions github-actions Bot added the arrow Changes to the arrow crate label Jul 1, 2026
@alamb alamb force-pushed the codex/bench-validated-row-decode branch from 345e692 to d972932 Compare July 1, 2026 22:13
Comment thread arrow/benches/row_format.rs Outdated
.collect();
c.bench_function(&format!("convert_rows_validated {name}"), |b| {
b.iter(|| {
hint::black_box(

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I profiled these benchmarks locally and the profile looks reasonable to me (it is validating the parsed rows)

Image

@alamb alamb marked this pull request as ready for review July 1, 2026 22:17

@Jefffrey Jefffrey left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good catch; didn't realize we needed this roundabout way of enabling validation

one concern i have here is this will add this new bench for non-string types creating more benchmark noise (and i think it ignores nulls too so the null config benches also get extra noise because of this)

probably want to config this so it only runs for string types (and probably just for the 0.0 null density case) 🤔

Comment thread arrow/benches/row_format.rs Outdated
// back into Arrow arrays by RowConverter::convert_rows.
let parsed_rows: Vec<_> = binary_rows
.iter()
.flatten()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this mean nulls are ignored in this bench?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call -- I fixed that by not converting into binary first

@alamb

alamb commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

this is a good catch; didn't realize we needed this roundabout way of enabling validation

Yeah -- it took me a while too -- I think the idea is that Rows are assumed to be valid when prodiced by a row converter. It is only when they may have come from somewhere else (e.g. a spill file) they need to be validated

one concern i have here is this will add this new bench for non-string types creating more benchmark noise (and i think it ignores nulls too so the null config benches also get extra noise because of this)

@Jefffrey Jefffrey merged commit c7dc6b8 into apache:main Jul 2, 2026
22 of 27 checks passed
@Jefffrey

Jefffrey commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

thanks @alamb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants