Validate short view strings in separate buffer in arrow-row#10250
Conversation
|
run benchmark row_format |
| } | ||
|
|
||
| fn decode_binary_view_inner( | ||
| fn decode_binary_view_inner<const VALIDATE_UTF8: bool>( |
There was a problem hiding this comment.
decided to take this opportunity to hoist the validate_utf8 option into a generic since it was used in the hot loop
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing row-short-utf8-validation (23069fd) to fbe75a3 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
| } | ||
| let mut values = MutableBuffer::new(values_capacity); | ||
| let mut view_utf8_validation_buffer = if VALIDATE_UTF8 { | ||
| Vec::with_capacity(inline_capacity) |
There was a problem hiding this comment.
so this is a new allocation, but given the benchmark it seems to be more than paid for with the additional performance
| } | ||
|
|
||
| #[test] | ||
| fn test_values_buffer_smaller_when_utf8_validation_disabled() { |
There was a problem hiding this comment.
maybe it is worth updating this test to verify that when utf8 validation is enabled, the values only contain data for the long strings (not the short strings(
There was a problem hiding this comment.
good point, repurposed the test
# Which issue does this PR close? - related to #10250 from @Jefffrey # Rationale for this change The existing row format benchmark measures `convert_rows` using rows created directly by `RowConverter::convert_columns`, which skips UTF-8 validation during decode. # What changes are included in this PR? This adds a `convert_rows_validated` benchmark that decodes rows parsed through `RowParser`, exercising the UTF-8 validation path. It also derives `Clone` for `Rows` so the benchmark setup can reuse the prepared rows when converting to binary. # Are these changes tested? CI. I also ran it locally like ```shell cargo bench --bench row_format -- convert_rows_validated ``` # Are there any user-facing changes? No breaking changes. `Rows` now implements `Clone`.
|
run benchmark row_format |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing row-short-utf8-validation (6d67b70) to c7dc6b8 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
run benchmark row_format |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing row-short-utf8-validation (6d67b70) to c7dc6b8 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
i can reproduce the regressions locally Detailsconvert_rows 4096 string view(10, 0)
time: [29.906 µs 29.955 µs 30.007 µs]
change: [−2.7118% −2.2274% −1.7427%] (p = 0.00 < 0.05)
Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
12 (12.00%) high mild
convert_rows_parsed 4096 string view(10, 0)
time: [40.673 µs 40.705 µs 40.740 µs]
change: [+23.796% +24.170% +24.558%] (p = 0.00 < 0.05)
Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
1 (1.00%) low severe
2 (2.00%) low mild
2 (2.00%) high mild
2 (2.00%) high severe
convert_rows 4096 string view(30, 0)
time: [38.753 µs 38.961 µs 39.200 µs]
change: [+1.8488% +2.1492% +2.4907%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
3 (3.00%) high mild
3 (3.00%) high severe
convert_rows_parsed 4096 string view(30, 0)
time: [60.500 µs 60.579 µs 60.672 µs]
change: [+46.025% +46.395% +46.758%] (p = 0.00 < 0.05)
Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
9 (9.00%) high mild
9 (9.00%) high severe
convert_rows 4096 string view(100, 0)
time: [48.704 µs 48.803 µs 48.905 µs]
change: [+2.3709% +2.6566% +2.9531%] (p = 0.00 < 0.05)
Performance has regressed.
Found 9 outliers among 100 measurements (9.00%)
2 (2.00%) low mild
6 (6.00%) high mild
1 (1.00%) high severe
convert_rows_parsed 4096 string view(100, 0)
time: [75.288 µs 75.331 µs 75.380 µs]
change: [+36.667% +36.813% +36.944%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low mild
2 (2.00%) high mild
3 (3.00%) high severe
convert_rows 4096 string view(100, 0.5)
time: [42.943 µs 43.025 µs 43.099 µs]
change: [+2.5279% +2.7698% +3.0016%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
2 (2.00%) low mild
2 (2.00%) high mild
convert_rows_parsed 4096 string view(100, 0.5)
time: [57.405 µs 57.465 µs 57.534 µs]
change: [+24.690% +25.082% +25.470%] (p = 0.00 < 0.05)
Performance has regressed.
Found 18 outliers among 100 measurements (18.00%)
2 (2.00%) low severe
1 (1.00%) low mild
12 (12.00%) high mild
3 (3.00%) high severe
convert_rows 4096 string view(1..100, 0)
time: [51.030 µs 51.123 µs 51.216 µs]
change: [+1.7675% +2.0033% +2.2338%] (p = 0.00 < 0.05)
Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
3 (3.00%) high mild
convert_rows_parsed 4096 string view(1..100, 0)
time: [75.340 µs 75.442 µs 75.553 µs]
change: [+38.508% +38.931% +39.346%] (p = 0.00 < 0.05)
Performance has regressed.
convert_rows 4096 string view(1..100, 0.5)
time: [37.833 µs 37.877 µs 37.921 µs]
change: [−0.3127% −0.0730% +0.1526%] (p = 0.56 > 0.05)
No change in performance detected.
Found 4 outliers among 100 measurements (4.00%)
1 (1.00%) low mild
3 (3.00%) high mild
convert_rows_parsed 4096 string view(1..100, 0.5)
time: [50.394 µs 50.441 µs 50.489 µs]
change: [+25.642% +25.928% +26.215%] (p = 0.00 < 0.05)
Performance has regressed.
Found 6 outliers among 100 measurements (6.00%)
1 (1.00%) low severe
4 (4.00%) low mild
1 (1.00%) high severemight need to think more on this |
| // truncate inline string in values buffer if validate_utf8 is false | ||
| if !validate_utf8 && decoded_len <= inline_str_max_len { | ||
| if VALIDATE_UTF8 { | ||
| view_utf8_validation_buffer.extend_from_slice(val); |
There was a problem hiding this comment.
oh wait i think i only need to append this val in the len check below 🤦
There was a problem hiding this comment.
fixed; that was an embarrassing mistake, thankfully the new benchmarks helped catch it 🙏
|
run benchmark row_format |
| // truncate inline string in values buffer if validate_utf8 is false | ||
| if !validate_utf8 && decoded_len <= inline_str_max_len { | ||
| if VALIDATE_UTF8 { | ||
| view_utf8_validation_buffer.extend_from_slice(val); |
There was a problem hiding this comment.
fixed; that was an embarrassing mistake, thankfully the new benchmarks helped catch it 🙏
| let mut views = BufferBuilder::<u128>::new(len); | ||
| for row in rows { | ||
| let mut views = vec![0_u128; len]; | ||
| for (i, row) in rows.iter_mut().enumerate() { |
There was a problem hiding this comment.
For good measure i also removed bufferbuilder in favour of a vec 🚀
|
run benchmark row_format |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing row-short-utf8-validation (f520cc0) to d969025 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing row-short-utf8-validation (f520cc0) to d969025 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
the slowdown in
in that i could try optimize it for the case where if we have more short strings than long strings (based on some ratio of |
|
I think now that validate_utf8 is a somewhat rare code path in the Rows code as it happens after externalizing the Rows to bytes and then parsing them back (I understand this after messing with #10259) Since most systems are likely going to be parsing row data back from trusted sources (itself, for example) maybe we could just offer an This will probably make things faster even than this optimized utf8 check 🤔 |
|
|
thanks @alamb |
Which issue does this PR close?
Rationale for this change
Reduce output string view values buffer size when decoding from row format & validating utf8
What changes are included in this PR?
Introduce separate vec to append inline short strings to for one shot utf8 validation; previously we mixed short + long strings in the values buffer for ease of utf8 validation, but this means the output view array has more memory usage than strictly required.
Are these changes tested?
Existing tests.
Are there any user-facing changes?
No.