Skip to content

Perf: Introduce zero copy path when tonic returns an aligned buffer#10273

Open
Rich-T-kid wants to merge 2 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/avoid-copies-if-aligned
Open

Perf: Introduce zero copy path when tonic returns an aligned buffer#10273
Rich-T-kid wants to merge 2 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/avoid-copies-if-aligned

Conversation

@Rich-T-kid

@Rich-T-kid Rich-T-kid commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Rationale for this change

see #10206 (comment)

If the buffer Tonic returns happens to be 64-bit aligned (as the Arrow spec requires), we can wrap it directly via Buffer::from(Bytes), a zero-copy path that just increments a reference count. If not, Buffer::from(&[u8]) copies into a fresh aligned allocation. The check is cheap (a single pointer modulo), and can save a potentially very large buffer copy.

What changes are included in this PR?

this PR adds a pointer check for the data body. if the buffer is aligned to a 64 bit address no copy happens, otherwise copy the bytes as usual.

Are these changes tested?

existing test cover this. if the buffer isn't aligned we fallback to copying it.

Are there any user-facing changes?

no

@github-actions github-actions Bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Jul 2, 2026
@Rich-T-kid

Rich-T-kid commented Jul 5, 2026

Copy link
Copy Markdown
Contributor Author

@Jefffrey can you take a look at this? Thank you

@Jefffrey

Jefffrey commented Jul 5, 2026

Copy link
Copy Markdown
Contributor

run benchmark flight

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4885401671-847-pn47q 6.12.85+ #1 SMP Mon May 11 08:17:35 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/avoid-copies-if-aligned (1af08d7) to d969025 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                                     main                                    rich-T-kid_avoid-copies-if-aligned
-----                                     ----                                    ----------------------------------
decode/fixed/65536x4                      2.17    253.1±2.33µs    30.9 GB/sec     1.00    116.6±3.25µs    67.0 GB/sec
decode/fixed/65536x8                      2.61  718.1±145.54µs    21.8 GB/sec     1.00   275.0±21.62µs    56.8 GB/sec
decode/fixed/8192x4                       1.23     27.7±0.29µs    35.3 GB/sec     1.00     22.5±0.15µs    43.4 GB/sec
decode/fixed/8192x8                       1.96     65.5±0.98µs    29.9 GB/sec     1.00     33.5±0.15µs    58.4 GB/sec
decode/nested/65536x4                     1.00      2.9±0.64ms     6.6 GB/sec     1.05      3.1±0.67ms     6.3 GB/sec
decode/nested/65536x8                     1.00     11.7±1.48ms     3.3 GB/sec     1.03     12.1±1.37ms     3.2 GB/sec
decode/nested/8192x4                      1.17   359.9±86.49µs     6.8 GB/sec     1.00   308.6±83.10µs     7.9 GB/sec
decode/nested/8192x8                      1.02  728.3±163.07µs     6.7 GB/sec     1.00  711.4±166.24µs     6.9 GB/sec
decode/variable/65536x4                   1.09      5.4±0.72ms     6.6 GB/sec     1.00      4.9±0.60ms     7.1 GB/sec
decode/variable/65536x8                   2.03     21.1±1.55ms     3.3 GB/sec     1.00     10.4±1.44ms     6.8 GB/sec
decode/variable/8192x4                    1.11  616.8±102.72µs     7.1 GB/sec     1.00   557.0±97.44µs     7.9 GB/sec
decode/variable/8192x8                    1.15  1220.3±173.77µs     7.2 GB/sec    1.00  1057.0±172.42µs     8.3 GB/sec
decode_stream/dict/65536x4x4              1.05  772.4±149.91µs     5.1 GB/sec     1.00  733.2±135.87µs     5.4 GB/sec
decode_stream/dict/65536x8x4              1.06  1606.2±225.41µs     4.9 GB/sec    1.00  1518.7±270.30µs     5.2 GB/sec
decode_stream/dict/8192x4x4               1.07     99.9±0.82µs     5.1 GB/sec     1.00     93.2±0.54µs     5.5 GB/sec
decode_stream/dict/8192x8x4               1.13    205.8±1.83µs     5.0 GB/sec     1.00    181.6±0.78µs     5.6 GB/sec
decode_stream/fixed/65536x4x4             1.03   258.4±16.81µs    30.2 GB/sec     1.00    251.6±1.62µs    31.1 GB/sec
decode_stream/fixed/65536x8x4             1.32    542.8±5.29µs    28.8 GB/sec     1.00    411.0±2.72µs    38.0 GB/sec
decode_stream/fixed/8192x4x4              1.03     28.7±0.15µs    34.0 GB/sec     1.00     28.0±0.32µs    34.9 GB/sec
decode_stream/fixed/8192x8x4              1.38     61.7±1.25µs    31.7 GB/sec     1.00     44.7±0.39µs    43.8 GB/sec
decode_stream/nested/65536x4x4            1.11      3.0±0.66ms     6.5 GB/sec     1.00      2.7±0.71ms     7.2 GB/sec
decode_stream/nested/65536x8x4            1.09      6.1±1.35ms     6.4 GB/sec     1.00      5.6±1.35ms     7.0 GB/sec
decode_stream/nested/8192x4x4             1.13   350.9±83.54µs     7.0 GB/sec     1.00   309.3±83.47µs     7.9 GB/sec
decode_stream/nested/8192x8x4             1.17  726.3±162.77µs     6.7 GB/sec     1.00  620.2±165.53µs     7.9 GB/sec
decode_stream/variable/65536x4x4          1.02      5.2±0.70ms     6.7 GB/sec     1.00      5.1±0.66ms     6.9 GB/sec
decode_stream/variable/65536x8x4          1.48     19.7±1.40ms     3.6 GB/sec     1.00     13.3±1.40ms     5.3 GB/sec
decode_stream/variable/8192x4x4           1.07   579.2±91.45µs     7.6 GB/sec     1.00   543.7±89.97µs     8.1 GB/sec
decode_stream/variable/8192x8x4           1.06  1245.7±155.17µs     7.1 GB/sec    1.00  1174.4±194.35µs     7.5 GB/sec
do_put_dictionary/dict/hydrate/65536x4    1.00  1260.3±76.03µs   798.0 MB/sec     1.02  1285.2±67.43µs   782.5 MB/sec
do_put_dictionary/dict/hydrate/65536x8    1.00      2.5±0.05ms   795.3 MB/sec     1.00      2.5±0.02ms   798.6 MB/sec
do_put_dictionary/dict/hydrate/8192x4     1.00    173.5±2.80µs   753.2 MB/sec     1.00    173.7±0.53µs   752.0 MB/sec
do_put_dictionary/dict/hydrate/8192x8     1.04   352.4±10.01µs   741.5 MB/sec     1.00    338.0±4.39µs   773.0 MB/sec
do_put_dictionary/dict/resend/65536x4     1.00    240.6±0.49µs     4.1 GB/sec     1.02    245.8±3.39µs     4.0 GB/sec
do_put_dictionary/dict/resend/65536x8     1.00   463.5±18.82µs     4.2 GB/sec     1.06   493.1±16.89µs     4.0 GB/sec
do_put_dictionary/dict/resend/8192x4      1.01     41.3±0.27µs     3.1 GB/sec     1.00     40.9±0.31µs     3.1 GB/sec
do_put_dictionary/dict/resend/8192x8      1.02     69.8±0.38µs     3.7 GB/sec     1.00     68.6±0.44µs     3.7 GB/sec
encode/fixed/65536x4                      1.00     49.6±0.15µs    39.4 GB/sec     1.01     50.3±0.33µs    38.9 GB/sec
encode/fixed/65536x8                      1.06  1111.0±39.45µs     3.5 GB/sec     1.00   1044.8±2.29µs     3.7 GB/sec
encode/fixed/8192x4                       1.00      8.3±0.02µs    29.4 GB/sec     1.01      8.4±0.03µs    29.3 GB/sec
encode/fixed/8192x8                       1.01     16.4±0.04µs    29.8 GB/sec     1.00     16.3±0.05µs    30.0 GB/sec
encode/nested/65536x4                     1.00  1453.2±27.08µs     3.4 GB/sec     1.13  1641.3±147.52µs     3.0 GB/sec
encode/nested/65536x8                     1.02      2.9±0.10ms     3.3 GB/sec     1.00      2.9±0.01ms     3.4 GB/sec
encode/nested/8192x4                      1.02     20.4±0.07µs    29.9 GB/sec     1.00     19.9±0.02µs    30.7 GB/sec
encode/nested/8192x8                      1.00     46.3±0.10µs    26.4 GB/sec     1.03     47.6±0.24µs    25.7 GB/sec
encode/variable/65536x4                   1.08      2.5±0.24ms     3.5 GB/sec     1.00      2.3±0.01ms     3.8 GB/sec
encode/variable/65536x8                   1.03      5.3±0.50ms     3.3 GB/sec     1.00      5.2±0.22ms     3.4 GB/sec
encode/variable/8192x4                    1.01     25.3±0.07µs    43.4 GB/sec     1.00     25.1±0.06µs    43.7 GB/sec
encode/variable/8192x8                    1.07     83.1±1.92µs    26.5 GB/sec     1.00     77.3±0.22µs    28.4 GB/sec
roundtrip/fixed/65536x4                   1.01  1320.9±89.70µs  1514.4 MB/sec     1.00  1307.9±86.39µs  1529.4 MB/sec
roundtrip/fixed/65536x8                   1.02      2.3±0.14ms  1728.2 MB/sec     1.00      2.3±0.13ms  1761.1 MB/sec
roundtrip/fixed/8192x4                    1.00    202.3±2.32µs  1237.8 MB/sec     1.00    202.2±2.82µs  1238.3 MB/sec
roundtrip/fixed/8192x8                    1.00    344.4±3.83µs  1453.8 MB/sec     1.01    346.9±5.32µs  1443.4 MB/sec
roundtrip/nested/65536x4                  1.04      4.2±0.25ms  1179.0 MB/sec     1.00      4.1±0.14ms  1225.1 MB/sec
roundtrip/nested/65536x8                  1.04      8.7±0.36ms  1150.0 MB/sec     1.00      8.3±0.28ms  1198.0 MB/sec
roundtrip/nested/8192x4                   1.00   471.5±22.41µs  1327.2 MB/sec     1.01   475.6±20.85µs  1316.0 MB/sec
roundtrip/nested/8192x8                   1.02   933.6±40.53µs  1340.6 MB/sec     1.00   915.8±38.81µs  1366.6 MB/sec
roundtrip/variable/65536x4                1.00      7.6±0.31ms  1176.6 MB/sec     1.03      7.8±0.60ms  1147.5 MB/sec
roundtrip/variable/65536x8                1.01     14.2±0.56ms  1266.6 MB/sec     1.00     14.1±0.46ms  1279.9 MB/sec
roundtrip/variable/8192x4                 1.00   736.9±42.74µs  1527.6 MB/sec     1.00   733.4±40.19µs  1534.8 MB/sec
roundtrip/variable/8192x8                 1.04  1279.0±59.10µs  1760.3 MB/sec     1.00  1228.5±22.02µs  1832.7 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 635.1s
Peak memory 172.4 MiB
Avg memory 71.5 MiB
CPU user 592.0s
CPU sys 94.4s
Peak spill 0 B

branch

Metric Value
Wall time 605.1s
Peak memory 151.9 MiB
Avg memory 77.0 MiB
CPU user 574.4s
CPU sys 83.0s
Peak spill 0 B

File an issue against this benchmark runner

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants