perf(gzip): index BytesView directly in crc32_update by mizchi · Pull Request #388 · moonbitlang/async

mizchi · 2026-05-23T10:09:11Z

Summary

crc32_update iterates the input via for byte in chunk where chunk : BytesView. That desugars through BytesView::iter + Iter::next, which allocates an iterator closure and pays a virtual dispatch per byte. In a gzip_roundtrip callgrind profile that single loop accounts for ~16% of total instructions (BytesView::iter 9.80% + Iter::next 6.07%), even though the loop body is a couple of arithmetic ops and a lookup.

Switching to a direct index over the backing Bytes removes both the closure allocation and the iterator dispatch. The intrinsic Bytes[i] is inlined.

Benchmark

Scenario: bench-async/cmd/gzip_roundtrip/main.mbt — 3 iter × 1000 chunks × 1024-byte payload, encode → pipe → decode. Linux x86_64, native build, 3-run median wall time:

	baseline	patched	delta
native	178 ms	162 ms	-9.0%

Callgrind self-time delta:

symbol	before	after	Δ
`BytesView::iter` (closure)	9.80%	—	gone
`Iter::next`	6.07%	—	gone
`crc32_update`	3.55%	4.57%	+1.02% (body now visible directly)

Net: -9.85% on this hot path, which lines up with the -9% wall-time drop.

Test results

moonbitlang/async/internal/gzip_internal  31 / 31 pass
moonbitlang/async/gzip                     7 /  7 pass
moonbitlang/async                         87 / 87 pass
moonbitlang/async/aqueue                  35 / 35 pass
moonbitlang/async/semaphore                6 /  6 pass
moonbitlang/async/cond_var                 5 /  5 pass
moonbitlang/async/internal/coroutine       1 /  1 pass
moonbitlang/async/internal/event_loop      7 /  7 pass
moonbitlang/async/internal/time            3 /  3 pass

Socket / TLS / HTTP / fs / process / websocket tests require a network interface this sandbox lacks; this patch does not touch any code on that path.

crc32_update iterates the input via 'for byte in chunk' where chunk is a BytesView. That desugars through BytesView::iter + Iter::next, which allocates an iterator closure and pays a virtual dispatch per byte. A gzip_roundtrip callgrind profile attributes ~16% of total instructions to that single loop (BytesView::iter 9.80% + Iter::next 6.07%), even though the loop body is a couple of arithmetic ops and a table lookup. Pull out the backing Bytes + start offset + length once and index the raw Bytes directly. Bytes[i] is intrinsic and inlined. gzip_roundtrip bench (native, 3-run median): baseline: 178 ms patched : 162 ms (-9.0%)

Guest0x0 · 2026-05-29T02:14:20Z

For builtin, array-like data structure, for .. in has a clearer shape, and is hence easier to optimize. For Array, FixedArray etc., the compiler can already specialize for .. in into faster-than-hand-written (due to omitted bound check) loop. So writing the loop as for .. in would be the better approach in the long term IMO. If for .. in for BytesView has a performance problem, I would flag that as a missing compiler optimization

Guest0x0 · 2026-05-29T08:45:40Z

It seems that for .. in (_ : BytesView) can now be optimized on nightly. Could you try your experiment again with nightly?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(gzip): index BytesView directly in crc32_update#388

perf(gzip): index BytesView directly in crc32_update#388
mizchi wants to merge 1 commit into
moonbitlang:mainfrom
mizchi:pr-gzip-crc32-index-loop

mizchi commented May 23, 2026 •

edited

Loading

Uh oh!

Guest0x0 commented May 29, 2026

Uh oh!

Guest0x0 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mizchi commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark

Test results

Uh oh!

Guest0x0 commented May 29, 2026

Uh oh!

Guest0x0 commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mizchi commented May 23, 2026 •

edited

Loading