Skip to content

perf(gzip): index BytesView directly in crc32_update#388

Open
mizchi wants to merge 1 commit into
moonbitlang:mainfrom
mizchi:pr-gzip-crc32-index-loop
Open

perf(gzip): index BytesView directly in crc32_update#388
mizchi wants to merge 1 commit into
moonbitlang:mainfrom
mizchi:pr-gzip-crc32-index-loop

Conversation

@mizchi

@mizchi mizchi commented May 23, 2026

Copy link
Copy Markdown
Contributor

Summary

crc32_update iterates the input via for byte in chunk where chunk : BytesView. That desugars through BytesView::iter + Iter::next, which allocates an iterator closure and pays a virtual dispatch per byte. In a gzip_roundtrip callgrind profile that single loop accounts for ~16% of total instructions (BytesView::iter 9.80% + Iter::next 6.07%), even though the loop body is a couple of arithmetic ops and a lookup.

Switching to a direct index over the backing Bytes removes both the closure allocation and the iterator dispatch. The intrinsic Bytes[i] is inlined.

Benchmark

Scenario: bench-async/cmd/gzip_roundtrip/main.mbt — 3 iter × 1000 chunks × 1024-byte payload, encode → pipe → decode. Linux x86_64, native build, 3-run median wall time:

baseline patched delta
native 178 ms 162 ms -9.0%

Callgrind self-time delta:

symbol before after Δ
BytesView::iter (closure) 9.80% gone
Iter::next 6.07% gone
crc32_update 3.55% 4.57% +1.02% (body now visible directly)

Net: -9.85% on this hot path, which lines up with the -9% wall-time drop.

Test results

moonbitlang/async/internal/gzip_internal  31 / 31 pass
moonbitlang/async/gzip                     7 /  7 pass
moonbitlang/async                         87 / 87 pass
moonbitlang/async/aqueue                  35 / 35 pass
moonbitlang/async/semaphore                6 /  6 pass
moonbitlang/async/cond_var                 5 /  5 pass
moonbitlang/async/internal/coroutine       1 /  1 pass
moonbitlang/async/internal/event_loop      7 /  7 pass
moonbitlang/async/internal/time            3 /  3 pass

Socket / TLS / HTTP / fs / process / websocket tests require a network interface this sandbox lacks; this patch does not touch any code on that path.

crc32_update iterates the input via 'for byte in chunk' where chunk
is a BytesView. That desugars through BytesView::iter + Iter::next,
which allocates an iterator closure and pays a virtual dispatch per
byte. A gzip_roundtrip callgrind profile attributes ~16% of total
instructions to that single loop (BytesView::iter 9.80% +
Iter::next 6.07%), even though the loop body is a couple of
arithmetic ops and a table lookup.

Pull out the backing Bytes + start offset + length once and index
the raw Bytes directly. Bytes[i] is intrinsic and inlined.

gzip_roundtrip bench (native, 3-run median):
  baseline: 178 ms
  patched : 162 ms  (-9.0%)
@Guest0x0

Copy link
Copy Markdown
Collaborator

For builtin, array-like data structure, for .. in has a clearer shape, and is hence easier to optimize. For Array, FixedArray etc., the compiler can already specialize for .. in into faster-than-hand-written (due to omitted bound check) loop. So writing the loop as for .. in would be the better approach in the long term IMO. If for .. in for BytesView has a performance problem, I would flag that as a missing compiler optimization

@Guest0x0

Copy link
Copy Markdown
Collaborator

It seems that for .. in (_ : BytesView) can now be optimized on nightly. Could you try your experiment again with nightly?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants