perf(compile): native-int modulo for `counter % literal` (#928) by nickna · Pull Request #963 · nickna/SharpTS

nickna · 2026-06-27T00:22:26Z

What & why (#928)

Native-int modulo fast path for the compiled IL backend: counter % integerLiteral is emitted as a native int64 rem + conv.r8 instead of an FP fmod on two doubles.

This came out of re-investigating #928. Split-isolation benchmarks overturned the issue's premise — the int32 gap is not element representation:

SharpTS runs int and float typed arrays at the same speed (per-element ToInt32/conv.r8 are negligible), and SharpTS's Float64 stencil-read is already at parity with Node (Node's int32 speed is just its Smi lane on integer data).
The dominant per-iteration cost in numeric write kernels is the i % k double fmod — it nearly doubles any kernel containing it (store64 16.4 → mod64 32.6ms @1m; double multiply adds ~0).

i is already an int64 loop-counter slot and k is an integer literal, so i % k as int64 rem is sound within the exact gate the int-counter already accepts (C# long % ≡ JS % — both truncated, sign-of-dividend — for every |i| ≤ 2^53).

Performance (A/B, only the compile-time gate differs; min ms @1m)

Workload	off	on	speedup
typed-arrays (Float64)	4.65	2.55	1.82× — now beats Node (5.64) ~2.2×, ~1.2× off Bun
int32-kernel	5.27	3.77	1.40×
bare `s += i*3-(i%7)`	29.3	14.8	1.99×
kernels with no `%` (store/mul/accumulate)	—	unchanged	1.0×

Correctness & soundness

4-way byte-identical (compiled-on / compiled-off / interpreter / Node) across an edge battery: negative dividend, negative divisor, i±k dividends, %1, descending counters, modulo embedded in a mixed-double kernel.
New SharpTS.Tests/SharedTests/ModuloParityTests.cs — 8 cases × both modes (16 tests), every expected value ground-truthed against Node; includes the divisor-0 → NaN fallback (declined by the fast path, no DivideByZeroException) and the non-counter-dividend fallback.
Full suite 14192/0 green with the opt default-on. ILVerify-clean (covered by ILVerificationTests).
Test262 (interp + compiled): the default subset is pre-existingly red on main — the compiled host hits a fatal CLR error (0x80131506) that aborts the run before the compiled differ executes, and the interpreted baseline is stale/flaky (timeouts). This reproduces identically with the opt disabled (SHARPTS_INT_MOD=0): neither run reaches the [Compiled] differ, and the interpreted soft-change count even varies between two runs of the byte-identical interpreter (42 vs 19) — proving harness nondeterminism, not a regression from this change (which is compiled-only and cannot affect the interpreter). The subset also doesn't include the multiplicative/modulus operator folders, and Test262's modulus tests use constant operands that this loop-counter fast path never matches. So Test262 gives no signal here either way; the gate for this change is the dual-mode xUnit suite + parity tests + the 4-way differential above. The pre-existing crash is unrelated infra debt worth a separate issue.

Scope / design

Only % goes native. Multiply stays double (measured free); division / must stay double (JS / is true division, not integer division).
Fires only when the left is a recognized integer-counter expression (i, i±k, k+i) and the right is a non-zero integer literal; everything else falls through to the existing double path.
Gated by SHARPTS_INT_MOD (default-on kill-switch, mirroring SHARPTS_INT_LOOP_COUNTER).

Interop

Safe by construction — the change operates on counter values, never the byte[] typed-array backing, so ArrayBuffer/DataView/Atomics and the $Runtime/$Array host types are untouched. Emitted IL is pure BCL opcodes (ldloc/ldc.i8/rem/conv.r8), so standalone DLLs gain no SharpTS.dll reference.

@1m

Emit `i % k` (loop-counter dividend, integer-literal divisor) as a native int64 `rem` + `conv.r8` instead of an FP `fmod` on two doubles. Split-isolation benchmarks for #928 showed the `fmod` is the dominant per-iteration cost in numeric write kernels (it ~doubles them); double multiply is already free and the int-vs-float element representation is not the bottleneck. Sound within the existing int-counter gate: C# `long %` and JS `%` are both truncated (sign of dividend), so the result is bit-identical to the double computation for every |dividend| <= 2^53. Only `%` goes native -- `/` stays double (JS `/` is true division). Divisor 0 falls through to the double path (NaN), so no DivideByZeroException. Perf @1m (min ms): typed-arrays Float64 4.65->2.55 (1.82x, now beats Node ~2.2x); int32-kernel 5.27->3.77 (1.40x); non-modulo kernels unchanged. - ForLoopAnalyzer.IntegerModuloEnabled (kill-switch SHARPTS_INT_MOD=0) - ILEmitter.Operators: TryEmitIntegerCounterModulo + TryEmitIntegerCounterValueI8 - ModuloParityTests: 8 cases x both modes, Node-ground-truthed Interop-safe: operates on counter values, not the byte[] backing; pure BCL opcodes, ILVerify-clean, no SharpTS.dll reference in standalone output. Full xUnit suite 14192/0 green with the opt default-on.

nickna · 2026-06-27T01:03:51Z

The pre-existing Test262 compiled-host crash referenced above (Test262 gives no usable signal for this change) is now tracked as #964 — it reproduces with this PR's optimization disabled and with a freshly-built worker, so it is unrelated to this change.

nickna · 2026-06-27T01:19:46Z

Update — clean compiled Test262 signal now available (via the #964 workaround of running the CompiledBaseline filter in isolation, since the unfiltered run crashes the host for unrelated reasons):

[Compiled] baseline drift: 0 regressions, 8 new passes, 0 new/removed entries  (16m44s, 11384 tests)

Zero compiled-mode regressions from this change. The 8 new passes (Array.join/toString/with, Number.toFixed, String.split) and the single soft change (String.indexOf bigint) are all unrelated to modulo — pre-existing main drift. This supersedes the "Test262 gives no signal" note in the PR description: compiled conformance is confirmed neutral, as expected for a pure representation optimization.

This was referenced Jun 27, 2026

Typed-array perf: native-int arithmetic path (close int32 ~3.7x Node) + double[]/Span+SIMD stencil floor #928

Closed

Test262: full-subset dotnet test crashes the test host (Internal CLR error 0x80131506) before the compiled differ runs #964

Closed

nickna merged commit 9a50249 into main Jun 27, 2026
2 checks passed

nickna mentioned this pull request Jun 27, 2026

fix(test262): stop in-process collectible-ALC churn crashing the testhost (#964) #965

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(compile): native-int modulo for `counter % literal` (#928)#963

perf(compile): native-int modulo for `counter % literal` (#928)#963
nickna merged 1 commit into
mainfrom
wrk/issue-928-int-modulo

nickna commented Jun 27, 2026

Uh oh!

nickna commented Jun 27, 2026

Uh oh!

nickna commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nickna commented Jun 27, 2026

What & why (#928)

Performance (A/B, only the compile-time gate differs; min ms @1m)

Correctness & soundness

Scope / design

Interop

Uh oh!

nickna commented Jun 27, 2026

Uh oh!

nickna commented Jun 27, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant