Add a full f64 "double window" to the Stack VM by wtholliday · Pull Request #19 · audulus/lyte

wtholliday · 2026-06-02T00:05:39Z

What & why

f64 was second-class in the stack VM: values rode the integer TOS window as bit patterns, paying a GPR↔FP crossing on every op, with zero fusion. The biquad_f64 benchmark exposed this — it ran ~2.9× slower than the f32 biquad (≈45 generic int-window ops per sample vs a handful of fused float-window superinstructions). This was a documented "deferred" decision (docs/FP_CODEGEN_PLAN.md §6.2, Option B).

This PR gives f64 its own parallel double register window (d0..d3 + a dfsp spill pointer) — the exact analogue of the existing f32 float window.

What changed

ABI plumbing — thread dfsp + d0..d3 through the preserve_none handler signature; the two FP windows together use all 8 FP arg registers (v0..v7 / xmm0..xmm7), verified to stay register-resident on this target.
Scalar *D op family — arithmetic, all six comparisons, conversions, memory load/store, the 24 math intrinsics, and print, with C handlers, bridge wiring, and stack-depth deltas. Every Type::Float64 codegen arm now targets the d-window, including call/return bridging (DToBitsD / BitsToDD) and window-aware DropD.
Fused superinstructions (the payoff) — mirrored the f32 set: get_get_dmul_sum* (the whole biquad FMA chain → one op), get_set*D move chains, and get_f64const_dgt_jiz. A F32ConstF + F32ToF64D → F64ConstD const-fold exposes <lit> as f64 constants to the compare-branch fusion.
Validation — double_stack_delta + an assert_d_window_balanced corpus sweep mirror the existing f-window correctness checks.
Docs updated (Stack_VM.md, FP_CODEGEN_PLAN.md §6.2).

Results (`benchmark/run.sh`, 3-run avg, Stack VM)

	before	after
biquad f64	~0.363s	0.157s
biquad f32 (reference)	0.128s	0.128s (unchanged)

f64 went from ~2.9× slower than f32 to ~1.2× — a 2.3× speedup. The residual gap is f64's 2× state bandwidth, not dispatch/crossing overhead. No regression on the f32/int benchmarks.

Also fixes a latent correctness bug: f64 >, >=, and != previously fell through to integer bit-pattern comparisons (wrong for negatives / NaN / -0.0).

Testing

cargo test --workspace green — golden tests across all four backends (jit/vm/asm/stack).
New tests/cases/f64_window.lyte covers arithmetic, all comparisons, FMA fusion, f64-across-calls (arg/return bridging), deep expression chains (window spill), conversions, math, and struct store/load.
biquad_f64 output identical on jit/vm/stack.
Both window-balance assertions (f-window and d-window) pass over the full test corpus.

Reviewer notes

The stack backend is not covered by the differential fuzzer (it only runs jit/vm/asm/llvm), so d-window correctness here rests on golden tests + the window-balance assertions. Adding a stack arm to the fuzzer is a worthwhile follow-up.
src/stack_vm.rs (the Rust reference VM) is intentionally untouched: it's unreachable from any real path (the stack backend uses the C interpreter) and its own unit tests don't use window ops.

🤖 Generated with Claude Code

f64 was second-class in the stack VM: values rode the integer TOS window as bit patterns, paying a GPR<->FP crossing on every op, with zero fusion. The biquad_f64 benchmark exposed this — it ran ~2.9x slower than the f32 biquad (45 generic int-window ops per sample vs a handful of fused float-window superinstructions). Give f64 its own parallel `double` register window (d0..d3 + dfsp spill pointer), the exact analogue of the f32 float window: - Thread dfsp + d0..d3 through the preserve_none handler signature; the two FP windows use all 8 FP arg registers (v0..v7 / xmm0..xmm7). - A full `*D`-suffix StackOp family (arithmetic, all six comparisons, conversions, memory load/store, math intrinsics, print) with C handlers, bridge wiring, and stack-depth deltas. Every Type::Float64 codegen arm now targets the d-window, including call/return bridging (DToBitsD / BitsToDD) and window-aware DropD. - Mirrored fused superinstructions: get_get_dmul_sum* (the whole biquad FMA chain collapses to one op), get_set*D move chains, and get_f64const_dgt_jiz. A F32ConstF + F32ToF64D -> F64ConstD const-fold exposes `<lit> as f64` constants to the compare-branch fusion. - double_stack_delta + an assert_d_window_balanced corpus sweep mirror the f-window correctness checks. Result (benchmark/run.sh, Stack VM): biquad f64 ~0.36s -> ~0.16s, now ~1.2x the f32 biquad (the residual gap is f64's 2x state bandwidth). No regression on the f32/int benchmarks. Fixes a latent correctness bug too: f64 `>` / `>=` / `\!=` previously fell through to integer bit-pattern comparisons. Verified: cargo test --workspace green (golden tests across all four backends), new tests/cases/f64_window.lyte covers arithmetic, comparisons, FMA fusion, f64-across-calls, deep chains, conversions and math; biquad_f64 output identical on jit/vm/stack; both window-balance assertions pass over the full corpus. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The differential fuzz target compared jit/vm/asm/llvm but never the stack backend (the Clang-built C interpreter), so the stack VM — and its f32 single window and f64 double window in particular — had no differential coverage. - Add a `stack` arm to run_backend(), gated on has_stack_interp, that compiles via compile_stack() and runs through stack_interp_bridge::run (the same path cli uses for --backend stack), and compare it against the VM in the fuzz body. - Add fuzz/build.rs (mirroring cli/build.rs) plus the cc build-dep so has_stack_interp is actually defined for the fuzz crate. - Fix capture_stdout: the stack interpreter prints via C stdio (printf), which buffers independently of Rust's stdout. Flush all C streams (fflush(NULL)) before restoring fd 1, or the captured output is lost and the stack backend appears to print nothing. - Extend the program generator to emit f32 and f64 computations (with optional a*b+c helpers exercising float arg/return-window bridging), printing results as i32 so output stays comparable across backends despite Rust-vs-C float formatting differences. Bare float literals are f32, so f32 uses bare literals and only f64 gets an `as f64` cast. Ran cargo fuzz run differential for several minutes across i32/f32/f64 programs with no divergences. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

wtholliday and others added 2 commits June 1, 2026 17:04

wtholliday merged commit 0a254f1 into main Jun 2, 2026
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a full f64 "double window" to the Stack VM#19

Add a full f64 "double window" to the Stack VM#19
wtholliday merged 2 commits into
mainfrom
f64-double-window

wtholliday commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

wtholliday commented Jun 2, 2026

What & why

What changed

Results (benchmark/run.sh, 3-run avg, Stack VM)

Testing

Reviewer notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Results (`benchmark/run.sh`, 3-run avg, Stack VM)