Skip to content

Add a full f64 "double window" to the Stack VM#19

Merged
wtholliday merged 2 commits into
mainfrom
f64-double-window
Jun 2, 2026
Merged

Add a full f64 "double window" to the Stack VM#19
wtholliday merged 2 commits into
mainfrom
f64-double-window

Conversation

@wtholliday

Copy link
Copy Markdown
Collaborator

What & why

f64 was second-class in the stack VM: values rode the integer TOS window as bit patterns, paying a GPR↔FP crossing on every op, with zero fusion. The biquad_f64 benchmark exposed this — it ran ~2.9× slower than the f32 biquad (≈45 generic int-window ops per sample vs a handful of fused float-window superinstructions). This was a documented "deferred" decision (docs/FP_CODEGEN_PLAN.md §6.2, Option B).

This PR gives f64 its own parallel double register window (d0..d3 + a dfsp spill pointer) — the exact analogue of the existing f32 float window.

What changed

  • ABI plumbing — thread dfsp + d0..d3 through the preserve_none handler signature; the two FP windows together use all 8 FP arg registers (v0..v7 / xmm0..xmm7), verified to stay register-resident on this target.
  • Scalar *D op family — arithmetic, all six comparisons, conversions, memory load/store, the 24 math intrinsics, and print, with C handlers, bridge wiring, and stack-depth deltas. Every Type::Float64 codegen arm now targets the d-window, including call/return bridging (DToBitsD / BitsToDD) and window-aware DropD.
  • Fused superinstructions (the payoff) — mirrored the f32 set: get_get_dmul_sum* (the whole biquad FMA chain → one op), get_set*D move chains, and get_f64const_dgt_jiz. A F32ConstF + F32ToF64D → F64ConstD const-fold exposes <lit> as f64 constants to the compare-branch fusion.
  • Validationdouble_stack_delta + an assert_d_window_balanced corpus sweep mirror the existing f-window correctness checks.
  • Docs updated (Stack_VM.md, FP_CODEGEN_PLAN.md §6.2).

Results (benchmark/run.sh, 3-run avg, Stack VM)

before after
biquad f64 ~0.363s 0.157s
biquad f32 (reference) 0.128s 0.128s (unchanged)

f64 went from ~2.9× slower than f32 to ~1.2× — a 2.3× speedup. The residual gap is f64's 2× state bandwidth, not dispatch/crossing overhead. No regression on the f32/int benchmarks.

Also fixes a latent correctness bug: f64 >, >=, and != previously fell through to integer bit-pattern comparisons (wrong for negatives / NaN / -0.0).

Testing

  • cargo test --workspace green — golden tests across all four backends (jit/vm/asm/stack).
  • New tests/cases/f64_window.lyte covers arithmetic, all comparisons, FMA fusion, f64-across-calls (arg/return bridging), deep expression chains (window spill), conversions, math, and struct store/load.
  • biquad_f64 output identical on jit/vm/stack.
  • Both window-balance assertions (f-window and d-window) pass over the full test corpus.

Reviewer notes

  • The stack backend is not covered by the differential fuzzer (it only runs jit/vm/asm/llvm), so d-window correctness here rests on golden tests + the window-balance assertions. Adding a stack arm to the fuzzer is a worthwhile follow-up.
  • src/stack_vm.rs (the Rust reference VM) is intentionally untouched: it's unreachable from any real path (the stack backend uses the C interpreter) and its own unit tests don't use window ops.

🤖 Generated with Claude Code

wtholliday and others added 2 commits June 1, 2026 17:04
f64 was second-class in the stack VM: values rode the integer TOS window
as bit patterns, paying a GPR<->FP crossing on every op, with zero fusion.
The biquad_f64 benchmark exposed this — it ran ~2.9x slower than the f32
biquad (45 generic int-window ops per sample vs a handful of fused
float-window superinstructions).

Give f64 its own parallel `double` register window (d0..d3 + dfsp spill
pointer), the exact analogue of the f32 float window:

- Thread dfsp + d0..d3 through the preserve_none handler signature; the
  two FP windows use all 8 FP arg registers (v0..v7 / xmm0..xmm7).
- A full `*D`-suffix StackOp family (arithmetic, all six comparisons,
  conversions, memory load/store, math intrinsics, print) with C handlers,
  bridge wiring, and stack-depth deltas. Every Type::Float64 codegen arm
  now targets the d-window, including call/return bridging (DToBitsD /
  BitsToDD) and window-aware DropD.
- Mirrored fused superinstructions: get_get_dmul_sum* (the whole biquad
  FMA chain collapses to one op), get_set*D move chains, and
  get_f64const_dgt_jiz. A F32ConstF + F32ToF64D -> F64ConstD const-fold
  exposes `<lit> as f64` constants to the compare-branch fusion.
- double_stack_delta + an assert_d_window_balanced corpus sweep mirror the
  f-window correctness checks.

Result (benchmark/run.sh, Stack VM): biquad f64 ~0.36s -> ~0.16s, now ~1.2x
the f32 biquad (the residual gap is f64's 2x state bandwidth). No
regression on the f32/int benchmarks. Fixes a latent correctness bug too:
f64 `>` / `>=` / `\!=` previously fell through to integer bit-pattern
comparisons.

Verified: cargo test --workspace green (golden tests across all four
backends), new tests/cases/f64_window.lyte covers arithmetic, comparisons,
FMA fusion, f64-across-calls, deep chains, conversions and math; biquad_f64
output identical on jit/vm/stack; both window-balance assertions pass over
the full corpus.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The differential fuzz target compared jit/vm/asm/llvm but never the
stack backend (the Clang-built C interpreter), so the stack VM — and
its f32 single window and f64 double window in particular — had no
differential coverage.

- Add a `stack` arm to run_backend(), gated on has_stack_interp, that
  compiles via compile_stack() and runs through stack_interp_bridge::run
  (the same path cli uses for --backend stack), and compare it against
  the VM in the fuzz body.
- Add fuzz/build.rs (mirroring cli/build.rs) plus the cc build-dep so
  has_stack_interp is actually defined for the fuzz crate.
- Fix capture_stdout: the stack interpreter prints via C stdio (printf),
  which buffers independently of Rust's stdout. Flush all C streams
  (fflush(NULL)) before restoring fd 1, or the captured output is lost
  and the stack backend appears to print nothing.
- Extend the program generator to emit f32 and f64 computations (with
  optional a*b+c helpers exercising float arg/return-window bridging),
  printing results as i32 so output stays comparable across backends
  despite Rust-vs-C float formatting differences. Bare float literals
  are f32, so f32 uses bare literals and only f64 gets an `as f64` cast.

Ran cargo fuzz run differential for several minutes across i32/f32/f64
programs with no divergences.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wtholliday wtholliday merged commit 0a254f1 into main Jun 2, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant