feat(math/fft): phased Bailey four-step FFT + prover LDE wiring (opt-in) by diegokingston · Pull Request #633 · yetanotherco/lambda_vm

diegokingston · 2026-05-28T19:58:10Z

Adds a 2-phase Bailey four-step FFT (bowers_phased_fft) as a sibling to the current bowers_fft_opt_fused_parallel. The phased path reuses the existing Bowers butterfly kernels for the inner FFTs; what changes is the outer loop, which splits log_n into two halves of size ~⌈log_n/2⌉ each so the inner FFT working set fits L1.

Wiring

Gated behind math/phased-fft feature (off by default).

When phased-fft is enabled AND the FFT input is ≥ 2²¹ AND is a power of two, dispatch_fft_natural (new helper) routes through the phased path. Otherwise it falls back to
bowers_fft_opt_fused_parallel + in_place_bit_reverse_permute. Both paths produce natural-order output, so the three call sites in polynomial.rs (coset_lde_full_into, coset_lde_full_expand, evaluate_fft_cpu_raw) collapse dispatch_fft(...) + in_place_bit_reverse_permute(...) into a single
dispatch_fft_natural(...) invocation.

Toggle for A/B benching:
cargo bench -p lambda-vm-prover ... # baseline
cargo bench -p lambda-vm-prover --features math/phased-fft # phased

Phased algorithm

Standard Bailey four-step. With N = M·K, view the input as M × K row-major:

Transpose to K × M (tiled, out-of-place).
FFT_M on each of K rows (Bowers fused kernel).
Multiply pointwise by ω_N^(s·d) (rows independent → parallel).
Transpose to M × K.
FFT_K on each of M rows.
Final transpose to put the result in natural linear order.

The phase twiddle ω_N^(s·d) is computed on the fly (one ω_s = ω_N^s per row, stepped element by element) — avoids the N-element inter-phase twiddle table, which at log_n = 26 would cost ~1 GB.

API

pub struct PhasedFftContext<F>
impl<F: IsFFTField> PhasedFftContext<F> {
    pub fn new(log_n: usize) -> Result<Self, FFTError>;
}

pub fn tiled_transpose<T: Clone>(input, output, rows, cols);

pub fn bowers_phased_fft<F, E>(input)
    -> Result<(), FFTError>;
pub fn bowers_phased_fft_with_context<F, E>(input, ctx)
    -> Result<(), FFTError>;
pub fn bowers_phased_fft_with_buf<F, E>(input, ctx, buf)
    -> Result<(), FFTError>;
pub fn bowers_phased_fft_multicol<F, E>(columns, ctx)
    -> Result<(), FFTError>;

bowers_phased_fft_with_buf accepts a caller-owned scratch buffer that's reused across invocations (avoids the per-call Vec::with_ capacity(N)). bowers_phased_fft_multicol runs many independent columns in parallel via for_each_init, amortising the scratch buffer across rayon workers.

Tests (8 new, all green)

phased_equivalence_log4
phased_equivalence_log5_odd_split
phased_equivalence_log_n_4_through_12 (× 4 seeds each)
phased_equivalence_log_n_16
phased_equivalence_log_n_18_odd_split
phased_multicol_matches_per_column_reference (6 columns, log_n = 14)
tiled_transpose_2x3
tiled_transpose_roundtrip (rows = 17, cols = 23)

Reference for equivalence tests is
bowers_fft_opt_fused + in_place_bit_reverse_permute → natural order, which is byte-identical to the new bowers_phased_fft output.

Benchmark

New benches/phased_fft_benchmark.rs covers:

Single-column FFT, log_n ∈ {20, 22, 24}.
Multi-column LDE shape, (cols × log_n) ∈ {(74, 21), (49, 21), (24, 22), (12, 24)} — these match lambda_vm's actual table widths (CPU 74-col @ 2²⁰, MEMW 49-col, etc.).

Apple Silicon results (M-series, baseline reference):

log_n	bowers (ms)	phased (ms)	ratio
20	10.4	11.8	1.13× slow
22	49.7	61.7	1.24× slow
24	217.1	250.1	1.15× slow

Multi-column (fft_multi_col):

cols × log_n	bowers_par (ms)	phased_multicol (ms)	ratio
74 × 2²¹	673	880	1.31× slow
49 × 2²¹	488	576	1.18× slow
24 × 2²²	589	622	1.06× slow
12 × 2²⁴	1342	1354	1.01× par

The phased path converges to parity as log_n grows but does NOT win on Apple Silicon for log_n ≤ 24 — the unified-memory cache subsystem hides the cache-cliff that this pattern is designed to attack. The expected payoff regime is x86 server hardware with sharper L3 boundaries; this PR is the working bed for that benchmark.

Out of scope (follow-ups)

PhasedFftContext caching / thread-local to amortise inner-twiddle builds across calls.
3-phase split for log_n ≥ 23 (smaller inner FFTs, more transposes).
Interleaved multi-column data layout (Plonky3 AdjacentColMatrix style) — same outer loop but butterflies operate on C-wide vectors, the real architectural win for SIMD-rich uarches.
Phased inverse FFT (current PR only routes the forward path).

Adds a 2-phase Bailey four-step FFT (`bowers_phased_fft`) as a sibling to the current `bowers_fft_opt_fused_parallel`. The phased path reuses the existing Bowers butterfly kernels for the inner FFTs; what changes is the outer loop, which splits log_n into two halves of size ~⌈log_n/2⌉ each so the inner FFT working set fits L1. # Wiring Gated behind `math/phased-fft` feature (off by default). When `phased-fft` is enabled AND the FFT input is ≥ 2²¹ AND is a power of two, `dispatch_fft_natural` (new helper) routes through the phased path. Otherwise it falls back to `bowers_fft_opt_fused_parallel + in_place_bit_reverse_permute`. Both paths produce natural-order output, so the three call sites in `polynomial.rs` (`coset_lde_full_into`, `coset_lde_full_expand`, `evaluate_fft_cpu_raw`) collapse `dispatch_fft(...) + in_place_bit_reverse_permute(...)` into a single `dispatch_fft_natural(...)` invocation. Toggle for A/B benching: cargo bench -p lambda-vm-prover ... # baseline cargo bench -p lambda-vm-prover --features math/phased-fft # phased # Phased algorithm Standard Bailey four-step. With N = M·K, view the input as M × K row-major: 1. Transpose to K × M (tiled, out-of-place). 2. FFT_M on each of K rows (Bowers fused kernel). 3. Multiply pointwise by ω_N^(s·d) (rows independent → parallel). 4. Transpose to M × K. 5. FFT_K on each of M rows. 6. Final transpose to put the result in natural linear order. The phase twiddle ω_N^(s·d) is computed on the fly (one ω_s = ω_N^s per row, stepped element by element) — avoids the N-element inter-phase twiddle table, which at log_n = 26 would cost ~1 GB. # API ``` pub struct PhasedFftContext<F> impl<F: IsFFTField> PhasedFftContext<F> { pub fn new(log_n: usize) -> Result<Self, FFTError>; } pub fn tiled_transpose<T: Clone>(input, output, rows, cols); pub fn bowers_phased_fft<F, E>(input) -> Result<(), FFTError>; pub fn bowers_phased_fft_with_context<F, E>(input, ctx) -> Result<(), FFTError>; pub fn bowers_phased_fft_with_buf<F, E>(input, ctx, buf) -> Result<(), FFTError>; pub fn bowers_phased_fft_multicol<F, E>(columns, ctx) -> Result<(), FFTError>; ``` `bowers_phased_fft_with_buf` accepts a caller-owned scratch buffer that's reused across invocations (avoids the per-call `Vec::with_ capacity(N)`). `bowers_phased_fft_multicol` runs many independent columns in parallel via `for_each_init`, amortising the scratch buffer across rayon workers. # Tests (8 new, all green) - phased_equivalence_log4 - phased_equivalence_log5_odd_split - phased_equivalence_log_n_4_through_12 (× 4 seeds each) - phased_equivalence_log_n_16 - phased_equivalence_log_n_18_odd_split - phased_multicol_matches_per_column_reference (6 columns, log_n = 14) - tiled_transpose_2x3 - tiled_transpose_roundtrip (rows = 17, cols = 23) Reference for equivalence tests is `bowers_fft_opt_fused + in_place_bit_reverse_permute` → natural order, which is byte-identical to the new `bowers_phased_fft` output. # Benchmark New `benches/phased_fft_benchmark.rs` covers: - Single-column FFT, log_n ∈ {20, 22, 24}. - Multi-column LDE shape, (cols × log_n) ∈ {(74, 21), (49, 21), (24, 22), (12, 24)} — these match lambda_vm's actual table widths (CPU 74-col @ 2²⁰, MEMW 49-col, etc.). Apple Silicon results (M-series, baseline reference): | log_n | bowers (ms) | phased (ms) | ratio | |-------|------------:|------------:|-------------| | 20 | 10.4 | 11.8 | 1.13× slow | | 22 | 49.7 | 61.7 | 1.24× slow | | 24 | 217.1 | 250.1 | 1.15× slow | Multi-column (`fft_multi_col`): | cols × log_n | bowers_par (ms) | phased_multicol (ms) | ratio | |--------------|----------------:|---------------------:|-------------| | 74 × 2²¹ | 673 | 880 | 1.31× slow | | 49 × 2²¹ | 488 | 576 | 1.18× slow | | 24 × 2²² | 589 | 622 | 1.06× slow | | 12 × 2²⁴ | 1342 | 1354 | 1.01× par | The phased path converges to parity as log_n grows but does NOT win on Apple Silicon for log_n ≤ 24 — the unified-memory cache subsystem hides the cache-cliff that this pattern is designed to attack. The expected payoff regime is x86 server hardware with sharper L3 boundaries; this PR is the working bed for that benchmark. # Out of scope (follow-ups) - PhasedFftContext caching / thread-local to amortise inner-twiddle builds across calls. - 3-phase split for log_n ≥ 23 (smaller inner FFTs, more transposes). - Interleaved multi-column data layout (Plonky3 AdjacentColMatrix style) — same outer loop but butterflies operate on C-wide vectors, the real architectural win for SIMD-rich uarches. - Phased inverse FFT (current PR only routes the forward path).

diegokingston · 2026-05-28T19:58:22Z

/bench 5

github-actions · 2026-05-28T20:00:52Z

Codex Code Review

Findings

Low - potential DoS/panic: crypto/math/src/fft/phased_fft.rs:91
PhasedFftContext::new(log_n) computes 1usize << log_n before validating log_n, and allocates twiddles_m/twiddles_k before checking that an N-th root exists. Invalid large orders can panic on oversized shifts, or allocate huge sub-twiddle tables before eventually returning an error. Validate log_n <= F::TWO_ADICITY and log_n < usize::BITS before shifting or allocating, and fetch omega_n before building sub-twiddles.

Tests

I attempted to run cargo test -p math --features phased-fft, but rustup failed because /home/runner/.rustup is read-only in this environment.

claude · 2026-05-28T20:01:07Z

Code Review: phased Bailey four-step FFT

Overview: Adds a cache-blocked Bailey four-step NTT gated behind phased-fft feature flag. Clean prototype structure with good tests and honest benchmark results. The PR itself notes performance doesn't win on Apple Silicon yet — the feature is a research bed for x86 server testing. No security issues found.

Medium — Silent fallback masks feature-flag failures

In dispatch_fft_natural (polynomial.rs):

if let Ok(ctx) = PhasedFftContext::<F>::new(log_n) {
    ...
    return bowers_phased_fft_with_buf(...);
}
// silently falls through to Bowers path

If PhasedFftContext::new fails (field missing the required root of unity order), the code silently falls back to the baseline. A benchmark run with --features math/phased-fft can be quietly measuring the Bowers path the entire time with no indication. At minimum, log a warning or return the error.

Low — `chunks_exact_mut` vs `par_chunks_mut` inconsistency in `apply_phase_twiddles`

Serial path uses chunks_exact_mut (silently drops a remainder); parallel path uses par_chunks_mut (includes a partial last chunk). With the debug_assert_eq! they're equivalent in debug builds, but in release mode the two paths behave differently when called with mismatched dimensions. Both paths should use the same semantic (chunks_exact_mut / par_chunks_exact_mut, or both chunks_mut / par_chunks_mut).

Trivial — Wrong test comment

phased_equivalence_log_n_18_odd_split says "even split, no reuse" but log_n=18 → log_m=9, log_k=9 is the case where twiddles_k = None and twiddles_m is reused — the opposite of what the comment says.

Trivial — Missing newline at EOF in Cargo.toml

harness = false on the last line is missing a trailing newline.

Notes (acknowledged in PR, listing for completeness)

PhasedFftContext is rebuilt per dispatch_fft_natural call — the context build is O(M+K) twiddle allocations. The in-code comment acknowledges a thread-local cache as a follow-up. Fine for the prototype.
All PhasedFftContext fields are pub. Fine for a prototype, but worth making private before promoting beyond research/.
The pre-computed twiddles argument to dispatch_fft_natural is unused when the phased path fires. Minor waste that resolves itself once context caching lands.

claude · 2026-05-28T20:01:16Z

+            // (inner LayerTwiddles); negligible relative to the FFT itself.
+            // A cached / thread-local context would be a follow-up
+            // optimisation if we want to land this for real.
+            if let Ok(ctx) = PhasedFftContext::<F>::new(log_n) {


Medium: Silent fallback means a benchmark with --features math/phased-fft can silently run the Bowers path if context construction fails (e.g. the field doesn't support a root of unity of the required order). Consider surfacing this as an error or at least a debug!/eprintln! warning so callers know which path actually executed.

Suggested change

if let Ok(ctx) = PhasedFftContext::<F>::new(log_n) {

match PhasedFftContext::<F>::new(log_n) {

Ok(ctx) => {

let mut scratch: Vec<FieldElement<E>> = Vec::with_capacity(n);

return bowers_phased_fft_with_buf(buffer, &ctx, &mut scratch);

}

Err(e) => {

debug_assert!(false, "phased-fft: context build failed ({e:?}), falling back to Bowers");

}

}

claude · 2026-05-28T20:01:18Z

+    #[cfg(feature = "parallel")]
+    let row_iter = buf.par_chunks_mut(cols).enumerate();
+    #[cfg(not(feature = "parallel"))]
+    let row_iter = buf.chunks_exact_mut(cols).enumerate();


Low: The serial and parallel paths have different overflow behaviour if called with mismatched dimensions. chunks_exact_mut silently drops a remainder while par_chunks_mut would include a short last chunk. Covered by the debug_assert_eq! in debug builds, but divergent in release. Prefer matching semantics — both chunks_exact_mut/par_chunks_exact_mut, or both chunks_mut/par_chunks_mut.

claude · 2026-05-28T20:01:24Z

+    fn phased_equivalence_log_n_18_odd_split() {
+        // log_n = 18 → log_m = 9, log_k = 9 (even split, no reuse).
+        let input = random_input(18, 7);
+        let expected = reference_fft(&input);


Trivial: Comment says "even split, no reuse" but log_n=18 → log_m=9, log_k=9 is precisely the case where twiddles_k = None and twiddles_m is reused for both phases. Should read something like "even split, twiddles_m reused for both phases".

gabrielbosio · 2026-05-28T21:14:16Z

/bench

github-actions · 2026-05-28T21:21:06Z

Benchmark — fib_iterative_8M (median of 5)

_{Table parallelism: auto (cores / 3)}

Metric	main	PR	Δ
Peak heap	51665 MB	51860 MB	+195 MB (+0.4%) ⚪
Prove time	25.030s	25.550s	+0.520s (+2.1%) ⚪

✅ No significant change.

✅ Low variance (time: 4.3%, heap: 0.5%)

_{Commit: 9042f08 · Baseline: built from main · Runner: self-hosted bench}

…g prover wiring - phased_fft.rs: parallelise the tiled transpose (tiled_transpose_par), unify the Bailey worker behind phased_fft_core(parallel) and drop the redundant final clone via a parallel copy. Single-column phased is now ~2.5x faster than bowers on a large isolated FFT (parallel transpose was the missing piece; bowers' early layers parallelise poorly). - Add bowers_phased_fft_seq_with_buf (sequential inner) for callers that already parallelise at a coarser grain. - polynomial.rs: dispatch_fft_natural now routes the LDE forward FFT through the SEQUENTIAL-inner phased path, so it composes with the prover's per-column par_iter without nesting rayon. The Bailey cache-blocking (lower DRAM traffic) is what pays at memory-bound sizes. - tables/mod.rs: LAMBDA_MAXROWS_SCALE_LOG2 env knob (default 0, clamped 0..=4) to A/B larger chunks/tables on the bench server without recompiling. - Feature passthrough: cli -> prover -> math/phased-fft. All opt-in (phased-fft feature off by default). Equivalence tests pass.

diegokingston · 2026-05-29T00:32:15Z

/bench 5

bench_prove.sh builds the feature branch and base branch with the same --features string; base (e.g. main) may lack features that only exist on the PR (like phased-fft). PR_EXTRA_FEATURES is appended to the feature/ current build only, never to base, so the comparison build doesn't break. Usage: PR_EXTRA_FEATURES=phased-fft LAMBDA_MAXROWS_SCALE_LOG2=3 \ ./scripts/bench_prove.sh <elf> 3

…branch DEFAULT_MAXROWS_SCALE_LOG2 = 3 so the automated main-vs-PR bench harness exercises the larger tables without injecting LAMBDA_MAXROWS_SCALE_LOG2 (which the harness does not propagate to the prove subprocess). CPU/MEMW 2^19->2^22, MUL/etc 2^20->2^23. Chunks pad to next_pow2(chunk_len), not to max_rows, and full power-of-two chunks aren't padded — so for large traces (e.g. fib_8M ~5% last-chunk rounding) the cached-LDE peak heap stays ~constant while chunk count drops 8x, cutting per-instance FRI/grinding/OOD overhead and pushing LDE sizes into the memory-bound regime where the phased FFT wins. Override with LAMBDA_MAXROWS_SCALE_LOG2=0 (production sizing) / =1 / =2 to sweep. Verified: scale=3 + phased proof verifies OK.

diegokingston · 2026-05-29T01:05:46Z

/bench 5

scale=3 (×8) regressed fib_8M prove time +41.4% / heap +13.5%: bigger chunks collapse the table-level parallelism the prover depends on (CPU 16->2 instances), and per-table last-chunk padding rounds up more. The per-instance overhead savings don't compensate. scale=2 (×4): CPU/MEMW 2^19->2^21, MUL/etc 2^20->2^22. CPU LDE 2^22 sits solidly in the memory-bound regime where the phased FFT wins standalone, while keeping ~4x more table instances than ×8 for parallelism. Override via LAMBDA_MAXROWS_SCALE_LOG2.

diegokingston · 2026-05-29T01:15:07Z

/bench 5

The GitHub bench workflow builds the PR with `--features jemalloc-stats` only, so the phased-fft-gated path never compiled in — the scale=2/3 regressions were bigger-tables + BOWERS, never bigger-tables + phased. Remove the cfg gates so dispatch_fft_natural always routes large LDE forward FFTs (n >= 2^21) through the sequential-inner phased path. main lacks this module entirely, so the bench baseline stays clean bowers and the comparison is (big tables + phased) vs (production + bowers) — the hypothesis we actually wanted to test. The phased-fft feature is kept as a no-op for compatibility. Verified: default-feature build proves + verifies.

diegokingston · 2026-05-29T01:27:19Z

/bench 5

Bigger tables regress fib_8M monotonically (×4 +16%, ×8 +41% prove time): fewer/bigger chunks collapse the table-level parallelism the prover depends on, and the last partial chunk pads up more. The phased FFT does not recover it even at memory-bound LDE sizes (×4 + phased +21.6%, no better than ×4 + bowers) — the per-column par_iter already saturates cores, so Bailey's extra transpose passes cost more than its locality saves. Phased FFT stays unconditionally wired (it fires for the 2^20 tables at LDE 2^21, ~neutral) and remains a 2.5x-faster standalone kernel for any future free-core FFT path. Chunk sizing back to production.

diegokingston · 2026-05-29T01:36:48Z

/bench 5

claude Bot reviewed May 28, 2026

View reviewed changes

diegokingston added 2 commits May 28, 2026 21:32

-            if let Ok(ctx) = PhasedFftContext::<F>::new(log_n) {
+            match PhasedFftContext::<F>::new(log_n) {
+                Ok(ctx) => {
+                    let mut scratch: Vec<FieldElement<E>> = Vec::with_capacity(n);
+                    return bowers_phased_fft_with_buf(buffer, &ctx, &mut scratch);
+                }
+                Err(e) => {
+                    debug_assert!(false, "phased-fft: context build failed ({e:?}), falling back to Bowers");
+                }
+            }

Conversation

diegokingston commented May 28, 2026

Wiring

Phased algorithm

API

Tests (8 new, all green)

Benchmark

Out of scope (follow-ups)

Uh oh!

diegokingston commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026

Codex Code Review

Uh oh!

claude Bot commented May 28, 2026

Code Review: phased Bailey four-step FFT

Medium — Silent fallback masks feature-flag failures

Low — chunks_exact_mut vs par_chunks_mut inconsistency in apply_phase_twiddles

Trivial — Wrong test comment

Trivial — Missing newline at EOF in Cargo.toml

Notes (acknowledged in PR, listing for completeness)

Uh oh!

claude Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gabrielbosio commented May 28, 2026

Uh oh!

github-actions Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark — fib_iterative_8M (median of 5)

Uh oh!

diegokingston commented May 29, 2026

Uh oh!

diegokingston commented May 29, 2026

Uh oh!

diegokingston commented May 29, 2026

Uh oh!

diegokingston commented May 29, 2026

Uh oh!

diegokingston commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Low — `chunks_exact_mut` vs `par_chunks_mut` inconsistency in `apply_phase_twiddles`

github-actions Bot commented May 28, 2026 •

edited

Loading