feat(math/fft): phased Bailey four-step FFT + prover LDE wiring (opt-in)#633
feat(math/fft): phased Bailey four-step FFT + prover LDE wiring (opt-in)#633diegokingston wants to merge 7 commits into
Conversation
Adds a 2-phase Bailey four-step FFT (`bowers_phased_fft`) as a sibling
to the current `bowers_fft_opt_fused_parallel`. The phased path reuses
the existing Bowers butterfly kernels for the inner FFTs; what changes
is the outer loop, which splits log_n into two halves of size
~⌈log_n/2⌉ each so the inner FFT working set fits L1.
# Wiring
Gated behind `math/phased-fft` feature (off by default).
When `phased-fft` is enabled AND the FFT input is ≥ 2²¹ AND is a
power of two, `dispatch_fft_natural` (new helper) routes through the
phased path. Otherwise it falls back to
`bowers_fft_opt_fused_parallel + in_place_bit_reverse_permute`. Both
paths produce natural-order output, so the three call sites in
`polynomial.rs` (`coset_lde_full_into`, `coset_lde_full_expand`,
`evaluate_fft_cpu_raw`) collapse `dispatch_fft(...) +
in_place_bit_reverse_permute(...)` into a single
`dispatch_fft_natural(...)` invocation.
Toggle for A/B benching:
cargo bench -p lambda-vm-prover ... # baseline
cargo bench -p lambda-vm-prover --features math/phased-fft # phased
# Phased algorithm
Standard Bailey four-step. With N = M·K, view the input as M × K
row-major:
1. Transpose to K × M (tiled, out-of-place).
2. FFT_M on each of K rows (Bowers fused kernel).
3. Multiply pointwise by ω_N^(s·d) (rows independent → parallel).
4. Transpose to M × K.
5. FFT_K on each of M rows.
6. Final transpose to put the result in natural linear order.
The phase twiddle ω_N^(s·d) is computed on the fly (one ω_s = ω_N^s
per row, stepped element by element) — avoids the N-element
inter-phase twiddle table, which at log_n = 26 would cost ~1 GB.
# API
```
pub struct PhasedFftContext<F>
impl<F: IsFFTField> PhasedFftContext<F> {
pub fn new(log_n: usize) -> Result<Self, FFTError>;
}
pub fn tiled_transpose<T: Clone>(input, output, rows, cols);
pub fn bowers_phased_fft<F, E>(input)
-> Result<(), FFTError>;
pub fn bowers_phased_fft_with_context<F, E>(input, ctx)
-> Result<(), FFTError>;
pub fn bowers_phased_fft_with_buf<F, E>(input, ctx, buf)
-> Result<(), FFTError>;
pub fn bowers_phased_fft_multicol<F, E>(columns, ctx)
-> Result<(), FFTError>;
```
`bowers_phased_fft_with_buf` accepts a caller-owned scratch buffer
that's reused across invocations (avoids the per-call `Vec::with_
capacity(N)`). `bowers_phased_fft_multicol` runs many independent
columns in parallel via `for_each_init`, amortising the scratch
buffer across rayon workers.
# Tests (8 new, all green)
- phased_equivalence_log4
- phased_equivalence_log5_odd_split
- phased_equivalence_log_n_4_through_12 (× 4 seeds each)
- phased_equivalence_log_n_16
- phased_equivalence_log_n_18_odd_split
- phased_multicol_matches_per_column_reference (6 columns, log_n = 14)
- tiled_transpose_2x3
- tiled_transpose_roundtrip (rows = 17, cols = 23)
Reference for equivalence tests is
`bowers_fft_opt_fused + in_place_bit_reverse_permute` → natural order,
which is byte-identical to the new `bowers_phased_fft` output.
# Benchmark
New `benches/phased_fft_benchmark.rs` covers:
- Single-column FFT, log_n ∈ {20, 22, 24}.
- Multi-column LDE shape, (cols × log_n) ∈ {(74, 21), (49, 21),
(24, 22), (12, 24)} — these match lambda_vm's actual table widths
(CPU 74-col @ 2²⁰, MEMW 49-col, etc.).
Apple Silicon results (M-series, baseline reference):
| log_n | bowers (ms) | phased (ms) | ratio |
|-------|------------:|------------:|-------------|
| 20 | 10.4 | 11.8 | 1.13× slow |
| 22 | 49.7 | 61.7 | 1.24× slow |
| 24 | 217.1 | 250.1 | 1.15× slow |
Multi-column (`fft_multi_col`):
| cols × log_n | bowers_par (ms) | phased_multicol (ms) | ratio |
|--------------|----------------:|---------------------:|-------------|
| 74 × 2²¹ | 673 | 880 | 1.31× slow |
| 49 × 2²¹ | 488 | 576 | 1.18× slow |
| 24 × 2²² | 589 | 622 | 1.06× slow |
| 12 × 2²⁴ | 1342 | 1354 | 1.01× par |
The phased path converges to parity as log_n grows but does NOT win on
Apple Silicon for log_n ≤ 24 — the unified-memory cache subsystem
hides the cache-cliff that this pattern is designed to attack. The
expected payoff regime is x86 server hardware with sharper L3
boundaries; this PR is the working bed for that benchmark.
# Out of scope (follow-ups)
- PhasedFftContext caching / thread-local to amortise inner-twiddle
builds across calls.
- 3-phase split for log_n ≥ 23 (smaller inner FFTs, more transposes).
- Interleaved multi-column data layout (Plonky3 AdjacentColMatrix
style) — same outer loop but butterflies operate on C-wide
vectors, the real architectural win for SIMD-rich uarches.
- Phased inverse FFT (current PR only routes the forward path).
|
/bench 5 |
Codex Code ReviewFindings
Tests I attempted to run |
Code Review: phased Bailey four-step FFTOverview: Adds a cache-blocked Bailey four-step NTT gated behind Medium — Silent fallback masks feature-flag failuresIn if let Ok(ctx) = PhasedFftContext::<F>::new(log_n) {
...
return bowers_phased_fft_with_buf(...);
}
// silently falls through to Bowers pathIf Low —
|
| // (inner LayerTwiddles); negligible relative to the FFT itself. | ||
| // A cached / thread-local context would be a follow-up | ||
| // optimisation if we want to land this for real. | ||
| if let Ok(ctx) = PhasedFftContext::<F>::new(log_n) { |
There was a problem hiding this comment.
Medium: Silent fallback means a benchmark with --features math/phased-fft can silently run the Bowers path if context construction fails (e.g. the field doesn't support a root of unity of the required order). Consider surfacing this as an error or at least a debug!/eprintln! warning so callers know which path actually executed.
| if let Ok(ctx) = PhasedFftContext::<F>::new(log_n) { | |
| match PhasedFftContext::<F>::new(log_n) { | |
| Ok(ctx) => { | |
| let mut scratch: Vec<FieldElement<E>> = Vec::with_capacity(n); | |
| return bowers_phased_fft_with_buf(buffer, &ctx, &mut scratch); | |
| } | |
| Err(e) => { | |
| debug_assert!(false, "phased-fft: context build failed ({e:?}), falling back to Bowers"); | |
| } | |
| } |
| #[cfg(feature = "parallel")] | ||
| let row_iter = buf.par_chunks_mut(cols).enumerate(); | ||
| #[cfg(not(feature = "parallel"))] | ||
| let row_iter = buf.chunks_exact_mut(cols).enumerate(); |
There was a problem hiding this comment.
Low: The serial and parallel paths have different overflow behaviour if called with mismatched dimensions. chunks_exact_mut silently drops a remainder while par_chunks_mut would include a short last chunk. Covered by the debug_assert_eq! in debug builds, but divergent in release. Prefer matching semantics — both chunks_exact_mut/par_chunks_exact_mut, or both chunks_mut/par_chunks_mut.
| fn phased_equivalence_log_n_18_odd_split() { | ||
| // log_n = 18 → log_m = 9, log_k = 9 (even split, no reuse). | ||
| let input = random_input(18, 7); | ||
| let expected = reference_fft(&input); |
There was a problem hiding this comment.
Trivial: Comment says "even split, no reuse" but log_n=18 → log_m=9, log_k=9 is precisely the case where twiddles_k = None and twiddles_m is reused for both phases. Should read something like "even split, twiddles_m reused for both phases".
|
/bench |
Benchmark — fib_iterative_8M (median of 5)Table parallelism: auto (cores / 3)
Commit: 9042f08 · Baseline: built from main · Runner: self-hosted bench |
…g prover wiring - phased_fft.rs: parallelise the tiled transpose (tiled_transpose_par), unify the Bailey worker behind phased_fft_core(parallel) and drop the redundant final clone via a parallel copy. Single-column phased is now ~2.5x faster than bowers on a large isolated FFT (parallel transpose was the missing piece; bowers' early layers parallelise poorly). - Add bowers_phased_fft_seq_with_buf (sequential inner) for callers that already parallelise at a coarser grain. - polynomial.rs: dispatch_fft_natural now routes the LDE forward FFT through the SEQUENTIAL-inner phased path, so it composes with the prover's per-column par_iter without nesting rayon. The Bailey cache-blocking (lower DRAM traffic) is what pays at memory-bound sizes. - tables/mod.rs: LAMBDA_MAXROWS_SCALE_LOG2 env knob (default 0, clamped 0..=4) to A/B larger chunks/tables on the bench server without recompiling. - Feature passthrough: cli -> prover -> math/phased-fft. All opt-in (phased-fft feature off by default). Equivalence tests pass.
|
/bench 5 |
bench_prove.sh builds the feature branch and base branch with the same
--features string; base (e.g. main) may lack features that only exist on
the PR (like phased-fft). PR_EXTRA_FEATURES is appended to the feature/
current build only, never to base, so the comparison build doesn't break.
Usage: PR_EXTRA_FEATURES=phased-fft LAMBDA_MAXROWS_SCALE_LOG2=3 \
./scripts/bench_prove.sh <elf> 3
…branch DEFAULT_MAXROWS_SCALE_LOG2 = 3 so the automated main-vs-PR bench harness exercises the larger tables without injecting LAMBDA_MAXROWS_SCALE_LOG2 (which the harness does not propagate to the prove subprocess). CPU/MEMW 2^19->2^22, MUL/etc 2^20->2^23. Chunks pad to next_pow2(chunk_len), not to max_rows, and full power-of-two chunks aren't padded — so for large traces (e.g. fib_8M ~5% last-chunk rounding) the cached-LDE peak heap stays ~constant while chunk count drops 8x, cutting per-instance FRI/grinding/OOD overhead and pushing LDE sizes into the memory-bound regime where the phased FFT wins. Override with LAMBDA_MAXROWS_SCALE_LOG2=0 (production sizing) / =1 / =2 to sweep. Verified: scale=3 + phased proof verifies OK.
|
/bench 5 |
scale=3 (×8) regressed fib_8M prove time +41.4% / heap +13.5%: bigger chunks collapse the table-level parallelism the prover depends on (CPU 16->2 instances), and per-table last-chunk padding rounds up more. The per-instance overhead savings don't compensate. scale=2 (×4): CPU/MEMW 2^19->2^21, MUL/etc 2^20->2^22. CPU LDE 2^22 sits solidly in the memory-bound regime where the phased FFT wins standalone, while keeping ~4x more table instances than ×8 for parallelism. Override via LAMBDA_MAXROWS_SCALE_LOG2.
|
/bench 5 |
The GitHub bench workflow builds the PR with `--features jemalloc-stats` only, so the phased-fft-gated path never compiled in — the scale=2/3 regressions were bigger-tables + BOWERS, never bigger-tables + phased. Remove the cfg gates so dispatch_fft_natural always routes large LDE forward FFTs (n >= 2^21) through the sequential-inner phased path. main lacks this module entirely, so the bench baseline stays clean bowers and the comparison is (big tables + phased) vs (production + bowers) — the hypothesis we actually wanted to test. The phased-fft feature is kept as a no-op for compatibility. Verified: default-feature build proves + verifies.
|
/bench 5 |
Bigger tables regress fib_8M monotonically (×4 +16%, ×8 +41% prove time): fewer/bigger chunks collapse the table-level parallelism the prover depends on, and the last partial chunk pads up more. The phased FFT does not recover it even at memory-bound LDE sizes (×4 + phased +21.6%, no better than ×4 + bowers) — the per-column par_iter already saturates cores, so Bailey's extra transpose passes cost more than its locality saves. Phased FFT stays unconditionally wired (it fires for the 2^20 tables at LDE 2^21, ~neutral) and remains a 2.5x-faster standalone kernel for any future free-core FFT path. Chunk sizing back to production.
|
/bench 5 |
Adds a 2-phase Bailey four-step FFT (
bowers_phased_fft) as a sibling to the currentbowers_fft_opt_fused_parallel. The phased path reuses the existing Bowers butterfly kernels for the inner FFTs; what changes is the outer loop, which splits log_n into two halves of size ~⌈log_n/2⌉ each so the inner FFT working set fits L1.Wiring
Gated behind
math/phased-fftfeature (off by default).When
phased-fftis enabled AND the FFT input is ≥ 2²¹ AND is a power of two,dispatch_fft_natural(new helper) routes through the phased path. Otherwise it falls back tobowers_fft_opt_fused_parallel + in_place_bit_reverse_permute. Both paths produce natural-order output, so the three call sites inpolynomial.rs(coset_lde_full_into,coset_lde_full_expand,evaluate_fft_cpu_raw) collapsedispatch_fft(...) + in_place_bit_reverse_permute(...)into a singledispatch_fft_natural(...)invocation.Toggle for A/B benching:
cargo bench -p lambda-vm-prover ... # baseline
cargo bench -p lambda-vm-prover --features math/phased-fft # phased
Phased algorithm
Standard Bailey four-step. With N = M·K, view the input as M × K row-major:
The phase twiddle ω_N^(s·d) is computed on the fly (one ω_s = ω_N^s per row, stepped element by element) — avoids the N-element inter-phase twiddle table, which at log_n = 26 would cost ~1 GB.
API
bowers_phased_fft_with_bufaccepts a caller-owned scratch buffer that's reused across invocations (avoids the per-callVec::with_ capacity(N)).bowers_phased_fft_multicolruns many independent columns in parallel viafor_each_init, amortising the scratch buffer across rayon workers.Tests (8 new, all green)
Reference for equivalence tests is
bowers_fft_opt_fused + in_place_bit_reverse_permute→ natural order, which is byte-identical to the newbowers_phased_fftoutput.Benchmark
New
benches/phased_fft_benchmark.rscovers:Apple Silicon results (M-series, baseline reference):
Multi-column (
fft_multi_col):The phased path converges to parity as log_n grows but does NOT win on Apple Silicon for log_n ≤ 24 — the unified-memory cache subsystem hides the cache-cliff that this pattern is designed to attack. The expected payoff regime is x86 server hardware with sharper L3 boundaries; this PR is the working bed for that benchmark.
Out of scope (follow-ups)