Skip to content

feat(math/fft): phased Bailey four-step FFT + prover LDE wiring (opt-in)#633

Draft
diegokingston wants to merge 7 commits into
mainfrom
research/phased-fft-prover
Draft

feat(math/fft): phased Bailey four-step FFT + prover LDE wiring (opt-in)#633
diegokingston wants to merge 7 commits into
mainfrom
research/phased-fft-prover

Conversation

@diegokingston
Copy link
Copy Markdown
Collaborator

Adds a 2-phase Bailey four-step FFT (bowers_phased_fft) as a sibling to the current bowers_fft_opt_fused_parallel. The phased path reuses the existing Bowers butterfly kernels for the inner FFTs; what changes is the outer loop, which splits log_n into two halves of size ~⌈log_n/2⌉ each so the inner FFT working set fits L1.

Wiring

Gated behind math/phased-fft feature (off by default).

When phased-fft is enabled AND the FFT input is ≥ 2²¹ AND is a power of two, dispatch_fft_natural (new helper) routes through the phased path. Otherwise it falls back to
bowers_fft_opt_fused_parallel + in_place_bit_reverse_permute. Both paths produce natural-order output, so the three call sites in polynomial.rs (coset_lde_full_into, coset_lde_full_expand, evaluate_fft_cpu_raw) collapse dispatch_fft(...) + in_place_bit_reverse_permute(...) into a single
dispatch_fft_natural(...) invocation.

Toggle for A/B benching:
cargo bench -p lambda-vm-prover ... # baseline
cargo bench -p lambda-vm-prover --features math/phased-fft # phased

Phased algorithm

Standard Bailey four-step. With N = M·K, view the input as M × K row-major:

  1. Transpose to K × M (tiled, out-of-place).
  2. FFT_M on each of K rows (Bowers fused kernel).
  3. Multiply pointwise by ω_N^(s·d) (rows independent → parallel).
  4. Transpose to M × K.
  5. FFT_K on each of M rows.
  6. Final transpose to put the result in natural linear order.

The phase twiddle ω_N^(s·d) is computed on the fly (one ω_s = ω_N^s per row, stepped element by element) — avoids the N-element inter-phase twiddle table, which at log_n = 26 would cost ~1 GB.

API

pub struct PhasedFftContext<F>
impl<F: IsFFTField> PhasedFftContext<F> {
    pub fn new(log_n: usize) -> Result<Self, FFTError>;
}

pub fn tiled_transpose<T: Clone>(input, output, rows, cols);

pub fn bowers_phased_fft<F, E>(input)
    -> Result<(), FFTError>;
pub fn bowers_phased_fft_with_context<F, E>(input, ctx)
    -> Result<(), FFTError>;
pub fn bowers_phased_fft_with_buf<F, E>(input, ctx, buf)
    -> Result<(), FFTError>;
pub fn bowers_phased_fft_multicol<F, E>(columns, ctx)
    -> Result<(), FFTError>;

bowers_phased_fft_with_buf accepts a caller-owned scratch buffer that's reused across invocations (avoids the per-call Vec::with_ capacity(N)). bowers_phased_fft_multicol runs many independent columns in parallel via for_each_init, amortising the scratch buffer across rayon workers.

Tests (8 new, all green)

  • phased_equivalence_log4
  • phased_equivalence_log5_odd_split
  • phased_equivalence_log_n_4_through_12 (× 4 seeds each)
  • phased_equivalence_log_n_16
  • phased_equivalence_log_n_18_odd_split
  • phased_multicol_matches_per_column_reference (6 columns, log_n = 14)
  • tiled_transpose_2x3
  • tiled_transpose_roundtrip (rows = 17, cols = 23)

Reference for equivalence tests is
bowers_fft_opt_fused + in_place_bit_reverse_permute → natural order, which is byte-identical to the new bowers_phased_fft output.

Benchmark

New benches/phased_fft_benchmark.rs covers:

  • Single-column FFT, log_n ∈ {20, 22, 24}.
  • Multi-column LDE shape, (cols × log_n) ∈ {(74, 21), (49, 21), (24, 22), (12, 24)} — these match lambda_vm's actual table widths (CPU 74-col @ 2²⁰, MEMW 49-col, etc.).

Apple Silicon results (M-series, baseline reference):

log_n bowers (ms) phased (ms) ratio
20 10.4 11.8 1.13× slow
22 49.7 61.7 1.24× slow
24 217.1 250.1 1.15× slow

Multi-column (fft_multi_col):

cols × log_n bowers_par (ms) phased_multicol (ms) ratio
74 × 2²¹ 673 880 1.31× slow
49 × 2²¹ 488 576 1.18× slow
24 × 2²² 589 622 1.06× slow
12 × 2²⁴ 1342 1354 1.01× par

The phased path converges to parity as log_n grows but does NOT win on Apple Silicon for log_n ≤ 24 — the unified-memory cache subsystem hides the cache-cliff that this pattern is designed to attack. The expected payoff regime is x86 server hardware with sharper L3 boundaries; this PR is the working bed for that benchmark.

Out of scope (follow-ups)

  • PhasedFftContext caching / thread-local to amortise inner-twiddle builds across calls.
  • 3-phase split for log_n ≥ 23 (smaller inner FFTs, more transposes).
  • Interleaved multi-column data layout (Plonky3 AdjacentColMatrix style) — same outer loop but butterflies operate on C-wide vectors, the real architectural win for SIMD-rich uarches.
  • Phased inverse FFT (current PR only routes the forward path).

Adds a 2-phase Bailey four-step FFT (`bowers_phased_fft`) as a sibling
to the current `bowers_fft_opt_fused_parallel`. The phased path reuses
the existing Bowers butterfly kernels for the inner FFTs; what changes
is the outer loop, which splits log_n into two halves of size
~⌈log_n/2⌉ each so the inner FFT working set fits L1.

# Wiring

Gated behind `math/phased-fft` feature (off by default).

When `phased-fft` is enabled AND the FFT input is ≥ 2²¹ AND is a
power of two, `dispatch_fft_natural` (new helper) routes through the
phased path. Otherwise it falls back to
`bowers_fft_opt_fused_parallel + in_place_bit_reverse_permute`. Both
paths produce natural-order output, so the three call sites in
`polynomial.rs` (`coset_lde_full_into`, `coset_lde_full_expand`,
`evaluate_fft_cpu_raw`) collapse `dispatch_fft(...) +
in_place_bit_reverse_permute(...)` into a single
`dispatch_fft_natural(...)` invocation.

Toggle for A/B benching:
  cargo bench -p lambda-vm-prover ...                       # baseline
  cargo bench -p lambda-vm-prover --features math/phased-fft # phased

# Phased algorithm

Standard Bailey four-step. With N = M·K, view the input as M × K
row-major:
  1. Transpose to K × M (tiled, out-of-place).
  2. FFT_M on each of K rows (Bowers fused kernel).
  3. Multiply pointwise by ω_N^(s·d) (rows independent → parallel).
  4. Transpose to M × K.
  5. FFT_K on each of M rows.
  6. Final transpose to put the result in natural linear order.

The phase twiddle ω_N^(s·d) is computed on the fly (one ω_s = ω_N^s
per row, stepped element by element) — avoids the N-element
inter-phase twiddle table, which at log_n = 26 would cost ~1 GB.

# API

```
pub struct PhasedFftContext<F>
impl<F: IsFFTField> PhasedFftContext<F> {
    pub fn new(log_n: usize) -> Result<Self, FFTError>;
}

pub fn tiled_transpose<T: Clone>(input, output, rows, cols);

pub fn bowers_phased_fft<F, E>(input)
    -> Result<(), FFTError>;
pub fn bowers_phased_fft_with_context<F, E>(input, ctx)
    -> Result<(), FFTError>;
pub fn bowers_phased_fft_with_buf<F, E>(input, ctx, buf)
    -> Result<(), FFTError>;
pub fn bowers_phased_fft_multicol<F, E>(columns, ctx)
    -> Result<(), FFTError>;
```

`bowers_phased_fft_with_buf` accepts a caller-owned scratch buffer
that's reused across invocations (avoids the per-call `Vec::with_
capacity(N)`). `bowers_phased_fft_multicol` runs many independent
columns in parallel via `for_each_init`, amortising the scratch
buffer across rayon workers.

# Tests (8 new, all green)

  - phased_equivalence_log4
  - phased_equivalence_log5_odd_split
  - phased_equivalence_log_n_4_through_12 (× 4 seeds each)
  - phased_equivalence_log_n_16
  - phased_equivalence_log_n_18_odd_split
  - phased_multicol_matches_per_column_reference (6 columns, log_n = 14)
  - tiled_transpose_2x3
  - tiled_transpose_roundtrip (rows = 17, cols = 23)

Reference for equivalence tests is
`bowers_fft_opt_fused + in_place_bit_reverse_permute` → natural order,
which is byte-identical to the new `bowers_phased_fft` output.

# Benchmark

New `benches/phased_fft_benchmark.rs` covers:
  - Single-column FFT, log_n ∈ {20, 22, 24}.
  - Multi-column LDE shape, (cols × log_n) ∈ {(74, 21), (49, 21),
    (24, 22), (12, 24)} — these match lambda_vm's actual table widths
    (CPU 74-col @ 2²⁰, MEMW 49-col, etc.).

Apple Silicon results (M-series, baseline reference):

| log_n | bowers (ms) | phased (ms) | ratio       |
|-------|------------:|------------:|-------------|
|    20 |        10.4 |        11.8 | 1.13× slow  |
|    22 |        49.7 |        61.7 | 1.24× slow  |
|    24 |       217.1 |       250.1 | 1.15× slow  |

Multi-column (`fft_multi_col`):

| cols × log_n | bowers_par (ms) | phased_multicol (ms) | ratio       |
|--------------|----------------:|---------------------:|-------------|
|     74 × 2²¹ |             673 |                  880 | 1.31× slow  |
|     49 × 2²¹ |             488 |                  576 | 1.18× slow  |
|     24 × 2²² |             589 |                  622 | 1.06× slow  |
|     12 × 2²⁴ |            1342 |                 1354 | 1.01× par   |

The phased path converges to parity as log_n grows but does NOT win on
Apple Silicon for log_n ≤ 24 — the unified-memory cache subsystem
hides the cache-cliff that this pattern is designed to attack. The
expected payoff regime is x86 server hardware with sharper L3
boundaries; this PR is the working bed for that benchmark.

# Out of scope (follow-ups)

  - PhasedFftContext caching / thread-local to amortise inner-twiddle
    builds across calls.
  - 3-phase split for log_n ≥ 23 (smaller inner FFTs, more transposes).
  - Interleaved multi-column data layout (Plonky3 AdjacentColMatrix
    style) — same outer loop but butterflies operate on C-wide
    vectors, the real architectural win for SIMD-rich uarches.
  - Phased inverse FFT (current PR only routes the forward path).
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 5

@github-actions
Copy link
Copy Markdown

Codex Code Review

Findings

  • Low - potential DoS/panic: crypto/math/src/fft/phased_fft.rs:91
    PhasedFftContext::new(log_n) computes 1usize << log_n before validating log_n, and allocates twiddles_m/twiddles_k before checking that an N-th root exists. Invalid large orders can panic on oversized shifts, or allocate huge sub-twiddle tables before eventually returning an error. Validate log_n <= F::TWO_ADICITY and log_n < usize::BITS before shifting or allocating, and fetch omega_n before building sub-twiddles.

Tests

I attempted to run cargo test -p math --features phased-fft, but rustup failed because /home/runner/.rustup is read-only in this environment.

@claude
Copy link
Copy Markdown

claude Bot commented May 28, 2026

Code Review: phased Bailey four-step FFT

Overview: Adds a cache-blocked Bailey four-step NTT gated behind phased-fft feature flag. Clean prototype structure with good tests and honest benchmark results. The PR itself notes performance doesn't win on Apple Silicon yet — the feature is a research bed for x86 server testing. No security issues found.


Medium — Silent fallback masks feature-flag failures

In dispatch_fft_natural (polynomial.rs):

if let Ok(ctx) = PhasedFftContext::<F>::new(log_n) {
    ...
    return bowers_phased_fft_with_buf(...);
}
// silently falls through to Bowers path

If PhasedFftContext::new fails (field missing the required root of unity order), the code silently falls back to the baseline. A benchmark run with --features math/phased-fft can be quietly measuring the Bowers path the entire time with no indication. At minimum, log a warning or return the error.


Low — chunks_exact_mut vs par_chunks_mut inconsistency in apply_phase_twiddles

Serial path uses chunks_exact_mut (silently drops a remainder); parallel path uses par_chunks_mut (includes a partial last chunk). With the debug_assert_eq! they're equivalent in debug builds, but in release mode the two paths behave differently when called with mismatched dimensions. Both paths should use the same semantic (chunks_exact_mut / par_chunks_exact_mut, or both chunks_mut / par_chunks_mut).


Trivial — Wrong test comment

phased_equivalence_log_n_18_odd_split says "even split, no reuse" but log_n=18 → log_m=9, log_k=9 is the case where twiddles_k = None and twiddles_m is reused — the opposite of what the comment says.


Trivial — Missing newline at EOF in Cargo.toml

harness = false on the last line is missing a trailing newline.


Notes (acknowledged in PR, listing for completeness)

  • PhasedFftContext is rebuilt per dispatch_fft_natural call — the context build is O(M+K) twiddle allocations. The in-code comment acknowledges a thread-local cache as a follow-up. Fine for the prototype.
  • All PhasedFftContext fields are pub. Fine for a prototype, but worth making private before promoting beyond research/.
  • The pre-computed twiddles argument to dispatch_fft_natural is unused when the phased path fires. Minor waste that resolves itself once context caching lands.

// (inner LayerTwiddles); negligible relative to the FFT itself.
// A cached / thread-local context would be a follow-up
// optimisation if we want to land this for real.
if let Ok(ctx) = PhasedFftContext::<F>::new(log_n) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Medium: Silent fallback means a benchmark with --features math/phased-fft can silently run the Bowers path if context construction fails (e.g. the field doesn't support a root of unity of the required order). Consider surfacing this as an error or at least a debug!/eprintln! warning so callers know which path actually executed.

Suggested change
if let Ok(ctx) = PhasedFftContext::<F>::new(log_n) {
match PhasedFftContext::<F>::new(log_n) {
Ok(ctx) => {
let mut scratch: Vec<FieldElement<E>> = Vec::with_capacity(n);
return bowers_phased_fft_with_buf(buffer, &ctx, &mut scratch);
}
Err(e) => {
debug_assert!(false, "phased-fft: context build failed ({e:?}), falling back to Bowers");
}
}

Comment thread crypto/math/src/fft/phased_fft.rs Outdated
Comment on lines +315 to +318
#[cfg(feature = "parallel")]
let row_iter = buf.par_chunks_mut(cols).enumerate();
#[cfg(not(feature = "parallel"))]
let row_iter = buf.chunks_exact_mut(cols).enumerate();
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Low: The serial and parallel paths have different overflow behaviour if called with mismatched dimensions. chunks_exact_mut silently drops a remainder while par_chunks_mut would include a short last chunk. Covered by the debug_assert_eq! in debug builds, but divergent in release. Prefer matching semantics — both chunks_exact_mut/par_chunks_exact_mut, or both chunks_mut/par_chunks_mut.

fn phased_equivalence_log_n_18_odd_split() {
// log_n = 18 → log_m = 9, log_k = 9 (even split, no reuse).
let input = random_input(18, 7);
let expected = reference_fft(&input);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial: Comment says "even split, no reuse" but log_n=18 → log_m=9, log_k=9 is precisely the case where twiddles_k = None and twiddles_m is reused for both phases. Should read something like "even split, twiddles_m reused for both phases".

@gabrielbosio
Copy link
Copy Markdown
Collaborator

/bench

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 28, 2026

Benchmark — fib_iterative_8M (median of 5)

Table parallelism: auto (cores / 3)

Metric main PR Δ
Peak heap 51665 MB 51860 MB +195 MB (+0.4%) ⚪
Prove time 25.030s 25.550s +0.520s (+2.1%) ⚪

✅ No significant change.

✅ Low variance (time: 4.3%, heap: 0.5%)

Commit: 9042f08 · Baseline: built from main · Runner: self-hosted bench

…g prover wiring

- phased_fft.rs: parallelise the tiled transpose (tiled_transpose_par),
  unify the Bailey worker behind phased_fft_core(parallel) and drop the
  redundant final clone via a parallel copy. Single-column phased is now
  ~2.5x faster than bowers on a large isolated FFT (parallel transpose was
  the missing piece; bowers' early layers parallelise poorly).
- Add bowers_phased_fft_seq_with_buf (sequential inner) for callers that
  already parallelise at a coarser grain.
- polynomial.rs: dispatch_fft_natural now routes the LDE forward FFT through
  the SEQUENTIAL-inner phased path, so it composes with the prover's
  per-column par_iter without nesting rayon. The Bailey cache-blocking
  (lower DRAM traffic) is what pays at memory-bound sizes.
- tables/mod.rs: LAMBDA_MAXROWS_SCALE_LOG2 env knob (default 0, clamped 0..=4)
  to A/B larger chunks/tables on the bench server without recompiling.
- Feature passthrough: cli -> prover -> math/phased-fft.

All opt-in (phased-fft feature off by default). Equivalence tests pass.
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 5

bench_prove.sh builds the feature branch and base branch with the same
--features string; base (e.g. main) may lack features that only exist on
the PR (like phased-fft). PR_EXTRA_FEATURES is appended to the feature/
current build only, never to base, so the comparison build doesn't break.

Usage: PR_EXTRA_FEATURES=phased-fft LAMBDA_MAXROWS_SCALE_LOG2=3 \
         ./scripts/bench_prove.sh <elf> 3
…branch

DEFAULT_MAXROWS_SCALE_LOG2 = 3 so the automated main-vs-PR bench harness
exercises the larger tables without injecting LAMBDA_MAXROWS_SCALE_LOG2
(which the harness does not propagate to the prove subprocess). CPU/MEMW
2^19->2^22, MUL/etc 2^20->2^23.

Chunks pad to next_pow2(chunk_len), not to max_rows, and full power-of-two
chunks aren't padded — so for large traces (e.g. fib_8M ~5% last-chunk
rounding) the cached-LDE peak heap stays ~constant while chunk count drops
8x, cutting per-instance FRI/grinding/OOD overhead and pushing LDE sizes
into the memory-bound regime where the phased FFT wins. Override with
LAMBDA_MAXROWS_SCALE_LOG2=0 (production sizing) / =1 / =2 to sweep.

Verified: scale=3 + phased proof verifies OK.
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 5

scale=3 (×8) regressed fib_8M prove time +41.4% / heap +13.5%: bigger
chunks collapse the table-level parallelism the prover depends on (CPU
16->2 instances), and per-table last-chunk padding rounds up more. The
per-instance overhead savings don't compensate.

scale=2 (×4): CPU/MEMW 2^19->2^21, MUL/etc 2^20->2^22. CPU LDE 2^22 sits
solidly in the memory-bound regime where the phased FFT wins standalone,
while keeping ~4x more table instances than ×8 for parallelism. Override
via LAMBDA_MAXROWS_SCALE_LOG2.
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 5

The GitHub bench workflow builds the PR with `--features jemalloc-stats`
only, so the phased-fft-gated path never compiled in — the scale=2/3
regressions were bigger-tables + BOWERS, never bigger-tables + phased.

Remove the cfg gates so dispatch_fft_natural always routes large LDE
forward FFTs (n >= 2^21) through the sequential-inner phased path. main
lacks this module entirely, so the bench baseline stays clean bowers and
the comparison is (big tables + phased) vs (production + bowers) — the
hypothesis we actually wanted to test. The phased-fft feature is kept as
a no-op for compatibility.

Verified: default-feature build proves + verifies.
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 5

Bigger tables regress fib_8M monotonically (×4 +16%, ×8 +41% prove time):
fewer/bigger chunks collapse the table-level parallelism the prover
depends on, and the last partial chunk pads up more. The phased FFT does
not recover it even at memory-bound LDE sizes (×4 + phased +21.6%, no
better than ×4 + bowers) — the per-column par_iter already saturates
cores, so Bailey's extra transpose passes cost more than its locality
saves.

Phased FFT stays unconditionally wired (it fires for the 2^20 tables at
LDE 2^21, ~neutral) and remains a 2.5x-faster standalone kernel for any
future free-core FFT path. Chunk sizing back to production.
@diegokingston
Copy link
Copy Markdown
Collaborator Author

/bench 5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants