feat(cuda): Round 2 composition parts + Round 3 OOD barycentric dispatch by ColoCarletti · Pull Request #637 · yetanotherco/lambda_vm

ColoCarletti · 2026-06-01T12:56:15Z

Summary

Builds on PR-2 (#582). Adds the GPU dispatch for Round 2 composition-poly LDE + Merkle commit (the number_of_parts > 2 branch, exercised today by the branch and shift tables since they have degree-3 transition constraints) and Round 3 OOD trace barycentric (reads PR-2's gpu_main / gpu_aux device handles, no host-side LDE traversal).

The cuda feature stays opt-in. CPU is the default and untouched. With cuda on, every new dispatch site falls through to the existing rayon CPU path when the type isn't Goldilocks/ext3, the size is below threshold, the GPU R1 handle is absent for this table, or the math-cuda call returns Err.

R4 DEEP, R4 FRI and the _keep variant of R2 (which retains a gpu_composition_parts handle for R4 DEEP) are deferred to PR-4.

What's in

crypto/math-cuda/kernels/barycentric.cu (new, ~190 LoC). Four kernels: barycentric_{base,ext3}_batched and barycentric_{base,ext3}_batched_strided. The
strided variants read an LDE buffer at a row stride (used by R3 to pick the trace-size coset out of the device-resident LDE without materialising a slab).
crypto/math-cuda/src/barycentric.rs (new, ~215 LoC). Four host wrappers: barycentric_{base,ext3} for host data and barycentric_{base,ext3}_on_device
for &GpuLdeBase / &GpuLdeExt3 handles.
crypto/math-cuda/src/device.rs (+14). BARY_PTX const, four CudaFunction fields for the new kernels.
crypto/math-cuda/build.rs (+1). compile_ptx("barycentric.cu", ...).
crypto/math-cuda/src/lib.rs (+1). pub mod barycentric.
crypto/math-cuda/tests/ (new, ~530 LoC across 3 files). barycentric.rs and barycentric_strided.rs cover the four kernels against a CPU reference
summing the unscaled barycentric over base / ext3 columns with optional stride. comp_poly_tree.rs exercises the fused
evaluate_poly_coset_batch_ext3_into_with_merkle_tree end-to-end against the CPU commit_composition_polynomial for sizes from (log_n=2, blowup=2) up to
(log_n=14, blowup=2).
crypto/stark/src/gpu_lde.rs (+~390 LoC). New dispatches:
- try_evaluate_parts_on_lde_gpu (R2, non-_keep ext3 LDE for the parts > 2 branch).
- try_build_comp_poly_tree_gpu (R2 row-pair Keccak leaves + inner tree from host evals).
- try_barycentric_base_on_handle + try_barycentric_ext3_on_handle (R3 OOD reading PR-2's device handles).
- Host helpers ood_ext3_scalar and apply_ext3_scalar for the per-column scalar application.
- Two new atomic counters (gpu_parts_lde_calls, gpu_bary_calls) and a separate LAMBDA_VM_GPU_BARY_THRESHOLD env override (default 2^14). Both new
  counters reset via reset_all_gpu_call_counters().
crypto/stark/src/prover.rs (~70 LoC of changes). R2 dispatch in round_2_compute_composition_polynomial: pre-compute composition_poly_parts once, then
GPU-or-CPU for the LDE step, then GPU-or-CPU for the comp-poly Merkle commit. Round2 struct is unchanged.
crypto/stark/src/trace.rs (~80 LoC of changes). R3 dispatch in get_trace_evaluations_from_lde: per eval-point, try try_barycentric_base_on_handle for
main and try_barycentric_ext3_on_handle for aux; on None, run the existing rayon CPU loop. inv_denoms stays on CPU (documented stream-contention regression).
Added + 'static bound on the function's type params to support TypeId dispatch in the new GPU branches.

Known limitations carried over from PR-2

Parallel cuda tests still deadlock under default rayon (pinned-staging mutex contention). Workaround: --test-threads=1. Math-cuda-side fix, out of scope here.
LAMBDA_VM_GPU_LDE_THRESHOLD=0 forces small-domain tables through math-cuda kernels that panic at log_n < 1. Pre-existing regression, present on PR-2 baseline, not introduced by this PR.
Peak VRAM still scales with num_AIRs * per_table_LDE because R1 handles are retained across all rounds. PR-3 does not add new device-resident handles, so the ceiling is unchanged from PR-2. A follow-up PR will introduce a VRAM budget that gracefully falls back to non-_keep when retention would OOM the GPU.

Continuation of

Builds on PR-2 (#582). Base branch is feat/cuda-pr2-r1-gpu-commits. PR-4 (R4 DEEP + FRI + batch invert + R2 _keep) stacks on top.

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

…commits

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

gabrielbosio

It would be nice to move the following test helpers to a new file and make the math-cuda tests use them to avoid code duplication:

type Fp / type Fp3 aliases
one rand_fp / rand_fp3 (the random generators, currently random_fp/rand_fp/rand_ext3)
ext3_to_u64s / u64s_to_ext3 (the interleaved packing)
the canonicalization family (canon, canon_fp3/canon3/canon_triplet, canon_triplet_raw)
reverse_index

This can be addressed in another PR though.

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

…o feat/cuda-pr3

gabrielbosio

It would be good to add a make command that runs clippy with the cuda feature:

clippy-cuda:
    cargo clippy -p stark --features cuda --all-targets -- -D warnings -A clippy::op_ref

gabrielbosio · 2026-06-01T20:56:19Z

+}
+
+/// Trace-size threshold for the R3 OOD barycentric GPU path. Below this the
+/// rayon CPU path already completes in well under a millisecond and PCIe


rayon CPU path already completes in well under a millisecond

where 2^14 was actually measured?

gabrielbosio · 2026-06-01T20:56:42Z

+/// The GPU kernels compute only the unscaled barycentric sum per column;
+/// applying this scalar on the host is one ext3 multiply per column, cheap


Suggested change

/// The GPU kernels compute only the unscaled barycentric sum per column;

/// applying this scalar on the host is one ext3 multiply per column, cheap

/// The GPU kernels compute only the unscaled barycentric sum per column.

/// Applying this scalar on the host is one ext3 multiply per column, cheap

gabrielbosio · 2026-06-01T21:04:08Z

+    // Hard assert (not debug_assert) because the unsafe `transmute_copy`
+    // below is UB if `E != Ext3`. Cost is one TypeId comparison per call.


Suggested change

// Hard assert (not debug_assert) because the unsafe `transmute_copy`

// below is UB if `E != Ext3`. Cost is one TypeId comparison per call.

// Avoids the `E != Ext3` path reaching the unsafe `transmute_copy` below

// that is UB in that case. Cost is one TypeId comparison per call.

gabrielbosio · 2026-06-01T21:04:31Z

+        let final_ext3 = &s * &scalar_e;
+        // SAFETY: TypeId-checked at the caller. E == Ext3, identical layout.
+        let final_e: FieldElement<E> = unsafe {
+            core::mem::transmute_copy::<FieldElement<Ext3>, FieldElement<E>>(&final_ext3)


Add the import statement to the top

gabrielbosio · 2026-06-01T21:05:02Z

+    offset: &FieldElement<F>,
+) -> Option<Vec<Vec<FieldElement<E>>>>
+where
+    F: math::field::traits::IsFFTField + IsField + IsSubFieldOf<E> + 'static,


Add the import statement at the top

gabrielbosio · 2026-06-01T21:05:13Z

+    let mut weights_u64 = Vec::with_capacity(domain_size);
+    let mut w = FieldElement::<F>::one();
+    for _ in 0..domain_size {
+        // SAFETY: F == Goldilocks per TypeId check; FieldElement<Gl> is


Suggested change

// SAFETY: F == Goldilocks per TypeId check; FieldElement<Gl> is

// SAFETY: F == Goldilocks per TypeId check. FieldElement<Gl> is

gabrielbosio · 2026-06-01T21:05:57Z

+pub(crate) fn try_build_comp_poly_tree_gpu<E, B>(
+    lde_parts: &[Vec<FieldElement<E>>],
+) -> Option<crypto::merkle_tree::merkle::MerkleTree<B>>
+where
+    E: IsField + 'static,
+    B: crypto::merkle_tree::traits::IsMerkleTreeBackend<Node = [u8; 32]>,


Add the import statement at the top

gabrielbosio · 2026-06-01T21:06:08Z

+        return None;
+    }
+
+    // SAFETY: E == Ext3 per TypeId check; FieldElement<Ext3> backing is


Suggested change

// SAFETY: E == Ext3 per TypeId check; FieldElement<Ext3> backing is

// SAFETY: E == Ext3 per TypeId check. FieldElement<Ext3> backing is

gabrielbosio · 2026-06-01T21:41:40Z

+    let tight_total_nodes = num_leaves
+        .checked_mul(2)
+        .and_then(|v| v.checked_sub(1))
+        .expect("tight_total_nodes: usize arithmetic overflow");


tight_total_nodes = num_leaves * 2 - 1 = (lde_size / 2) * 2 - 1 = lde_size - 1
(lde_size / 2) * 2 = lde_size because lde_size is even (Enforced [here])(

lambda_vm/crypto/stark/src/gpu_lde.rs

Lines 837 to 839 in 0ffc661

if !lde_size.is_power_of_two() || lde_size < 2 {

return None;

}

)).

Suggested change

.expect("tight_total_nodes: usize arithmetic overflow");

// lde_size is an even power of two >= 2, so 2*num_leaves == lde_size and

// tight_total_nodes = lde_size - 1 >= 1. No overflow or underflow possible.

let tight_total_nodes = lde_size - 1;

gabrielbosio · 2026-06-01T21:42:50Z

+    // 2*num_leaves which is a power of 2. `from_precomputed_nodes` only
+    // returns `None` when that invariant fails or `nodes` is empty.
+    Some(
+        crypto::merkle_tree::merkle::MerkleTree::<B>::from_precomputed_nodes(nodes)


Add the import statement at the top

gabrielbosio · 2026-06-01T21:49:04Z

+/// threshold, or the math-cuda call returns `Err`.
+#[allow(clippy::too_many_arguments)]
+pub(crate) fn try_barycentric_base_on_handle<F, E>(
+    lde_trace: &crate::trace::LDETraceTable<F, E>,


Add the import statement at the top

gabrielbosio · 2026-06-01T21:55:36Z

+/// R3 GPU dispatch: batched strided barycentric OOD evaluation over the main
+/// (base-field) LDE columns kept on device from R1 GPU dispatch. Reads
+/// `lde_trace.gpu_main()` directly, no H2D for the column data. Returns
+/// the OOD evaluations as `Vec<FieldElement<E>>` of length `num_main_cols`
+/// (already scaled by `vanishing * n_inv * g_n_inv`), or `None` if the GPU
+/// handle is absent, types don't match, the trace-size domain is below
+/// threshold, or the math-cuda call returns `Err`.


Suggested change

/// R3 GPU dispatch: batched strided barycentric OOD evaluation over the main

/// (base-field) LDE columns kept on device from R1 GPU dispatch. Reads

/// `lde_trace.gpu_main()` directly, no H2D for the column data. Returns

/// the OOD evaluations as `Vec<FieldElement<E>>` of length `num_main_cols`

/// (already scaled by `vanishing * n_inv * g_n_inv`), or `None` if the GPU

/// handle is absent, types don't match, the trace-size domain is below

/// threshold, or the math-cuda call returns `Err`.

/// R3 GPU dispatch: batched strided barycentric OOD evaluation over the main

/// (base-field) LDE columns kept on device from R1. Operates on the

/// device-resident LDE in place; only the coset points and inv_denoms are

/// copied to the device, not the columns. Returns the OOD evaluations

/// as `Vec<FieldElement<E>>` of length `num_main_cols` (already scaled by

/// `vanishing * n_inv * g_n_inv`), or `None` if the GPU handle is absent,

/// types don't match, the trace-size domain is below threshold, or the

/// math-cuda call returns `Err`.

Waiting to resolve previous comments

ColoCarletti and others added 30 commits May 6, 2026 15:12

add first cuda files

d1a0abf

fmt

79634ff

fix clippy

ac6fbb5

gpu 2nd part

2ceb3b0

feat(cuda): Round 1 GPU LDE+commit dispatch + device-resident handles

affceb1

merge main

01172f2

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

c4627e1

comments fix

01aa5e4

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

cfc5c19

Update crypto/stark/src/gpu_lde.rs

ea5696f

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

a8cf265

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

fb8d31f

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

a79f2b5

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

761a2c0

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

address reviews

e066e9d

fix review comments

7d3d0f0

Merge remote-tracking branch 'origin/main' into feat/cuda-pr2-r1-gpu-…

cf80771

…commits

address doc comment suggestions

71aba0d

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

83d91b8

fix

34cae4b

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

f076bf4

Pass replay transcript to bus-balance call in verify_vm_minimal

a2cde0f

Update crypto/math-cuda/src/device.rs

46c305b

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

aca3dca

Update crypto/math-cuda/src/device.rs

63d7c00

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/device.rs

eb16c02

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/device.rs

66925b1

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/lde.rs

4e6daf3

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/lde.rs

4cd27d9

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/lde.rs

5fe390f

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>