Skip to content

feat(cuda): Round 2 composition parts + Round 3 OOD barycentric dispatch#637

Open
ColoCarletti wants to merge 45 commits into
mainfrom
feat/cuda-pr3
Open

feat(cuda): Round 2 composition parts + Round 3 OOD barycentric dispatch#637
ColoCarletti wants to merge 45 commits into
mainfrom
feat/cuda-pr3

Conversation

@ColoCarletti
Copy link
Copy Markdown
Collaborator

Summary

Builds on PR-2 (#582). Adds the GPU dispatch for Round 2 composition-poly LDE + Merkle commit (the number_of_parts > 2 branch, exercised today by the branch and shift tables since they have degree-3 transition constraints) and Round 3 OOD trace barycentric (reads PR-2's gpu_main / gpu_aux device handles, no host-side LDE traversal).

The cuda feature stays opt-in. CPU is the default and untouched. With cuda on, every new dispatch site falls through to the existing rayon CPU path when the type isn't Goldilocks/ext3, the size is below threshold, the GPU R1 handle is absent for this table, or the math-cuda call returns Err.

R4 DEEP, R4 FRI and the _keep variant of R2 (which retains a gpu_composition_parts handle for R4 DEEP) are deferred to PR-4.

What's in

  • crypto/math-cuda/kernels/barycentric.cu (new, ~190 LoC). Four kernels: barycentric_{base,ext3}_batched and barycentric_{base,ext3}_batched_strided. The
    strided variants read an LDE buffer at a row stride (used by R3 to pick the trace-size coset out of the device-resident LDE without materialising a slab).
  • crypto/math-cuda/src/barycentric.rs (new, ~215 LoC). Four host wrappers: barycentric_{base,ext3} for host data and barycentric_{base,ext3}_on_device
    for &GpuLdeBase / &GpuLdeExt3 handles.
  • crypto/math-cuda/src/device.rs (+14). BARY_PTX const, four CudaFunction fields for the new kernels.
  • crypto/math-cuda/build.rs (+1). compile_ptx("barycentric.cu", ...).
  • crypto/math-cuda/src/lib.rs (+1). pub mod barycentric.
  • crypto/math-cuda/tests/ (new, ~530 LoC across 3 files). barycentric.rs and barycentric_strided.rs cover the four kernels against a CPU reference
    summing the unscaled barycentric over base / ext3 columns with optional stride. comp_poly_tree.rs exercises the fused
    evaluate_poly_coset_batch_ext3_into_with_merkle_tree end-to-end against the CPU commit_composition_polynomial for sizes from (log_n=2, blowup=2) up to
    (log_n=14, blowup=2).
  • crypto/stark/src/gpu_lde.rs (+~390 LoC). New dispatches:
    • try_evaluate_parts_on_lde_gpu (R2, non-_keep ext3 LDE for the parts > 2 branch).
    • try_build_comp_poly_tree_gpu (R2 row-pair Keccak leaves + inner tree from host evals).
    • try_barycentric_base_on_handle + try_barycentric_ext3_on_handle (R3 OOD reading PR-2's device handles).
    • Host helpers ood_ext3_scalar and apply_ext3_scalar for the per-column scalar application.
    • Two new atomic counters (gpu_parts_lde_calls, gpu_bary_calls) and a separate LAMBDA_VM_GPU_BARY_THRESHOLD env override (default 2^14). Both new
      counters reset via reset_all_gpu_call_counters().
  • crypto/stark/src/prover.rs (~70 LoC of changes). R2 dispatch in round_2_compute_composition_polynomial: pre-compute composition_poly_parts once, then
    GPU-or-CPU for the LDE step, then GPU-or-CPU for the comp-poly Merkle commit. Round2 struct is unchanged.
  • crypto/stark/src/trace.rs (~80 LoC of changes). R3 dispatch in get_trace_evaluations_from_lde: per eval-point, try try_barycentric_base_on_handle for
    main and try_barycentric_ext3_on_handle for aux; on None, run the existing rayon CPU loop. inv_denoms stays on CPU (documented stream-contention regression).
    Added + 'static bound on the function's type params to support TypeId dispatch in the new GPU branches.

Known limitations carried over from PR-2

  • Parallel cuda tests still deadlock under default rayon (pinned-staging mutex contention). Workaround: --test-threads=1. Math-cuda-side fix, out of scope here.
  • LAMBDA_VM_GPU_LDE_THRESHOLD=0 forces small-domain tables through math-cuda kernels that panic at log_n < 1. Pre-existing regression, present on PR-2 baseline, not introduced by this PR.
  • Peak VRAM still scales with num_AIRs * per_table_LDE because R1 handles are retained across all rounds. PR-3 does not add new device-resident handles, so the ceiling is unchanged from PR-2. A follow-up PR will introduce a VRAM budget that gracefully falls back to non-_keep when retention would OOM the GPU.

Continuation of

Builds on PR-2 (#582). Base branch is feat/cuda-pr2-r1-gpu-commits. PR-4 (R4 DEEP + FRI + batch invert + R2 _keep) stacks on top.

ColoCarletti and others added 30 commits May 6, 2026 15:12
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Comment thread crypto/math-cuda/src/barycentric.rs Outdated
Comment thread crypto/math-cuda/tests/barycentric.rs Outdated
Comment thread crypto/math-cuda/tests/barycentric.rs Outdated
Comment thread crypto/math-cuda/tests/barycentric_strided.rs Outdated
Comment thread crypto/math-cuda/tests/barycentric_strided.rs Outdated
Comment thread crypto/math-cuda/tests/barycentric_strided.rs Outdated
Comment thread crypto/math-cuda/tests/barycentric_strided.rs Outdated
Comment thread crypto/math-cuda/tests/barycentric_strided.rs Outdated
Comment thread crypto/math-cuda/tests/barycentric_strided.rs Outdated
Copy link
Copy Markdown
Collaborator

@gabrielbosio gabrielbosio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to move the following test helpers to a new file and make the math-cuda tests use them to avoid code duplication:

  • type Fp / type Fp3 aliases
  • one rand_fp / rand_fp3 (the random generators, currently random_fp/rand_fp/rand_ext3)
  • ext3_to_u64s / u64s_to_ext3 (the interleaved packing)
  • the canonicalization family (canon, canon_fp3/canon3/canon_triplet, canon_triplet_raw)
  • reverse_index

This can be addressed in another PR though.

ColoCarletti and others added 4 commits June 1, 2026 17:24
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Copy link
Copy Markdown
Collaborator

@gabrielbosio gabrielbosio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to add a make command that runs clippy with the cuda feature:

clippy-cuda:
    cargo clippy -p stark --features cuda --all-targets -- -D warnings -A clippy::op_ref

}

/// Trace-size threshold for the R3 OOD barycentric GPU path. Below this the
/// rayon CPU path already completes in well under a millisecond and PCIe
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rayon CPU path already completes in well under a millisecond

where 2^14 was actually measured?

Comment on lines +658 to +659
/// The GPU kernels compute only the unscaled barycentric sum per column;
/// applying this scalar on the host is one ext3 multiply per column, cheap
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// The GPU kernels compute only the unscaled barycentric sum per column;
/// applying this scalar on the host is one ext3 multiply per column, cheap
/// The GPU kernels compute only the unscaled barycentric sum per column.
/// Applying this scalar on the host is one ext3 multiply per column, cheap

Comment on lines +693 to +694
// Hard assert (not debug_assert) because the unsafe `transmute_copy`
// below is UB if `E != Ext3`. Cost is one TypeId comparison per call.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Hard assert (not debug_assert) because the unsafe `transmute_copy`
// below is UB if `E != Ext3`. Cost is one TypeId comparison per call.
// Avoids the `E != Ext3` path reaching the unsafe `transmute_copy` below
// that is UB in that case. Cost is one TypeId comparison per call.

let final_ext3 = &s * &scalar_e;
// SAFETY: TypeId-checked at the caller. E == Ext3, identical layout.
let final_e: FieldElement<E> = unsafe {
core::mem::transmute_copy::<FieldElement<Ext3>, FieldElement<E>>(&final_ext3)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the import statement to the top

offset: &FieldElement<F>,
) -> Option<Vec<Vec<FieldElement<E>>>>
where
F: math::field::traits::IsFFTField + IsField + IsSubFieldOf<E> + 'static,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the import statement at the top

let mut weights_u64 = Vec::with_capacity(domain_size);
let mut w = FieldElement::<F>::one();
for _ in 0..domain_size {
// SAFETY: F == Goldilocks per TypeId check; FieldElement<Gl> is
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// SAFETY: F == Goldilocks per TypeId check; FieldElement<Gl> is
// SAFETY: F == Goldilocks per TypeId check. FieldElement<Gl> is

Comment on lines +826 to +831
pub(crate) fn try_build_comp_poly_tree_gpu<E, B>(
lde_parts: &[Vec<FieldElement<E>>],
) -> Option<crypto::merkle_tree::merkle::MerkleTree<B>>
where
E: IsField + 'static,
B: crypto::merkle_tree::traits::IsMerkleTreeBackend<Node = [u8; 32]>,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the import statement at the top

return None;
}

// SAFETY: E == Ext3 per TypeId check; FieldElement<Ext3> backing is
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// SAFETY: E == Ext3 per TypeId check; FieldElement<Ext3> backing is
// SAFETY: E == Ext3 per TypeId check. FieldElement<Ext3> backing is

let tight_total_nodes = num_leaves
.checked_mul(2)
.and_then(|v| v.checked_sub(1))
.expect("tight_total_nodes: usize arithmetic overflow");
Copy link
Copy Markdown
Collaborator

@gabrielbosio gabrielbosio Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tight_total_nodes = num_leaves * 2 - 1 = (lde_size / 2) * 2 - 1 = lde_size - 1
(lde_size / 2) * 2 = lde_size because lde_size is even (Enforced [here])(

if !lde_size.is_power_of_two() || lde_size < 2 {
return None;
}
)).

Suggested change
.expect("tight_total_nodes: usize arithmetic overflow");
// lde_size is an even power of two >= 2, so 2*num_leaves == lde_size and
// tight_total_nodes = lde_size - 1 >= 1. No overflow or underflow possible.
let tight_total_nodes = lde_size - 1;

// 2*num_leaves which is a power of 2. `from_precomputed_nodes` only
// returns `None` when that invariant fails or `nodes` is empty.
Some(
crypto::merkle_tree::merkle::MerkleTree::<B>::from_precomputed_nodes(nodes)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the import statement at the top

/// threshold, or the math-cuda call returns `Err`.
#[allow(clippy::too_many_arguments)]
pub(crate) fn try_barycentric_base_on_handle<F, E>(
lde_trace: &crate::trace::LDETraceTable<F, E>,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the import statement at the top

Comment on lines +897 to +903
/// R3 GPU dispatch: batched strided barycentric OOD evaluation over the main
/// (base-field) LDE columns kept on device from R1 GPU dispatch. Reads
/// `lde_trace.gpu_main()` directly, no H2D for the column data. Returns
/// the OOD evaluations as `Vec<FieldElement<E>>` of length `num_main_cols`
/// (already scaled by `vanishing * n_inv * g_n_inv`), or `None` if the GPU
/// handle is absent, types don't match, the trace-size domain is below
/// threshold, or the math-cuda call returns `Err`.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// R3 GPU dispatch: batched strided barycentric OOD evaluation over the main
/// (base-field) LDE columns kept on device from R1 GPU dispatch. Reads
/// `lde_trace.gpu_main()` directly, no H2D for the column data. Returns
/// the OOD evaluations as `Vec<FieldElement<E>>` of length `num_main_cols`
/// (already scaled by `vanishing * n_inv * g_n_inv`), or `None` if the GPU
/// handle is absent, types don't match, the trace-size domain is below
/// threshold, or the math-cuda call returns `Err`.
/// R3 GPU dispatch: batched strided barycentric OOD evaluation over the main
/// (base-field) LDE columns kept on device from R1. Operates on the
/// device-resident LDE in place; only the coset points and inv_denoms are
/// copied to the device, not the columns. Returns the OOD evaluations
/// as `Vec<FieldElement<E>>` of length `num_main_cols` (already scaled by
/// `vanishing * n_inv * g_n_inv`), or `None` if the GPU handle is absent,
/// types don't match, the trace-size domain is below threshold, or the
/// math-cuda call returns `Err`.

gabrielbosio
gabrielbosio previously approved these changes Jun 2, 2026
@gabrielbosio gabrielbosio dismissed their stale review June 2, 2026 15:22

Waiting to resolve previous comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants