Skip to content

feat(cuda): Round 1 GPU LDE+commit dispatch + device-resident handles#582

Merged
MauroToscano merged 39 commits into
mainfrom
feat/cuda-pr2-r1-gpu-commits
Jun 1, 2026
Merged

feat(cuda): Round 1 GPU LDE+commit dispatch + device-resident handles#582
MauroToscano merged 39 commits into
mainfrom
feat/cuda-pr2-r1-gpu-commits

Conversation

@ColoCarletti
Copy link
Copy Markdown
Collaborator

Summary

Wires crypto/stark into the math-cuda crate (PR-1a + PR-1b) for the first time: Round 1 main + aux trace LDE + Merkle commits run on GPU when the cuda feature is
enabled and the table is above threshold. The committed LDE buffer is kept on device as a GpuLdeBase / GpuLdeExt3 handle, attached to LDETraceTable so future
work (R3 OOD, R4 DEEP, R4 FRI) can read the LDE without re-paying PCIe transfer.

The cuda feature stays opt-in. The CPU path is the default and untouched. With the feature on, the dispatch falls through to CPU when the type isn't
Goldilocks/ext3, the size is below threshold, or the GPU path returns None.

This is the structural piece — the headline wall-time win arrives in later rounds (R4 DEEP + FRI), but the device-resident handle plumbing has to land first.

What's in

  • New file crypto/stark/src/gpu_lde.rs (~960 LoC) — dispatch surface: try_expand_columns_batched, try_expand_leaf_and_tree_batched_keep (R1 main),
    try_expand_leaf_and_tree_batched_ext3_keep (R1 aux), try_extend_two_halves_gpu (R2 quotient extend, dormant for current sizes but ships now), threshold + atomic
    call counters (LAMBDA_VM_GPU_LDE_THRESHOLD, default 2^19).
  • prover.rs (~250 LoC of changes): commit_main_trace / commit_preprocessed_trace return type refactored from a 6-tuple to a MainTraceCommitResult struct;
    Lde<F, FE> gains #[cfg(feature = "cuda")] gpu_main / gpu_aux fields; dispatch sites in expand_columns_to_lde and the multi_prove aux-build chunk; GPU handles
    thread through multi_prove into Round1Commitments and LDETraceTable.
  • trace.rs (~40 LoC): LDETraceTable gains gpu_main / gpu_aux: Option<...> fields + setters/accessors. R3 GPU branches are NOT added here — they'll land alongside
    the R3 GPU dispatch.
  • prover/tests/bench_single.rs (new) — small #[ignore]'d prove-fib bench so reviewers can verify R1 wall time on the GPU path.
  • crypto/crypto/src/merkle_tree/merkle.rs (+24 LoC) — small helper to construct a BatchedMerkleTree from a flat device-produced node buffer.

Verification

Check Result
cargo build --workspace --release
cargo build -p stark --features cuda --release
cargo build -p lambda-vm-prover --release (no cuda) ✓ — does not pull math-cuda
cargo fmt --check --all
cargo clippy --workspace --all-targets -- -D warnings -A clippy::op_ref
cargo test -p math-cuda --release 35 / 35 parity tests pass
cargo test -p lambda-vm-prover --features stark/cuda --release -- --test-threads=1 339 / 339

Run on RTX 5090, driver 595.58.03, CUDA 13.1.

Known limitation: parallel cuda tests deadlock

cargo test -p lambda-vm-prover --features cuda deadlocks under default rayon parallelism (rayon-on-rayon contention while holding math-cuda's pinned_staging
Mutex). Workaround: run with --test-threads=1. Inherited from the source CUDA work; proper fix lives on the math-cuda side and is out of scope here.

Continuation of

Builds on PR-1a (#576) and PR-1b (#578). Base branch is feat/cuda-pr1b-keccak-merkle, not main.

Comment thread crypto/stark/src/gpu_lde.rs
Comment thread crypto/stark/src/gpu_lde.rs
Comment thread crypto/stark/src/gpu_lde.rs
Comment thread crypto/stark/src/gpu_lde.rs
Comment thread crypto/crypto/src/merkle_tree/merkle.rs Outdated
Comment thread crypto/stark/src/prover.rs
@claude
Copy link
Copy Markdown

claude Bot commented May 8, 2026

Review: feat(cuda) Round 1 GPU LDE+commit dispatch

Good structural work — the CPU fallback is clean, feature-gating is consistent, and the fused LDE+leaf-hash+tree pipeline avoids the extra H2D that a separate step would need. A few issues need attention before merge.

High

1. type_name used to justify unsafe memory casts (see inline on gpu_lde.rs:99)
type_name::<T>() is explicitly documented as not stable across compiler versions and not suitable for type identity. Every unsafe cast in the file (*(e.value() as *const _ as *const u64), from_raw_parts on ext3 slices) is guarded solely by these string comparisons. A sealed-trait approach makes the same dispatch sound at compile time and removes the runtime fragility.

2. debug_assert protecting unsafe set_len (see inline on gpu_lde.rs:130)
debug_assert! is a no-op in release builds. Every col.set_len(lde_size) pattern in the file relies on the capacity precondition being true; if a caller regresses, this is UB (GPU writes past the allocation) with no diagnostic in production. Use a hard assert! or restructure with spare_capacity_mut + MaybeUninit.

Medium

3. GPU kernel failures panic (see inline on gpu_lde.rs:156)
All GPU calls end with .expect("..."). A transient CUDA error (OOM, context loss) kills the process. The functions should return Result so errors surface as ProvingError instead.

4. squared_offset parameter silently ignored (see inline on gpu_lde.rs:204)
try_extend_two_halves_gpu accepts squared_offset but discards it and rebuilds the offset from domain.coset_offset. This silently produces wrong weights if a future caller passes a different squared offset.

5. from_precomputed_nodes accepts unverified GPU output (see inline on merkle.rs:78)
Only the length is validated. A kernel bug or CUDA memory corruption produces a structurally-valid but cryptographically-wrong commitment with no detection. At minimum add a debug_assert-mode spot-check of the root node.

Low

6. AuxResult tuple uses () as phantom fourth field (see inline on prover.rs:1836)
Adds noise and forces #[allow(clippy::let_unit_value)]. Mirroring the main_gpu_handles side-vector pattern is cleaner.

7. Pervasive code duplication across gpu_lde.rs
try_expand_leaf_and_tree_batched, _keep, _ext3, _ext3_keep, try_expand_and_leaf_hash_batched, _ext3 share identical prologue blocks (~50 lines each). A small internal helper that validates preconditions and extracts the raw slices would halve the file size and make future changes safer.

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 8, 2026

Codex Code Review

Findings

  • Medium Security: gpu_lde.rs uses type_name checks plus raw pointer casts to reinterpret Vec<FieldElement<E>> as &mut [u64]. Rust does not guarantee FieldElement<GoldilocksField> has u64 layout because FieldElement is not #[repr(transparent)]; the ext3 paths make the same assumption for [u64; 3]. This is unsafe layout-dependent code and can become UB if the type layout changes. Fix by making the layout contract explicit with repr(transparent) and size/alignment assertions, or by using explicit conversion buffers.

  • Bug: prover.rs constructs Lde { main, aux } inside #[cfg(feature = "debug-checks")], but Lde has extra gpu_main and gpu_aux fields under feature = "cuda". Building with both cuda and debug-checks will fail. Add cfg-gated None fields there.

I could not run cargo check; rustup failed creating temp files under /home/runner/.rustup because that path is read-only in this environment.

@ColoCarletti ColoCarletti marked this pull request as ready for review May 13, 2026 21:07
@ColoCarletti ColoCarletti marked this pull request as draft May 13, 2026 21:08
Comment thread crypto/stark/src/gpu_lde.rs
@github-actions
Copy link
Copy Markdown

Codex Code Review

Findings:

  1. [Medium][Security] Unsafe layout casts are not backed by a Rust layout guarantee
    crypto/stark/src/gpu_lde.rs repeatedly treats FieldElement<GoldilocksField> / FieldElement<Degree3GoldilocksExtensionField> memory as u64 slices based on type_name checks. type_name is not a type/layout contract, and FieldElement is repr(Rust), so these raw casts can become UB if layout changes. This affects both reads and writes, including from_raw_parts_mut output views. Add explicit layout guarantees, e.g. #[repr(transparent)] where valid plus compile-time size/align assertions, or convert through safe field/component APIs.

  2. Bug: cuda + debug-checks feature combo does not construct Lde correctly
    crypto/stark/src/prover.rs builds Lde { main, aux } inside reconstruct_round1, but Lde has extra gpu_main and gpu_aux fields under #[cfg(feature = "cuda")] at crypto/stark/src/prover.rs. Enabling both features should fail to compile. Add #[cfg(feature = "cuda")] gpu_main: None, gpu_aux: None there.

I attempted cargo check -p stark --features 'cuda debug-checks', but this environment’s rustup temp dir is read-only, so I could not confirm via compiler output.

Comment thread crypto/stark/src/gpu_lde.rs
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/crypto/src/merkle_tree/merkle.rs
@claude
Copy link
Copy Markdown

claude Bot commented May 13, 2026

Review summary

Five issues found across the new GPU dispatch layer and the Merkle helper.

High

  • type_name::<T>() is used as the sole proof that a generic type is memory-equivalent to u64 / [u64;3] before a pile of unsafe transmutes. The Rust docs explicitly say this string must not be considered a unique identifier of a type. The fix is a sealed unsafe trait IsU64Repr {} implemented only for the concrete types you control — the type system then enforces the invariant instead of a runtime string compare. All type_name + raw-pointer cast pairs throughout gpu_lde.rs share this issue.

Medium

  • debug_assert!(col.capacity() >= lde_size) immediately before unsafe { col.set_len(lde_size) } — the assert is stripped in --release, which is the only mode the GPU path runs. A capacity violation in release is silent UB. Use assert! or make the invariant structurally impossible to violate.
  • Every math_cuda call uses .expect(). A GPU OOM or driver error crashes the whole process; these should map to ProvingError and let the caller recover.

Low

  • try_extend_two_halves_gpu accepts squared_offset but immediately discards it (let _ = squared_offset), silently falling back to deriving the same value from domain. The parameter should either be removed or used — leaving it makes future callers believe they can influence the computation when they cannot.
  • from_precomputed_nodes validates only array length. A GPU bit-flip or kernel bug produces a structurally invalid MerkleTree with no diagnostic — consider a #[cfg(debug_assertions)] integrity walk to catch GPU correctness regressions in CI.

Comment thread crypto/crypto/src/merkle_tree/merkle.rs Outdated
Base automatically changed from feat/cuda-pr1b-keccak-merkle to main May 18, 2026 22:30
@ColoCarletti ColoCarletti marked this pull request as ready for review May 19, 2026 19:17
@github-actions
Copy link
Copy Markdown

Codex Code Review

Findings

  1. Build break: moved result is used after drop(result)
    prover/src/lib.rs:680 adds a second debug_assert_eq! that reads result.return_values after line 622 drops result. This will fail to compile. Remove the duplicate assertion or move it before drop(result).

  2. Build break with disk-spill: missing mmap_backing field
    crypto/crypto/src/merkle_tree/merkle.rs:147 constructs MerkleTree { root, nodes }, but under feature = "disk-spill" the struct also has mmap_backing. Add #[cfg(feature = "disk-spill")] mmap_backing: None.

  3. Build break with cuda + debug-checks
    crypto/stark/src/prover.rs:835 constructs Lde { main, aux }, but with cuda enabled Lde now also requires gpu_main and gpu_aux. Add None for both behind the same cfg.

  4. Test build break: missing Prover import
    prover/src/tests/prove_elfs_tests.rs:110 uses Prover::multi_prove without importing stark::prover::Prover.

No security vulnerabilities found in the reviewed diff beyond these compile-blocking correctness issues. I attempted cargo check -p lambda-vm-prover, but the sandbox cannot write to /home/runner/.rustup/tmp, so it could not run here.

@claude
Copy link
Copy Markdown

claude Bot commented May 19, 2026

Review: feat/cuda-pr2-r1-gpu-commits

This PR wires in GPU (CUDA) acceleration for Round 1 LDE + Merkle commitments via a new gpu_lde.rs (~960 lines) and refactors prover.rs to carry GPU handles through the commit pipeline. Issues found:


High

H1 — debug_assert! guards unsafe set_len — stripped in release builds
Every unsafe { col.set_len(lde_size) } is gated only by debug_assert!(col.capacity() >= lde_size). In release builds the assertion is compiled out. If a caller ever returns columns with insufficient capacity, set_len will point the Vec's length past its allocation — undefined behaviour (heap corruption). This is a safety invariant; use assert! unconditionally.
Affected lines: gpu_lde.rs:128, 297, 393, 482, 568, 656, 743, 822

H2 — type_name::<F>() used as a memory-safety gate
All unsafe casts and reinterpret-slices are enabled/disabled by comparing core::any::type_name::<F>() strings. The docs explicitly say the string is best-effort with no stability guarantee. The correct mechanism is std::any::TypeId::of::<F>() == TypeId::of::<GoldilocksField>(), which is the stable runtime type identity check. A type alias or a future compiler version could make two distinct types share a name, silently enabling wrong casts.
Affected lines: gpu_lde.rs:84, 97, 101, 190, 193, 273, 276, 364, 367, ...


Medium

M1 — GPU kernel failures panic the prover with no fallback
Every CUDA call terminates with .expect("GPU … failed"). A transient GPU error, OOM, or driver failure will panic the entire prover thread. These entry-point functions already return Option; they should return None (or Result) on kernel failure so the caller can fall back to the CPU path.

M2 — squared_offset parameter is silently discarded
try_extend_two_halves_gpu accepts squared_offset: &FieldElement<F> but immediately does let _ = squared_offset and re-derives the coset weight from domain.coset_offset. If a caller ever passes a domain whose coset_offset differs from the one used to compute squared_offset, the LDE output is silently wrong. Remove the parameter or use it.
Line: gpu_lde.rs:199

M3 — Plain * on total_nodes * 32 can overflow and silently truncate the slice
lde_size is computed with saturating_mul, but total_nodes = 2 * lde_size - 1 and then total_nodes * 32 use plain arithmetic. If lde_size is large (possible on a misconfigured domain), the multiplication wraps in release mode to a small number, the from_raw_parts_mut slice is smaller than the allocation, and the GPU writes past it — silent UB. Use checked_mul and return None on overflow.
Affected lines: gpu_lde.rs:412, 500, 586, 674, 760


Low

L1 — Only GPU_LDE_CALLS has a reset function
reset_gpu_lde_calls() resets GPU_LDE_CALLS but GPU_EXTEND_HALVES_CALLS, GPU_LEAF_HASH_CALLS, and GPU_MERKLE_TREE_CALLS accumulate forever. The bench test runs a warm-up pass then a profiled pass, so those three counters will be doubled relative to GPU_LDE_CALLS.

L2 — from_precomputed_nodes does not validate internal node consistency
The new constructor only checks that the node count forms a valid power-of-two tree. A GPU Merkle kernel bug producing an internally inconsistent tree (root doesn't match children) would be undetected — the prover commits to a corrupt tree and the verifier rejects the proof with no actionable diagnostic.

Comment thread crypto/stark/src/gpu_lde.rs Outdated
Comment thread crypto/stark/src/prover.rs Outdated
Comment thread crypto/stark/src/prover.rs Outdated
Comment thread crypto/stark/src/prover.rs Outdated
Comment thread crypto/stark/src/prover.rs
Comment thread crypto/stark/src/prover.rs Outdated
Comment thread crypto/stark/src/prover.rs Outdated
Comment thread crypto/stark/src/prover.rs
Comment thread crypto/stark/src/prover.rs Outdated
Comment thread crypto/stark/Cargo.toml
Comment thread crypto/stark/src/trace.rs Outdated
Comment thread prover/tests/bench_single.rs
Comment thread crypto/stark/src/trace.rs Outdated
Comment thread prover/src/tests/prove_elfs_tests.rs
Comment thread prover/tests/bench_single.rs Outdated
Comment thread Makefile
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Comment thread prover/src/tests/prove_elfs_tests.rs Outdated
ColoCarletti and others added 3 commits June 1, 2026 11:45
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
@MauroToscano MauroToscano added this pull request to the merge queue Jun 1, 2026
Merged via the queue into main with commit fdc0ca5 Jun 1, 2026
12 checks passed
@MauroToscano MauroToscano deleted the feat/cuda-pr2-r1-gpu-commits branch June 1, 2026 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpu Related to GPU/CUDA development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants