feat(cuda): Round 2 composition parts + Round 3 OOD barycentric dispatch#637
feat(cuda): Round 2 composition parts + Round 3 OOD barycentric dispatch#637ColoCarletti wants to merge 45 commits into
Conversation
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
gabrielbosio
left a comment
There was a problem hiding this comment.
It would be nice to move the following test helpers to a new file and make the math-cuda tests use them to avoid code duplication:
- type
Fp/ typeFp3aliases - one
rand_fp/rand_fp3(the random generators, currentlyrandom_fp/rand_fp/rand_ext3) ext3_to_u64s/u64s_to_ext3(the interleaved packing)- the canonicalization family (
canon,canon_fp3/canon3/canon_triplet,canon_triplet_raw) reverse_index
This can be addressed in another PR though.
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
gabrielbosio
left a comment
There was a problem hiding this comment.
It would be good to add a make command that runs clippy with the cuda feature:
clippy-cuda:
cargo clippy -p stark --features cuda --all-targets -- -D warnings -A clippy::op_ref
| } | ||
|
|
||
| /// Trace-size threshold for the R3 OOD barycentric GPU path. Below this the | ||
| /// rayon CPU path already completes in well under a millisecond and PCIe |
There was a problem hiding this comment.
rayon CPU path already completes in well under a millisecond
where 2^14 was actually measured?
| /// The GPU kernels compute only the unscaled barycentric sum per column; | ||
| /// applying this scalar on the host is one ext3 multiply per column, cheap |
There was a problem hiding this comment.
| /// The GPU kernels compute only the unscaled barycentric sum per column; | |
| /// applying this scalar on the host is one ext3 multiply per column, cheap | |
| /// The GPU kernels compute only the unscaled barycentric sum per column. | |
| /// Applying this scalar on the host is one ext3 multiply per column, cheap |
| // Hard assert (not debug_assert) because the unsafe `transmute_copy` | ||
| // below is UB if `E != Ext3`. Cost is one TypeId comparison per call. |
There was a problem hiding this comment.
| // Hard assert (not debug_assert) because the unsafe `transmute_copy` | |
| // below is UB if `E != Ext3`. Cost is one TypeId comparison per call. | |
| // Avoids the `E != Ext3` path reaching the unsafe `transmute_copy` below | |
| // that is UB in that case. Cost is one TypeId comparison per call. |
| let final_ext3 = &s * &scalar_e; | ||
| // SAFETY: TypeId-checked at the caller. E == Ext3, identical layout. | ||
| let final_e: FieldElement<E> = unsafe { | ||
| core::mem::transmute_copy::<FieldElement<Ext3>, FieldElement<E>>(&final_ext3) |
There was a problem hiding this comment.
Add the import statement to the top
| offset: &FieldElement<F>, | ||
| ) -> Option<Vec<Vec<FieldElement<E>>>> | ||
| where | ||
| F: math::field::traits::IsFFTField + IsField + IsSubFieldOf<E> + 'static, |
There was a problem hiding this comment.
Add the import statement at the top
| let mut weights_u64 = Vec::with_capacity(domain_size); | ||
| let mut w = FieldElement::<F>::one(); | ||
| for _ in 0..domain_size { | ||
| // SAFETY: F == Goldilocks per TypeId check; FieldElement<Gl> is |
There was a problem hiding this comment.
| // SAFETY: F == Goldilocks per TypeId check; FieldElement<Gl> is | |
| // SAFETY: F == Goldilocks per TypeId check. FieldElement<Gl> is |
| pub(crate) fn try_build_comp_poly_tree_gpu<E, B>( | ||
| lde_parts: &[Vec<FieldElement<E>>], | ||
| ) -> Option<crypto::merkle_tree::merkle::MerkleTree<B>> | ||
| where | ||
| E: IsField + 'static, | ||
| B: crypto::merkle_tree::traits::IsMerkleTreeBackend<Node = [u8; 32]>, |
There was a problem hiding this comment.
Add the import statement at the top
| return None; | ||
| } | ||
|
|
||
| // SAFETY: E == Ext3 per TypeId check; FieldElement<Ext3> backing is |
There was a problem hiding this comment.
| // SAFETY: E == Ext3 per TypeId check; FieldElement<Ext3> backing is | |
| // SAFETY: E == Ext3 per TypeId check. FieldElement<Ext3> backing is |
| let tight_total_nodes = num_leaves | ||
| .checked_mul(2) | ||
| .and_then(|v| v.checked_sub(1)) | ||
| .expect("tight_total_nodes: usize arithmetic overflow"); |
There was a problem hiding this comment.
tight_total_nodes = num_leaves * 2 - 1 = (lde_size / 2) * 2 - 1 = lde_size - 1
(lde_size / 2) * 2 = lde_size because lde_size is even (Enforced [here])(
lambda_vm/crypto/stark/src/gpu_lde.rs
Lines 837 to 839 in 0ffc661
| .expect("tight_total_nodes: usize arithmetic overflow"); | |
| // lde_size is an even power of two >= 2, so 2*num_leaves == lde_size and | |
| // tight_total_nodes = lde_size - 1 >= 1. No overflow or underflow possible. | |
| let tight_total_nodes = lde_size - 1; |
| // 2*num_leaves which is a power of 2. `from_precomputed_nodes` only | ||
| // returns `None` when that invariant fails or `nodes` is empty. | ||
| Some( | ||
| crypto::merkle_tree::merkle::MerkleTree::<B>::from_precomputed_nodes(nodes) |
There was a problem hiding this comment.
Add the import statement at the top
| /// threshold, or the math-cuda call returns `Err`. | ||
| #[allow(clippy::too_many_arguments)] | ||
| pub(crate) fn try_barycentric_base_on_handle<F, E>( | ||
| lde_trace: &crate::trace::LDETraceTable<F, E>, |
There was a problem hiding this comment.
Add the import statement at the top
| /// R3 GPU dispatch: batched strided barycentric OOD evaluation over the main | ||
| /// (base-field) LDE columns kept on device from R1 GPU dispatch. Reads | ||
| /// `lde_trace.gpu_main()` directly, no H2D for the column data. Returns | ||
| /// the OOD evaluations as `Vec<FieldElement<E>>` of length `num_main_cols` | ||
| /// (already scaled by `vanishing * n_inv * g_n_inv`), or `None` if the GPU | ||
| /// handle is absent, types don't match, the trace-size domain is below | ||
| /// threshold, or the math-cuda call returns `Err`. |
There was a problem hiding this comment.
| /// R3 GPU dispatch: batched strided barycentric OOD evaluation over the main | |
| /// (base-field) LDE columns kept on device from R1 GPU dispatch. Reads | |
| /// `lde_trace.gpu_main()` directly, no H2D for the column data. Returns | |
| /// the OOD evaluations as `Vec<FieldElement<E>>` of length `num_main_cols` | |
| /// (already scaled by `vanishing * n_inv * g_n_inv`), or `None` if the GPU | |
| /// handle is absent, types don't match, the trace-size domain is below | |
| /// threshold, or the math-cuda call returns `Err`. | |
| /// R3 GPU dispatch: batched strided barycentric OOD evaluation over the main | |
| /// (base-field) LDE columns kept on device from R1. Operates on the | |
| /// device-resident LDE in place; only the coset points and inv_denoms are | |
| /// copied to the device, not the columns. Returns the OOD evaluations | |
| /// as `Vec<FieldElement<E>>` of length `num_main_cols` (already scaled by | |
| /// `vanishing * n_inv * g_n_inv`), or `None` if the GPU handle is absent, | |
| /// types don't match, the trace-size domain is below threshold, or the | |
| /// math-cuda call returns `Err`. |
Summary
Builds on PR-2 (#582). Adds the GPU dispatch for Round 2 composition-poly LDE + Merkle commit (the
number_of_parts > 2branch, exercised today by the branch and shift tables since they have degree-3 transition constraints) and Round 3 OOD trace barycentric (reads PR-2'sgpu_main/gpu_auxdevice handles, no host-side LDE traversal).The cuda feature stays opt-in. CPU is the default and untouched. With cuda on, every new dispatch site falls through to the existing rayon CPU path when the type isn't Goldilocks/ext3, the size is below threshold, the GPU R1 handle is absent for this table, or the math-cuda call returns
Err.R4 DEEP, R4 FRI and the
_keepvariant of R2 (which retains agpu_composition_partshandle for R4 DEEP) are deferred to PR-4.What's in
crypto/math-cuda/kernels/barycentric.cu(new, ~190 LoC). Four kernels:barycentric_{base,ext3}_batchedandbarycentric_{base,ext3}_batched_strided. Thestrided variants read an LDE buffer at a row stride (used by R3 to pick the trace-size coset out of the device-resident LDE without materialising a slab).
crypto/math-cuda/src/barycentric.rs(new, ~215 LoC). Four host wrappers:barycentric_{base,ext3}for host data andbarycentric_{base,ext3}_on_devicefor
&GpuLdeBase/&GpuLdeExt3handles.crypto/math-cuda/src/device.rs(+14).BARY_PTXconst, fourCudaFunctionfields for the new kernels.crypto/math-cuda/build.rs(+1).compile_ptx("barycentric.cu", ...).crypto/math-cuda/src/lib.rs(+1).pub mod barycentric.crypto/math-cuda/tests/(new, ~530 LoC across 3 files).barycentric.rsandbarycentric_strided.rscover the four kernels against a CPU referencesumming the unscaled barycentric over base / ext3 columns with optional stride.
comp_poly_tree.rsexercises the fusedevaluate_poly_coset_batch_ext3_into_with_merkle_treeend-to-end against the CPUcommit_composition_polynomialfor sizes from(log_n=2, blowup=2)up to(log_n=14, blowup=2).crypto/stark/src/gpu_lde.rs(+~390 LoC). New dispatches:try_evaluate_parts_on_lde_gpu(R2, non-_keepext3 LDE for the parts > 2 branch).try_build_comp_poly_tree_gpu(R2 row-pair Keccak leaves + inner tree from host evals).try_barycentric_base_on_handle+try_barycentric_ext3_on_handle(R3 OOD reading PR-2's device handles).ood_ext3_scalarandapply_ext3_scalarfor the per-column scalar application.gpu_parts_lde_calls,gpu_bary_calls) and a separateLAMBDA_VM_GPU_BARY_THRESHOLDenv override (default2^14). Both newcounters reset via
reset_all_gpu_call_counters().crypto/stark/src/prover.rs(~70 LoC of changes). R2 dispatch inround_2_compute_composition_polynomial: pre-computecomposition_poly_partsonce, thenGPU-or-CPU for the LDE step, then GPU-or-CPU for the comp-poly Merkle commit.
Round2struct is unchanged.crypto/stark/src/trace.rs(~80 LoC of changes). R3 dispatch inget_trace_evaluations_from_lde: per eval-point, trytry_barycentric_base_on_handleformain and
try_barycentric_ext3_on_handlefor aux; onNone, run the existing rayon CPU loop.inv_denomsstays on CPU (documented stream-contention regression).Added
+ 'staticbound on the function's type params to supportTypeIddispatch in the new GPU branches.Known limitations carried over from PR-2
--test-threads=1. Math-cuda-side fix, out of scope here.LAMBDA_VM_GPU_LDE_THRESHOLD=0forces small-domain tables through math-cuda kernels that panic atlog_n < 1. Pre-existing regression, present on PR-2 baseline, not introduced by this PR.num_AIRs * per_table_LDEbecause R1 handles are retained across all rounds. PR-3 does not add new device-resident handles, so the ceiling is unchanged from PR-2. A follow-up PR will introduce a VRAM budget that gracefully falls back to non-_keepwhen retention would OOM the GPU.Continuation of
Builds on PR-2 (#582). Base branch is
feat/cuda-pr2-r1-gpu-commits. PR-4 (R4 DEEP + FRI + batch invert + R2_keep) stacks on top.