Add cuSolverDx JIT fusion and solver projections#1176
Conversation
|
/build |
Greptile SummaryThis PR upgrades MathDx/libmathdx integration to the latest runtime codegen packages and introduces cuSolverDx-backed JIT fusion for solver operators (Cholesky, inverse, LU, QR, eig), along with lazy solver projections that let multi-output APIs (
Confidence Score: 3/5The PR introduces a large new JIT fusion infrastructure for solver operators. Several defects identified in earlier review rounds remain open, and a new one (SVD U/VT uninitialized when SVDMode::NONE) was found in this pass. The SVD U/VT uninitialized-output bug means callers using SVDMode::NONE projections silently receive garbage data. Combined with still-open issues from previous rounds — unconditional extern declarations causing potential nvJitLink failures, missing const methods on SolverQROp/EconQROp, and the block-dim selection concern — the change carries meaningful correctness risk across multiple solver paths. include/matx/operators/svd.h (SVDMode::NONE U/VT uninitialized), include/matx/transforms/solver_cusolverdx.h (block dim range selection), include/matx/operators/qr.h (SolverQROp/EconQROp const methods, unconditional UNGQR extern in R-only path), include/matx/operators/eig.h (dual-extern emission for values-only JIT) Important Files Changed
Class Diagram%%{init: {'theme': 'neutral'}}%%
classDiagram
class SolverProjectionOp {
+shared_ptr~State~ state_owner_
+SolverProjectionStorage storage_
+PreRun(shape, ex)
+PostRun(shape, ex)
}
class SolverProjectionStorage {
-State* state_
-array shape_
-TensorType tensor_
+PreRun(shape, ex)
+PostRun(shape, ex)
}
class LUOp {
+SolverProjectionOp LU
+SolverProjectionOp Piv
}
class SVDOp {
+SolverProjectionOp U
+SolverProjectionOp S
+SolverProjectionOp VT
}
class cuSolverDxHelper {
+GenerateLTOIR()
+GetShmRequired()
+GetBlockDimRange()
+GetSymbolName()
}
SolverProjectionOp --> SolverProjectionStorage
LUOp --> SolverProjectionOp
SVDOp --> SolverProjectionOp
Reviews (33): Last reviewed commit: "Guard JIT launch metadata helpers" | Re-trigger Greptile |
Upgrade MathDx/libmathdx integration to the latest runtime codegen packages, preserve runtime descriptor queries for FFT and BLAS, add cuSolverDx-backed JIT support for solver operators, and introduce lazy solver projections so multi-output APIs like QR, LU, SVD, and eig can participate in single expressions with tests covering the fused and projection paths.
Add cuSolverDx-backed JIT fusion for inverse and Cholesky-style solver operators, relax solver block-dimension ranges so they can intersect with cuBLASDx, and validate fused matmul/inverse execution against reference results. Preserve the runtime libmathdx resource-query path for cold JIT planning while adding persistent launch metadata plus fixed-size JIT cache keys so warmed MathDx launches can skip repeated libmathdx calls and full JIT type-string construction.
Broaden the cuSolverDx-backed Cholesky and inverse tests from single float smoke cases into typed MathDx/JIT suites over the same non-half floating and complex value types used by the existing solver coverage. The new tests exercise runtime resource queries, CUDA backend comparisons, batched execution, unsupported shapes, and the fused matmul-plus-inverse path across float, double, complex float, and complex double.
Extend the MathDx/JIT-gated solver tests to cover LU, QR, QR econ, SVD, and eigen projection APIs across the same CUDA floating-point type sweep used by the existing cuSolverDx tests. The new cases exercise both single-matrix and rank-3 batched shapes and assert that these multi-output solver projections currently report no JIT support and fail cleanly under CUDAJITExecutor, matching the branch's current implementation where only chol and inv have cuSolverDx codegen paths.
Tighten the cuSolverDx JIT helper and solver projection lifecycle in response to PR review feedback. cuSolverDx descriptors and code objects now use RAII cleanup, generated solver JIT is capped to ranks with emitted indexing support, inverse shared-memory sizing accounts for both matrix buffers plus pivot/info storage, MathDx device attribute queries are checked, and multi-output solver projection states keep scratch storage alive until the last projection releases it. The chol and inverse JIT tests now cover rank-5 rejection and the inverse shared-memory floor.
Update the CUDA JIT fusion documentation to reflect the current MathDx backend coverage: cuFFTDx for FFT, cuBLASDx for compatible matmul/GEMM, and partial cuSolverDx support for chol and inv. Clarify that multi-output solver APIs remain non-JIT today, document runtime libmathdx resource queries and launch-compatibility limits, and add the same support boundaries to the relevant solver, BLAS, and build docs.
Extend the cuSolverDx JIT projection work for supported solver outputs, add the remaining solver coverage and documentation updates, and refresh the executor compatibility matrix so it records HostExecutor, cudaExecutor, and CUDAJITExecutor support with explicit notes for MathDx, CPU backend, and host threading limitations. Also add contributor guidance in AGENTS.md so future operator changes keep tests, documentation, and the executor compatibility table in sync.
Clear MatX caches and cached allocations around the existing QREcon solver validation so the new cuSolverDx projection tests do not leave enough cached module or solver workspace pressure to starve the larger non-JIT QR-econ cases later in the same test process. This keeps the expanded JIT coverage and the existing CUDA QR-econ coverage runnable together after rebasing on main.
1a3e0c5 to
fcab809
Compare
|
/build |
Make cuSolverDx-backed solver JIT projection class names and cache keys include the generated MathDx symbols so mixed-size projections cannot silently reuse stale generated classes. Tighten generated solver code for complex eigenvalue storage, unsupported ranks, sliced direct solver results, and slice JIT type names, then add focused mixed-size fusion regressions for the affected solver APIs and clean up executor compatibility RST markup.
Fix the cuSolverDx LTOIR cache-store ownership path, add missing PostRun forwarding for multi-output solver mtie wrappers, and gate qr_econ().Q JIT fusion for wide matrices where the in-place shared-memory compaction is unsafe. The LU pivot projection is now encoded consistently with cuSolver and covered by a nontrivial fused-expression regression, and the QR JIT documentation notes the qr_econ Q-shape limitation.
Add a focused cuSolverDx QR projection regression using a dense non-diagonal square matrix so the JIT path exercises Householder QR instead of only diagonal inputs. The test JIT-generates Q and R projections, reconstructs with regular CUDA matmul, and checks the product against the original matrix.
|
@greptile review |
Qualify generated cuSolverDx solver projection class names with input rank so same-size projections from different-rank operands cannot share an incompatible generated class, and make direct Cholesky and inverse generated Rank() implementations follow the actual operand type. This also preserves 64-bit workspace sizing until launch-time bounds checking, lets full QR fuse R-only rectangular projections without enabling unsafe Q generation, and adds focused LU and QR regression coverage for those cases.
|
@greptile review |
Remove the redundant jobz contribution from values-only eigen projection cache keys so identical values paths share metadata across job modes, and reuse the active cuSolverDx descriptor when HEEV shared-memory floor sizing needs the workspace trait. This keeps the runtime query model intact while avoiding the extra descriptor round-trip Greptile flagged.
|
@greptile review Please review latest head commit: 2ae80c4 |
Make the QR, QR-solver, QR-econ, Eigen, and SVD projection states clean up partially allocated scratch buffers if materialization fails before the projection tensors are fully ready. This keeps the solver projection object itself device-safe by retaining raw state pointers there, while leaving ownership in the parent solver operators, and matches the existing LU failure-cleanup behavior without adding non-CUDA-copyable state to expressions.
|
@greptile review |
Split solver projections into a host-facing owner and a CUDA-visible raw storage type so copied projections keep their solver state alive without placing std::shared_ptr in the object copied into kernels. Projection constructors now retain the parent state shared_ptr, base_type strips expressions to SolverProjectionStorage for device execution, and a LU regression verifies a projection copied out of a temporary solver op still runs correctly through the MathDx JIT path.
|
@greptile review Please review latest head commit: 8b8f98f1d6d3c7d2734b1bcc3bd08b5f36cf7720 |
Generate component-specific QR, economic QR, and eigen projection classes so the generated device code only declares and references helper symbols whose LTOIR is actually linked for that projection. R-only QR paths now use a GEQRF-only projection body, values-only eig uses the values HEEV helper only, and the class names encode the component to keep mixed-output expressions from reusing an incompatible generated class.
|
@greptile review Please review latest head commit: 9d6585f |
Move QR UNGQR and eigen HEEV helper symbol construction into the component-specific branches that emit the matching JIT projection code. This keeps R-only and values-only projection generation visibly limited to the helper LTOIR that will be linked, while preserving the already-tested runtime behavior.
|
@greptile review Please review latest head commit: 9d790ce |
Clarify that QR solver OUT and TAU projections intentionally reuse the same generated JIT class because the GEQRF projection body dispatches on the component template parameter. This addresses the review concern without changing the already-tested runtime path.
|
@greptile review Please review latest head commit: 49b4dc0 |
Remove the QR Q shared-memory compaction path from the cuSolverDx generated projection and replace it with a generated static_assert that the supported Q projection layout matches the input leading dimension. Also document the intentional shared generated class for LU factors and pivots, since the body dispatches on the projection component.
|
@greptile review Please review latest head commit: 65c21df |
Keep solver projection state alive when a projection is composed into a longer-lived expression without putting shared ownership into the device-facing storage object. SolverProjectionStorage now retains state through a host-side registry while remaining raw-pointer based on device, skips eager materialization under CUDAJITExecutor, and includes a compound-expression lifetime regression. This also makes QR solver transform methods const-correct, avoids eager Chol/Inv PreRun work in the JIT path, clarifies cuSolverDx symbol/block metadata, and bumps the launch metadata cache version to drop stale parameters.
|
@greptile review Please review latest head commit: 84c8b71 |
|
/build |
Make the full MathDx-enabled validation pass cleanly by aligning the resample example with the real inverse FFT path, skipping the sparse CSR direct-solve example when MatX is built without cuDSS, and making the profiling overhead check tolerate normal CPU-wall-time variance in containerized runs. The r2c operator test now asks the composed operator whether CUDA JIT is supported before exercising that executor, so unsupported real-FFT JIT combinations are skipped instead of throwing during the all-tests pass.
| @@ -0,0 +1,1004 @@ | |||
| //////////////////////////////////////////////////////////////////////////////// | |||
There was a problem hiding this comment.
GetBlockDimRange returns hardcoded {32, 1024} instead of the queried trait block dim
After calling TryCacheTraits(), the trait-queried block_dim_ is stored but GetBlockDimRange() returns a hardcoded range {32, 1024}. For solver sizes where cuSolverDx requires a specific block dimension, the JIT executor will select the minimum (32) and may launch with the wrong occupancy or violate hardware requirements for that solver variant.
| @@ -0,0 +1,477 @@ | |||
| //////////////////////////////////////////////////////////////////////////////// | |||
There was a problem hiding this comment.
Static
LifetimeRegistry is not thread-safe for concurrent projection construction
The LifetimeRegistry is a static unordered_map<State*, LifetimeEntry> modified by RetainState and ReleaseState without synchronization. When multiple CPU threads construct or destroy SolverProjectionOp instances referring to the same or different State objects concurrently, the map can be corrupted. A std::mutex or atomic ref-count per State is needed.
Keep the launch-parameter metadata cache helpers behind MATX_EN_JIT so jit_cuda.h can be included in non-JIT builds where nvrtc_helper.h does not declare GetCache. The executor still reports the normal runtime error if CUDAJITExecutor is used without JIT support, while JIT-enabled builds keep the warmed-launch metadata cache path.
| SolverProjectionOp<state_type, SVD_U, u_type> U; | ||
| SolverProjectionOp<state_type, SVD_S, s_type> S; | ||
| SolverProjectionOp<state_type, SVD_VT, vt_type> VT; |
There was a problem hiding this comment.
U and VT projections expose uninitialized tensors when SVDMode::NONE
SVDState always allocates u_ and vt_ and SVDState::Tensor<SVD_U>()/Tensor<SVD_VT>() return them unconditionally, but svd_impl does not fill those buffers when jobz_ == SVDMode::NONE. Calling .U or .VT on an svd(A, SVDMode::NONE) op runs Materialize, writes nothing to the backing buffers, and exposes uninitialized device memory to the caller — silently producing wrong results.
Adding a MATX_ASSERT or throwing in SVDState::Tensor<Component>() when the requested component was not computed (e.g., jobz_ == NONE and Component != SVD_S) would catch this at the point of access, matching the guard that EigState::SupportsJITProjection<EIG_VECTORS> already applies for the JIT path.
|
/build |
Upgrade MathDx/libmathdx integration to the latest runtime codegen packages, preserve runtime descriptor queries for FFT and BLAS, add cuSolverDx-backed JIT support for solver operators, and introduce lazy solver projections so multi-output APIs like QR, LU, SVD, and eig can participate in single expressions with tests covering the fused and projection paths.
Also added new interface to allow multi-output return transforms to be used in a fusion context.