Skip to content

Add cuSolverDx JIT fusion and solver projections#1176

Open
cliffburdick wants to merge 23 commits into
mainfrom
cburdick/cusolverdx-jit-fusion
Open

Add cuSolverDx JIT fusion and solver projections#1176
cliffburdick wants to merge 23 commits into
mainfrom
cburdick/cusolverdx-jit-fusion

Conversation

@cliffburdick
Copy link
Copy Markdown
Collaborator

Upgrade MathDx/libmathdx integration to the latest runtime codegen packages, preserve runtime descriptor queries for FFT and BLAS, add cuSolverDx-backed JIT support for solver operators, and introduce lazy solver projections so multi-output APIs like QR, LU, SVD, and eig can participate in single expressions with tests covering the fused and projection paths.

Also added new interface to allow multi-output return transforms to be used in a fusion context.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 9, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 9, 2026

Greptile Summary

This PR upgrades MathDx/libmathdx integration to the latest runtime codegen packages and introduces cuSolverDx-backed JIT fusion for solver operators (Cholesky, inverse, LU, QR, eig), along with lazy solver projections that let multi-output APIs (lu().LU, qr().Q, eig().Values, etc.) participate in single fused expressions. A new SolverProjectionOp/SolverProjectionStorage infrastructure handles state lifetime, JIT class generation, and eager materialization for non-JIT paths across all decomposition operators.

  • New solver_cusolverdx.h / solver_projection.h: cuSolverDxHelper encapsulates plan creation, trait queries, LTOIR generation, and shared-memory layout; SolverProjectionStorage adds reference-counted lifetime management so projections remain valid past the owning Op rvalue.
  • Operator rewrites (lu, qr, svd, eig, chol, inv): Each operator extracts its mutable state into a *State class shared via shared_ptr, adds SolverProjectionOp public members for each output, and wires cuSolverDx JIT capabilities (SUPPORTS_JIT, BLOCK_DIM, DYN_SHM_SIZE, JIT_CACHE_KEY, GENERATE_LTOIR, JIT_CLASS_QUERY) through the capabilities system.

Confidence Score: 3/5

The PR introduces a large new JIT fusion infrastructure for solver operators. Several defects identified in earlier review rounds remain open, and a new one (SVD U/VT uninitialized when SVDMode::NONE) was found in this pass.

The SVD U/VT uninitialized-output bug means callers using SVDMode::NONE projections silently receive garbage data. Combined with still-open issues from previous rounds — unconditional extern declarations causing potential nvJitLink failures, missing const methods on SolverQROp/EconQROp, and the block-dim selection concern — the change carries meaningful correctness risk across multiple solver paths.

include/matx/operators/svd.h (SVDMode::NONE U/VT uninitialized), include/matx/transforms/solver_cusolverdx.h (block dim range selection), include/matx/operators/qr.h (SolverQROp/EconQROp const methods, unconditional UNGQR extern in R-only path), include/matx/operators/eig.h (dual-extern emission for values-only JIT)

Important Files Changed

Filename Overview
include/matx/operators/solver_projection.h New file: introduces SolverProjectionStorage (reference-counted, thread-safe lifetime registry) and SolverProjectionOp (wraps shared_ptr owner + storage). Lifetime tracking logic is complex but traces correctly for typical usage patterns.
include/matx/transforms/solver_cusolverdx.h New file: cuSolverDxHelper encapsulates descriptor/code RAII handles, trait caching, GetSymbolName, GenerateLTOIR, and all JIT kernel body string generators. Key concern: GetBlockDimRange returns {32, 1024} hardcoded and JIT always picks 32.
include/matx/operators/lu.h Refactored to extract LUState with proper Materialize/Release exception safety (cleanup lambda pattern). LU and Piv are public SolverProjectionOp members sharing one JIT class with component dispatch.
include/matx/operators/qr.h Largest diff: adds QRState, SolverQRState, EconQRState. EconQRState correctly guards Q JIT path with q_layout_safe (m>=n) to avoid the in-place smem race on wide matrices. Multiple previously-flagged issues remain.
include/matx/operators/eig.h EigState correctly uses precision_type for lambda_bytes and conditional extern emission in AddJITProjectionClasses. EIG_VECTORS accessor still returns uninitialized vectors tensor when jobz_==NO_VECTOR.
include/matx/operators/svd.h SVDState Materialize has proper exception-safety. Critical: SVDState::Tensor<SVD_U>() and Tensor<SVD_VT>() return allocated-but-uninitialized tensors when jobz_==SVDMode::NONE, silently exposing garbage data.
include/matx/operators/chol.h CholOp gains cuSolverDx JIT support. PreRun correctly skips the eager solve in JIT mode. PostRun properly resets prerun_done_=false and frees ptr.
include/matx/operators/inverse.h InvOp gains cuSolverDx GESV-based JIT support. PreRun correctly skips eager solve in JIT mode. GetGesvInverseFuncStr shared-memory floor calculation correctly accounts for 2*n² elements + (n+1) ints.

Class Diagram

%%{init: {'theme': 'neutral'}}%%
classDiagram
    class SolverProjectionOp {
        +shared_ptr~State~ state_owner_
        +SolverProjectionStorage storage_
        +PreRun(shape, ex)
        +PostRun(shape, ex)
    }
    class SolverProjectionStorage {
        -State* state_
        -array shape_
        -TensorType tensor_
        +PreRun(shape, ex)
        +PostRun(shape, ex)
    }
    class LUOp {
        +SolverProjectionOp LU
        +SolverProjectionOp Piv
    }
    class SVDOp {
        +SolverProjectionOp U
        +SolverProjectionOp S
        +SolverProjectionOp VT
    }
    class cuSolverDxHelper {
        +GenerateLTOIR()
        +GetShmRequired()
        +GetBlockDimRange()
        +GetSymbolName()
    }
    SolverProjectionOp --> SolverProjectionStorage
    LUOp --> SolverProjectionOp
    SVDOp --> SolverProjectionOp
Loading

Reviews (33): Last reviewed commit: "Guard JIT launch metadata helpers" | Re-trigger Greptile

Comment thread include/matx/transforms/solver_cusolverdx.h Outdated
Comment thread include/matx/transforms/solver_cusolverdx.h Outdated
Comment thread include/matx/transforms/solver_cusolverdx.h Outdated
Comment thread include/matx/operators/solver_projection.h
@coveralls
Copy link
Copy Markdown

Coverage Status

Coverage is 94.402%cburdick/cusolverdx-jit-fusion into main. No base build found for main.

Comment thread include/matx/transforms/solver_cusolverdx.h Outdated
Comment thread include/matx/transforms/solver_cusolverdx.h Outdated
Comment thread include/matx/operators/chol.h
Upgrade MathDx/libmathdx integration to the latest runtime codegen packages, preserve runtime descriptor queries for FFT and BLAS, add cuSolverDx-backed JIT support for solver operators, and introduce lazy solver projections so multi-output APIs like QR, LU, SVD, and eig can participate in single expressions with tests covering the fused and projection paths.
Add cuSolverDx-backed JIT fusion for inverse and Cholesky-style solver operators, relax solver block-dimension ranges so they can intersect with cuBLASDx, and validate fused matmul/inverse execution against reference results. Preserve the runtime libmathdx resource-query path for cold JIT planning while adding persistent launch metadata plus fixed-size JIT cache keys so warmed MathDx launches can skip repeated libmathdx calls and full JIT type-string construction.
Broaden the cuSolverDx-backed Cholesky and inverse tests from single float smoke cases into typed MathDx/JIT suites over the same non-half floating and complex value types used by the existing solver coverage. The new tests exercise runtime resource queries, CUDA backend comparisons, batched execution, unsupported shapes, and the fused matmul-plus-inverse path across float, double, complex float, and complex double.
Extend the MathDx/JIT-gated solver tests to cover LU, QR, QR econ, SVD, and eigen projection APIs across the same CUDA floating-point type sweep used by the existing cuSolverDx tests. The new cases exercise both single-matrix and rank-3 batched shapes and assert that these multi-output solver projections currently report no JIT support and fail cleanly under CUDAJITExecutor, matching the branch's current implementation where only chol and inv have cuSolverDx codegen paths.
Tighten the cuSolverDx JIT helper and solver projection lifecycle in response to PR review feedback. cuSolverDx descriptors and code objects now use RAII cleanup, generated solver JIT is capped to ranks with emitted indexing support, inverse shared-memory sizing accounts for both matrix buffers plus pivot/info storage, MathDx device attribute queries are checked, and multi-output solver projection states keep scratch storage alive until the last projection releases it. The chol and inverse JIT tests now cover rank-5 rejection and the inverse shared-memory floor.
Update the CUDA JIT fusion documentation to reflect the current MathDx backend coverage: cuFFTDx for FFT, cuBLASDx for compatible matmul/GEMM, and partial cuSolverDx support for chol and inv. Clarify that multi-output solver APIs remain non-JIT today, document runtime libmathdx resource queries and launch-compatibility limits, and add the same support boundaries to the relevant solver, BLAS, and build docs.
Extend the cuSolverDx JIT projection work for supported solver outputs, add the remaining solver coverage and documentation updates, and refresh the executor compatibility matrix so it records HostExecutor, cudaExecutor, and CUDAJITExecutor support with explicit notes for MathDx, CPU backend, and host threading limitations. Also add contributor guidance in AGENTS.md so future operator changes keep tests, documentation, and the executor compatibility table in sync.
Comment thread include/matx/transforms/solver_cusolverdx.h Outdated
Clear MatX caches and cached allocations around the existing QREcon solver validation so the new cuSolverDx projection tests do not leave enough cached module or solver workspace pressure to starve the larger non-JIT QR-econ cases later in the same test process. This keeps the expanded JIT coverage and the existing CUDA QR-econ coverage runnable together after rebasing on main.
@cliffburdick cliffburdick force-pushed the cburdick/cusolverdx-jit-fusion branch from 1a3e0c5 to fcab809 Compare May 13, 2026 00:43
Comment thread test/00_solver/LU.cu
@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

Comment thread include/matx/operators/lu.h
Make cuSolverDx-backed solver JIT projection class names and cache keys include the generated MathDx symbols so mixed-size projections cannot silently reuse stale generated classes. Tighten generated solver code for complex eigenvalue storage, unsupported ranks, sliced direct solver results, and slice JIT type names, then add focused mixed-size fusion regressions for the affected solver APIs and clean up executor compatibility RST markup.
Comment thread include/matx/operators/qr.h
Comment thread include/matx/transforms/solver_cusolverdx.h Outdated
Fix the cuSolverDx LTOIR cache-store ownership path, add missing PostRun forwarding for multi-output solver mtie wrappers, and gate qr_econ().Q JIT fusion for wide matrices where the in-place shared-memory compaction is unsafe. The LU pivot projection is now encoded consistently with cuSolver and covered by a nontrivial fused-expression regression, and the QR JIT documentation notes the qr_econ Q-shape limitation.
Add a focused cuSolverDx QR projection regression using a dense non-diagonal square matrix so the JIT path exercises Householder QR instead of only diagonal inputs. The test JIT-generates Q and R projections, reconstructs with regular CUDA matmul, and checks the product against the original matrix.
@cliffburdick
Copy link
Copy Markdown
Collaborator Author

@greptile review

Qualify generated cuSolverDx solver projection class names with input rank so same-size projections from different-rank operands cannot share an incompatible generated class, and make direct Cholesky and inverse generated Rank() implementations follow the actual operand type. This also preserves 64-bit workspace sizing until launch-time bounds checking, lets full QR fuse R-only rectangular projections without enabling unsafe Q generation, and adds focused LU and QR regression coverage for those cases.
@cliffburdick
Copy link
Copy Markdown
Collaborator Author

@greptile review

Remove the redundant jobz contribution from values-only eigen projection cache keys so identical values paths share metadata across job modes, and reuse the active cuSolverDx descriptor when HEEV shared-memory floor sizing needs the workspace trait. This keeps the runtime query model intact while avoiding the extra descriptor round-trip Greptile flagged.
@cliffburdick
Copy link
Copy Markdown
Collaborator Author

@greptile review

Please review latest head commit: 2ae80c4

Comment thread include/matx/operators/qr.h
Comment thread include/matx/operators/qr.h Outdated
Make the QR, QR-solver, QR-econ, Eigen, and SVD projection states clean up partially allocated scratch buffers if materialization fails before the projection tensors are fully ready. This keeps the solver projection object itself device-safe by retaining raw state pointers there, while leaving ownership in the parent solver operators, and matches the existing LU failure-cleanup behavior without adding non-CUDA-copyable state to expressions.
@cliffburdick
Copy link
Copy Markdown
Collaborator Author

@greptile review

Comment thread include/matx/operators/eig.h
Split solver projections into a host-facing owner and a CUDA-visible raw storage type so copied projections keep their solver state alive without placing std::shared_ptr in the object copied into kernels. Projection constructors now retain the parent state shared_ptr, base_type strips expressions to SolverProjectionStorage for device execution, and a LU regression verifies a projection copied out of a temporary solver op still runs correctly through the MathDx JIT path.
@cliffburdick
Copy link
Copy Markdown
Collaborator Author

@greptile review

Please review latest head commit: 8b8f98f1d6d3c7d2734b1bcc3bd08b5f36cf7720

Generate component-specific QR, economic QR, and eigen projection classes so the generated device code only declares and references helper symbols whose LTOIR is actually linked for that projection. R-only QR paths now use a GEQRF-only projection body, values-only eig uses the values HEEV helper only, and the class names encode the component to keep mixed-output expressions from reusing an incompatible generated class.
@cliffburdick
Copy link
Copy Markdown
Collaborator Author

@greptile review

Please review latest head commit: 9d6585f

Comment thread include/matx/operators/eig.h
Comment thread include/matx/operators/qr.h
Move QR UNGQR and eigen HEEV helper symbol construction into the component-specific branches that emit the matching JIT projection code. This keeps R-only and values-only projection generation visibly limited to the helper LTOIR that will be linked, while preserving the already-tested runtime behavior.
@cliffburdick
Copy link
Copy Markdown
Collaborator Author

@greptile review

Please review latest head commit: 9d790ce

Clarify that QR solver OUT and TAU projections intentionally reuse the same generated JIT class because the GEQRF projection body dispatches on the component template parameter. This addresses the review concern without changing the already-tested runtime path.
@cliffburdick
Copy link
Copy Markdown
Collaborator Author

@greptile review

Please review latest head commit: 49b4dc0

Comment thread include/matx/operators/solver_projection.h
Comment thread include/matx/transforms/solver_cusolverdx.h
Comment thread include/matx/operators/qr.h
Remove the QR Q shared-memory compaction path from the cuSolverDx generated projection and replace it with a generated static_assert that the supported Q projection layout matches the input leading dimension. Also document the intentional shared generated class for LU factors and pivots, since the body dispatches on the projection component.
@cliffburdick
Copy link
Copy Markdown
Collaborator Author

@greptile review

Please review latest head commit: 65c21df

Comment thread include/matx/operators/solver_projection.h
Comment thread include/matx/transforms/solver_cusolverdx.h
Keep solver projection state alive when a projection is composed into a longer-lived expression without putting shared ownership into the device-facing storage object. SolverProjectionStorage now retains state through a host-side registry while remaining raw-pointer based on device, skips eager materialization under CUDAJITExecutor, and includes a compound-expression lifetime regression. This also makes QR solver transform methods const-correct, avoids eager Chol/Inv PreRun work in the JIT path, clarifies cuSolverDx symbol/block metadata, and bumps the launch metadata cache version to drop stale parameters.
@cliffburdick
Copy link
Copy Markdown
Collaborator Author

@greptile review

Please review latest head commit: 84c8b71

@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

Make the full MathDx-enabled validation pass cleanly by aligning the resample example with the real inverse FFT path, skipping the sparse CSR direct-solve example when MatX is built without cuDSS, and making the profiling overhead check tolerate normal CPU-wall-time variance in containerized runs. The r2c operator test now asks the composed operator whether CUDA JIT is supported before exercising that executor, so unsupported real-FFT JIT combinations are skipped instead of throwing during the all-tests pass.
@@ -0,0 +1,1004 @@
////////////////////////////////////////////////////////////////////////////////
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 GetBlockDimRange returns hardcoded {32, 1024} instead of the queried trait block dim

After calling TryCacheTraits(), the trait-queried block_dim_ is stored but GetBlockDimRange() returns a hardcoded range {32, 1024}. For solver sizes where cuSolverDx requires a specific block dimension, the JIT executor will select the minimum (32) and may launch with the wrong occupancy or violate hardware requirements for that solver variant.

@@ -0,0 +1,477 @@
////////////////////////////////////////////////////////////////////////////////
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Static LifetimeRegistry is not thread-safe for concurrent projection construction

The LifetimeRegistry is a static unordered_map<State*, LifetimeEntry> modified by RetainState and ReleaseState without synchronization. When multiple CPU threads construct or destroy SolverProjectionOp instances referring to the same or different State objects concurrently, the map can be corrupted. A std::mutex or atomic ref-count per State is needed.

Keep the launch-parameter metadata cache helpers behind MATX_EN_JIT so jit_cuda.h can be included in non-JIT builds where nvrtc_helper.h does not declare GetCache. The executor still reports the normal runtime error if CUDAJITExecutor is used without JIT support, while JIT-enabled builds keep the warmed-launch metadata cache path.
Comment on lines +252 to +254
SolverProjectionOp<state_type, SVD_U, u_type> U;
SolverProjectionOp<state_type, SVD_S, s_type> S;
SolverProjectionOp<state_type, SVD_VT, vt_type> VT;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 U and VT projections expose uninitialized tensors when SVDMode::NONE

SVDState always allocates u_ and vt_ and SVDState::Tensor<SVD_U>()/Tensor<SVD_VT>() return them unconditionally, but svd_impl does not fill those buffers when jobz_ == SVDMode::NONE. Calling .U or .VT on an svd(A, SVDMode::NONE) op runs Materialize, writes nothing to the backing buffers, and exposes uninitialized device memory to the caller — silently producing wrong results.

Adding a MATX_ASSERT or throwing in SVDState::Tensor<Component>() when the requested component was not computed (e.g., jobz_ == NONE and Component != SVD_S) would catch this at the point of access, matching the guard that EigState::SupportsJITProjection<EIG_VECTORS> already applies for the JIT path.

@cliffburdick
Copy link
Copy Markdown
Collaborator Author

/build

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants