refactor(cuda_core): defer device checks in LaunchConfig to launch time#2066
refactor(cuda_core): defer device checks in LaunchConfig to launch time#2066KRRT7 wants to merge 5 commits into
Conversation
Avoid calling Device() twice (once for cluster validation, once for cooperative check). Now called at most once, and zero times for the common simple-launch path where neither cluster nor is_cooperative is set. Co-Authored-By: Claude <noreply@anthropic.com>
LaunchConfig.__init__ previously called Device() to validate compute capability (for cluster launches) and cooperative_launch support, but at construction time the stream — and therefore the correct device — is not yet known. Move both checks into _launcher.pyx where the stream is available: - _check_cluster_launch: queries stream.device.compute_capability and raises if CC < 9.0 (thread block clusters require H100+) - _check_cooperative_launch: now also guards cooperative_launch support via stream.device before the grid-size check LaunchConfig.__init__ is now a pure data class with no driver calls. Cluster and cooperative config objects can be constructed without a CUDA context, and errors surface at launch() time with the correct device in scope. Remove the try/except CUDAError skip guards from cluster-related tests; constructing LaunchConfig(cluster=...) no longer raises on sub-CC-9.0 devices, so those tests run on all hardware.
* Fix tab completion * Fix tests * Always install the monkeypatch * Update release note * Apply suggestion from @leofang Co-authored-by: Leo Fang <leof@nvidia.com> * Fix test * Fix tests hanging on Windows --------- Co-authored-by: Leo Fang <leof@nvidia.com>
LaunchConfig.__init__ previously called Device() to validate compute capability (for cluster launches) and cooperative_launch support, but at construction time the stream — and therefore the correct device — is not yet known. Move both checks into _launcher.pyx where the stream is available: - _check_cluster_launch: queries stream.device.compute_capability and raises if CC < 9.0 (thread block clusters require H100+) - _check_cooperative_launch: now also guards cooperative_launch support via stream.device before the grid-size check LaunchConfig.__init__ is now a pure data class with no driver calls. Cluster and cooperative config objects can be constructed without a CUDA context, and errors surface at launch() time with the correct device in scope. Remove the try/except CUDAError skip guards from cluster-related tests; constructing LaunchConfig(cluster=...) no longer raises on sub-CC-9.0 devices, so those tests run on all hardware.
|
Can you provide (or link to) more context about why this change is desirable? |
|
There's a
|
|
Makes sense. We could just drop the checks entirely and let the driver handle it (per #685). |
|
A few more things on the hot path worth looking at:
For reference: cuPy (raw.pyx#L118) calls |
Summary
LaunchConfig.__init__previously calledDevice()to validate compute capability (cluster launches) andcooperative_launchsupport, but the stream — and therefore the correct device — is not known at construction time_launcher.pyxwherestream.deviceis available:_check_cluster_launch(CC < 9.0 guard) and an updated_check_cooperative_launch(addscooperative_launchsupport guard before the existing grid-size check)LaunchConfig.__init__is now a pure data class with zero driver calls; cluster and cooperative configs can be constructed without a CUDA contextTest plan
try/except CUDAErrorskip guards fromtest_launch_config_cluster_grid_conversion— constructingLaunchConfig(cluster=...)no longer raises on sub-CC-9.0 devices, so these tests now run on all hardwarelaunch()time on CC < 9.0), full suite 3225 passed