perf(cuda_core): cache native LaunchConfig struct and make fields read-only#2070
Draft
KRRT7 wants to merge 6 commits into
Draft
perf(cuda_core): cache native LaunchConfig struct and make fields read-only#2070KRRT7 wants to merge 6 commits into
KRRT7 wants to merge 6 commits into
Conversation
Contributor
…d-only _to_native_launch_config() rebuilt CUlaunchConfig on every launch() call even when the config was unchanged. Since LaunchConfig is already designed as an immutable value type (__hash__, __eq__), cache the result after the first build and return a struct copy on subsequent calls. Fields are changed from `public` to `readonly` so the cache can never go stale from Python-side mutation. Cython-internal access is unaffected. Benchmark (T4, 50k iters, noop kernel): launch() reused config (cache warm): 3.98 us/call launch() fresh config each call: 6.34 us/call speedup: 1.6x
Author
|
meant to open the draft in my fork so that it could run in my CI, apologies, will clean up shortly |
7bb484b to
6d5e25d
Compare
- Expose _cache_valid as readonly so tests can assert the cdef cache path - Add test_launch_config_cdef_cache_populated_by_launch: verifies the cdef _to_native_launch_config cache is set after a real launch() call - Add test_launch_config_native_conversion_stable_cluster: cluster config consistency via the cpdef wrapper - Rename cpdef-level cache tests to make clear they test the Python wrapper, not the cdef cache - Add NOTE comment at the cpdef wrapper distinguishing it from the cached cdef method - Add breaking change entry to 1.0.1 release notes for readonly fields
- Use typed values in test_launch_config_fields_are_readonly instead of None - Narrow except clause to CUDAError in cooperative skip guard - Assert cache persists through a second launch in cdef cache test
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Motivation
launch()is on the critical path. Every call currently pays the full cost of_to_native_launch_config()— amemset,vector::resize, and attributerebuild — even when the
LaunchConfighasn't changed between calls, which isthe normal pattern in a tight dispatch loop.
LaunchConfigis already designed as an immutable value type (__hash__and__eq__are defined), so the nativeCUlaunchConfigstruct is a pure functionof its fields. We can compute it once and reuse it.
Changes
_launch_config.pxd—public→readonlyon all five fields. Pythoncallers can still read
config.gridetc. but can no longer mutate them afterconstruction. Cython-internal code is unaffected (direct C field access is
unchanged). Two new private C fields:
_cached_drv_cfgand_cache_valid._launch_config.pyx—_to_native_launch_config(thecdefmethod calledfrom
launch()) now short-circuits on_cache_valid. On first call it buildsthe struct as before, stores it, and sets
_cache_valid = True. Subsequentcalls return a struct copy in O(1). The
attrspointer in the cached structis stable because
self._attrsis never resized after the cache is set.No changes to
_launcher.pyx,_graph_node.pyx, or any test that readsfields —
readonlyfields are accessed identically from both Python andCython.
Correctness
user code.
attrspointer in the cachedCUlaunchConfigpoints intoself._attrs.Since
_attrs.resize(0)is skipped on the fast path, the vector is neverreallocated after the cache is populated; the pointer is valid for the
lifetime of any
cuLaunchKernelExcall.both compute the same result simultaneously, which is harmless.
Benchmark
Measured on a T4 (CUDA 12.9), 50,000 iterations, noop kernel:
launch()reused config (cache warm)launch()fresh config each call (cache cold)LaunchConfig()construction alone1.6x speedup on
launch()when reusing aLaunchConfigacross calls,which is the expected pattern for any steady-state dispatch loop.
LaunchConfigconstruction accounts for ~28% of the cold-path cost; the restis the
_to_native_launch_configrebuild that the cache eliminates.Tests
test_launch_config_fields_are_readonly— all five fields raiseAttributeErroron writetest_launch_config_native_cache_stable— two calls to_to_native_launch_configon the same config return consistent grid/block/shmem/numAttrs valuestest_launch_config_native_cache_cooperative— cached cooperative config retains its attribute (numAttrs == 1)