[ROCm] Add HIP/ROCm support for AMD GPUs#73
Open
jeffdaily wants to merge 2 commits into
Open
Conversation
Adds HIP/ROCm support for RXMesh, enabling execution on AMD GPUs. The core library, all mesh-query tests, and the full dynamic-editing path are validated across AMD CDNA and RDNA architectures (gfx90a CDNA2 wave64, gfx1100 RDNA3 wave32, gfx1151 RDNA3.5 wave32). Validated on AMD hardware: - AMD Instinct MI250X (gfx90a, CDNA2, ROCm 7.2.1): 24/25 test suite passed (one pre-existing flake unrelated to HIP changes) - AMD Radeon Pro W7800 (gfx1100, RDNA3, ROCm 7.2.1): 24/25 passed - AMD Radeon 8060S (gfx1151, RDNA3.5, Windows 11, ROCm): 25/25 passed Four key porting changes: 1. **ShmemMutex / ShmemMutexArray** (kernels/shmem_mutex*.cuh): the CUDA spin-lock (atomicCAS loop) relied on per-lane forward progress (Volta+ Independent Thread Scheduling), which CDNA wave64 lacks and would deadlock; reworked to wave-serialized critical sections. HIP path only; CUDA unchanged. 2. **remove_surplus_elements** (rxmesh_dynamic.cu): fixed latent __shared__ overflow - s_patch_stash (sized PatchStash::stash_size == 64) was cleared with wrong length LPHashTable::stash_size == 128, overrunning count accumulators. Benign on nvcc's layout, fatal on clang/hipcc; fixed to use array's own size. 3. **update_launch_box** (rxmesh_dynamic.cu): upstream unconditionally overrides computed cavity-kernel shared-memory budget with fixed 80 KB (NVIDIA-only slack that fits CUDA's 96-227 KB opt-in). CDNA caps dynamic shared memory at 64 KB/block, so 80 KB request made cavity editing kernels un-launchable. AMD now uses computed per-capacity footprint. HIP path only; CUDA unchanged. 4. **Windows HIP build fixes** (cmake/RXMeshTarget.cmake, cmake/RXMeshApp.cmake, cuda_query.h): -fuse-ld= override for clang++ gcc-driver mode, --allow-multiple-definition for cooperative_groups duplicate symbols under -fgpu-rdc, and managedMemory check guard for AMD APU devices. The TriangleRefinement test now references bundled bumpy-cube.obj instead of missing rocker-arm.obj so the test runs. **Deferred (module-level follow-on):** matrix/solver/diff subsystem uses low-level cusolverSp sparse API (cusolverSp_LOWLEVEL_PREVIEW csrqr*) and NVIDIA cuDSS, which have no ROCm equivalent today - requested in ROCm/hipSOLVER issue 443. These headers do not affect the validated core build. Architecture is taken from CMAKE_HIP_ARCHITECTURES / device capability; nothing is hardcoded to specific architectures. **Build system changes:** - New CMake option USE_HIP to enable HIP build - Compat headers: include/rxmesh/util/cuda_to_hip.h, cuda_to_hip_math.h - HIP redirect headers in include/rxmesh/hip_compat/ map CUDA includes to HIP - All .cu files marked as LANGUAGE HIP when USE_HIP=ON - Relocatable device code (-fgpu-rdc) required for device linking **Test Plan:** ```bash # AMD gfx90a (Linux, ROCm 7.2.1) cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \ -DRX_BUILD_TESTS=ON -DRX_BUILD_APPS=OFF -DRX_USE_POLYSCOPE=OFF ninja RXMesh_test ./RXMesh_test # 24/25 passed (PatchLock is pre-existing flake) # AMD gfx1100 (Linux, ROCm 7.2.1) cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 ... ./RXMesh_test # 24/25 passed # AMD gfx1151 (Windows 11, ROCm) cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1151 ... RXMesh_test.exe # 25/25 passed (both runs deterministic) ``` Authored with the assistance of Claude (Anthropic) as AI assistant.
Authored with the assistance of Claude (Anthropic).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds HIP/ROCm support for RXMesh, enabling execution on AMD GPUs. The core library, all mesh-query tests, and the full dynamic-editing path are validated across AMD CDNA and RDNA architectures (gfx90a CDNA2 wave64, gfx1100 RDNA3 wave32, gfx1151 RDNA3.5 wave32).
Validated on AMD hardware:
Four key porting changes:
ShmemMutex / ShmemMutexArray (kernels/shmem_mutex*.cuh): the CUDA spin-lock (atomicCAS loop) relied on per-lane forward progress (Volta+ Independent Thread Scheduling), which CDNA wave64 lacks and would deadlock; reworked to wave-serialized critical sections. HIP path only; CUDA unchanged.
remove_surplus_elements (rxmesh_dynamic.cu): fixed latent shared overflow - s_patch_stash (sized PatchStash::stash_size == 64) was cleared with wrong length LPHashTable::stash_size == 128, overrunning count accumulators. Benign on nvcc's layout, fatal on clang/hipcc; fixed to use array's own size.
update_launch_box (rxmesh_dynamic.cu): upstream unconditionally overrides computed cavity-kernel shared-memory budget with fixed 80 KB (NVIDIA-only slack that fits CUDA's 96-227 KB opt-in). CDNA caps dynamic shared memory at 64 KB/block, so 80 KB request made cavity editing kernels un-launchable. AMD now uses computed per-capacity footprint. HIP path only; CUDA unchanged.
Windows HIP build fixes (cmake/RXMeshTarget.cmake, cmake/RXMeshApp.cmake, cuda_query.h): -fuse-ld= override for clang++ gcc-driver mode, --allow-multiple-definition for cooperative_groups duplicate symbols under -fgpu-rdc, and managedMemory check guard for AMD APU devices.
The TriangleRefinement test now references bundled bumpy-cube.obj instead of missing rocker-arm.obj so the test runs.
Deferred (module-level follow-on): matrix/solver/diff subsystem uses low-level cusolverSp sparse API (cusolverSp_LOWLEVEL_PREVIEW csrqr*) and NVIDIA cuDSS, which have no ROCm equivalent today - requested in ROCm/hipSOLVER issue 443. These headers do not affect the validated core build.
Architecture is taken from CMAKE_HIP_ARCHITECTURES / device capability; nothing is hardcoded to specific architectures.
Build system changes:
Test Plan:
Authored with the assistance of Claude (Anthropic) as AI assistant.