[ROCm] Add HIP/ROCm support for AMD GPUs by jeffdaily · Pull Request #73 · owensgroup/RXMesh

jeffdaily · 2026-06-05T20:33:59Z

Adds HIP/ROCm support for RXMesh, enabling execution on AMD GPUs. The core library, all mesh-query tests, and the full dynamic-editing path are validated across AMD CDNA and RDNA architectures (gfx90a CDNA2 wave64, gfx1100 RDNA3 wave32, gfx1151 RDNA3.5 wave32).

Validated on AMD hardware:

AMD Instinct MI250X (gfx90a, CDNA2, ROCm 7.2.1): 24/25 test suite passed (one pre-existing flake unrelated to HIP changes)
AMD Radeon Pro W7800 (gfx1100, RDNA3, ROCm 7.2.1): 24/25 passed
AMD Radeon PRO V710 (gfx1101, RDNA3, Windows 11, ROCm): 25/25 passed
AMD Radeon RX 9070 XT (gfx1201, RDNA4, Windows 11, ROCm): 25/25 passed
AMD Radeon 8060S (gfx1151, RDNA3.5, Windows 11, ROCm): 25/25 passed

Four key porting changes:

ShmemMutex / ShmemMutexArray (kernels/shmem_mutex*.cuh): the CUDA spin-lock (atomicCAS loop) relied on per-lane forward progress (Volta+ Independent Thread Scheduling), which CDNA wave64 lacks and would deadlock; reworked to wave-serialized critical sections. HIP path only; CUDA unchanged.
remove_surplus_elements (rxmesh_dynamic.cu): fixed latent shared overflow - s_patch_stash (sized PatchStash::stash_size == 64) was cleared with wrong length LPHashTable::stash_size == 128, overrunning count accumulators. Benign on nvcc's layout, fatal on clang/hipcc; fixed to use array's own size.
update_launch_box (rxmesh_dynamic.cu): upstream unconditionally overrides computed cavity-kernel shared-memory budget with fixed 80 KB (NVIDIA-only slack that fits CUDA's 96-227 KB opt-in). CDNA caps dynamic shared memory at 64 KB/block, so 80 KB request made cavity editing kernels un-launchable. AMD now uses computed per-capacity footprint. HIP path only; CUDA unchanged.
Windows HIP build fixes (cmake/RXMeshTarget.cmake, cmake/RXMeshApp.cmake, cuda_query.h): -fuse-ld= override for clang++ gcc-driver mode, --allow-multiple-definition for cooperative_groups duplicate symbols under -fgpu-rdc, and managedMemory check guard for AMD APU devices.

The TriangleRefinement test now references bundled bumpy-cube.obj instead of missing rocker-arm.obj so the test runs.

Deferred (module-level follow-on): matrix/solver/diff subsystem uses low-level cusolverSp sparse API (cusolverSp_LOWLEVEL_PREVIEW csrqr*) and NVIDIA cuDSS, which have no ROCm equivalent today - requested in ROCm/hipSOLVER issue 443. These headers do not affect the validated core build.

Architecture is taken from CMAKE_HIP_ARCHITECTURES / device capability; nothing is hardcoded to specific architectures.

Build system changes:

New CMake option USE_HIP to enable HIP build
Compat headers: include/rxmesh/util/cuda_to_hip.h, cuda_to_hip_math.h
HIP redirect headers in include/rxmesh/hip_compat/ map CUDA includes to HIP
All .cu files marked as LANGUAGE HIP when USE_HIP=ON
Relocatable device code (-fgpu-rdc) required for device linking

Test Plan:

# AMD gfx90a (Linux, ROCm 7.2.1)
cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \
  -DRX_BUILD_TESTS=ON -DRX_BUILD_APPS=OFF -DRX_USE_POLYSCOPE=OFF
ninja RXMesh_test
./RXMesh_test  # 24/25 passed (PatchLock is pre-existing flake)

# AMD gfx1100 (Linux, ROCm 7.2.1)
cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 ...
./RXMesh_test  # 24/25 passed

# AMD gfx1101, gfx1201, gfx1151 (Windows 11, ROCm)
# Multi-arch fat binary:
cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES="gfx1101;gfx1201" ...
# Or single arch:
cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1151 ...
RXMesh_test.exe  # 25/25 passed on all three architectures

Authored with the assistance of Claude (Anthropic) as AI assistant.

Adds HIP/ROCm support for RXMesh, enabling execution on AMD GPUs. The core library, all mesh-query tests, and the full dynamic-editing path are validated across AMD CDNA and RDNA architectures (gfx90a CDNA2 wave64, gfx1100 RDNA3 wave32, gfx1151 RDNA3.5 wave32). Validated on AMD hardware: - AMD Instinct MI250X (gfx90a, CDNA2, ROCm 7.2.1): 24/25 test suite passed (one pre-existing flake unrelated to HIP changes) - AMD Radeon Pro W7800 (gfx1100, RDNA3, ROCm 7.2.1): 24/25 passed - AMD Radeon 8060S (gfx1151, RDNA3.5, Windows 11, ROCm): 25/25 passed Four key porting changes: 1. **ShmemMutex / ShmemMutexArray** (kernels/shmem_mutex*.cuh): the CUDA spin-lock (atomicCAS loop) relied on per-lane forward progress (Volta+ Independent Thread Scheduling), which CDNA wave64 lacks and would deadlock; reworked to wave-serialized critical sections. HIP path only; CUDA unchanged. 2. **remove_surplus_elements** (rxmesh_dynamic.cu): fixed latent __shared__ overflow - s_patch_stash (sized PatchStash::stash_size == 64) was cleared with wrong length LPHashTable::stash_size == 128, overrunning count accumulators. Benign on nvcc's layout, fatal on clang/hipcc; fixed to use array's own size. 3. **update_launch_box** (rxmesh_dynamic.cu): upstream unconditionally overrides computed cavity-kernel shared-memory budget with fixed 80 KB (NVIDIA-only slack that fits CUDA's 96-227 KB opt-in). CDNA caps dynamic shared memory at 64 KB/block, so 80 KB request made cavity editing kernels un-launchable. AMD now uses computed per-capacity footprint. HIP path only; CUDA unchanged. 4. **Windows HIP build fixes** (cmake/RXMeshTarget.cmake, cmake/RXMeshApp.cmake, cuda_query.h): -fuse-ld= override for clang++ gcc-driver mode, --allow-multiple-definition for cooperative_groups duplicate symbols under -fgpu-rdc, and managedMemory check guard for AMD APU devices. The TriangleRefinement test now references bundled bumpy-cube.obj instead of missing rocker-arm.obj so the test runs. **Deferred (module-level follow-on):** matrix/solver/diff subsystem uses low-level cusolverSp sparse API (cusolverSp_LOWLEVEL_PREVIEW csrqr*) and NVIDIA cuDSS, which have no ROCm equivalent today - requested in ROCm/hipSOLVER issue 443. These headers do not affect the validated core build. Architecture is taken from CMAKE_HIP_ARCHITECTURES / device capability; nothing is hardcoded to specific architectures. **Build system changes:** - New CMake option USE_HIP to enable HIP build - Compat headers: include/rxmesh/util/cuda_to_hip.h, cuda_to_hip_math.h - HIP redirect headers in include/rxmesh/hip_compat/ map CUDA includes to HIP - All .cu files marked as LANGUAGE HIP when USE_HIP=ON - Relocatable device code (-fgpu-rdc) required for device linking **Test Plan:** ```bash # AMD gfx90a (Linux, ROCm 7.2.1) cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \ -DRX_BUILD_TESTS=ON -DRX_BUILD_APPS=OFF -DRX_USE_POLYSCOPE=OFF ninja RXMesh_test ./RXMesh_test # 24/25 passed (PatchLock is pre-existing flake) # AMD gfx1100 (Linux, ROCm 7.2.1) cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 ... ./RXMesh_test # 24/25 passed # AMD gfx1151 (Windows 11, ROCm) cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1151 ... RXMesh_test.exe # 25/25 passed (both runs deterministic) ``` Authored with the assistance of Claude (Anthropic) as AI assistant.

Authored with the assistance of Claude (Anthropic).

jeffdaily added 2 commits June 5, 2026 20:20

[ROCm] Note AMD GPU (ROCm/HIP) support in the README

adfe9a6

Authored with the assistance of Claude (Anthropic).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Add HIP/ROCm support for AMD GPUs#73

[ROCm] Add HIP/ROCm support for AMD GPUs#73
jeffdaily wants to merge 2 commits into
owensgroup:mainfrom
jeffdaily:moat-port

jeffdaily commented Jun 5, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jeffdaily commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jeffdaily commented Jun 5, 2026 •

edited

Loading