Skip to content

[ROCm] Add HIP/ROCm support for AMD GPUs#73

Open
jeffdaily wants to merge 2 commits into
owensgroup:mainfrom
jeffdaily:moat-port
Open

[ROCm] Add HIP/ROCm support for AMD GPUs#73
jeffdaily wants to merge 2 commits into
owensgroup:mainfrom
jeffdaily:moat-port

Conversation

@jeffdaily

@jeffdaily jeffdaily commented Jun 5, 2026

Copy link
Copy Markdown

Adds HIP/ROCm support for RXMesh, enabling execution on AMD GPUs. The core library, all mesh-query tests, and the full dynamic-editing path are validated across AMD CDNA and RDNA architectures (gfx90a CDNA2 wave64, gfx1100 RDNA3 wave32, gfx1151 RDNA3.5 wave32).

Validated on AMD hardware:

  • AMD Instinct MI250X (gfx90a, CDNA2, ROCm 7.2.1): 24/25 test suite passed (one pre-existing flake unrelated to HIP changes)
  • AMD Radeon Pro W7800 (gfx1100, RDNA3, ROCm 7.2.1): 24/25 passed
  • AMD Radeon PRO V710 (gfx1101, RDNA3, Windows 11, ROCm): 25/25 passed
  • AMD Radeon RX 9070 XT (gfx1201, RDNA4, Windows 11, ROCm): 25/25 passed
  • AMD Radeon 8060S (gfx1151, RDNA3.5, Windows 11, ROCm): 25/25 passed

Four key porting changes:

  1. ShmemMutex / ShmemMutexArray (kernels/shmem_mutex*.cuh): the CUDA spin-lock (atomicCAS loop) relied on per-lane forward progress (Volta+ Independent Thread Scheduling), which CDNA wave64 lacks and would deadlock; reworked to wave-serialized critical sections. HIP path only; CUDA unchanged.

  2. remove_surplus_elements (rxmesh_dynamic.cu): fixed latent shared overflow - s_patch_stash (sized PatchStash::stash_size == 64) was cleared with wrong length LPHashTable::stash_size == 128, overrunning count accumulators. Benign on nvcc's layout, fatal on clang/hipcc; fixed to use array's own size.

  3. update_launch_box (rxmesh_dynamic.cu): upstream unconditionally overrides computed cavity-kernel shared-memory budget with fixed 80 KB (NVIDIA-only slack that fits CUDA's 96-227 KB opt-in). CDNA caps dynamic shared memory at 64 KB/block, so 80 KB request made cavity editing kernels un-launchable. AMD now uses computed per-capacity footprint. HIP path only; CUDA unchanged.

  4. Windows HIP build fixes (cmake/RXMeshTarget.cmake, cmake/RXMeshApp.cmake, cuda_query.h): -fuse-ld= override for clang++ gcc-driver mode, --allow-multiple-definition for cooperative_groups duplicate symbols under -fgpu-rdc, and managedMemory check guard for AMD APU devices.

The TriangleRefinement test now references bundled bumpy-cube.obj instead of missing rocker-arm.obj so the test runs.

Deferred (module-level follow-on): matrix/solver/diff subsystem uses low-level cusolverSp sparse API (cusolverSp_LOWLEVEL_PREVIEW csrqr*) and NVIDIA cuDSS, which have no ROCm equivalent today - requested in ROCm/hipSOLVER issue 443. These headers do not affect the validated core build.

Architecture is taken from CMAKE_HIP_ARCHITECTURES / device capability; nothing is hardcoded to specific architectures.

Build system changes:

  • New CMake option USE_HIP to enable HIP build
  • Compat headers: include/rxmesh/util/cuda_to_hip.h, cuda_to_hip_math.h
  • HIP redirect headers in include/rxmesh/hip_compat/ map CUDA includes to HIP
  • All .cu files marked as LANGUAGE HIP when USE_HIP=ON
  • Relocatable device code (-fgpu-rdc) required for device linking

Test Plan:

# AMD gfx90a (Linux, ROCm 7.2.1)
cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \
  -DRX_BUILD_TESTS=ON -DRX_BUILD_APPS=OFF -DRX_USE_POLYSCOPE=OFF
ninja RXMesh_test
./RXMesh_test  # 24/25 passed (PatchLock is pre-existing flake)

# AMD gfx1100 (Linux, ROCm 7.2.1)
cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 ...
./RXMesh_test  # 24/25 passed

# AMD gfx1101, gfx1201, gfx1151 (Windows 11, ROCm)
# Multi-arch fat binary:
cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES="gfx1101;gfx1201" ...
# Or single arch:
cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1151 ...
RXMesh_test.exe  # 25/25 passed on all three architectures

Authored with the assistance of Claude (Anthropic) as AI assistant.

jeffdaily added 2 commits June 5, 2026 20:20
Adds HIP/ROCm support for RXMesh, enabling execution on AMD GPUs. The core
library, all mesh-query tests, and the full dynamic-editing path are validated
across AMD CDNA and RDNA architectures (gfx90a CDNA2 wave64, gfx1100 RDNA3
wave32, gfx1151 RDNA3.5 wave32).

Validated on AMD hardware:
- AMD Instinct MI250X (gfx90a, CDNA2, ROCm 7.2.1): 24/25 test suite passed
  (one pre-existing flake unrelated to HIP changes)
- AMD Radeon Pro W7800 (gfx1100, RDNA3, ROCm 7.2.1): 24/25 passed
- AMD Radeon 8060S (gfx1151, RDNA3.5, Windows 11, ROCm): 25/25 passed

Four key porting changes:

1. **ShmemMutex / ShmemMutexArray** (kernels/shmem_mutex*.cuh): the CUDA
   spin-lock (atomicCAS loop) relied on per-lane forward progress (Volta+
   Independent Thread Scheduling), which CDNA wave64 lacks and would deadlock;
   reworked to wave-serialized critical sections. HIP path only; CUDA unchanged.

2. **remove_surplus_elements** (rxmesh_dynamic.cu): fixed latent __shared__
   overflow - s_patch_stash (sized PatchStash::stash_size == 64) was cleared
   with wrong length LPHashTable::stash_size == 128, overrunning count
   accumulators. Benign on nvcc's layout, fatal on clang/hipcc; fixed to use
   array's own size.

3. **update_launch_box** (rxmesh_dynamic.cu): upstream unconditionally overrides
   computed cavity-kernel shared-memory budget with fixed 80 KB (NVIDIA-only
   slack that fits CUDA's 96-227 KB opt-in). CDNA caps dynamic shared memory at
   64 KB/block, so 80 KB request made cavity editing kernels un-launchable. AMD
   now uses computed per-capacity footprint. HIP path only; CUDA unchanged.

4. **Windows HIP build fixes** (cmake/RXMeshTarget.cmake, cmake/RXMeshApp.cmake,
   cuda_query.h): -fuse-ld= override for clang++ gcc-driver mode,
   --allow-multiple-definition for cooperative_groups duplicate symbols under
   -fgpu-rdc, and managedMemory check guard for AMD APU devices.

The TriangleRefinement test now references bundled bumpy-cube.obj instead of
missing rocker-arm.obj so the test runs.

**Deferred (module-level follow-on):** matrix/solver/diff subsystem uses
low-level cusolverSp sparse API (cusolverSp_LOWLEVEL_PREVIEW csrqr*) and NVIDIA
cuDSS, which have no ROCm equivalent today - requested in ROCm/hipSOLVER
issue 443. These headers do not affect the validated core build.

Architecture is taken from CMAKE_HIP_ARCHITECTURES / device capability; nothing
is hardcoded to specific architectures.

**Build system changes:**
- New CMake option USE_HIP to enable HIP build
- Compat headers: include/rxmesh/util/cuda_to_hip.h, cuda_to_hip_math.h
- HIP redirect headers in include/rxmesh/hip_compat/ map CUDA includes to HIP
- All .cu files marked as LANGUAGE HIP when USE_HIP=ON
- Relocatable device code (-fgpu-rdc) required for device linking

**Test Plan:**
```bash
# AMD gfx90a (Linux, ROCm 7.2.1)
cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx90a \
  -DRX_BUILD_TESTS=ON -DRX_BUILD_APPS=OFF -DRX_USE_POLYSCOPE=OFF
ninja RXMesh_test
./RXMesh_test  # 24/25 passed (PatchLock is pre-existing flake)

# AMD gfx1100 (Linux, ROCm 7.2.1)
cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1100 ...
./RXMesh_test  # 24/25 passed

# AMD gfx1151 (Windows 11, ROCm)
cmake -DUSE_HIP=ON -DCMAKE_HIP_ARCHITECTURES=gfx1151 ...
RXMesh_test.exe  # 25/25 passed (both runs deterministic)
```

Authored with the assistance of Claude (Anthropic) as AI assistant.
Authored with the assistance of Claude (Anthropic).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant