Skip to content

[OPUS]: arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151)#3860

Merged
valarLip merged 1 commit into
mainfrom
carhuang/gfx1151_opus_fp8_guard
Jun 23, 2026
Merged

[OPUS]: arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151)#3860
valarLip merged 1 commit into
mainfrom
carhuang/gfx1151_opus_fp8_guard

Conversation

@carlushuang

@carlushuang carlushuang commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator

fix(opus): arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151)

opus/opus.hpp (fp32_to_fp8 / fp8_to_fp32, non-template) and aiter_opus_plus.h (fp32_to_{fp8,bf8}_scaled_x2) use __builtin_amdgcn_cvt_pk_fp8_f32 / v_cvt_pk_{fp8,bf8}_f32, which require the fp8-conversion-insts target feature. That feature is absent on RDNA3 (gfx1100) and RDNA3.5 (gfx115x). Because the non-template functions are compiled eagerly, clang rejects them with "instruction not supported on this GPU" / "needs target feature fp8-conversion-insts", so any HIP module that includes these headers (module_cache, module_activation, module_fused_qk_rmsnorm_group_quant, …) fails to JIT-build on gfx1151 — even on BF16 paths that never call fp8 conversion at runtime.

Change

Guard the instruction bodies behind the arches that actually have the feature (gfx942/gfx950/gfx1200/gfx1201/gfx1250) and provide a compile-only fallback on others. No runtime behaviour change on supported arches; on RDNA3/3.5 the fp8 cast is never executed for BF16.

Testing

With this patch, module_cache / module_activation / norm modules JIT-build and run on gfx1151 (Radeon 8060S). aiter.reshape_and_cache validated.

Companion

ROCm/ATOM: gfx1151 Qwen3.x (GDN hybrid) BF16 bring-up — branch carhuang/support_gfx1151_qwen36. ROCm/ATOM#1314ROCm/ATOM#1314

…151)

opus.hpp fp32_to_fp8 / fp8_to_fp32 (non-template) and aiter_opus_plus.h
fp32_to_{fp8,bf8}_scaled_x2 use __builtin_amdgcn_cvt_pk_fp8_f32 /
v_cvt_pk_{fp8,bf8}_f32, which require the fp8-conversion-insts target
feature absent on RDNA3 (gfx1100) and RDNA3.5 (gfx115x). clang rejects the
non-template builtins eagerly, so any HIP module that includes these headers
(module_cache, module_activation, module_norm fused-quant, ...) fails to JIT
on gfx1151 even for BF16 paths that never call fp8 conversion.

Guard the instruction bodies behind the arches that have the feature
(gfx942/gfx950/gfx1200/gfx1201/gfx1250) with a compile-only fallback on
others. BF16 code paths never invoke fp8 cast at runtime on these arches.

Enables ROCm/ATOM Qwen3.x (Qwen3.5/3.6) BF16 bring-up on gfx1151 (Strix Halo /
Radeon 8060S). Companion: ROCm/ATOM PR (gfx1151 Qwen3.x bring-up).
@github-actions

Copy link
Copy Markdown
Contributor

🏷️ CI Guide

Runs automatically on every PR:

  • ✅ Pre-checks (submodule verification, code formatting)
  • ✅ Aiter op tests (gfx942 + gfx950)
  • ✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label Tests
ci:triton-300x Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
ci:sglang SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
ci:atom ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
ci:atom_full ATOM accuracy suite for PR and main models from ATOM models_accuracy.json
ci:vllm vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
ci:all All standard extended tests (excludes ci:atom_full)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3860 --add-label <label>

@carlushuang carlushuang changed the title fix(opus): arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151) [OPUS]: arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151) Jun 22, 2026
@valarLip valarLip merged commit 82c6b27 into main Jun 23, 2026
40 checks passed
@valarLip valarLip deleted the carhuang/gfx1151_opus_fp8_guard branch June 23, 2026 08:31
vgokhale pushed a commit that referenced this pull request Jun 24, 2026
…151) (#3860)

opus.hpp fp32_to_fp8 / fp8_to_fp32 (non-template) and aiter_opus_plus.h
fp32_to_{fp8,bf8}_scaled_x2 use __builtin_amdgcn_cvt_pk_fp8_f32 /
v_cvt_pk_{fp8,bf8}_f32, which require the fp8-conversion-insts target
feature absent on RDNA3 (gfx1100) and RDNA3.5 (gfx115x). clang rejects the
non-template builtins eagerly, so any HIP module that includes these headers
(module_cache, module_activation, module_norm fused-quant, ...) fails to JIT
on gfx1151 even for BF16 paths that never call fp8 conversion.

Guard the instruction bodies behind the arches that have the feature
(gfx942/gfx950/gfx1200/gfx1201/gfx1250) with a compile-only fallback on
others. BF16 code paths never invoke fp8 cast at runtime on these arches.

Enables ROCm/ATOM Qwen3.x (Qwen3.5/3.6) BF16 bring-up on gfx1151 (Strix Halo /
Radeon 8060S). Companion: ROCm/ATOM PR (gfx1151 Qwen3.x bring-up).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants