[OPUS]: arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151)#3860
Merged
Conversation
…151)
opus.hpp fp32_to_fp8 / fp8_to_fp32 (non-template) and aiter_opus_plus.h
fp32_to_{fp8,bf8}_scaled_x2 use __builtin_amdgcn_cvt_pk_fp8_f32 /
v_cvt_pk_{fp8,bf8}_f32, which require the fp8-conversion-insts target
feature absent on RDNA3 (gfx1100) and RDNA3.5 (gfx115x). clang rejects the
non-template builtins eagerly, so any HIP module that includes these headers
(module_cache, module_activation, module_norm fused-quant, ...) fails to JIT
on gfx1151 even for BF16 paths that never call fp8 conversion.
Guard the instruction bodies behind the arches that have the feature
(gfx942/gfx950/gfx1200/gfx1201/gfx1250) with a compile-only fallback on
others. BF16 code paths never invoke fp8 cast at runtime on these arches.
Enables ROCm/ATOM Qwen3.x (Qwen3.5/3.6) BF16 bring-up on gfx1151 (Strix Halo /
Radeon 8060S). Companion: ROCm/ATOM PR (gfx1151 Qwen3.x bring-up).
Contributor
🏷️ CI GuideRuns automatically on every PR:
Extended tests (opt-in via labels):
|
valarLip
approved these changes
Jun 23, 2026
vgokhale
pushed a commit
that referenced
this pull request
Jun 24, 2026
…151) (#3860) opus.hpp fp32_to_fp8 / fp8_to_fp32 (non-template) and aiter_opus_plus.h fp32_to_{fp8,bf8}_scaled_x2 use __builtin_amdgcn_cvt_pk_fp8_f32 / v_cvt_pk_{fp8,bf8}_f32, which require the fp8-conversion-insts target feature absent on RDNA3 (gfx1100) and RDNA3.5 (gfx115x). clang rejects the non-template builtins eagerly, so any HIP module that includes these headers (module_cache, module_activation, module_norm fused-quant, ...) fails to JIT on gfx1151 even for BF16 paths that never call fp8 conversion. Guard the instruction bodies behind the arches that have the feature (gfx942/gfx950/gfx1200/gfx1201/gfx1250) with a compile-only fallback on others. BF16 code paths never invoke fp8 cast at runtime on these arches. Enables ROCm/ATOM Qwen3.x (Qwen3.5/3.6) BF16 bring-up on gfx1151 (Strix Halo / Radeon 8060S). Companion: ROCm/ATOM PR (gfx1151 Qwen3.x bring-up).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
fix(opus): arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151)
opus/opus.hpp(fp32_to_fp8/fp8_to_fp32, non-template) andaiter_opus_plus.h(fp32_to_{fp8,bf8}_scaled_x2) use__builtin_amdgcn_cvt_pk_fp8_f32/v_cvt_pk_{fp8,bf8}_f32, which require thefp8-conversion-inststarget feature. That feature is absent on RDNA3 (gfx1100) and RDNA3.5 (gfx115x). Because the non-template functions are compiled eagerly, clang rejects them with "instruction not supported on this GPU" / "needs target feature fp8-conversion-insts", so any HIP module that includes these headers (module_cache,module_activation,module_fused_qk_rmsnorm_group_quant, …) fails to JIT-build on gfx1151 — even on BF16 paths that never call fp8 conversion at runtime.Change
Guard the instruction bodies behind the arches that actually have the feature (
gfx942/gfx950/gfx1200/gfx1201/gfx1250) and provide a compile-only fallback on others. No runtime behaviour change on supported arches; on RDNA3/3.5 the fp8 cast is never executed for BF16.Testing
With this patch,
module_cache/module_activation/ norm modules JIT-build and run on gfx1151 (Radeon 8060S).aiter.reshape_and_cachevalidated.Companion
ROCm/ATOM: gfx1151 Qwen3.x (GDN hybrid) BF16 bring-up — branch
carhuang/support_gfx1151_qwen36. ROCm/ATOM#1314 — ROCm/ATOM#1314