[OPUS]: arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151) by carlushuang · Pull Request #3860 · ROCm/aiter

carlushuang · 2026-06-22T15:55:28Z

fix(opus): arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151)

opus/opus.hpp (fp32_to_fp8 / fp8_to_fp32, non-template) and aiter_opus_plus.h (fp32_to_{fp8,bf8}_scaled_x2) use __builtin_amdgcn_cvt_pk_fp8_f32 / v_cvt_pk_{fp8,bf8}_f32, which require the fp8-conversion-insts target feature. That feature is absent on RDNA3 (gfx1100) and RDNA3.5 (gfx115x). Because the non-template functions are compiled eagerly, clang rejects them with "instruction not supported on this GPU" / "needs target feature fp8-conversion-insts", so any HIP module that includes these headers (module_cache, module_activation, module_fused_qk_rmsnorm_group_quant, …) fails to JIT-build on gfx1151 — even on BF16 paths that never call fp8 conversion at runtime.

Change

Guard the instruction bodies behind the arches that actually have the feature (gfx942/gfx950/gfx1200/gfx1201/gfx1250) and provide a compile-only fallback on others. No runtime behaviour change on supported arches; on RDNA3/3.5 the fp8 cast is never executed for BF16.

Testing

With this patch, module_cache / module_activation / norm modules JIT-build and run on gfx1151 (Radeon 8060S). aiter.reshape_and_cache validated.

Companion

ROCm/ATOM: gfx1151 Qwen3.x (GDN hybrid) BF16 bring-up — branch carhuang/support_gfx1151_qwen36. ROCm/ATOM#1314 — ROCm/ATOM#1314

…151) opus.hpp fp32_to_fp8 / fp8_to_fp32 (non-template) and aiter_opus_plus.h fp32_to_{fp8,bf8}_scaled_x2 use __builtin_amdgcn_cvt_pk_fp8_f32 / v_cvt_pk_{fp8,bf8}_f32, which require the fp8-conversion-insts target feature absent on RDNA3 (gfx1100) and RDNA3.5 (gfx115x). clang rejects the non-template builtins eagerly, so any HIP module that includes these headers (module_cache, module_activation, module_norm fused-quant, ...) fails to JIT on gfx1151 even for BF16 paths that never call fp8 conversion. Guard the instruction bodies behind the arches that have the feature (gfx942/gfx950/gfx1200/gfx1201/gfx1250) with a compile-only fallback on others. BF16 code paths never invoke fp8 cast at runtime on these arches. Enables ROCm/ATOM Qwen3.x (Qwen3.5/3.6) BF16 bring-up on gfx1151 (Strix Halo / Radeon 8060S). Companion: ROCm/ATOM PR (gfx1151 Qwen3.x bring-up).

github-actions · 2026-06-22T15:56:13Z

🏷️ CI Guide

Runs automatically on every PR:

✅ Pre-checks (submodule verification, code formatting)
✅ Aiter op tests (gfx942 + gfx950)
✅ Triton tests on MI35X (only when aiter/ops/triton/** or related paths are changed)

Extended tests (opt-in via labels):

Label	Tests
`ci:triton-300x`	Run an additional Triton test job on MI300X in PRs; main branch always runs both MI35X and MI300X
`ci:sglang`	SGLang integration tests: DeepSeek-R1-MXFP4 accuracy, Qwen 3.5 accuracy
`ci:atom`	ATOM benchmark: DeepSeek-R1-0528, GPT-OSS-120B
`ci:atom_full`	ATOM accuracy suite for PR and main models from ATOM `models_accuracy.json`
`ci:vllm`	vLLM benchmark: GPT-OSS-120B, DeepSeek-R1-0528, Kimi-K2.5
`ci:all`	All standard extended tests (excludes `ci:atom_full`)

Only add ci:atom_full for FlyDSL or Triton upgrades.
Add labels via the sidebar or gh pr edit 3860 --add-label <label>

…151) (#3860) opus.hpp fp32_to_fp8 / fp8_to_fp32 (non-template) and aiter_opus_plus.h fp32_to_{fp8,bf8}_scaled_x2 use __builtin_amdgcn_cvt_pk_fp8_f32 / v_cvt_pk_{fp8,bf8}_f32, which require the fp8-conversion-insts target feature absent on RDNA3 (gfx1100) and RDNA3.5 (gfx115x). clang rejects the non-template builtins eagerly, so any HIP module that includes these headers (module_cache, module_activation, module_norm fused-quant, ...) fails to JIT on gfx1151 even for BF16 paths that never call fp8 conversion. Guard the instruction bodies behind the arches that have the feature (gfx942/gfx950/gfx1200/gfx1201/gfx1250) with a compile-only fallback on others. BF16 code paths never invoke fp8 cast at runtime on these arches. Enables ROCm/ATOM Qwen3.x (Qwen3.5/3.6) BF16 bring-up on gfx1151 (Strix Halo / Radeon 8060S). Companion: ROCm/ATOM PR (gfx1151 Qwen3.x bring-up).

carlushuang requested a review from a team June 22, 2026 15:55

carlushuang mentioned this pull request Jun 22, 2026

[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention ROCm/ATOM#1314

Open

carlushuang changed the title ~~fix(opus): arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151)~~ [OPUS]: arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151) Jun 22, 2026

valarLip approved these changes Jun 23, 2026

View reviewed changes

valarLip merged commit 82c6b27 into main Jun 23, 2026
40 checks passed

valarLip deleted the carhuang/gfx1151_opus_fp8_guard branch June 23, 2026 08:31

carlushuang mentioned this pull request Jun 24, 2026

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP ROCm/ATOM#1337

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[OPUS]: arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151)#3860

[OPUS]: arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151)#3860
valarLip merged 1 commit into
mainfrom
carhuang/gfx1151_opus_fp8_guard

carlushuang commented Jun 22, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

carlushuang commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fix(opus): arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151)

Change

Testing

Companion

Uh oh!

github-actions Bot commented Jun 22, 2026

🏷️ CI Guide

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

carlushuang commented Jun 22, 2026 •

edited

Loading