[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP by carlushuang · Pull Request #1337 · ROCm/ATOM

carlushuang · 2026-06-24T06:45:59Z

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5 (Strix Halo), with working MTP

Builds on the gfx1151 BF16 enablement (#1314) to add online INT8 W8A8 (Quark-style, no offline quant step) for the Qwen3.6 dense (27B) and MoE (35B-A3B) architectures, plus the fixes needed to make MTP speculative decoding work on the MoE draft. RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the right quantization target for this iGPU.

Precision split (chosen for quality): A8W8 (int8 weight + dynamic per-token int8 activation) for all dense GEMMs and the MoE experts; BF16 for the Gated-DeltaNet linear-attention (recurrent, quant-sensitive — int8 there produces garbage), the MoE router gate, lm_head, and embeddings; KV cache BF16. This keeps gsm8k at BF16-equivalent quality while halving weight bytes (the decode bottleneck on a bandwidth-bound iGPU).

What this enables

35B-A3B runs at all: there is no BF16 MoE kernel on gfx1151 (the asm fused-MoE is gfx9-only; ATOM's Triton MoE is MXFP4-weight-only). The int8 path uses aiter's moe_gemm_int8_smoothquant, which is the only int8-W8A8 grouped-GEMM that runs on RDNA3.5.
~6× faster than the BF16 27B baseline for the 35B-A3B (3B active params × int8 + MTP).

Changes

model_ops/linear.py — per_Token int8 branch routes to aiter Triton gemm_a8w8 on non-gfx9 (CK gemm_a8w8_CK is gfx9-only), wrapped as a torch custom op (torch.ops.aiter.atom_gemm_a8w8_triton) so it is HIP-graph / torch.compile safe. Online-quant allow-list += torch.int8.
model_ops/moe.py — new Int8MoEMethod (int8 w13/w2 + per-channel fp32 scales) and an int8 branch in FusedMoE._online_quant.
model_ops/fused_moe_triton.py — triton_kernel_int8_moe_forward: matmul-ogs routing → per-token int8 quant → moe_gemm_int8_smoothquant (gemm1 with fused gated-SiLU via interleaved w13 columns) → per-token int8 quant → gemm2 with scatter/combine.
models/qwen3_5_mtp.py + model_loader/loader.py — MTP-MoE drafter fix: the draft's fused expert weights (experts.gate_up_proj/down_proj) were silently dropped at load → 0% draft acceptance → MTP was pure overhead. Add the draft's fused-expert mapping (detect_fused_expert_format / get_fused_expert_mapping / load_fused_expert_weights), fix get_expert_mapping to use num_experts, and let the loader resolve load_fused_expert_weights_fn from the model. After the fix: acceptance 0 → 0.83.
entrypoints/openai/tool_parser.py — unique tool-call ids (call_<uuid> instead of a per-response call_0). Non-unique ids made agentic clients (qwen-code) dedupe every tool call after the first → endless tool-call loop. Extends feat(openai): Qwen3 (qwen3_coder/qwen3_xml) tool-call support #1319.

Quality (gsm8k, 5-shot-equivalent, chat + thinking, greedy)

35B-A3B INT8 W8A8 = 0.84 — BF16-equivalent (int8 is faithful). MTP is lossless (accepts a draft token only when it matches the target's greedy argmax), so the MTP build has identical quality.

Performance (gfx1151 / Radeon 8060S, bs=1)

Decode (single-stream, short context):

Model	Config	Decode tok/s
27B dense	INT8 W8A8	6.0
27B dense	INT8 W8A8 + MTP-1	9.4
35B-A3B	INT8 W8A8 + HIP graph	24.8
35B-A3B	INT8 W8A8 + MTP-1 + HIP graph	~35

Long-context (35B-A3B INT8 W8A8 + MTP-1, bs=1):

Context	Prefill TTFT	Prefill tok/s	Decode (output) tok/s	Total tok/s
64K (60,016 tok)	85.4 s	703	23.4	661
128K (119,071 tok)	191.8 s	621	17.3	598

Decode tok/s falls with context (each step reads the growing KV); prefill is compute-bound (one-time prompt-ingestion cost). The hybrid model's KV is cheap (only the interleaved full-attn layers cache KV), so 128K fits easily — at gpu-memory-utilization 0.9 the KV pool holds ~2.1M tokens; the limit is --max-model-len, not memory.

Serve

ATOM_USE_UNIFIED_ATTN=1 \
python -m atom.entrypoints.openai_server --model Qwen/Qwen3.6-35B-A3B \
  --trust-remote-code -tp 1 --kv_cache_dtype bf16 --block-size 64 \
  --max-model-len 131072 --max-num-seqs 2 --gpu-memory-utilization 0.9 \
  --method mtp --num-speculative-tokens 1 \
  --online_quant_config '{"global_quant_config":"ptpc_i8","exclude_layer":["*linear_attn*","*lm_head*","*shared_head*","*embed_tokens*","*mlp.gate"]}'

(Drop --method mtp ... for 35B if you don't want MTP; for the dense 27B MTP is a ~1.6× lossless win.)

Dependency

[OPUS]: arch-guard fp8/bf8 packed-cvt builtins for RDNA3/3.5 (gfx1151) aiter#3860 (carhuang/gfx1151_opus_fp8_guard) — arch-guard the gfx9-only fp8/bf8-cvt builtins. Required (shared with [gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention #1314). The int8 GEMM/MoE kernels (gemm_a8w8 Triton, moe_gemm_int8_smoothquant, per_token_quant_hip) are already upstream in aiter; no new aiter code is needed for the int8 path.
[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention #1314 (carhuang/support_gfx1151_qwen36) — the gfx1151 BF16 base enablement (arch gate, native Triton attention, GDN block_tables). Prerequisite.
feat(openai): Qwen3 (qwen3_coder/qwen3_xml) tool-call support #1319 (carhuang/qwen3_xml_tool_parser) — qwen3_xml tool-call parsing; the unique-tool-call-id fix here extends it.

* [ATOM SGL]Add dsv4 ci Co-authored-by: Cursor <cursoragent@cursor.com>

* fix * use func

* feat(minimax_m3): add MXFP4 native support Introduce the minimal MiniMax-M3 MXFP4 native ATOM path without BF16, MXFP8, EAGLE, or unified-attention support. * fix(minimax_m3): align FP4 Triton paths with BF16 branch Keep the split MXFP4 PR aligned with the BF16 branch for shared Triton kernel paths while removing the extra package marker file. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(minimax_m3): add MXFP4 native support Introduce the minimal MiniMax-M3 MXFP4 native ATOM path without BF16, MXFP8, EAGLE, or unified-attention support. * fix(attn): use Triton for unsupported GQA decode Route the block-size 128 GQA decode shape used by MiniMax-M3 away from generic PA ASM, which has no matching AITER kernel in the validation image. * chore(minimax_m3): trim FP4 split cleanup Remove the extra MiniMax-M3 module docstring note and keep Triton attention selection controlled by the existing environment flag. * docs(minimax_m3): force Triton attention in MXFP4 recipe Document the required ATOM_FORCE_ATTN_TRITON flag for the MXFP4 TP4 launch path. * chore(minimax_m3): fix Black formatting and trim comments Remove the extra blank lines flagged by Black and keep MiniMax-M3 sparse attention comments focused on ATOM's FP4 path. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(minimax_m3): simplify text config normalization Copy MiniMax-M3 text config attributes generically so the FP4 path keeps required root config fields without maintaining a long field allowlist. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(attn): generalize sparse block-size handling Read sparse attention block-size requirements from the HF sparse attention config instead of hard-coding the MiniMax-M3 sparse attention constant in the shared AITER metadata builder. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(attn): use generic sparse metadata naming Keep MiniMax-M3 sparse metadata construction local to the sparse attention path while exposing it through generic attention metadata fields in the shared AITER builder. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(attn): generalize indexed sparse marker Use a model-agnostic marker for indexed sparse attention modules so the shared AITER cache binding path no longer checks a MiniMax-M3-specific attribute name. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(attn): generalize sparse cache names Use generic indexed sparse cache and metadata helper names in the shared AITER attention path while keeping the MiniMax-M3 sparse implementation module unchanged. Co-authored-by: Cursor <cursoragent@cursor.com> * refact attention code * keep ATOM_USE_UNIFIED_ATTN path --------- Co-authored-by: xytpai <xytpai@foxmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>

…#1328) * Add model cache mount for MI308 sglang benchmark

* mla * fix * fix --------- Co-authored-by: HaonanWang98 <hwang@amd.com> Co-authored-by: feifei14119 <carlus.huang@amd.com>

* Move gpt-oss and kimi2.5 CI from mi355 to mi350 * Move deepseek-v4-flash and qwen3.5 vllm CI from mi355 to mi350 * Add runner user info * Add clean up containers function for runner atom-mi35x-8gpu-oot-acc * CI: tolerate missing Docker config on OOT runners --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com>

… on MoE RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the quantization target for this iGPU. A8W8 (int8 weight + dynamic per-token int8 activation) for all dense GEMMs and MoE experts; BF16 for the GDN linear-attn (recurrent, quant-sensitive), router gate, lm_head, embeddings; KV cache BF16. - model_ops/linear.py: per_Token int8 branch -> aiter Triton gemm_a8w8 on non-gfx9 (CK gemm_a8w8_CK is gfx9-only), wrapped as a torch custom op so it is HIP-graph / torch.compile safe. Online-quant allow-list += torch.int8. - model_ops/moe.py + model_ops/fused_moe_triton.py: Int8MoEMethod + int8 branch in FusedMoE._online_quant, and triton_kernel_int8_moe_forward using aiter moe_gemm_int8_smoothquant (gemm1 with fused gated-SiLU via interleaved w13). Enables 35B-A3B, which has no BF16 MoE kernel on gfx1151. - models/qwen3_5_mtp.py + model_loader/loader.py: fix the MTP-MoE drafter so the draft's fused expert weights load (add detect_fused_expert_format / get_fused_expert_mapping / load_fused_expert_weights; get_expert_mapping uses num_experts; loader resolves load_fused_expert_weights_fn from the model). Draft acceptance 0 -> 0.83; MTP now a net win on the MoE model. - model_ops/topK.py: keep the shared expert as a separate MLP on non-gfx9 so the routed MoE uses the portable Triton path.

* replace einsum with bmm * fix

* add m3 mxfp8 support * add mxfp8 recipe * wip Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com> * revert dequant fp8 back to bf16 for linear layers * update m3 recipe * format * remove hard code dtype --------- Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com> Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com> Co-authored-by: Haoyang Li <lihaoyang0109@gmail.com> Co-authored-by: ganyi <ygan@amd.com> Co-authored-by: Guanbao Yu <gyu@amd.com> Co-authored-by: wuhuikx <hattie.wu@amd.com>

* [fix](qwen): fix qwen3.5 accuracy * [fix](attn): delete extra code * [fix](attn): add kv cache to mutate args * [fix](qwen): remove quick allreduce in qwen3.5 --------- Co-authored-by: perzhang <perzhang@amd.com>

) * Add NUMA-aware CPU/memory binding * Add glm-5-2-fp8 benchmark dispatch checkbox

… AAC machine (#1346) * Modify atom-sgl-accuracy workflow to adapt it for AAC machine

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

)

* docs: revise M3 fp8/gluon port plan for first-class framework compat Replace the env-gated bolt-on approach with one driven by main's existing attention-framework contracts: fp8 selected by config.kv_cache_dtype, scales returned via KVCacheTensor, binding through build_kv_cache_tensor/ bind_kv_cache, insert via the quantized hook, metadata via make_sparse_* factories, frozen custom-op signature, CUDAGraph-safe scratch, byte accounting. Adds a 9-point contract checklist mapped to each task. Co-Authored-By: Claude <noreply@anthropic.com> * feat(attn): SparseMHAPagedAttentionImpl skeleton + Attention impl_cls override Task 0 of the MiniMax-M3 fp8 KV cache + gluon PA port. Adds the subclass scaffold (SparseMHAPagedAttentionImpl extends PagedAttentionImpl, overriding only rope_cache + dispatch_backend via delegation for now) and an optional impl_cls kwarg on Attention.__init__ so a model can plug in a specialized impl while reusing the backend's metadata builder. Indexer state lives on the impl. Co-Authored-By: Claude <noreply@anthropic.com> * feat(minimax_m3): page-16 constants + fused SHUFFLE KV-insert kernel Task 1 of the M3 fp8/gluon port. Adds ASM_PAGE_SIZE=16 / PAGES_PER_SPARSE_BLOCK=8 and grafts the Triton fused Gemma-RMSNorm + partial-NeoX-RoPE + page-16 SHUFFLE KV-insert kernel (+ host wrapper) from origin/ganyi/shuffle_kv_cache_fp8_eagle. GPU round-trip test validates q_out/index_q_out vs PyTorch ref and K/V/index cache scatter at each token slot. Co-Authored-By: Claude <noreply@anthropic.com> * feat(minimax_m3): page-16 sparse block-table builders + fused topk EMIT_SPARSE_BT Task 2 of the M3 fp8/gluon port. Grafts the decode + prefill page-16 sparse block-table builders into sparse_attn.py (each selected logical 128-block expands to 8 contiguous physical 16-pages, partial tail packed last, exact context_lens), and replaces index_topk.py wholesale with the source-branch version that adds the fused EMIT_SPARSE_BT block-table emission and MAX_Q spec-decode causal support (both opt-in via defaulted kwargs, so existing decode callers are unaffected). Tests: x8 expansion + tail-last packing + ctx lengths for the standalone builder; fused EMIT path matches the standalone builder bit-for-bit (num_kv_heads==1). Co-Authored-By: Claude <noreply@anthropic.com> * feat(minimax_m3): gluon PA decode + prefill runners over page-16 SHUFFLE cache Task 3 of the M3 fp8/gluon port. Grafts minimax_m3_sparse_attn_decode_asm, minimax_m3_sparse_attn_prefill_asm, and the shared _run_prefill_fp8_gluon helper from the source branch: index top-k -> page-16 sparse block table -> AITER gluon split-KV paged-attention (run_pa_decode_gluon), with fp8 vs bf16 compute_type and per-token scales selected by the KV cache dtype. Adds `import aiter` (used for aiter.dtypes.fp8). Parity test (gluon vs Triton split-K decode reference) for gqa 8/16; validated further by the existing asm/fp8/prefill oracle tests. Co-Authored-By: Claude <noreply@anthropic.com> * feat(attn): implement SparseMHAPagedAttentionImpl.rope_cache override Task 4 of the M3 fp8/gluon port. The override runs MiniMax-M3's fused qk-norm + partial-NeoX-RoPE + page-16 SHUFFLE KV insert + indexer-key insert via aiter.fused_qknorm_idxrqknorm (consuming the packed qkv), reading the SHUFFLE K/V + scale + index caches off the bound layer. It returns the parent's 7-tuple (query rotated) and stashes the rotated indexer query on self._index_q for dispatch_backend. fp8 vs bf16 selected by kv_cache_dtype; fp8 writes per-token dequant scales into k_scale/v_scale. Adds the _minimax_m3_cos_sin_cache helper. Test (bf16 + fp8): override returns the 7-tuple, populates _index_q with correct shape, and mutates the KV/index caches (+ fp8 scales). Co-Authored-By: Claude <noreply@anthropic.com> * feat(attn): implement SparseMHAPagedAttentionImpl.dispatch_backend override Task 5 of the M3 fp8/gluon port. dispatch_backend returns the M3 sparse prefill/decode backend callable (parent contract fn(q,k,v,k_cache,v_cache,k_scale,v_scale,fwd_ctx)). Both paths select per-token top-k index blocks with the fused page-16 sparse block-table emit, then run the gluon split-KV paged-attention over the SHUFFLE cache; fp8 vs bf16 follows the cache dtype inside the runners. Prefill uses the sync-free on-device metadata fallback (query_req_id/abs_pos/qo_indptr=None). Consumes self._index_q from rope_cache and clears it afterward. Note: index_cache is page-128 3D [num_logical, 128, idx_head_dim], indexed by the logical block_table in index-topk (distinct from the page-16 SHUFFLE KV cache). Test (bf16+fp8): dispatch returns the decode callable; running it yields finite [tokens, nh, hd] output and clears _index_q. Co-Authored-By: Claude <noreply@anthropic.com> * first version of refactor Signed-off-by: ganyi <ygan@amd.com> * remove unnecessary files Signed-off-by: ganyi <ygan@amd.com> * runable and can response resonable output Signed-off-by: ganyi <ygan@amd.com> * acc right Signed-off-by: ganyi <ygan@amd.com> * reuse mha's allocation for main cache, view at use time Signed-off-by: ganyi <ygan@amd.com> * remove prepare mtp metadata Signed-off-by: ganyi <ygan@amd.com> * format Signed-off-by: ganyi <ygan@amd.com> * format Signed-off-by: ganyi <ygan@amd.com> * resolve comments Signed-off-by: ganyi <ygan@amd.com> --------- Signed-off-by: ganyi <ygan@amd.com> Co-authored-by: Claude <noreply@anthropic.com>

…k_size' (#1348) Co-authored-by: junxiaguo <JunXia.Guo@amd.com>

…bort (#1322) (#1339) During CUDAGraph capture, MiniMax-M3's autotuned _topk_index_partial_kernel discards candidate CompiledKernels. A gen-0 GC firing inside the stream-capture region runs CompiledKernel.__del__ -> hipModuleUnload, which HIP forbids while a stream is capturing (HIP 900), corrupting the capture and aborting the custom_all_reduce IPC handshake (SIGABRT). gc.freeze() did not help because the discarded kernels are created mid-loop. Disable GC for the whole capture window and restore via try/finally.

* feat: RTPLLM plugin GLM5 integration * feat: RTPLLM GLM5 enable cuda graph * fix: RTP glm5 qwen35 cuda graph conflict * fix: RTP crash when long input_len > 16384 * fix:[RTP] making GLM5 run true Sparse MLA * refactor: RTP glm5 code * feat: RTP glm5 optimize sparse decode path * refactor: RTP remove redundant envs * refactor: [RTP] unify GLM5 MLA on sparse path, drop dead dense backend * fix: RTP GLM5 prefil reuse Sparse MLA metadata * fix: RTP GLM5 enable FP8 MLA path * feat: RTP GLM5 conflict issue after rebase * fix: RTP plugin imports conflict after rebase main * refactor: RTP GLM5 tests merge * refactor: cleanup GLM5 RTP sparse MLA backend * refactor: RTP remove redundant labels * refactor: RTP GLM5 remove redundant code * refactor: RTP GLM5 remove mla redundant code * fix: RTP Qwen35 use prewarmed req id buffer for RTP CUDA graphs * fix: RTP remove redundant qwen35 code

* feat(minimax-m3): split index cache projection Route MiniMax-M3 index Q/K through a separate projection and thread it through the attention stack so cached top-k layers can skip indexer work while preserving the non-cache path. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(minimax-m3): keep indexer qk packed Keep MiniMax-M3 index Q/K in the packed QKV projection so index-cache support only skips top-k work and does not require a separate aiter input ABI. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(minimax-m3): drop leftover formatting noise Remove residual formatting-only changes from the packed index-cache refactor so the branch only carries functional sparse-attention updates. Co-authored-by: Cursor <cursoragent@cursor.com> * code format * chore(minimax-m3): remove index cache debug logging Drop temporary hit/miss logging and counters from the MiniMax-M3 top-k cache path now that the packed index-cache flow is settled. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>

On gfx1250 with ATOM_USE_UNIFIED_ATTN, a prefix-cache hit during prefill fell back to the Triton unified_attention path instead of the sink ASM varlen kernel, because _can_attempt_prefill_sink_asm bailed on has_cached and on max_seqlen_q != max_seqlen_k. The gfx1250 sink varlen ASM kernel (fmha_fwd_with_sink_varlen_asm) actually handles bottom-right causal for sq < sk (chunked-prefill), and cu_seqlens_q/ cu_seqlens_k already carry the per-request new-token vs cached+new lengths. Verified on gfx1250 against a bottom-right causal + per-head sink reference (single/multi-batch, GQA, sq=1) within bf16 tolerance, and end-to-end on gpt-oss-120b (full-attention layers take the ASM path on a cache hit; the forced-Triton path never gathers). Changes: - _can_attempt_prefill_sink_asm: drop the has_cached and max_seqlen_q == max_seqlen_k gates. - prefill_attention: gather the cached+new KV into a dense packed tensor here, where the ASM varlen kernel consumes it. Each prefill backend now prepares its own KV: the ASM path gathers; the Triton path reads the paged cache directly via block_table and never gathers. - rope_cache: no longer gathers, so dispatch_backend sees q/k with matching token counts (sq == sk) and _can_use_prefill_sink_asm's shape check stays valid. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [m3 eagle] migrate draft-side EAGLE3 optimizations (Phase 1) Bring the model-agnostic / draft-side MiniMax-M3 EAGLE3 work from wuhuikx/atom-m3-bf16-to-main (2f1c385). These files' pre-eagle base is byte-identical to current main, so they port as-is: - eagle3_llama.py / eagle3_deepseek_mla.py: draft fusions (fused dual-RMSNorm +concat, fused group-RMSNorm aux, AR+RMSNorm fusion), compute_draft_token, replicated-embed option. - fused_aux_rmsnorm.py (new): the fused RMSNorm kernels for the draft. - lm_head_argmax.py (new) + embed_head.py: distributed greedy argmax (all-gather [N,2] per-rank maxima instead of full [N,vocab] logits). - spec_decode/eagle.py: draft loop with distributed-argmax fast path, no-pre-concat aux, and Eagle3 MHA draft KV-cache transfer for PD disaggregation (from #1331). - envs.py: ATOM_EAGLE_REPLICATE_EMBED. - tests/test_lm_head_argmax.py (new, importorskip(aiter) for the no-aiter CI). Target-side enablement (aux-hidden capture in minimax_m3, q>1 spec-verify metadata, prepare_mtp_decode) follows in Phase 2; note eagle.py now references attn_metadata_builder.prepare_mtp_decode which Phase 2 adds. Mocked suite: 437 passed / 38 pre-existing failures / +1 new skip — no regression. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [m3 eagle] target-side enablement on main's M3 API (Phase 2) Enable MiniMax-M3 EAGLE3 on current main's (Triton-sparse) M3 base, adapting the target side to main's API instead of wuhuikx's asm/gluon infra (absent on main). aiter_attention.py: - Add the generic block-paged MHA Eagle3 draft metadata: _mtp_prepare_decode_ metadata_kernel + prepare_mtp_decode + fuse_mtp_decode_position_update (used by the migrated eagle.py for both Kimi and M3 drafts; not M3-sparse coupled). - Replace the two "speculative decode not supported" NotImplementedError sites: route q>1 spec-verify through the sparse PREFILL path (make_sparse_prefill_ metadata; per-query causal via cu_seqlens_q, which is now filled uniformly for q>1). prefix_lens is bound to a new persistent sparse_prefix_lens buffer so the CUDAGraph-captured sparse indexer reads live causal lengths on each replay. minimax_m3.py: Eagle3 aux-hidden-state capture (Dynamo-safe, mirrors deepseek_v2): aux_hidden_state_layers, in-layer residual.clone() after the fused-allreduce norm, model forward returns (hidden, aux) tuple, set/get_eagle3_aux_hidden_state_layers on the ForCausalLM + VL-wrapper delegation. model_runner.py: extend KV transfer regions with the Eagle3 draft pool for PD disaggregation (#1331). scheduler.py: trim emitted spec tokens past the stop position (rejection sampler emits past EOS) so flexible-extract doesn't pick up leaked trailing tokens. recipes/MiniMax-M3.md: full EAGLE3 section (with a note that the ASM-PA/fp8/MXFP8 specifics reflect the fully-optimized variant, not this Triton-sparse base). Drop tests/test_lm_head_argmax.py (per request). Note: the q>1 sparse-verify path is new on main and CUDAGraph-sensitive — needs GPU validation (GSM8K + accept on TP4/TP8; confirm Kimi eagle unaffected). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * update recipe make lint happy Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * remove fp8 attn related command Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * [m3 eagle] recipe: set ATOM_FORCE_ATTN_TRITON=1 in EAGLE launch main's MiniMaxM3Attention (dense layers) does not set force_triton_attn in code and attention_mha has no block-128 guard, so on this base the dense attention is routed to Triton only via ATOM_FORCE_ATTN_TRITON=1 (the MXFP4 base section already sets it). The EAGLE section migrated from wuhuikx omitted it (wuhuikx set force_triton_attn=True in code instead), so the spec-verify dense attention (q=num_spec+1) fell into paged_attention_asm and aborted in get_heuristic_kernel (no bf16 block-128 ASM-PA kernel). Add the env to the EAGLE launch and drop the stale MXFP8 model_path line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * remove ATOM_FORCE_ATTN_TRITON Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * update recipe Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * update recipe Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * update the recipe with the perf Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * refine the comment Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [atom CI/Nightly/Benchmark] Add MiniMax-M3 and Eagle into atom infra Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * remove minimax m2.7 case Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* Modify model cache mount

…ing the target top-k (#1362) `_should_skip_index_topk` force-skips the DSA indexer top-k for the MTP layer (layer_id >= num_hidden_layers) whenever `index_share_for_mtp_iteration` is set, making the MTP block reuse the *target* model's top-k. But the MTP block ships its OWN indexer weights (indexer.wk / wq_b / weights_proj / k_norm at layer num_hidden_layers in the checkpoint) and is meant to compute its own top-k for the drafted position. Reusing the target's top-k feeds the draft a wrong attention context at all sequence lengths. This is non-standard: neither vLLM upstream (deepseek_mtp.py allocates a dedicated topk_indices_buffer + Indexer for the MTP block) nor the ATOM sglang plugin reuses the target index; both compute the MTP top-k independently. `index_share_for_mtp_iteration` should at most share across multiple MTP draft steps (num_speculative_tokens > 1), never reuse the target model's index. Fix: drop the MTP special-case so the MTP layer computes its own top-k with its own (loaded) indexer weights, matching vLLM upstream and sglang. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

gyohuangxin and others added 13 commits June 22, 2026 23:55

CI: start Docker release at 21:48 Beijing time (#1313)

345d6a5

[ATOM SGL] Add dsv4 ci (#1224)

cc80cd1

* [ATOM SGL]Add dsv4 ci Co-authored-by: Cursor <cursoragent@cursor.com>

CI: use host network for ATOM test container (#1315)

73f168a

Set sink to fp32 for ps decode asm (#1309)

9c751e1

* fix * use func

fix(sglang): skip sparse MLA fast metadata for unsupported heads (#1252)

f05f3ab

[atom-sgl-benchmark] Add model cache mount for MI308 sglang benchmark (…

f126a50

…#1328) * Add model cache mount for MI308 sglang benchmark

[atom-sgl-accuracy] Modify sglang accuracy runner for mi355 (#1329)

9ca76d6

[atom-vllm benchmark] Add host network to start container (#1325)

b45e3c6

mlatest (#1301)

ef44603

* mla * fix * fix --------- Co-authored-by: HaonanWang98 <hwang@amd.com> Co-authored-by: feifei14119 <carlus.huang@amd.com>

Support PD disaggregation on Single node (#1308)

4577dcc

carlushuang mentioned this pull request Jun 24, 2026

[gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention #1314

Open

k50112113 and others added 16 commits June 24, 2026 18:24

[Triton] DSV4 replace einsum with Triton BMM (#1270)

c4ae045

* replace einsum with bmm * fix

[fix](qwen): fix qwen3.5 accuracy (#1321)

908cdaf

* [fix](qwen): fix qwen3.5 accuracy * [fix](attn): delete extra code * [fix](attn): add kv cache to mutate args * [fix](qwen): remove quick allreduce in qwen3.5 --------- Co-authored-by: perzhang <perzhang@amd.com>

Add NUMA-aware CPU/memory binding for PD Single Node optimization (#1340

feb5ce5

) * Add NUMA-aware CPU/memory binding * Add glm-5-2-fp8 benchmark dispatch checkbox

[atom-sgl-accuracy] Modify atom-sgl-accuracy workflow to adapt it for…

ab9eb78

… AAC machine (#1346) * Modify atom-sgl-accuracy workflow to adapt it for AAC machine

fix prefill swa write (#1343)

fc4d766

Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>

Make profile stop timeout configurable with ATOM_PROFILER_TIMEOUT (#1332

083551e

)

fix AttributeError: 'AttentionMetaData' object has no attribute 'bloc…

b9cff14

…k_size' (#1348) Co-authored-by: junxiaguo <JunXia.Guo@amd.com>

[atom-sgl-accuracy] Modify model cache mount (#1352)

c451c40

* Modify model cache mount

zufayu requested review from ZhangLirong-amd and removed request for ZhangLirong-amd June 26, 2026 06:12

zufayu removed the request for review from ZhangLirong-amd June 26, 2026 08:24

Merge remote-tracking branch 'origin/main' into gfx1151_int8_qwen36

60f24dd

zufayu requested a review from yhl-amd June 26, 2026 13:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337
carlushuang wants to merge 31 commits into
carhuang/support_gfx1151_qwen36from
carhuang/gfx1151_int8_qwen36

carlushuang commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Uh oh!

Conversation

carlushuang commented Jun 24, 2026

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5 (Strix Halo), with working MTP

What this enables

Changes

Quality (gsm8k, 5-shot-equivalent, chat + thinking, greedy)

Performance (gfx1151 / Radeon 8060S, bs=1)

Serve

Dependency

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants