Skip to content

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337

Open
carlushuang wants to merge 31 commits into
carhuang/support_gfx1151_qwen36from
carhuang/gfx1151_int8_qwen36
Open

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337
carlushuang wants to merge 31 commits into
carhuang/support_gfx1151_qwen36from
carhuang/gfx1151_int8_qwen36

Conversation

@carlushuang

Copy link
Copy Markdown
Collaborator

[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5 (Strix Halo), with working MTP

Builds on the gfx1151 BF16 enablement (#1314) to add online INT8 W8A8 (Quark-style, no offline quant step) for the Qwen3.6 dense (27B) and MoE (35B-A3B) architectures, plus the fixes needed to make MTP speculative decoding work on the MoE draft. RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the right quantization target for this iGPU.

Precision split (chosen for quality): A8W8 (int8 weight + dynamic per-token int8 activation) for all dense GEMMs and the MoE experts; BF16 for the Gated-DeltaNet linear-attention (recurrent, quant-sensitive — int8 there produces garbage), the MoE router gate, lm_head, and embeddings; KV cache BF16. This keeps gsm8k at BF16-equivalent quality while halving weight bytes (the decode bottleneck on a bandwidth-bound iGPU).

What this enables

  • 35B-A3B runs at all: there is no BF16 MoE kernel on gfx1151 (the asm fused-MoE is gfx9-only; ATOM's Triton MoE is MXFP4-weight-only). The int8 path uses aiter's moe_gemm_int8_smoothquant, which is the only int8-W8A8 grouped-GEMM that runs on RDNA3.5.
  • ~6× faster than the BF16 27B baseline for the 35B-A3B (3B active params × int8 + MTP).

Changes

  • model_ops/linear.pyper_Token int8 branch routes to aiter Triton gemm_a8w8 on non-gfx9 (CK gemm_a8w8_CK is gfx9-only), wrapped as a torch custom op (torch.ops.aiter.atom_gemm_a8w8_triton) so it is HIP-graph / torch.compile safe. Online-quant allow-list += torch.int8.
  • model_ops/moe.py — new Int8MoEMethod (int8 w13/w2 + per-channel fp32 scales) and an int8 branch in FusedMoE._online_quant.
  • model_ops/fused_moe_triton.pytriton_kernel_int8_moe_forward: matmul-ogs routing → per-token int8 quant → moe_gemm_int8_smoothquant (gemm1 with fused gated-SiLU via interleaved w13 columns) → per-token int8 quant → gemm2 with scatter/combine.
  • models/qwen3_5_mtp.py + model_loader/loader.pyMTP-MoE drafter fix: the draft's fused expert weights (experts.gate_up_proj/down_proj) were silently dropped at load → 0% draft acceptance → MTP was pure overhead. Add the draft's fused-expert mapping (detect_fused_expert_format / get_fused_expert_mapping / load_fused_expert_weights), fix get_expert_mapping to use num_experts, and let the loader resolve load_fused_expert_weights_fn from the model. After the fix: acceptance 0 → 0.83.
  • entrypoints/openai/tool_parser.pyunique tool-call ids (call_<uuid> instead of a per-response call_0). Non-unique ids made agentic clients (qwen-code) dedupe every tool call after the first → endless tool-call loop. Extends feat(openai): Qwen3 (qwen3_coder/qwen3_xml) tool-call support #1319.

Quality (gsm8k, 5-shot-equivalent, chat + thinking, greedy)

  • 35B-A3B INT8 W8A8 = 0.84 — BF16-equivalent (int8 is faithful). MTP is lossless (accepts a draft token only when it matches the target's greedy argmax), so the MTP build has identical quality.

Performance (gfx1151 / Radeon 8060S, bs=1)

Decode (single-stream, short context):

Model Config Decode tok/s
27B dense INT8 W8A8 6.0
27B dense INT8 W8A8 + MTP-1 9.4
35B-A3B INT8 W8A8 + HIP graph 24.8
35B-A3B INT8 W8A8 + MTP-1 + HIP graph ~35

Long-context (35B-A3B INT8 W8A8 + MTP-1, bs=1):

Context Prefill TTFT Prefill tok/s Decode (output) tok/s Total tok/s
64K (60,016 tok) 85.4 s 703 23.4 661
128K (119,071 tok) 191.8 s 621 17.3 598

Decode tok/s falls with context (each step reads the growing KV); prefill is compute-bound (one-time prompt-ingestion cost). The hybrid model's KV is cheap (only the interleaved full-attn layers cache KV), so 128K fits easily — at gpu-memory-utilization 0.9 the KV pool holds ~2.1M tokens; the limit is --max-model-len, not memory.

Serve

ATOM_USE_UNIFIED_ATTN=1 \
python -m atom.entrypoints.openai_server --model Qwen/Qwen3.6-35B-A3B \
  --trust-remote-code -tp 1 --kv_cache_dtype bf16 --block-size 64 \
  --max-model-len 131072 --max-num-seqs 2 --gpu-memory-utilization 0.9 \
  --method mtp --num-speculative-tokens 1 \
  --online_quant_config '{"global_quant_config":"ptpc_i8","exclude_layer":["*linear_attn*","*lm_head*","*shared_head*","*embed_tokens*","*mlp.gate"]}'

(Drop --method mtp ... for 35B if you don't want MTP; for the dense 27B MTP is a ~1.6× lossless win.)

Dependency

gyohuangxin and others added 13 commits June 22, 2026 23:55
* [ATOM SGL]Add dsv4 ci

Co-authored-by: Cursor <cursoragent@cursor.com>
* feat(minimax_m3): add MXFP4 native support

Introduce the minimal MiniMax-M3 MXFP4 native ATOM path without BF16, MXFP8, EAGLE, or unified-attention support.

* fix(minimax_m3): align FP4 Triton paths with BF16 branch

Keep the split MXFP4 PR aligned with the BF16 branch for shared Triton kernel paths while removing the extra package marker file.

Co-authored-by: Cursor <cursoragent@cursor.com>

* feat(minimax_m3): add MXFP4 native support

Introduce the minimal MiniMax-M3 MXFP4 native ATOM path without BF16, MXFP8, EAGLE, or unified-attention support.

* fix(attn): use Triton for unsupported GQA decode

Route the block-size 128 GQA decode shape used by MiniMax-M3 away from generic PA ASM, which has no matching AITER kernel in the validation image.

* chore(minimax_m3): trim FP4 split cleanup

Remove the extra MiniMax-M3 module docstring note and keep Triton attention selection controlled by the existing environment flag.

* docs(minimax_m3): force Triton attention in MXFP4 recipe

Document the required ATOM_FORCE_ATTN_TRITON flag for the MXFP4 TP4 launch path.

* chore(minimax_m3): fix Black formatting and trim comments

Remove the extra blank lines flagged by Black and keep MiniMax-M3 sparse attention comments focused on ATOM's FP4 path.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(minimax_m3): simplify text config normalization

Copy MiniMax-M3 text config attributes generically so the FP4 path keeps required root config fields without maintaining a long field allowlist.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(attn): generalize sparse block-size handling

Read sparse attention block-size requirements from the HF sparse attention config instead of hard-coding the MiniMax-M3 sparse attention constant in the shared AITER metadata builder.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(attn): use generic sparse metadata naming

Keep MiniMax-M3 sparse metadata construction local to the sparse attention path while exposing it through generic attention metadata fields in the shared AITER builder.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(attn): generalize indexed sparse marker

Use a model-agnostic marker for indexed sparse attention modules so the shared AITER cache binding path no longer checks a MiniMax-M3-specific attribute name.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(attn): generalize sparse cache names

Use generic indexed sparse cache and metadata helper names in the shared AITER attention path while keeping the MiniMax-M3 sparse implementation module unchanged.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refact attention code

* keep ATOM_USE_UNIFIED_ATTN path

---------

Co-authored-by: xytpai <xytpai@foxmail.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
* mla

* fix

* fix

---------

Co-authored-by: HaonanWang98 <hwang@amd.com>
Co-authored-by: feifei14119 <carlus.huang@amd.com>
* Move gpt-oss and kimi2.5 CI from mi355 to mi350

* Move deepseek-v4-flash and qwen3.5 vllm CI from mi355 to mi350

* Add runner user info

* Add clean up containers function for runner atom-mi35x-8gpu-oot-acc

* CI: tolerate missing Docker config on OOT runners

---------

Co-authored-by: Xin Huang <Xin.Huang@amd.com>
… on MoE

RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the quantization
target for this iGPU. A8W8 (int8 weight + dynamic per-token int8 activation) for
all dense GEMMs and MoE experts; BF16 for the GDN linear-attn (recurrent,
quant-sensitive), router gate, lm_head, embeddings; KV cache BF16.

- model_ops/linear.py: per_Token int8 branch -> aiter Triton gemm_a8w8 on
  non-gfx9 (CK gemm_a8w8_CK is gfx9-only), wrapped as a torch custom op so it is
  HIP-graph / torch.compile safe. Online-quant allow-list += torch.int8.
- model_ops/moe.py + model_ops/fused_moe_triton.py: Int8MoEMethod + int8 branch
  in FusedMoE._online_quant, and triton_kernel_int8_moe_forward using aiter
  moe_gemm_int8_smoothquant (gemm1 with fused gated-SiLU via interleaved w13).
  Enables 35B-A3B, which has no BF16 MoE kernel on gfx1151.
- models/qwen3_5_mtp.py + model_loader/loader.py: fix the MTP-MoE drafter so the
  draft's fused expert weights load (add detect_fused_expert_format /
  get_fused_expert_mapping / load_fused_expert_weights; get_expert_mapping uses
  num_experts; loader resolves load_fused_expert_weights_fn from the model).
  Draft acceptance 0 -> 0.83; MTP now a net win on the MoE model.
- model_ops/topK.py: keep the shared expert as a separate MLP on non-gfx9 so the
  routed MoE uses the portable Triton path.
k50112113 and others added 16 commits June 24, 2026 18:24
* add m3 mxfp8 support

* add mxfp8 recipe

* wip

Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>

* revert dequant fp8 back to bf16 for linear layers

* update m3 recipe

* format

* remove hard code dtype

---------

Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com>
Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
Co-authored-by: Haoyang Li <lihaoyang0109@gmail.com>
Co-authored-by: ganyi <ygan@amd.com>
Co-authored-by: Guanbao Yu <gyu@amd.com>
Co-authored-by: wuhuikx <hattie.wu@amd.com>
* [fix](qwen): fix qwen3.5 accuracy

* [fix](attn): delete extra code

* [fix](attn): add kv cache to mutate args

* [fix](qwen): remove quick allreduce in qwen3.5

---------

Co-authored-by: perzhang <perzhang@amd.com>
)

* Add NUMA-aware CPU/memory binding

* Add glm-5-2-fp8 benchmark dispatch checkbox
… AAC machine (#1346)

* Modify atom-sgl-accuracy workflow to adapt it for AAC machine
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
* docs: revise M3 fp8/gluon port plan for first-class framework compat

Replace the env-gated bolt-on approach with one driven by main's existing
attention-framework contracts: fp8 selected by config.kv_cache_dtype,
scales returned via KVCacheTensor, binding through build_kv_cache_tensor/
bind_kv_cache, insert via the quantized hook, metadata via make_sparse_*
factories, frozen custom-op signature, CUDAGraph-safe scratch, byte
accounting. Adds a 9-point contract checklist mapped to each task.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(attn): SparseMHAPagedAttentionImpl skeleton + Attention impl_cls override

Task 0 of the MiniMax-M3 fp8 KV cache + gluon PA port. Adds the subclass
scaffold (SparseMHAPagedAttentionImpl extends PagedAttentionImpl, overriding
only rope_cache + dispatch_backend via delegation for now) and an optional
impl_cls kwarg on Attention.__init__ so a model can plug in a specialized impl
while reusing the backend's metadata builder. Indexer state lives on the impl.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(minimax_m3): page-16 constants + fused SHUFFLE KV-insert kernel

Task 1 of the M3 fp8/gluon port. Adds ASM_PAGE_SIZE=16 / PAGES_PER_SPARSE_BLOCK=8
and grafts the Triton fused Gemma-RMSNorm + partial-NeoX-RoPE + page-16 SHUFFLE
KV-insert kernel (+ host wrapper) from origin/ganyi/shuffle_kv_cache_fp8_eagle.
GPU round-trip test validates q_out/index_q_out vs PyTorch ref and K/V/index
cache scatter at each token slot.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(minimax_m3): page-16 sparse block-table builders + fused topk EMIT_SPARSE_BT

Task 2 of the M3 fp8/gluon port. Grafts the decode + prefill page-16 sparse
block-table builders into sparse_attn.py (each selected logical 128-block expands
to 8 contiguous physical 16-pages, partial tail packed last, exact context_lens),
and replaces index_topk.py wholesale with the source-branch version that adds the
fused EMIT_SPARSE_BT block-table emission and MAX_Q spec-decode causal support
(both opt-in via defaulted kwargs, so existing decode callers are unaffected).

Tests: x8 expansion + tail-last packing + ctx lengths for the standalone builder;
fused EMIT path matches the standalone builder bit-for-bit (num_kv_heads==1).

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(minimax_m3): gluon PA decode + prefill runners over page-16 SHUFFLE cache

Task 3 of the M3 fp8/gluon port. Grafts minimax_m3_sparse_attn_decode_asm,
minimax_m3_sparse_attn_prefill_asm, and the shared _run_prefill_fp8_gluon helper
from the source branch: index top-k -> page-16 sparse block table -> AITER gluon
split-KV paged-attention (run_pa_decode_gluon), with fp8 vs bf16 compute_type and
per-token scales selected by the KV cache dtype. Adds `import aiter` (used for
aiter.dtypes.fp8). Parity test (gluon vs Triton split-K decode reference) for
gqa 8/16; validated further by the existing asm/fp8/prefill oracle tests.

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(attn): implement SparseMHAPagedAttentionImpl.rope_cache override

Task 4 of the M3 fp8/gluon port. The override runs MiniMax-M3's fused
qk-norm + partial-NeoX-RoPE + page-16 SHUFFLE KV insert + indexer-key insert via
aiter.fused_qknorm_idxrqknorm (consuming the packed qkv), reading the SHUFFLE
K/V + scale + index caches off the bound layer. It returns the parent's 7-tuple
(query rotated) and stashes the rotated indexer query on self._index_q for
dispatch_backend. fp8 vs bf16 selected by kv_cache_dtype; fp8 writes per-token
dequant scales into k_scale/v_scale. Adds the _minimax_m3_cos_sin_cache helper.

Test (bf16 + fp8): override returns the 7-tuple, populates _index_q with correct
shape, and mutates the KV/index caches (+ fp8 scales).

Co-Authored-By: Claude <noreply@anthropic.com>

* feat(attn): implement SparseMHAPagedAttentionImpl.dispatch_backend override

Task 5 of the M3 fp8/gluon port. dispatch_backend returns the M3 sparse
prefill/decode backend callable (parent contract
fn(q,k,v,k_cache,v_cache,k_scale,v_scale,fwd_ctx)). Both paths select per-token
top-k index blocks with the fused page-16 sparse block-table emit, then run the
gluon split-KV paged-attention over the SHUFFLE cache; fp8 vs bf16 follows the
cache dtype inside the runners. Prefill uses the sync-free on-device metadata
fallback (query_req_id/abs_pos/qo_indptr=None). Consumes self._index_q from
rope_cache and clears it afterward.

Note: index_cache is page-128 3D [num_logical, 128, idx_head_dim], indexed by the
logical block_table in index-topk (distinct from the page-16 SHUFFLE KV cache).
Test (bf16+fp8): dispatch returns the decode callable; running it yields finite
[tokens, nh, hd] output and clears _index_q.

Co-Authored-By: Claude <noreply@anthropic.com>

* first version of refactor

Signed-off-by: ganyi <ygan@amd.com>

* remove unnecessary files

Signed-off-by: ganyi <ygan@amd.com>

* runable and can response resonable output

Signed-off-by: ganyi <ygan@amd.com>

* acc right

Signed-off-by: ganyi <ygan@amd.com>

* reuse mha's allocation for main cache,  view at use time

Signed-off-by: ganyi <ygan@amd.com>

* remove prepare mtp metadata

Signed-off-by: ganyi <ygan@amd.com>

* format

Signed-off-by: ganyi <ygan@amd.com>

* format

Signed-off-by: ganyi <ygan@amd.com>

* resolve comments

Signed-off-by: ganyi <ygan@amd.com>

---------

Signed-off-by: ganyi <ygan@amd.com>
Co-authored-by: Claude <noreply@anthropic.com>
…k_size' (#1348)

Co-authored-by: junxiaguo <JunXia.Guo@amd.com>
…bort (#1322) (#1339)

During CUDAGraph capture, MiniMax-M3's autotuned _topk_index_partial_kernel
discards candidate CompiledKernels. A gen-0 GC firing inside the stream-capture
region runs CompiledKernel.__del__ -> hipModuleUnload, which HIP forbids while a
stream is capturing (HIP 900), corrupting the capture and aborting the
custom_all_reduce IPC handshake (SIGABRT). gc.freeze() did not help because the
discarded kernels are created mid-loop. Disable GC for the whole capture window
and restore via try/finally.
* feat: RTPLLM plugin GLM5 integration

* feat: RTPLLM GLM5 enable cuda graph

* fix: RTP glm5 qwen35 cuda graph conflict

* fix: RTP crash when long input_len > 16384

* fix:[RTP] making GLM5 run true Sparse MLA

* refactor: RTP glm5 code

* feat: RTP glm5 optimize sparse decode path

* refactor: RTP remove redundant envs

* refactor: [RTP] unify GLM5 MLA on sparse path, drop dead dense backend

* fix: RTP GLM5 prefil reuse Sparse MLA metadata

* fix: RTP GLM5 enable FP8 MLA path

* feat: RTP GLM5 conflict issue after rebase

* fix: RTP plugin imports conflict after rebase main

* refactor: RTP GLM5 tests merge

* refactor: cleanup GLM5 RTP sparse MLA backend

* refactor: RTP remove redundant labels

* refactor: RTP GLM5 remove redundant code

* refactor: RTP GLM5 remove mla redundant code

* fix: RTP Qwen35 use prewarmed req id buffer for RTP CUDA graphs

* fix: RTP remove redundant qwen35 code
* feat(minimax-m3): split index cache projection

Route MiniMax-M3 index Q/K through a separate projection and thread it through the attention stack so cached top-k layers can skip indexer work while preserving the non-cache path.

Co-authored-by: Cursor <cursoragent@cursor.com>

* refactor(minimax-m3): keep indexer qk packed

Keep MiniMax-M3 index Q/K in the packed QKV projection so index-cache support only skips top-k work and does not require a separate aiter input ABI.

Co-authored-by: Cursor <cursoragent@cursor.com>

* chore(minimax-m3): drop leftover formatting noise

Remove residual formatting-only changes from the packed index-cache refactor so the branch only carries functional sparse-attention updates.

Co-authored-by: Cursor <cursoragent@cursor.com>

* code format

* chore(minimax-m3): remove index cache debug logging

Drop temporary hit/miss logging and counters from the MiniMax-M3 top-k cache path now that the packed index-cache flow is settled.

Co-authored-by: Cursor <cursoragent@cursor.com>

---------

Co-authored-by: Cursor <cursoragent@cursor.com>
On gfx1250 with ATOM_USE_UNIFIED_ATTN, a prefix-cache hit during prefill
fell back to the Triton unified_attention path instead of the sink ASM
varlen kernel, because _can_attempt_prefill_sink_asm bailed on has_cached
and on max_seqlen_q != max_seqlen_k.

The gfx1250 sink varlen ASM kernel (fmha_fwd_with_sink_varlen_asm) actually
handles bottom-right causal for sq < sk (chunked-prefill), and cu_seqlens_q/
cu_seqlens_k already carry the per-request new-token vs cached+new lengths.
Verified on gfx1250 against a bottom-right causal + per-head sink reference
(single/multi-batch, GQA, sq=1) within bf16 tolerance, and end-to-end on
gpt-oss-120b (full-attention layers take the ASM path on a cache hit; the
forced-Triton path never gathers).

Changes:
- _can_attempt_prefill_sink_asm: drop the has_cached and
  max_seqlen_q == max_seqlen_k gates.
- prefill_attention: gather the cached+new KV into a dense packed tensor here,
  where the ASM varlen kernel consumes it. Each prefill backend now prepares
  its own KV: the ASM path gathers; the Triton path reads the paged cache
  directly via block_table and never gathers.
- rope_cache: no longer gathers, so dispatch_backend sees q/k with matching
  token counts (sq == sk) and _can_use_prefill_sink_asm's shape check stays
  valid.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* [m3 eagle] migrate draft-side EAGLE3 optimizations (Phase 1)

Bring the model-agnostic / draft-side MiniMax-M3 EAGLE3 work from
wuhuikx/atom-m3-bf16-to-main (2f1c385). These files' pre-eagle base is
byte-identical to current main, so they port as-is:

- eagle3_llama.py / eagle3_deepseek_mla.py: draft fusions (fused dual-RMSNorm
  +concat, fused group-RMSNorm aux, AR+RMSNorm fusion), compute_draft_token,
  replicated-embed option.
- fused_aux_rmsnorm.py (new): the fused RMSNorm kernels for the draft.
- lm_head_argmax.py (new) + embed_head.py: distributed greedy argmax (all-gather
  [N,2] per-rank maxima instead of full [N,vocab] logits).
- spec_decode/eagle.py: draft loop with distributed-argmax fast path, no-pre-concat
  aux, and Eagle3 MHA draft KV-cache transfer for PD disaggregation (from #1331).
- envs.py: ATOM_EAGLE_REPLICATE_EMBED.
- tests/test_lm_head_argmax.py (new, importorskip(aiter) for the no-aiter CI).

Target-side enablement (aux-hidden capture in minimax_m3, q>1 spec-verify
metadata, prepare_mtp_decode) follows in Phase 2; note eagle.py now references
attn_metadata_builder.prepare_mtp_decode which Phase 2 adds.

Mocked suite: 437 passed / 38 pre-existing failures / +1 new skip — no regression.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* [m3 eagle] target-side enablement on main's M3 API (Phase 2)

Enable MiniMax-M3 EAGLE3 on current main's (Triton-sparse) M3 base, adapting the
target side to main's API instead of wuhuikx's asm/gluon infra (absent on main).

aiter_attention.py:
- Add the generic block-paged MHA Eagle3 draft metadata: _mtp_prepare_decode_
  metadata_kernel + prepare_mtp_decode + fuse_mtp_decode_position_update (used by
  the migrated eagle.py for both Kimi and M3 drafts; not M3-sparse coupled).
- Replace the two "speculative decode not supported" NotImplementedError sites:
  route q>1 spec-verify through the sparse PREFILL path (make_sparse_prefill_
  metadata; per-query causal via cu_seqlens_q, which is now filled uniformly for
  q>1). prefix_lens is bound to a new persistent sparse_prefix_lens buffer so the
  CUDAGraph-captured sparse indexer reads live causal lengths on each replay.

minimax_m3.py: Eagle3 aux-hidden-state capture (Dynamo-safe, mirrors deepseek_v2):
aux_hidden_state_layers, in-layer residual.clone() after the fused-allreduce norm,
model forward returns (hidden, aux) tuple, set/get_eagle3_aux_hidden_state_layers
on the ForCausalLM + VL-wrapper delegation.

model_runner.py: extend KV transfer regions with the Eagle3 draft pool for PD
disaggregation (#1331).

scheduler.py: trim emitted spec tokens past the stop position (rejection sampler
emits past EOS) so flexible-extract doesn't pick up leaked trailing tokens.

recipes/MiniMax-M3.md: full EAGLE3 section (with a note that the ASM-PA/fp8/MXFP8
specifics reflect the fully-optimized variant, not this Triton-sparse base).

Drop tests/test_lm_head_argmax.py (per request).

Note: the q>1 sparse-verify path is new on main and CUDAGraph-sensitive — needs
GPU validation (GSM8K + accept on TP4/TP8; confirm Kimi eagle unaffected).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* update recipe
make lint happy

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* remove fp8 attn related command

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* [m3 eagle] recipe: set ATOM_FORCE_ATTN_TRITON=1 in EAGLE launch

main's MiniMaxM3Attention (dense layers) does not set force_triton_attn in code
and attention_mha has no block-128 guard, so on this base the dense attention is
routed to Triton only via ATOM_FORCE_ATTN_TRITON=1 (the MXFP4 base section already
sets it). The EAGLE section migrated from wuhuikx omitted it (wuhuikx set
force_triton_attn=True in code instead), so the spec-verify dense attention
(q=num_spec+1) fell into paged_attention_asm and aborted in get_heuristic_kernel
(no bf16 block-128 ASM-PA kernel). Add the env to the EAGLE launch and drop the
stale MXFP8 model_path line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

* remove ATOM_FORCE_ATTN_TRITON

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* update recipe

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* update recipe

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* update the recipe with the perf

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* refine the comment

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

---------

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* [atom CI/Nightly/Benchmark] Add MiniMax-M3 and Eagle
into atom infra

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

* remove minimax m2.7 case

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>

---------

Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
@zufayu zufayu requested review from ZhangLirong-amd and removed request for ZhangLirong-amd June 26, 2026 06:12
…ing the target top-k (#1362)

`_should_skip_index_topk` force-skips the DSA indexer top-k for the MTP
layer (layer_id >= num_hidden_layers) whenever `index_share_for_mtp_iteration`
is set, making the MTP block reuse the *target* model's top-k. But the MTP
block ships its OWN indexer weights (indexer.wk / wq_b / weights_proj /
k_norm at layer num_hidden_layers in the checkpoint) and is meant to compute
its own top-k for the drafted position. Reusing the target's top-k feeds the
draft a wrong attention context at all sequence lengths.

This is non-standard: neither vLLM upstream (deepseek_mtp.py allocates a
dedicated topk_indices_buffer + Indexer for the MTP block) nor the ATOM
sglang plugin reuses the target index; both compute the MTP top-k
independently. `index_share_for_mtp_iteration` should at most share across
multiple MTP draft steps (num_speculative_tokens > 1), never reuse the
target model's index.

Fix: drop the MTP special-case so the MTP layer computes its own top-k with
its own (loaded) indexer weights, matching vLLM upstream and sglang.

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@zufayu zufayu removed the request for review from ZhangLirong-amd June 26, 2026 08:24
@zufayu zufayu requested a review from yhl-amd June 26, 2026 13:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.