[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337
Open
carlushuang wants to merge 31 commits into
Open
[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5, with working MTP#1337carlushuang wants to merge 31 commits into
carlushuang wants to merge 31 commits into
Conversation
* [ATOM SGL]Add dsv4 ci Co-authored-by: Cursor <cursoragent@cursor.com>
* fix * use func
* feat(minimax_m3): add MXFP4 native support Introduce the minimal MiniMax-M3 MXFP4 native ATOM path without BF16, MXFP8, EAGLE, or unified-attention support. * fix(minimax_m3): align FP4 Triton paths with BF16 branch Keep the split MXFP4 PR aligned with the BF16 branch for shared Triton kernel paths while removing the extra package marker file. Co-authored-by: Cursor <cursoragent@cursor.com> * feat(minimax_m3): add MXFP4 native support Introduce the minimal MiniMax-M3 MXFP4 native ATOM path without BF16, MXFP8, EAGLE, or unified-attention support. * fix(attn): use Triton for unsupported GQA decode Route the block-size 128 GQA decode shape used by MiniMax-M3 away from generic PA ASM, which has no matching AITER kernel in the validation image. * chore(minimax_m3): trim FP4 split cleanup Remove the extra MiniMax-M3 module docstring note and keep Triton attention selection controlled by the existing environment flag. * docs(minimax_m3): force Triton attention in MXFP4 recipe Document the required ATOM_FORCE_ATTN_TRITON flag for the MXFP4 TP4 launch path. * chore(minimax_m3): fix Black formatting and trim comments Remove the extra blank lines flagged by Black and keep MiniMax-M3 sparse attention comments focused on ATOM's FP4 path. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(minimax_m3): simplify text config normalization Copy MiniMax-M3 text config attributes generically so the FP4 path keeps required root config fields without maintaining a long field allowlist. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(attn): generalize sparse block-size handling Read sparse attention block-size requirements from the HF sparse attention config instead of hard-coding the MiniMax-M3 sparse attention constant in the shared AITER metadata builder. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(attn): use generic sparse metadata naming Keep MiniMax-M3 sparse metadata construction local to the sparse attention path while exposing it through generic attention metadata fields in the shared AITER builder. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(attn): generalize indexed sparse marker Use a model-agnostic marker for indexed sparse attention modules so the shared AITER cache binding path no longer checks a MiniMax-M3-specific attribute name. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(attn): generalize sparse cache names Use generic indexed sparse cache and metadata helper names in the shared AITER attention path while keeping the MiniMax-M3 sparse implementation module unchanged. Co-authored-by: Cursor <cursoragent@cursor.com> * refact attention code * keep ATOM_USE_UNIFIED_ATTN path --------- Co-authored-by: xytpai <xytpai@foxmail.com> Co-authored-by: Cursor <cursoragent@cursor.com> Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com>
…#1328) * Add model cache mount for MI308 sglang benchmark
* Move gpt-oss and kimi2.5 CI from mi355 to mi350 * Move deepseek-v4-flash and qwen3.5 vllm CI from mi355 to mi350 * Add runner user info * Add clean up containers function for runner atom-mi35x-8gpu-oot-acc * CI: tolerate missing Docker config on OOT runners --------- Co-authored-by: Xin Huang <Xin.Huang@amd.com>
… on MoE RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the quantization target for this iGPU. A8W8 (int8 weight + dynamic per-token int8 activation) for all dense GEMMs and MoE experts; BF16 for the GDN linear-attn (recurrent, quant-sensitive), router gate, lm_head, embeddings; KV cache BF16. - model_ops/linear.py: per_Token int8 branch -> aiter Triton gemm_a8w8 on non-gfx9 (CK gemm_a8w8_CK is gfx9-only), wrapped as a torch custom op so it is HIP-graph / torch.compile safe. Online-quant allow-list += torch.int8. - model_ops/moe.py + model_ops/fused_moe_triton.py: Int8MoEMethod + int8 branch in FusedMoE._online_quant, and triton_kernel_int8_moe_forward using aiter moe_gemm_int8_smoothquant (gemm1 with fused gated-SiLU via interleaved w13). Enables 35B-A3B, which has no BF16 MoE kernel on gfx1151. - models/qwen3_5_mtp.py + model_loader/loader.py: fix the MTP-MoE drafter so the draft's fused expert weights load (add detect_fused_expert_format / get_fused_expert_mapping / load_fused_expert_weights; get_expert_mapping uses num_experts; loader resolves load_fused_expert_weights_fn from the model). Draft acceptance 0 -> 0.83; MTP now a net win on the MoE model. - model_ops/topK.py: keep the shared expert as a separate MLP on non-gfx9 so the routed MoE uses the portable Triton path.
* replace einsum with bmm * fix
* add m3 mxfp8 support * add mxfp8 recipe * wip Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com> * revert dequant fp8 back to bf16 for linear layers * update m3 recipe * format * remove hard code dtype --------- Signed-off-by: Haoyang Li <lihaoyang0109@gmail.com> Co-authored-by: XiaobingSuper <xiaobingzhangupc@gmail.com> Co-authored-by: Haoyang Li <lihaoyang0109@gmail.com> Co-authored-by: ganyi <ygan@amd.com> Co-authored-by: Guanbao Yu <gyu@amd.com> Co-authored-by: wuhuikx <hattie.wu@amd.com>
* [fix](qwen): fix qwen3.5 accuracy * [fix](attn): delete extra code * [fix](attn): add kv cache to mutate args * [fix](qwen): remove quick allreduce in qwen3.5 --------- Co-authored-by: perzhang <perzhang@amd.com>
… AAC machine (#1346) * Modify atom-sgl-accuracy workflow to adapt it for AAC machine
Signed-off-by: whx-sjtu <xiaowang990929@gmail.com>
* docs: revise M3 fp8/gluon port plan for first-class framework compat Replace the env-gated bolt-on approach with one driven by main's existing attention-framework contracts: fp8 selected by config.kv_cache_dtype, scales returned via KVCacheTensor, binding through build_kv_cache_tensor/ bind_kv_cache, insert via the quantized hook, metadata via make_sparse_* factories, frozen custom-op signature, CUDAGraph-safe scratch, byte accounting. Adds a 9-point contract checklist mapped to each task. Co-Authored-By: Claude <noreply@anthropic.com> * feat(attn): SparseMHAPagedAttentionImpl skeleton + Attention impl_cls override Task 0 of the MiniMax-M3 fp8 KV cache + gluon PA port. Adds the subclass scaffold (SparseMHAPagedAttentionImpl extends PagedAttentionImpl, overriding only rope_cache + dispatch_backend via delegation for now) and an optional impl_cls kwarg on Attention.__init__ so a model can plug in a specialized impl while reusing the backend's metadata builder. Indexer state lives on the impl. Co-Authored-By: Claude <noreply@anthropic.com> * feat(minimax_m3): page-16 constants + fused SHUFFLE KV-insert kernel Task 1 of the M3 fp8/gluon port. Adds ASM_PAGE_SIZE=16 / PAGES_PER_SPARSE_BLOCK=8 and grafts the Triton fused Gemma-RMSNorm + partial-NeoX-RoPE + page-16 SHUFFLE KV-insert kernel (+ host wrapper) from origin/ganyi/shuffle_kv_cache_fp8_eagle. GPU round-trip test validates q_out/index_q_out vs PyTorch ref and K/V/index cache scatter at each token slot. Co-Authored-By: Claude <noreply@anthropic.com> * feat(minimax_m3): page-16 sparse block-table builders + fused topk EMIT_SPARSE_BT Task 2 of the M3 fp8/gluon port. Grafts the decode + prefill page-16 sparse block-table builders into sparse_attn.py (each selected logical 128-block expands to 8 contiguous physical 16-pages, partial tail packed last, exact context_lens), and replaces index_topk.py wholesale with the source-branch version that adds the fused EMIT_SPARSE_BT block-table emission and MAX_Q spec-decode causal support (both opt-in via defaulted kwargs, so existing decode callers are unaffected). Tests: x8 expansion + tail-last packing + ctx lengths for the standalone builder; fused EMIT path matches the standalone builder bit-for-bit (num_kv_heads==1). Co-Authored-By: Claude <noreply@anthropic.com> * feat(minimax_m3): gluon PA decode + prefill runners over page-16 SHUFFLE cache Task 3 of the M3 fp8/gluon port. Grafts minimax_m3_sparse_attn_decode_asm, minimax_m3_sparse_attn_prefill_asm, and the shared _run_prefill_fp8_gluon helper from the source branch: index top-k -> page-16 sparse block table -> AITER gluon split-KV paged-attention (run_pa_decode_gluon), with fp8 vs bf16 compute_type and per-token scales selected by the KV cache dtype. Adds `import aiter` (used for aiter.dtypes.fp8). Parity test (gluon vs Triton split-K decode reference) for gqa 8/16; validated further by the existing asm/fp8/prefill oracle tests. Co-Authored-By: Claude <noreply@anthropic.com> * feat(attn): implement SparseMHAPagedAttentionImpl.rope_cache override Task 4 of the M3 fp8/gluon port. The override runs MiniMax-M3's fused qk-norm + partial-NeoX-RoPE + page-16 SHUFFLE KV insert + indexer-key insert via aiter.fused_qknorm_idxrqknorm (consuming the packed qkv), reading the SHUFFLE K/V + scale + index caches off the bound layer. It returns the parent's 7-tuple (query rotated) and stashes the rotated indexer query on self._index_q for dispatch_backend. fp8 vs bf16 selected by kv_cache_dtype; fp8 writes per-token dequant scales into k_scale/v_scale. Adds the _minimax_m3_cos_sin_cache helper. Test (bf16 + fp8): override returns the 7-tuple, populates _index_q with correct shape, and mutates the KV/index caches (+ fp8 scales). Co-Authored-By: Claude <noreply@anthropic.com> * feat(attn): implement SparseMHAPagedAttentionImpl.dispatch_backend override Task 5 of the M3 fp8/gluon port. dispatch_backend returns the M3 sparse prefill/decode backend callable (parent contract fn(q,k,v,k_cache,v_cache,k_scale,v_scale,fwd_ctx)). Both paths select per-token top-k index blocks with the fused page-16 sparse block-table emit, then run the gluon split-KV paged-attention over the SHUFFLE cache; fp8 vs bf16 follows the cache dtype inside the runners. Prefill uses the sync-free on-device metadata fallback (query_req_id/abs_pos/qo_indptr=None). Consumes self._index_q from rope_cache and clears it afterward. Note: index_cache is page-128 3D [num_logical, 128, idx_head_dim], indexed by the logical block_table in index-topk (distinct from the page-16 SHUFFLE KV cache). Test (bf16+fp8): dispatch returns the decode callable; running it yields finite [tokens, nh, hd] output and clears _index_q. Co-Authored-By: Claude <noreply@anthropic.com> * first version of refactor Signed-off-by: ganyi <ygan@amd.com> * remove unnecessary files Signed-off-by: ganyi <ygan@amd.com> * runable and can response resonable output Signed-off-by: ganyi <ygan@amd.com> * acc right Signed-off-by: ganyi <ygan@amd.com> * reuse mha's allocation for main cache, view at use time Signed-off-by: ganyi <ygan@amd.com> * remove prepare mtp metadata Signed-off-by: ganyi <ygan@amd.com> * format Signed-off-by: ganyi <ygan@amd.com> * format Signed-off-by: ganyi <ygan@amd.com> * resolve comments Signed-off-by: ganyi <ygan@amd.com> --------- Signed-off-by: ganyi <ygan@amd.com> Co-authored-by: Claude <noreply@anthropic.com>
…k_size' (#1348) Co-authored-by: junxiaguo <JunXia.Guo@amd.com>
…bort (#1322) (#1339) During CUDAGraph capture, MiniMax-M3's autotuned _topk_index_partial_kernel discards candidate CompiledKernels. A gen-0 GC firing inside the stream-capture region runs CompiledKernel.__del__ -> hipModuleUnload, which HIP forbids while a stream is capturing (HIP 900), corrupting the capture and aborting the custom_all_reduce IPC handshake (SIGABRT). gc.freeze() did not help because the discarded kernels are created mid-loop. Disable GC for the whole capture window and restore via try/finally.
* feat: RTPLLM plugin GLM5 integration * feat: RTPLLM GLM5 enable cuda graph * fix: RTP glm5 qwen35 cuda graph conflict * fix: RTP crash when long input_len > 16384 * fix:[RTP] making GLM5 run true Sparse MLA * refactor: RTP glm5 code * feat: RTP glm5 optimize sparse decode path * refactor: RTP remove redundant envs * refactor: [RTP] unify GLM5 MLA on sparse path, drop dead dense backend * fix: RTP GLM5 prefil reuse Sparse MLA metadata * fix: RTP GLM5 enable FP8 MLA path * feat: RTP GLM5 conflict issue after rebase * fix: RTP plugin imports conflict after rebase main * refactor: RTP GLM5 tests merge * refactor: cleanup GLM5 RTP sparse MLA backend * refactor: RTP remove redundant labels * refactor: RTP GLM5 remove redundant code * refactor: RTP GLM5 remove mla redundant code * fix: RTP Qwen35 use prewarmed req id buffer for RTP CUDA graphs * fix: RTP remove redundant qwen35 code
* feat(minimax-m3): split index cache projection Route MiniMax-M3 index Q/K through a separate projection and thread it through the attention stack so cached top-k layers can skip indexer work while preserving the non-cache path. Co-authored-by: Cursor <cursoragent@cursor.com> * refactor(minimax-m3): keep indexer qk packed Keep MiniMax-M3 index Q/K in the packed QKV projection so index-cache support only skips top-k work and does not require a separate aiter input ABI. Co-authored-by: Cursor <cursoragent@cursor.com> * chore(minimax-m3): drop leftover formatting noise Remove residual formatting-only changes from the packed index-cache refactor so the branch only carries functional sparse-attention updates. Co-authored-by: Cursor <cursoragent@cursor.com> * code format * chore(minimax-m3): remove index cache debug logging Drop temporary hit/miss logging and counters from the MiniMax-M3 top-k cache path now that the packed index-cache flow is settled. Co-authored-by: Cursor <cursoragent@cursor.com> --------- Co-authored-by: Cursor <cursoragent@cursor.com>
On gfx1250 with ATOM_USE_UNIFIED_ATTN, a prefix-cache hit during prefill fell back to the Triton unified_attention path instead of the sink ASM varlen kernel, because _can_attempt_prefill_sink_asm bailed on has_cached and on max_seqlen_q != max_seqlen_k. The gfx1250 sink varlen ASM kernel (fmha_fwd_with_sink_varlen_asm) actually handles bottom-right causal for sq < sk (chunked-prefill), and cu_seqlens_q/ cu_seqlens_k already carry the per-request new-token vs cached+new lengths. Verified on gfx1250 against a bottom-right causal + per-head sink reference (single/multi-batch, GQA, sq=1) within bf16 tolerance, and end-to-end on gpt-oss-120b (full-attention layers take the ASM path on a cache hit; the forced-Triton path never gathers). Changes: - _can_attempt_prefill_sink_asm: drop the has_cached and max_seqlen_q == max_seqlen_k gates. - prefill_attention: gather the cached+new KV into a dense packed tensor here, where the ASM varlen kernel consumes it. Each prefill backend now prepares its own KV: the ASM path gathers; the Triton path reads the paged cache directly via block_table and never gathers. - rope_cache: no longer gathers, so dispatch_backend sees q/k with matching token counts (sq == sk) and _can_use_prefill_sink_asm's shape check stays valid. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* [m3 eagle] migrate draft-side EAGLE3 optimizations (Phase 1) Bring the model-agnostic / draft-side MiniMax-M3 EAGLE3 work from wuhuikx/atom-m3-bf16-to-main (2f1c385). These files' pre-eagle base is byte-identical to current main, so they port as-is: - eagle3_llama.py / eagle3_deepseek_mla.py: draft fusions (fused dual-RMSNorm +concat, fused group-RMSNorm aux, AR+RMSNorm fusion), compute_draft_token, replicated-embed option. - fused_aux_rmsnorm.py (new): the fused RMSNorm kernels for the draft. - lm_head_argmax.py (new) + embed_head.py: distributed greedy argmax (all-gather [N,2] per-rank maxima instead of full [N,vocab] logits). - spec_decode/eagle.py: draft loop with distributed-argmax fast path, no-pre-concat aux, and Eagle3 MHA draft KV-cache transfer for PD disaggregation (from #1331). - envs.py: ATOM_EAGLE_REPLICATE_EMBED. - tests/test_lm_head_argmax.py (new, importorskip(aiter) for the no-aiter CI). Target-side enablement (aux-hidden capture in minimax_m3, q>1 spec-verify metadata, prepare_mtp_decode) follows in Phase 2; note eagle.py now references attn_metadata_builder.prepare_mtp_decode which Phase 2 adds. Mocked suite: 437 passed / 38 pre-existing failures / +1 new skip — no regression. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * [m3 eagle] target-side enablement on main's M3 API (Phase 2) Enable MiniMax-M3 EAGLE3 on current main's (Triton-sparse) M3 base, adapting the target side to main's API instead of wuhuikx's asm/gluon infra (absent on main). aiter_attention.py: - Add the generic block-paged MHA Eagle3 draft metadata: _mtp_prepare_decode_ metadata_kernel + prepare_mtp_decode + fuse_mtp_decode_position_update (used by the migrated eagle.py for both Kimi and M3 drafts; not M3-sparse coupled). - Replace the two "speculative decode not supported" NotImplementedError sites: route q>1 spec-verify through the sparse PREFILL path (make_sparse_prefill_ metadata; per-query causal via cu_seqlens_q, which is now filled uniformly for q>1). prefix_lens is bound to a new persistent sparse_prefix_lens buffer so the CUDAGraph-captured sparse indexer reads live causal lengths on each replay. minimax_m3.py: Eagle3 aux-hidden-state capture (Dynamo-safe, mirrors deepseek_v2): aux_hidden_state_layers, in-layer residual.clone() after the fused-allreduce norm, model forward returns (hidden, aux) tuple, set/get_eagle3_aux_hidden_state_layers on the ForCausalLM + VL-wrapper delegation. model_runner.py: extend KV transfer regions with the Eagle3 draft pool for PD disaggregation (#1331). scheduler.py: trim emitted spec tokens past the stop position (rejection sampler emits past EOS) so flexible-extract doesn't pick up leaked trailing tokens. recipes/MiniMax-M3.md: full EAGLE3 section (with a note that the ASM-PA/fp8/MXFP8 specifics reflect the fully-optimized variant, not this Triton-sparse base). Drop tests/test_lm_head_argmax.py (per request). Note: the q>1 sparse-verify path is new on main and CUDAGraph-sensitive — needs GPU validation (GSM8K + accept on TP4/TP8; confirm Kimi eagle unaffected). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * update recipe make lint happy Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * remove fp8 attn related command Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * [m3 eagle] recipe: set ATOM_FORCE_ATTN_TRITON=1 in EAGLE launch main's MiniMaxM3Attention (dense layers) does not set force_triton_attn in code and attention_mha has no block-128 guard, so on this base the dense attention is routed to Triton only via ATOM_FORCE_ATTN_TRITON=1 (the MXFP4 base section already sets it). The EAGLE section migrated from wuhuikx omitted it (wuhuikx set force_triton_attn=True in code instead), so the spec-verify dense attention (q=num_spec+1) fell into paged_attention_asm and aborted in get_heuristic_kernel (no bf16 block-128 ASM-PA kernel). Add the env to the EAGLE launch and drop the stale MXFP8 model_path line. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * remove ATOM_FORCE_ATTN_TRITON Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * update recipe Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * update recipe Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * update the recipe with the perf Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * refine the comment Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
* [atom CI/Nightly/Benchmark] Add MiniMax-M3 and Eagle into atom infra Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> * remove minimax m2.7 case Signed-off-by: zejunchen-zejun <zejun.chen@amd.com> --------- Signed-off-by: zejunchen-zejun <zejun.chen@amd.com>
* Modify model cache mount
…ing the target top-k (#1362) `_should_skip_index_topk` force-skips the DSA indexer top-k for the MTP layer (layer_id >= num_hidden_layers) whenever `index_share_for_mtp_iteration` is set, making the MTP block reuse the *target* model's top-k. But the MTP block ships its OWN indexer weights (indexer.wk / wq_b / weights_proj / k_norm at layer num_hidden_layers in the checkpoint) and is meant to compute its own top-k for the drafted position. Reusing the target's top-k feeds the draft a wrong attention context at all sequence lengths. This is non-standard: neither vLLM upstream (deepseek_mtp.py allocates a dedicated topk_indices_buffer + Indexer for the MTP block) nor the ATOM sglang plugin reuses the target index; both compute the MTP top-k independently. `index_share_for_mtp_iteration` should at most share across multiple MTP draft steps (num_speculative_tokens > 1), never reuse the target model's index. Fix: drop the MTP special-case so the MTP layer computes its own top-k with its own (loaded) indexer weights, matching vLLM upstream and sglang. Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[gfx1151] Online INT8 W8A8 for Qwen3.6 27B / 35B-A3B on RDNA3.5 (Strix Halo), with working MTP
Builds on the gfx1151 BF16 enablement (#1314) to add online INT8 W8A8 (Quark-style, no offline quant step) for the Qwen3.6 dense (27B) and MoE (35B-A3B) architectures, plus the fixes needed to make MTP speculative decoding work on the MoE draft. RDNA3.5 WMMA supports int8 natively (no FP8/FP4), so int8 is the right quantization target for this iGPU.
Precision split (chosen for quality): A8W8 (int8 weight + dynamic per-token int8 activation) for all dense GEMMs and the MoE experts; BF16 for the Gated-DeltaNet linear-attention (recurrent, quant-sensitive — int8 there produces garbage), the MoE router gate,
lm_head, and embeddings; KV cache BF16. This keeps gsm8k at BF16-equivalent quality while halving weight bytes (the decode bottleneck on a bandwidth-bound iGPU).What this enables
moe_gemm_int8_smoothquant, which is the only int8-W8A8 grouped-GEMM that runs on RDNA3.5.Changes
model_ops/linear.py—per_Tokenint8 branch routes to aiter Tritongemm_a8w8on non-gfx9 (CKgemm_a8w8_CKis gfx9-only), wrapped as a torch custom op (torch.ops.aiter.atom_gemm_a8w8_triton) so it is HIP-graph /torch.compilesafe. Online-quant allow-list +=torch.int8.model_ops/moe.py— newInt8MoEMethod(int8w13/w2+ per-channel fp32 scales) and an int8 branch inFusedMoE._online_quant.model_ops/fused_moe_triton.py—triton_kernel_int8_moe_forward: matmul-ogs routing → per-token int8 quant →moe_gemm_int8_smoothquant(gemm1 with fused gated-SiLU via interleavedw13columns) → per-token int8 quant → gemm2 with scatter/combine.models/qwen3_5_mtp.py+model_loader/loader.py— MTP-MoE drafter fix: the draft's fused expert weights (experts.gate_up_proj/down_proj) were silently dropped at load → 0% draft acceptance → MTP was pure overhead. Add the draft's fused-expert mapping (detect_fused_expert_format/get_fused_expert_mapping/load_fused_expert_weights), fixget_expert_mappingto usenum_experts, and let the loader resolveload_fused_expert_weights_fnfrom the model. After the fix: acceptance 0 → 0.83.entrypoints/openai/tool_parser.py— unique tool-call ids (call_<uuid>instead of a per-responsecall_0). Non-unique ids made agentic clients (qwen-code) dedupe every tool call after the first → endless tool-call loop. Extends feat(openai): Qwen3 (qwen3_coder/qwen3_xml) tool-call support #1319.Quality (gsm8k, 5-shot-equivalent, chat + thinking, greedy)
Performance (gfx1151 / Radeon 8060S, bs=1)
Decode (single-stream, short context):
Long-context (35B-A3B INT8 W8A8 + MTP-1, bs=1):
Decode tok/s falls with context (each step reads the growing KV); prefill is compute-bound (one-time prompt-ingestion cost). The hybrid model's KV is cheap (only the interleaved full-attn layers cache KV), so 128K fits easily — at
gpu-memory-utilization 0.9the KV pool holds ~2.1M tokens; the limit is--max-model-len, not memory.Serve
(Drop
--method mtp ...for 35B if you don't want MTP; for the dense 27B MTP is a ~1.6× lossless win.)Dependency
carhuang/gfx1151_opus_fp8_guard) — arch-guard the gfx9-only fp8/bf8-cvt builtins. Required (shared with [gfx1151] Qwen3.5/3.6 (GDN hybrid) BF16 on RDNA3.5 via native Triton attention #1314). The int8 GEMM/MoE kernels (gemm_a8w8Triton,moe_gemm_int8_smoothquant,per_token_quant_hip) are already upstream in aiter; no new aiter code is needed for the int8 path.carhuang/support_gfx1151_qwen36) — the gfx1151 BF16 base enablement (arch gate, native Triton attention, GDNblock_tables). Prerequisite.carhuang/qwen3_xml_tool_parser) — qwen3_xml tool-call parsing; the unique-tool-call-id fix here extends it.