Skip to content

Enable MiniMax-M3 vLLM plugin path#1342

Draft
lirui927 wants to merge 5 commits into
mainfrom
lirui/vllm_atom_m3_0624
Draft

Enable MiniMax-M3 vLLM plugin path#1342
lirui927 wants to merge 5 commits into
mainfrom
lirui/vllm_atom_m3_0624

Conversation

@lirui927

@lirui927 lirui927 commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Motivation

Enable MiniMax-M3 vLLM plugin path.

Technical Details

  • Add MiniMax-M3 sparse MHA support in the vLLM plugin path.
  • Align sparse MHA KV cache handling with the page-shuffled layout expected by ATOM kernels.
  • Reuse vLLM-provided output buffers in sparse prefill/decode to avoid extra allocation and copy.
  • Handle actual-token slicing for padded vLLM batches and zero padded output tails.
  • Route mixed decode+prefill sparse batches through the prefill path so per-token sparse block tables are built correctly.

Test Plan

  • Start vLLM service and run full GSM8K.

Test Result

atom native

image

vllm-atom

image ## Submission Checklist

@XiaobingSuper XiaobingSuper force-pushed the lirui/vllm_atom_m3_0624 branch 3 times, most recently from f9d842b to 35ec283 Compare June 24, 2026 17:52
Comment thread atom/models/minimax_m3.py Outdated
@XiaobingSuper XiaobingSuper force-pushed the lirui/vllm_atom_m3_0624 branch 3 times, most recently from 0644b27 to bc8c975 Compare June 25, 2026 09:03
Comment thread atom/plugin/vllm/attention/layer_mha.py Outdated
lirui927 and others added 2 commits June 25, 2026 13:15
Fix PTPC FP8 MoE loading to preserve offline checkpoint bits and wire MiniMax-M3 sparse MHA metadata/backend support for vLLM serving.

Co-authored-by: Cursor <cursoragent@cursor.com>
Reuse vLLM-provided output buffers in sparse MHA prefill/decode and align the adapter with the page-shuffled KV cache layout used by MiniMax-M3 serving.

Co-authored-by: Cursor <cursoragent@cursor.com>
@XiaobingSuper XiaobingSuper force-pushed the lirui/vllm_atom_m3_0624 branch from f81cf00 to 39ebb75 Compare June 25, 2026 13:21
XiaobingSuper and others added 3 commits June 25, 2026 09:22
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Keep mixed decode/prefill/extend batches phase-local and separate index-cache top-k metadata from main KV-cache sparse block emission to prevent cross-request fp8 accuracy drift.
@XiaobingSuper XiaobingSuper marked this pull request as draft June 26, 2026 02:47
@zufayu zufayu requested review from ZhangLirong-amd and valarLip and removed request for ZhangLirong-amd June 26, 2026 06:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants