Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/ContribOperators.md
Original file line number Diff line number Diff line change
Expand Up @@ -4949,7 +4949,7 @@ This version of the operator has been available since version 1 of the 'com.micr
<dt><tt>use_sparse_mixer</tt> : int</dt>
<dd>Whether to use sparse mixer</dd>
<dt><tt>weights_prepacked</tt> : int</dt>
<dd>Only meaningful when quant_type='int'. Tri-state control over whether the int4/int8 fc1/fc2 weight initializers are already laid out in the CUTLASS fpA_intB format expected by the runner. -1 (auto): let the execution provider choose its own backward-compatible default; the CUDA EP treats auto as prepacked. 1: the initializers are already prepacked (e.g. produced offline by pack_weights_for_cuda_mixed_gemm) and are consumed as-is. 0: the initializers are raw, un-prepacked [E, N, K/pack] tensors as produced by quantize_matmul_{4,8}bits; the kernel runs the CUTLASS layout transform itself in PrePack(), matching the behaviour of MatMulNBits and removing the offline pre-pack requirement from exporters. Defaults to -1 (auto) so each execution provider can pick its own backward-compatible default rather than the schema imposing one.</dd>
<dd>Only meaningful when quant_type='int'. Tri-state control over the layout of the int4/int8 fc1/fc2 weight initializers. The concrete prepacked layouts selected by -1 and 1 are determined by the execution provider. 0: the initializers are raw, un-prepacked [E, N, K/pack] tensors as produced by quantize_matmul_{4,8}bits. Defaults to -1.</dd>
</dl>

#### Inputs (6 - 21)
Expand Down
77 changes: 71 additions & 6 deletions docs/contrib_ops/cuda/moe_qmoe.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ input tokens → router (top-k softmax) → permute by expert
| `expert_weight_bits` (QMoE only) | int | 4 | 4 (INT4/MXFP4) or 8 (INT8/FP8). |
| `block_size` (QMoE only) | int | -1 | Group size for INT4/INT8 group-wise quantization. -1 = per-output-channel. |
| `quant_type` (QMoE only) | string | `"int"` | `"int"`, `"fp4"`, `"fp8"`, `"wfp4afp8"`. See [§3](#3-quantization-modes). |
| `weights_prepacked` (QMoE only) | int | -1 | Tri-state, only meaningful when `quant_type="int"`. The prepacked layouts selected by `-1` and `1` are **EP-determined**. `-1` (default): the INT4/INT8 `fc1`/`fc2` initializers are already prepacked in the EP's default layout (e.g. from `pack_weights_for_cuda_mixed_gemm` for the CUDA EP). `1`: already prepacked in an alternate EP-selected layout. `0`: the initializers are raw `[E, N, K/pack]` tensors (as produced by `quantize_matmul_{4,8}bits`) and the kernel runs the CUTLASS layout transform in `PrePack()`. **Note:** the CUDA EP INT4/INT8 MoE GEMM always runs the Ampere (SM80) kernel — even on SM90 — so it consumes the SM80 `fpA_intB` layout on all architectures; `-1` and `1` are therefore equivalent for the CUDA EP today, and `1` is reserved for a possible future Hopper-specific layout. See [§5.1](#51-weights-input-2--5--8). |

### 2.2 Type Constraints

Expand Down Expand Up @@ -228,10 +229,53 @@ extra subtraction.

### 5.1 Weights (input 2 / 5 / 8)

Not transformed at runtime. INT4/INT8 weights must already be packed offline by
`pack_weights_for_cuda_mixed_gemm` (see [§6](#6-weight-formats)). MXFP4 weights
must be packed by `pack_fp4_weights_for_cuda_moe_gemm`. FP8 weights are stored
as raw e4m3 bytes (no packing).
**INT4/INT8** weight layout is controlled by the `weights_prepacked` attribute
([§2.1](#21-attributes)). The prepacked layouts selected by `-1` and `1` are
determined by the execution provider:

- **`weights_prepacked=-1` (default)** — the `fc1`/`fc2` weights are already in
the EP's default prepacked layout (e.g. packed offline by
`pack_weights_for_cuda_mixed_gemm` for the CUDA EP). They are copied to GPU
and consumed as-is.
- **`weights_prepacked=1`** — the `fc1`/`fc2` weights are already in the EP's
**SM90** (Hopper) prepacked layout (reserved; see the note below).
- **`weights_prepacked=0`** — the `fc1`/`fc2` weights are raw, schema-conformant
`[E, N, K/pack]` tensors as produced by `quantize_matmul_{4,8}bits`. `PrePack`
runs the CUTLASS layout transform itself via `PrePackIntExpertWeights`,
removing the offline pre-pack dependency. This makes integer QMoE symmetric
with `MatMulNBits::PrePack_B`.

> **Single layout on the CUDA EP.** The CUDA EP INT4/INT8 MoE GEMM always
> dispatches to the Ampere (**SM80**) grouped-GEMM kernel — even on SM90 —
> because mixed int-weight + fp16/bf16 activation is not a valid Hopper TMA
> warp-specialized specialisation (`isValidHopperMOESpecialisation` is `false`).
> This matches **TensorRT-LLM**, which likewise routes `W4A16`/`W8A16` MoE to the
> SM80 kernel on Hopper; its Hopper TMA-WS mixed-dtype MoE kernel is reserved for
> `W4A8` (FP8 activation) and `WFP4A16` (FP4 weight). Consequently the CUDA EP
> consumes the **SM80 `fpA_intB` layout on every GPU**, `PrePack` always packs
> for SM80, and `weights_prepacked=-1` and `=1` are equivalent today. `1` is
> accepted and reserved for a possible future Hopper-specific layout (e.g.
> `W4A8`). There is therefore no architecture-match constraint: SM80-format
> weights run correctly on SM90 via the SM80 kernel.

`PrePackIntExpertWeights` loops over the `E` experts and, per expert, applies the
same transpose + row-permutation / column-interleave / bias / pair-interleave
transform as `pack_weights_for_cuda_mixed_gemm` (see [§6.1](#61-int4-group-wise-quant_typeint-expert_weight_bits4)),
always targeting the SM80 layout. SM75+ is required. The source
`[E, N, K/pack]` initializers are released after their shapes are cached
(`fc1_weights_shape_` / `fc2_weights_shape_`), so peak weight memory stays ~1×.
The prepacked GPU buffers (`packed_fc1_weights_` / `packed_fc2_weights_`) are then
preferred by `ComputeInternal`. If prepacking is disabled at the session level
(`session.disable_prepacking`), the buffers stay null and the raw initializer
pointers are read at compute time instead.

> **Note**: `weights_prepacked=0` is the only path that triggers an in-`PrePack`
> layout transform for INT weights. FP4 / FP8 / WFP4AFP8 weight handling is
> unaffected.

MXFP4 weights must be packed by `pack_fp4_weights_for_cuda_moe_gemm`. FP8 weights
are stored as raw e4m3 bytes (no packing).


### 5.2 INT4/INT8 scales + zero-point → bias

Expand Down Expand Up @@ -287,7 +331,12 @@ This section covers the five distinct weight encodings supported by QMoE.
INT4 packing layout within a byte: `[high_nibble | low_nibble] = [elt_1 | elt_0]`.
Each INT4 element is in `[-8, 7]` (signed) before bias, `[0, 15]` after the +8 bias.

#### Preprocessing pipeline (offline, `pack_weights_for_cuda_mixed_gemm`)
#### Preprocessing pipeline (offline `pack_weights_for_cuda_mixed_gemm`, or in-`PrePack` via `PrePackIntExpertWeights`)

This is the layout transform applied either offline by
`pack_weights_for_cuda_mixed_gemm`, or per-expert inside `PrePack` when
`weights_prepacked=0` (see [§5.1](#51-weights-input-2--5--8)).


1. **Input layout**: `[N, K]` per expert (Out × In), 2 elements per byte for INT4.
2. **Transpose & signed conversion**:
Expand Down Expand Up @@ -405,6 +454,17 @@ weights are interchangeable across SMs:
— does not use `pack_weights_for_cuda_mixed_gemm`.
- **FP8**: no packing.

> **QMoE uses Group A on every GPU.** The table above describes the layouts the
> `pack_weights_for_cuda_mixed_gemm` *preprocessor* can emit. The QMoE INT4/INT8
> MoE GEMM, however, always dispatches to the Ampere (SM80) grouped-GEMM kernel —
> even on SM90 — because mixed int-weight + fp16/bf16 activation is not a valid
> Hopper TMA warp-specialized specialisation (the same is true in TensorRT-LLM).
> It therefore consumes the **Group A (SM80) layout on all architectures,
> including Hopper**. For QMoE, always pack INT4/INT8 weights for SM80 (`arch=80`),
> and `PrePackIntExpertWeights` (`weights_prepacked=0`) does exactly that
> regardless of the runtime device SM. Group B (SM90) layout is currently unused
> by QMoE.

---

## 8. SwiGLU Fusion
Expand Down Expand Up @@ -830,7 +890,7 @@ will not change the operator interface.
|-----------|----------|
| [test_moe_cuda.py](onnxruntime/test/python/transformers/test_moe_cuda.py) | Standard MoE on CUDA: FP16/BF16, SiLU/GeLU/SwiGLU, routing, GEMM parity. SwiGLU coverage includes both GPT-OSS (`TestSwigluMoE`: interleaved, alpha=1.702/beta=1.0/limit=7.0) and Standard/Llama-Gemma (`TestStandardSwigluMoE`: concatenated `swiglu_fusion=2`, alpha=1.0/beta=0.0/no limit → `SiLU(Gate)×Value`). |
| [test_moe_cpu.py](onnxruntime/test/python/transformers/test_moe_cpu.py) | Standard MoE on CPU (smoke). |
| [test_qmoe_cuda.py](onnxruntime/test/python/transformers/test_qmoe_cuda.py) | INT4/INT8 QMoE — primary regression signal for the production QMoE path. Exercises `pack_weights_for_cuda_mixed_gemm` and dequant-then-matmul reference. |
| [test_qmoe_cuda.py](onnxruntime/test/python/transformers/test_qmoe_cuda.py) | INT4/INT8 QMoE — primary regression signal for the production QMoE path. Exercises `pack_weights_for_cuda_mixed_gemm` and dequant-then-matmul reference. `TestQMoEIntPrePackSmoke` covers the raw-weight `weights_prepacked=0` in-`PrePack` layout transform (smoke test: asserts finite output, not bit-parity). |
| [test_qmoe_cpu.py](onnxruntime/test/python/transformers/test_qmoe_cpu.py) | INT4/INT8 QMoE on CPU (smoke). |
| [test_qmoe_fp4_cuda.py](onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py) | MXFP4 QMoE: quantization utilities, packing, FP16/BF16, SiLU/SwiGLU, top-k and expert-count variants. End-to-end runs on SM120; on SM<120 the dequant fallback is exercised. |
| [test_qmoe_fp8_cuda.py](onnxruntime/test/python/transformers/test_qmoe_fp8_cuda.py) | FP8 W8A16 QMoE on SM90+ native path and SM<90 dequant fallback. |
Expand Down Expand Up @@ -954,6 +1014,11 @@ over-aligned by-value parameters.
cannot. See [§14.1](#141-msvc-and-tma-grouped-moe-gemm).
- **WFP4AFP8 native** requires SM100+ hardware; only the dequant fallback path
is validated end-to-end so far.
- **In-`PrePack` INT weight layout transform** (`weights_prepacked=0`) is
currently covered only by a smoke test (`TestQMoEIntPrePackSmoke`), not a
bit-parity check: the existing offline pre-pack harness hardcodes
`force_arch=80` (the same SM80 layout consumed by the CUDA EP on all GPUs),
so a separate parity harness for this path is still pending.
- **Hopper W4A8** (INT4 weight + FP8 activation) is not supported — TRT-LLM gates
its fast path to SM89 only.

Expand Down
Loading
Loading