Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/ContribOperators.md
Original file line number Diff line number Diff line change
Expand Up @@ -4949,7 +4949,7 @@ This version of the operator has been available since version 1 of the 'com.micr
<dt><tt>use_sparse_mixer</tt> : int</dt>
<dd>Whether to use sparse mixer</dd>
<dt><tt>weights_prepacked</tt> : int</dt>
<dd>Only meaningful when quant_type='int'. Tri-state control over whether the int4/int8 fc1/fc2 weight initializers are already laid out in the CUTLASS fpA_intB format expected by the runner. -1 (auto): let the execution provider choose its own backward-compatible default; the CUDA EP treats auto as prepacked. 1: the initializers are already prepacked (e.g. produced offline by pack_weights_for_cuda_mixed_gemm) and are consumed as-is. 0: the initializers are raw, un-prepacked [E, N, K/pack] tensors as produced by quantize_matmul_{4,8}bits; the kernel runs the CUTLASS layout transform itself in PrePack(), matching the behaviour of MatMulNBits and removing the offline pre-pack requirement from exporters. Defaults to -1 (auto) so each execution provider can pick its own backward-compatible default rather than the schema imposing one.</dd>
<dd>Only meaningful when quant_type='int'. Tri-state control over the layout of the int4/int8 fc1/fc2 weight initializers. The concrete prepacked layouts selected by -1 and 1 are determined by the execution provider. 0: the initializers are raw, un-prepacked [E, N, K/pack] tensors as produced by quantize_matmul_{4,8}bits. Defaults to -1.</dd>
</dl>

#### Inputs (6 - 21)
Expand Down
77 changes: 71 additions & 6 deletions docs/contrib_ops/cuda/moe_qmoe.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,7 @@ input tokens → router (top-k softmax) → permute by expert
| `expert_weight_bits` (QMoE only) | int | 4 | 4 (INT4/MXFP4) or 8 (INT8/FP8). |
| `block_size` (QMoE only) | int | -1 | Group size for INT4/INT8 group-wise quantization. -1 = per-output-channel. |
| `quant_type` (QMoE only) | string | `"int"` | `"int"`, `"fp4"`, `"fp8"`, `"wfp4afp8"`. See [§3](#3-quantization-modes). |
| `weights_prepacked` (QMoE only) | int | -1 | Tri-state, only meaningful when `quant_type="int"`. The prepacked layouts selected by `-1` and `1` are **EP-determined**. `-1` (default): the INT4/INT8 `fc1`/`fc2` initializers are already prepacked in the EP's default layout (e.g. from `pack_weights_for_cuda_mixed_gemm` for the CUDA EP). `1`: already prepacked in an alternate EP-selected layout. `0`: the initializers are raw `[E, N, K/pack]` tensors (as produced by `quantize_matmul_{4,8}bits`) and the kernel runs the CUTLASS layout transform in `PrePack()`. **Note:** the CUDA EP INT4/INT8 MoE GEMM always runs the Ampere (SM80) kernel — even on SM90 — so it consumes the SM80 `fpA_intB` layout on all architectures; `-1` and `1` are therefore equivalent for the CUDA EP today, and `1` is reserved for a possible future Hopper-specific layout. See [§5.1](#51-weights-input-2--5--8). |

### 2.2 Type Constraints

Expand Down Expand Up @@ -228,10 +229,53 @@ extra subtraction.

### 5.1 Weights (input 2 / 5 / 8)

Not transformed at runtime. INT4/INT8 weights must already be packed offline by
`pack_weights_for_cuda_mixed_gemm` (see [§6](#6-weight-formats)). MXFP4 weights
must be packed by `pack_fp4_weights_for_cuda_moe_gemm`. FP8 weights are stored
as raw e4m3 bytes (no packing).
**INT4/INT8** weight layout is controlled by the `weights_prepacked` attribute
([§2.1](#21-attributes)). The prepacked layouts selected by `-1` and `1` are
determined by the execution provider:

- **`weights_prepacked=-1` (default)** — the `fc1`/`fc2` weights are already in
the EP's default prepacked layout (e.g. packed offline by
`pack_weights_for_cuda_mixed_gemm` for the CUDA EP). They are copied to GPU
and consumed as-is.
- **`weights_prepacked=1`** — the `fc1`/`fc2` weights are already in the EP's
**SM90** (Hopper) prepacked layout (reserved; see the note below).
- **`weights_prepacked=0`** — the `fc1`/`fc2` weights are raw, schema-conformant
`[E, N, K/pack]` tensors as produced by `quantize_matmul_{4,8}bits`. `PrePack`
runs the CUTLASS layout transform itself via `PrePackIntExpertWeights`,
removing the offline pre-pack dependency. This makes integer QMoE symmetric
with `MatMulNBits::PrePack_B`.

> **Single layout on the CUDA EP.** The CUDA EP INT4/INT8 MoE GEMM always
> dispatches to the Ampere (**SM80**) grouped-GEMM kernel — even on SM90 —
> because mixed int-weight + fp16/bf16 activation is not a valid Hopper TMA
> warp-specialized specialisation (`isValidHopperMOESpecialisation` is `false`).
> This matches **TensorRT-LLM**, which likewise routes `W4A16`/`W8A16` MoE to the
> SM80 kernel on Hopper; its Hopper TMA-WS mixed-dtype MoE kernel is reserved for
> `W4A8` (FP8 activation) and `WFP4A16` (FP4 weight). Consequently the CUDA EP
> consumes the **SM80 `fpA_intB` layout on every GPU**, `PrePack` always packs
> for SM80, and `weights_prepacked=-1` and `=1` are equivalent today. `1` is
> accepted and reserved for a possible future Hopper-specific layout (e.g.
> `W4A8`). There is therefore no architecture-match constraint: SM80-format
> weights run correctly on SM90 via the SM80 kernel.

`PrePackIntExpertWeights` loops over the `E` experts and, per expert, applies the
same transpose + row-permutation / column-interleave / bias / pair-interleave
transform as `pack_weights_for_cuda_mixed_gemm` (see [§6.1](#61-int4-group-wise-quant_typeint-expert_weight_bits4)),
always targeting the SM80 layout. SM75+ is required. The source
`[E, N, K/pack]` initializers are released after their shapes are cached
(`fc1_weights_shape_` / `fc2_weights_shape_`), so peak weight memory stays ~1×.
The prepacked GPU buffers (`packed_fc1_weights_` / `packed_fc2_weights_`) are then
preferred by `ComputeInternal`. If prepacking is disabled at the session level
(`session.disable_prepacking`), the buffers stay null and the raw initializer
pointers are read at compute time instead.

> **Note**: `weights_prepacked=0` is the only path that triggers an in-`PrePack`
> layout transform for INT weights. FP4 / FP8 / WFP4AFP8 weight handling is
> unaffected.

MXFP4 weights must be packed by `pack_fp4_weights_for_cuda_moe_gemm`. FP8 weights
are stored as raw e4m3 bytes (no packing).


### 5.2 INT4/INT8 scales + zero-point → bias

Expand Down Expand Up @@ -287,7 +331,12 @@ This section covers the five distinct weight encodings supported by QMoE.
INT4 packing layout within a byte: `[high_nibble | low_nibble] = [elt_1 | elt_0]`.
Each INT4 element is in `[-8, 7]` (signed) before bias, `[0, 15]` after the +8 bias.

#### Preprocessing pipeline (offline, `pack_weights_for_cuda_mixed_gemm`)
#### Preprocessing pipeline (offline `pack_weights_for_cuda_mixed_gemm`, or in-`PrePack` via `PrePackIntExpertWeights`)

This is the layout transform applied either offline by
`pack_weights_for_cuda_mixed_gemm`, or per-expert inside `PrePack` when
`weights_prepacked=0` (see [§5.1](#51-weights-input-2--5--8)).


1. **Input layout**: `[N, K]` per expert (Out × In), 2 elements per byte for INT4.
2. **Transpose & signed conversion**:
Expand Down Expand Up @@ -405,6 +454,17 @@ weights are interchangeable across SMs:
— does not use `pack_weights_for_cuda_mixed_gemm`.
- **FP8**: no packing.

> **QMoE uses Group A on every GPU.** The table above describes the layouts the
> `pack_weights_for_cuda_mixed_gemm` *preprocessor* can emit. The QMoE INT4/INT8
> MoE GEMM, however, always dispatches to the Ampere (SM80) grouped-GEMM kernel —
> even on SM90 — because mixed int-weight + fp16/bf16 activation is not a valid
> Hopper TMA warp-specialized specialisation (the same is true in TensorRT-LLM).
> It therefore consumes the **Group A (SM80) layout on all architectures,
> including Hopper**. For QMoE, always pack INT4/INT8 weights for SM80 (`arch=80`),
> and `PrePackIntExpertWeights` (`weights_prepacked=0`) does exactly that
> regardless of the runtime device SM. Group B (SM90) layout is currently unused
> by QMoE.

---

## 8. SwiGLU Fusion
Expand Down Expand Up @@ -830,7 +890,7 @@ will not change the operator interface.
|-----------|----------|
| [test_moe_cuda.py](onnxruntime/test/python/transformers/test_moe_cuda.py) | Standard MoE on CUDA: FP16/BF16, SiLU/GeLU/SwiGLU, routing, GEMM parity. SwiGLU coverage includes both GPT-OSS (`TestSwigluMoE`: interleaved, alpha=1.702/beta=1.0/limit=7.0) and Standard/Llama-Gemma (`TestStandardSwigluMoE`: concatenated `swiglu_fusion=2`, alpha=1.0/beta=0.0/no limit → `SiLU(Gate)×Value`). |
| [test_moe_cpu.py](onnxruntime/test/python/transformers/test_moe_cpu.py) | Standard MoE on CPU (smoke). |
| [test_qmoe_cuda.py](onnxruntime/test/python/transformers/test_qmoe_cuda.py) | INT4/INT8 QMoE — primary regression signal for the production QMoE path. Exercises `pack_weights_for_cuda_mixed_gemm` and dequant-then-matmul reference. |
| [test_qmoe_cuda.py](onnxruntime/test/python/transformers/test_qmoe_cuda.py) | INT4/INT8 QMoE — primary regression signal for the production QMoE path. Exercises `pack_weights_for_cuda_mixed_gemm` and dequant-then-matmul reference. `TestQMoEIntPrePackSmoke` covers the raw-weight `weights_prepacked=0` in-`PrePack` layout transform (smoke test: asserts finite output, not bit-parity). |
| [test_qmoe_cpu.py](onnxruntime/test/python/transformers/test_qmoe_cpu.py) | INT4/INT8 QMoE on CPU (smoke). |
| [test_qmoe_fp4_cuda.py](onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py) | MXFP4 QMoE: quantization utilities, packing, FP16/BF16, SiLU/SwiGLU, top-k and expert-count variants. End-to-end runs on SM120; on SM<120 the dequant fallback is exercised. |
| [test_qmoe_fp8_cuda.py](onnxruntime/test/python/transformers/test_qmoe_fp8_cuda.py) | FP8 W8A16 QMoE on SM90+ native path and SM<90 dequant fallback. |
Expand Down Expand Up @@ -954,6 +1014,11 @@ over-aligned by-value parameters.
cannot. See [§14.1](#141-msvc-and-tma-grouped-moe-gemm).
- **WFP4AFP8 native** requires SM100+ hardware; only the dequant fallback path
is validated end-to-end so far.
- **In-`PrePack` INT weight layout transform** (`weights_prepacked=0`) is
currently covered only by a smoke test (`TestQMoEIntPrePackSmoke`), not a
bit-parity check: the existing offline pre-pack harness hardcodes
`force_arch=80` (the same SM80 layout consumed by the CUDA EP on all GPUs),
so a separate parity harness for this path is still pending.
- **Hopper W4A8** (INT4 weight + FP8 activation) is not supported — TRT-LLM gates
its fast path to SM89 only.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,8 @@ enum class QuantType {
W4_AFP8
};

int get_arch_for_mixed_gemm_weight_preprocess(int arch);

void preprocess_weights_for_mixed_gemm_cuda(cudaStream_t stream,
int arch,
int8_t* preprocessed_quantized_weight,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -521,6 +521,19 @@ void add_bias_and_interleave_quantized_tensor_inplace_cuda(
}
}

int get_arch_for_mixed_gemm_weight_preprocess(int arch) {
ORT_ENFORCE(arch >= 75, "Unsupported CUDA architecture: ", arch);
if (arch < 80) {
return 75;
}
#ifndef EXCLUDE_SM_90
if (arch >= 90 && arch < 100) {
return 90;
}
#endif
return 80;
}

void preprocess_weights_for_mixed_gemm_cuda(cudaStream_t stream,
int arch,
int8_t* preprocessed_quantized_weight,
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -120,11 +120,11 @@ LayoutDetails getLayoutDetailsForArch(QuantType quant_type) {
}

LayoutDetails getLayoutDetailsForTransform(QuantType quant_type, int arch) {
ORT_ENFORCE(arch >= 75, "Unsupported CUDA architecture: ", arch);
if (arch < 80) {
arch = get_arch_for_mixed_gemm_weight_preprocess(arch);
if (arch == 75) {
return getLayoutDetailsForArch<cutlass::arch::Sm75>(quant_type);
#ifndef EXCLUDE_SM_90
} else if (arch >= 90 && arch < 100) {
} else if (arch == 90) {
return getLayoutDetailsForArch<cutlass::arch::Sm90>(quant_type);
#endif
} else {
Expand Down
58 changes: 32 additions & 26 deletions onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc
Original file line number Diff line number Diff line change
Expand Up @@ -62,18 +62,28 @@ QMoE::QMoE(const OpKernelInfo& op_kernel_info) : CudaKernel(op_kernel_info), MoE
this->quant_type_ = op_kernel_info.GetAttrOrDefault<std::string>("quant_type", "int");
ORT_ENFORCE(quant_type_ == "int" || quant_type_ == "fp4" || quant_type_ == "fp8" || quant_type_ == "wfp4afp8",
"quant_type must be 'int', 'fp4', 'fp8', or 'wfp4afp8', but got '", quant_type_, "'");
// ``weights_prepacked`` is an optional tri-state attribute that defaults to
// -1 (auto) in the schema, so each EP picks its own backward-compatible
// default rather than the schema imposing one:
// -1 (auto, also the schema default): the EP decides. The CUDA EP's
// backward-compatible default is "prepacked" because all pre-existing
// tooling ships CUTLASS-prepacked weights.
// 1: initializers are already prepacked; the compute path reads them as-is.
// 0: initializers are raw [E, N, K/pack]; the PrePack hook lays them out.
// ``weights_prepacked`` is an optional tri-state attribute (default -1) that
// declares the layout of the int4/int8 fc1/fc2 weight initializers. The
// concrete prepacked layouts selected by -1 and 1 are determined by the
// execution provider. The CUDA EP maps the tri-state as:
// -1 (default): already prepacked in the EP's default int weight layout.
// 1: already prepacked in an alternate EP-selected int weight layout.
// 0: raw [E, N, K/pack] initializers; the PrePack hook lays them out.
//
// Important: the CUDA QMoE int4/int8 MoE GEMM always dispatches to the
// Ampere (SM80) grouped-GEMM kernel -- even on SM90 -- because mixed
// int-weight + fp16/bf16 activation is not a valid Hopper TMA warp-specialized
// specialisation (see isValidHopperMOESpecialisation). The kernel therefore
// consumes the SM80/Ampere CUTLASS fpA_intB layout on every GPU. As a result
// the EP default (-1) is the SM80 layout regardless of the runtime device SM,
// and SM80-format weights are valid on SM90 (they run via the SM80 kernel).
// For CUDA today, -1 and 1 are equivalent (both SM80 layout), and 1 is
// reserved for a possible future Hopper-specific layout.
// PrePack (weights_prepacked=0) packs for the SM80 layout accordingly.
const int64_t weights_prepacked_mode =
op_kernel_info.GetAttrOrDefault<int64_t>("weights_prepacked", static_cast<int64_t>(-1));
ORT_ENFORCE(weights_prepacked_mode == -1 || weights_prepacked_mode == 0 || weights_prepacked_mode == 1,
"weights_prepacked must be -1 (auto), 0, or 1, but got ", weights_prepacked_mode);
"weights_prepacked must be -1 (default), 0, or 1, but got ", weights_prepacked_mode);
weights_prepacked_ = (weights_prepacked_mode != 0);
#if !defined(ENABLE_FP4) || !defined(USE_FP4_QMOE)
ORT_ENFORCE(quant_type_ != "fp4", "QMoE quant_type='fp4' requires USE_FP4_QMOE with CUDA 12.8 or newer.");
Expand Down Expand Up @@ -850,7 +860,7 @@ Status QMoE::ComputeInternal(OpKernelContext* context) const {
// PrePack converted the raw int4/int8 weights to the CUTLASS fpA_intB
// layout that the runner consumes and freed the source initializer
// (``is_packed = true``). Gate on ``int_weights_consumed_by_prepack``
// (which already requires ``packed_fc1_weights_ != nullptr``) rather than
// (which already requires both packed weight buffers) rather than
// just ``is_int && !weights_prepacked_``: when prepacking is disabled at
// the session level (``session.disable_prepacking``) PrePack never runs,
// the prepack buffers stay null, and the raw initializer pointers read
Expand Down Expand Up @@ -1146,6 +1156,9 @@ void QMoE::PrePackIntExpertWeights(const Tensor& tensor, cudaStream_t stream, Al
IAllocatorUniquePtr<void>& packed_buf, bool& is_packed) {
ORT_ENFORCE(expert_weight_bits_ == 4 || expert_weight_bits_ == 8,
"PrePackIntExpertWeights: only 4 and 8 bits are supported, got ", expert_weight_bits_);
ORT_ENFORCE(sm_ >= 75,
"PrePackIntExpertWeights: quant_type='int' with weights_prepacked=0 requires SM75+ CUDA hardware, got SM",
sm_);
const auto& shape = tensor.Shape();
ORT_ENFORCE(shape.NumDimensions() == 3,
"PrePackIntExpertWeights: expected 3-D weight tensor [E, N, K/pack], got ndim=",
Expand All @@ -1158,22 +1171,15 @@ void QMoE::PrePackIntExpertWeights(const Tensor& tensor, cudaStream_t stream, Al
const int64_t k_packed = shape[2];
const int64_t k = k_packed * pack_factor;

// Weight packing is architecture-aware (see
// docs/contrib_ops/cuda/moe_qmoe.md §7 "Cross-Architecture Packing
// Compatibility"). SM90 (Hopper) uses its own Permuted-Linear layout that
// skips column interleaving, so it is its own compatibility group. Every
// other supported arch — SM75/80/86/89 and SM100/120 (Blackwell) — shares
// the SM80 fpA_intB layout, so they all pack as SM80. SM70 and older lack
// INT8 LDSM and are unsupported. The compute-side runner selects the same
// layout from this clamped arch, so the two cannot drift.
//
// SM75 is passed through unchanged (rather than clamped to 80) even though it
// shares SM80's layout: the compute-side dispatch (getLayoutDetailsForTransform)
// still has a distinct SM75 branch, so mirroring it here avoids confusing a
// reader into thinking prepack and dispatch disagree.
ORT_ENFORCE(sm_ >= 75,
"QMoE int4/int8 weight prepack requires SM75 or newer, got sm=", sm_);
const int packing_sm = (sm_ == 90 || sm_ == 75) ? sm_ : 80;
// The CUDA QMoE int4/int8 MoE GEMM always dispatches to the Ampere (SM80)
// grouped-GEMM kernel -- even on SM90 -- because mixed int-weight + fp16/bf16
// is not a valid Hopper TMA warp-specialized specialisation. The kernel thus
// consumes the SM80 CUTLASS fpA_intB layout on every GPU, so the weights must
// always be preprocessed for SM80 regardless of the runtime device SM.
// (Using get_arch_for_mixed_gemm_weight_preprocess(sm_) here would emit the
// SM90 layout on Hopper, which the SM80 kernel cannot consume -> wrong output.)
const int packing_sm =
onnxruntime::llm::kernels::weight_only::get_arch_for_mixed_gemm_weight_preprocess(80);
Comment thread
justinchuby marked this conversation as resolved.

// Per-expert sizes.
const size_t per_expert_bytes = static_cast<size_t>(n) * static_cast<size_t>(k) / pack_factor;
Expand Down
Loading
Loading