diff --git a/docs/ContribOperators.md b/docs/ContribOperators.md
index 9d19a95136ad7..38d101786b41a 100644
--- a/docs/ContribOperators.md
+++ b/docs/ContribOperators.md
@@ -4949,7 +4949,7 @@ This version of the operator has been available since version 1 of the 'com.micr
use_sparse_mixer : int
Whether to use sparse mixer
weights_prepacked : int
-Only meaningful when quant_type='int'. Tri-state control over whether the int4/int8 fc1/fc2 weight initializers are already laid out in the CUTLASS fpA_intB format expected by the runner. -1 (auto): let the execution provider choose its own backward-compatible default; the CUDA EP treats auto as prepacked. 1: the initializers are already prepacked (e.g. produced offline by pack_weights_for_cuda_mixed_gemm) and are consumed as-is. 0: the initializers are raw, un-prepacked [E, N, K/pack] tensors as produced by quantize_matmul_{4,8}bits; the kernel runs the CUTLASS layout transform itself in PrePack(), matching the behaviour of MatMulNBits and removing the offline pre-pack requirement from exporters. Defaults to -1 (auto) so each execution provider can pick its own backward-compatible default rather than the schema imposing one.
+Only meaningful when quant_type='int'. Tri-state control over the layout of the int4/int8 fc1/fc2 weight initializers. The concrete prepacked layouts selected by -1 and 1 are determined by the execution provider. 0: the initializers are raw, un-prepacked [E, N, K/pack] tensors as produced by quantize_matmul_{4,8}bits. Defaults to -1.
#### Inputs (6 - 21)
diff --git a/docs/contrib_ops/cuda/moe_qmoe.md b/docs/contrib_ops/cuda/moe_qmoe.md
index 6d53211ff40cb..36b68889ae582 100644
--- a/docs/contrib_ops/cuda/moe_qmoe.md
+++ b/docs/contrib_ops/cuda/moe_qmoe.md
@@ -71,6 +71,7 @@ input tokens → router (top-k softmax) → permute by expert
| `expert_weight_bits` (QMoE only) | int | 4 | 4 (INT4/MXFP4) or 8 (INT8/FP8). |
| `block_size` (QMoE only) | int | -1 | Group size for INT4/INT8 group-wise quantization. -1 = per-output-channel. |
| `quant_type` (QMoE only) | string | `"int"` | `"int"`, `"fp4"`, `"fp8"`, `"wfp4afp8"`. See [§3](#3-quantization-modes). |
+| `weights_prepacked` (QMoE only) | int | -1 | Tri-state, only meaningful when `quant_type="int"`. The prepacked layouts selected by `-1` and `1` are **EP-determined**. `-1` (default): the INT4/INT8 `fc1`/`fc2` initializers are already prepacked in the EP's default layout (e.g. from `pack_weights_for_cuda_mixed_gemm` for the CUDA EP). `1`: already prepacked in an alternate EP-selected layout. `0`: the initializers are raw `[E, N, K/pack]` tensors (as produced by `quantize_matmul_{4,8}bits`) and the kernel runs the CUTLASS layout transform in `PrePack()`. **Note:** the CUDA EP INT4/INT8 MoE GEMM always runs the Ampere (SM80) kernel — even on SM90 — so it consumes the SM80 `fpA_intB` layout on all architectures; `-1` and `1` are therefore equivalent for the CUDA EP today, and `1` is reserved for a possible future Hopper-specific layout. See [§5.1](#51-weights-input-2--5--8). |
### 2.2 Type Constraints
@@ -228,10 +229,53 @@ extra subtraction.
### 5.1 Weights (input 2 / 5 / 8)
-Not transformed at runtime. INT4/INT8 weights must already be packed offline by
-`pack_weights_for_cuda_mixed_gemm` (see [§6](#6-weight-formats)). MXFP4 weights
-must be packed by `pack_fp4_weights_for_cuda_moe_gemm`. FP8 weights are stored
-as raw e4m3 bytes (no packing).
+**INT4/INT8** weight layout is controlled by the `weights_prepacked` attribute
+([§2.1](#21-attributes)). The prepacked layouts selected by `-1` and `1` are
+determined by the execution provider:
+
+- **`weights_prepacked=-1` (default)** — the `fc1`/`fc2` weights are already in
+ the EP's default prepacked layout (e.g. packed offline by
+ `pack_weights_for_cuda_mixed_gemm` for the CUDA EP). They are copied to GPU
+ and consumed as-is.
+- **`weights_prepacked=1`** — the `fc1`/`fc2` weights are already in the EP's
+ **SM90** (Hopper) prepacked layout (reserved; see the note below).
+- **`weights_prepacked=0`** — the `fc1`/`fc2` weights are raw, schema-conformant
+ `[E, N, K/pack]` tensors as produced by `quantize_matmul_{4,8}bits`. `PrePack`
+ runs the CUTLASS layout transform itself via `PrePackIntExpertWeights`,
+ removing the offline pre-pack dependency. This makes integer QMoE symmetric
+ with `MatMulNBits::PrePack_B`.
+
+> **Single layout on the CUDA EP.** The CUDA EP INT4/INT8 MoE GEMM always
+> dispatches to the Ampere (**SM80**) grouped-GEMM kernel — even on SM90 —
+> because mixed int-weight + fp16/bf16 activation is not a valid Hopper TMA
+> warp-specialized specialisation (`isValidHopperMOESpecialisation` is `false`).
+> This matches **TensorRT-LLM**, which likewise routes `W4A16`/`W8A16` MoE to the
+> SM80 kernel on Hopper; its Hopper TMA-WS mixed-dtype MoE kernel is reserved for
+> `W4A8` (FP8 activation) and `WFP4A16` (FP4 weight). Consequently the CUDA EP
+> consumes the **SM80 `fpA_intB` layout on every GPU**, `PrePack` always packs
+> for SM80, and `weights_prepacked=-1` and `=1` are equivalent today. `1` is
+> accepted and reserved for a possible future Hopper-specific layout (e.g.
+> `W4A8`). There is therefore no architecture-match constraint: SM80-format
+> weights run correctly on SM90 via the SM80 kernel.
+
+`PrePackIntExpertWeights` loops over the `E` experts and, per expert, applies the
+same transpose + row-permutation / column-interleave / bias / pair-interleave
+transform as `pack_weights_for_cuda_mixed_gemm` (see [§6.1](#61-int4-group-wise-quant_typeint-expert_weight_bits4)),
+always targeting the SM80 layout. SM75+ is required. The source
+`[E, N, K/pack]` initializers are released after their shapes are cached
+(`fc1_weights_shape_` / `fc2_weights_shape_`), so peak weight memory stays ~1×.
+The prepacked GPU buffers (`packed_fc1_weights_` / `packed_fc2_weights_`) are then
+preferred by `ComputeInternal`. If prepacking is disabled at the session level
+(`session.disable_prepacking`), the buffers stay null and the raw initializer
+pointers are read at compute time instead.
+
+> **Note**: `weights_prepacked=0` is the only path that triggers an in-`PrePack`
+> layout transform for INT weights. FP4 / FP8 / WFP4AFP8 weight handling is
+> unaffected.
+
+MXFP4 weights must be packed by `pack_fp4_weights_for_cuda_moe_gemm`. FP8 weights
+are stored as raw e4m3 bytes (no packing).
+
### 5.2 INT4/INT8 scales + zero-point → bias
@@ -287,7 +331,12 @@ This section covers the five distinct weight encodings supported by QMoE.
INT4 packing layout within a byte: `[high_nibble | low_nibble] = [elt_1 | elt_0]`.
Each INT4 element is in `[-8, 7]` (signed) before bias, `[0, 15]` after the +8 bias.
-#### Preprocessing pipeline (offline, `pack_weights_for_cuda_mixed_gemm`)
+#### Preprocessing pipeline (offline `pack_weights_for_cuda_mixed_gemm`, or in-`PrePack` via `PrePackIntExpertWeights`)
+
+This is the layout transform applied either offline by
+`pack_weights_for_cuda_mixed_gemm`, or per-expert inside `PrePack` when
+`weights_prepacked=0` (see [§5.1](#51-weights-input-2--5--8)).
+
1. **Input layout**: `[N, K]` per expert (Out × In), 2 elements per byte for INT4.
2. **Transpose & signed conversion**:
@@ -405,6 +454,17 @@ weights are interchangeable across SMs:
— does not use `pack_weights_for_cuda_mixed_gemm`.
- **FP8**: no packing.
+> **QMoE uses Group A on every GPU.** The table above describes the layouts the
+> `pack_weights_for_cuda_mixed_gemm` *preprocessor* can emit. The QMoE INT4/INT8
+> MoE GEMM, however, always dispatches to the Ampere (SM80) grouped-GEMM kernel —
+> even on SM90 — because mixed int-weight + fp16/bf16 activation is not a valid
+> Hopper TMA warp-specialized specialisation (the same is true in TensorRT-LLM).
+> It therefore consumes the **Group A (SM80) layout on all architectures,
+> including Hopper**. For QMoE, always pack INT4/INT8 weights for SM80 (`arch=80`),
+> and `PrePackIntExpertWeights` (`weights_prepacked=0`) does exactly that
+> regardless of the runtime device SM. Group B (SM90) layout is currently unused
+> by QMoE.
+
---
## 8. SwiGLU Fusion
@@ -830,7 +890,7 @@ will not change the operator interface.
|-----------|----------|
| [test_moe_cuda.py](onnxruntime/test/python/transformers/test_moe_cuda.py) | Standard MoE on CUDA: FP16/BF16, SiLU/GeLU/SwiGLU, routing, GEMM parity. SwiGLU coverage includes both GPT-OSS (`TestSwigluMoE`: interleaved, alpha=1.702/beta=1.0/limit=7.0) and Standard/Llama-Gemma (`TestStandardSwigluMoE`: concatenated `swiglu_fusion=2`, alpha=1.0/beta=0.0/no limit → `SiLU(Gate)×Value`). |
| [test_moe_cpu.py](onnxruntime/test/python/transformers/test_moe_cpu.py) | Standard MoE on CPU (smoke). |
-| [test_qmoe_cuda.py](onnxruntime/test/python/transformers/test_qmoe_cuda.py) | INT4/INT8 QMoE — primary regression signal for the production QMoE path. Exercises `pack_weights_for_cuda_mixed_gemm` and dequant-then-matmul reference. |
+| [test_qmoe_cuda.py](onnxruntime/test/python/transformers/test_qmoe_cuda.py) | INT4/INT8 QMoE — primary regression signal for the production QMoE path. Exercises `pack_weights_for_cuda_mixed_gemm` and dequant-then-matmul reference. `TestQMoEIntPrePackSmoke` covers the raw-weight `weights_prepacked=0` in-`PrePack` layout transform (smoke test: asserts finite output, not bit-parity). |
| [test_qmoe_cpu.py](onnxruntime/test/python/transformers/test_qmoe_cpu.py) | INT4/INT8 QMoE on CPU (smoke). |
| [test_qmoe_fp4_cuda.py](onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py) | MXFP4 QMoE: quantization utilities, packing, FP16/BF16, SiLU/SwiGLU, top-k and expert-count variants. End-to-end runs on SM120; on SM<120 the dequant fallback is exercised. |
| [test_qmoe_fp8_cuda.py](onnxruntime/test/python/transformers/test_qmoe_fp8_cuda.py) | FP8 W8A16 QMoE on SM90+ native path and SM<90 dequant fallback. |
@@ -954,6 +1014,11 @@ over-aligned by-value parameters.
cannot. See [§14.1](#141-msvc-and-tma-grouped-moe-gemm).
- **WFP4AFP8 native** requires SM100+ hardware; only the dequant fallback path
is validated end-to-end so far.
+- **In-`PrePack` INT weight layout transform** (`weights_prepacked=0`) is
+ currently covered only by a smoke test (`TestQMoEIntPrePackSmoke`), not a
+ bit-parity check: the existing offline pre-pack harness hardcodes
+ `force_arch=80` (the same SM80 layout consumed by the CUDA EP on all GPUs),
+ so a separate parity harness for this path is still pending.
- **Hopper W4A8** (INT4 weight + FP8 activation) is not supported — TRT-LLM gates
its fast path to SM89 only.
diff --git a/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors.h b/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors.h
index b9e62443145e5..c3b734816cf84 100644
--- a/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors.h
+++ b/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors.h
@@ -31,6 +31,8 @@ enum class QuantType {
W4_AFP8
};
+int get_arch_for_mixed_gemm_weight_preprocess(int arch);
+
void preprocess_weights_for_mixed_gemm_cuda(cudaStream_t stream,
int arch,
int8_t* preprocessed_quantized_weight,
diff --git a/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.cu b/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.cu
index a006612ddadc9..7e83bdda72eab 100644
--- a/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.cu
+++ b/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.cu
@@ -521,6 +521,19 @@ void add_bias_and_interleave_quantized_tensor_inplace_cuda(
}
}
+int get_arch_for_mixed_gemm_weight_preprocess(int arch) {
+ ORT_ENFORCE(arch >= 75, "Unsupported CUDA architecture: ", arch);
+ if (arch < 80) {
+ return 75;
+ }
+#ifndef EXCLUDE_SM_90
+ if (arch >= 90 && arch < 100) {
+ return 90;
+ }
+#endif
+ return 80;
+}
+
void preprocess_weights_for_mixed_gemm_cuda(cudaStream_t stream,
int arch,
int8_t* preprocessed_quantized_weight,
diff --git a/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.h b/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.h
index a8fb411ed0663..47bbe0c0e10ec 100644
--- a/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.h
+++ b/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.h
@@ -120,11 +120,11 @@ LayoutDetails getLayoutDetailsForArch(QuantType quant_type) {
}
LayoutDetails getLayoutDetailsForTransform(QuantType quant_type, int arch) {
- ORT_ENFORCE(arch >= 75, "Unsupported CUDA architecture: ", arch);
- if (arch < 80) {
+ arch = get_arch_for_mixed_gemm_weight_preprocess(arch);
+ if (arch == 75) {
return getLayoutDetailsForArch(quant_type);
#ifndef EXCLUDE_SM_90
- } else if (arch >= 90 && arch < 100) {
+ } else if (arch == 90) {
return getLayoutDetailsForArch(quant_type);
#endif
} else {
diff --git a/onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc b/onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc
index e1ddcac0cea4f..7d1291e004d78 100644
--- a/onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc
+++ b/onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc
@@ -62,18 +62,28 @@ QMoE::QMoE(const OpKernelInfo& op_kernel_info) : CudaKernel(op_kernel_info), MoE
this->quant_type_ = op_kernel_info.GetAttrOrDefault("quant_type", "int");
ORT_ENFORCE(quant_type_ == "int" || quant_type_ == "fp4" || quant_type_ == "fp8" || quant_type_ == "wfp4afp8",
"quant_type must be 'int', 'fp4', 'fp8', or 'wfp4afp8', but got '", quant_type_, "'");
- // ``weights_prepacked`` is an optional tri-state attribute that defaults to
- // -1 (auto) in the schema, so each EP picks its own backward-compatible
- // default rather than the schema imposing one:
- // -1 (auto, also the schema default): the EP decides. The CUDA EP's
- // backward-compatible default is "prepacked" because all pre-existing
- // tooling ships CUTLASS-prepacked weights.
- // 1: initializers are already prepacked; the compute path reads them as-is.
- // 0: initializers are raw [E, N, K/pack]; the PrePack hook lays them out.
+ // ``weights_prepacked`` is an optional tri-state attribute (default -1) that
+ // declares the layout of the int4/int8 fc1/fc2 weight initializers. The
+ // concrete prepacked layouts selected by -1 and 1 are determined by the
+ // execution provider. The CUDA EP maps the tri-state as:
+ // -1 (default): already prepacked in the EP's default int weight layout.
+ // 1: already prepacked in an alternate EP-selected int weight layout.
+ // 0: raw [E, N, K/pack] initializers; the PrePack hook lays them out.
+ //
+ // Important: the CUDA QMoE int4/int8 MoE GEMM always dispatches to the
+ // Ampere (SM80) grouped-GEMM kernel -- even on SM90 -- because mixed
+ // int-weight + fp16/bf16 activation is not a valid Hopper TMA warp-specialized
+ // specialisation (see isValidHopperMOESpecialisation). The kernel therefore
+ // consumes the SM80/Ampere CUTLASS fpA_intB layout on every GPU. As a result
+ // the EP default (-1) is the SM80 layout regardless of the runtime device SM,
+ // and SM80-format weights are valid on SM90 (they run via the SM80 kernel).
+ // For CUDA today, -1 and 1 are equivalent (both SM80 layout), and 1 is
+ // reserved for a possible future Hopper-specific layout.
+ // PrePack (weights_prepacked=0) packs for the SM80 layout accordingly.
const int64_t weights_prepacked_mode =
op_kernel_info.GetAttrOrDefault("weights_prepacked", static_cast(-1));
ORT_ENFORCE(weights_prepacked_mode == -1 || weights_prepacked_mode == 0 || weights_prepacked_mode == 1,
- "weights_prepacked must be -1 (auto), 0, or 1, but got ", weights_prepacked_mode);
+ "weights_prepacked must be -1 (default), 0, or 1, but got ", weights_prepacked_mode);
weights_prepacked_ = (weights_prepacked_mode != 0);
#if !defined(ENABLE_FP4) || !defined(USE_FP4_QMOE)
ORT_ENFORCE(quant_type_ != "fp4", "QMoE quant_type='fp4' requires USE_FP4_QMOE with CUDA 12.8 or newer.");
@@ -850,7 +860,7 @@ Status QMoE::ComputeInternal(OpKernelContext* context) const {
// PrePack converted the raw int4/int8 weights to the CUTLASS fpA_intB
// layout that the runner consumes and freed the source initializer
// (``is_packed = true``). Gate on ``int_weights_consumed_by_prepack``
- // (which already requires ``packed_fc1_weights_ != nullptr``) rather than
+ // (which already requires both packed weight buffers) rather than
// just ``is_int && !weights_prepacked_``: when prepacking is disabled at
// the session level (``session.disable_prepacking``) PrePack never runs,
// the prepack buffers stay null, and the raw initializer pointers read
@@ -1146,6 +1156,9 @@ void QMoE::PrePackIntExpertWeights(const Tensor& tensor, cudaStream_t stream, Al
IAllocatorUniquePtr& packed_buf, bool& is_packed) {
ORT_ENFORCE(expert_weight_bits_ == 4 || expert_weight_bits_ == 8,
"PrePackIntExpertWeights: only 4 and 8 bits are supported, got ", expert_weight_bits_);
+ ORT_ENFORCE(sm_ >= 75,
+ "PrePackIntExpertWeights: quant_type='int' with weights_prepacked=0 requires SM75+ CUDA hardware, got SM",
+ sm_);
const auto& shape = tensor.Shape();
ORT_ENFORCE(shape.NumDimensions() == 3,
"PrePackIntExpertWeights: expected 3-D weight tensor [E, N, K/pack], got ndim=",
@@ -1158,22 +1171,15 @@ void QMoE::PrePackIntExpertWeights(const Tensor& tensor, cudaStream_t stream, Al
const int64_t k_packed = shape[2];
const int64_t k = k_packed * pack_factor;
- // Weight packing is architecture-aware (see
- // docs/contrib_ops/cuda/moe_qmoe.md §7 "Cross-Architecture Packing
- // Compatibility"). SM90 (Hopper) uses its own Permuted-Linear layout that
- // skips column interleaving, so it is its own compatibility group. Every
- // other supported arch — SM75/80/86/89 and SM100/120 (Blackwell) — shares
- // the SM80 fpA_intB layout, so they all pack as SM80. SM70 and older lack
- // INT8 LDSM and are unsupported. The compute-side runner selects the same
- // layout from this clamped arch, so the two cannot drift.
- //
- // SM75 is passed through unchanged (rather than clamped to 80) even though it
- // shares SM80's layout: the compute-side dispatch (getLayoutDetailsForTransform)
- // still has a distinct SM75 branch, so mirroring it here avoids confusing a
- // reader into thinking prepack and dispatch disagree.
- ORT_ENFORCE(sm_ >= 75,
- "QMoE int4/int8 weight prepack requires SM75 or newer, got sm=", sm_);
- const int packing_sm = (sm_ == 90 || sm_ == 75) ? sm_ : 80;
+ // The CUDA QMoE int4/int8 MoE GEMM always dispatches to the Ampere (SM80)
+ // grouped-GEMM kernel -- even on SM90 -- because mixed int-weight + fp16/bf16
+ // is not a valid Hopper TMA warp-specialized specialisation. The kernel thus
+ // consumes the SM80 CUTLASS fpA_intB layout on every GPU, so the weights must
+ // always be preprocessed for SM80 regardless of the runtime device SM.
+ // (Using get_arch_for_mixed_gemm_weight_preprocess(sm_) here would emit the
+ // SM90 layout on Hopper, which the SM80 kernel cannot consume -> wrong output.)
+ const int packing_sm =
+ onnxruntime::llm::kernels::weight_only::get_arch_for_mixed_gemm_weight_preprocess(80);
// Per-expert sizes.
const size_t per_expert_bytes = static_cast(n) * static_cast(k) / pack_factor;
diff --git a/onnxruntime/contrib_ops/cuda/moe/moe_quantization.h b/onnxruntime/contrib_ops/cuda/moe/moe_quantization.h
index 5722ac41cc470..2bbadc205b5d8 100644
--- a/onnxruntime/contrib_ops/cuda/moe/moe_quantization.h
+++ b/onnxruntime/contrib_ops/cuda/moe/moe_quantization.h
@@ -46,16 +46,23 @@ class QMoE final : public CudaKernel, public MoEBase {
IAllocatorUniquePtr& packed_buf, bool& is_packed);
int64_t expert_weight_bits_;
bool is_fp16_;
- // When true (the schema default), the int4/int8 fc1/fc2 weight
- // initializers are already in the CUTLASS fpA_intB layout — produced
- // offline e.g. via ``pack_weights_for_cuda_mixed_gemm`` — and the
- // compute path reads them as-is. When false, the raw schema-conformant
- // ``[E, N, K/pack]`` layout (as produced by
- // ``quantize_matmul_{4,8}bits``) is rewritten inside the PrePack hook
- // via ``PrePackIntExpertWeights``, removing the offline prepack
- // dependency. Only meaningful when ``quant_type_ == "int"``. Derived from
- // the optional tri-state ``weights_prepacked`` attribute: -1/auto (or
- // absent) maps to true on the CUDA EP, 1 maps to true, 0 maps to false.
+ // When true, the int4/int8 fc1/fc2 weight initializers are already in a
+ // CUTLASS fpA_intB layout — produced offline e.g. via
+ // ``pack_weights_for_cuda_mixed_gemm`` — and the compute path reads them
+ // as-is. When false, the raw schema-conformant ``[E, N, K/pack]`` layout
+ // (as produced by ``quantize_matmul_{4,8}bits``) is rewritten inside the
+ // PrePack hook via ``PrePackIntExpertWeights``, removing the offline
+ // prepack dependency. Only meaningful when ``quant_type_ == "int"``.
+ // Derived from the optional tri-state ``weights_prepacked`` attribute:
+ // -1 (default) and 1 both map to true; 0 maps to false. The concrete
+ // prepacked layouts selected by -1 and 1 are determined by the execution
+ // provider. For the CUDA EP the int4/int8 MoE GEMM always dispatches to the
+ // Ampere (SM80) grouped-GEMM kernel -- even on SM90 -- because mixed
+ // int-weight + fp16/bf16 activation is not a valid Hopper TMA warp-specialized
+ // specialisation (matches TensorRT-LLM, which also routes W4A16/W8A16 MoE to
+ // the SM80 kernel on Hopper). The kernel therefore consumes the SM80 fpA_intB
+ // layout on every GPU, so -1 and 1 are currently equivalent for the CUDA EP;
+ // 1 is reserved for a possible future Hopper-specific layout (e.g. W4A8).
bool weights_prepacked_ = true;
// Cached source weight shapes captured at PrePack time. When the
// PrePack hook consumed and released the original int4/int8 weight
diff --git a/onnxruntime/core/graph/contrib_ops/contrib_defs.cc b/onnxruntime/core/graph/contrib_ops/contrib_defs.cc
index 1054fd94ef423..f3f2f521ecab2 100644
--- a/onnxruntime/core/graph/contrib_ops/contrib_defs.cc
+++ b/onnxruntime/core/graph/contrib_ops/contrib_defs.cc
@@ -1520,18 +1520,10 @@ ONNX_MS_OPERATOR_SET_SCHEMA(
AttributeProto::STRING,
std::string("int"))
.Attr("weights_prepacked",
- "Only meaningful when quant_type='int'. Tri-state control over whether the "
- "int4/int8 fc1/fc2 weight initializers are already laid out in the CUTLASS "
- "fpA_intB format expected by the runner. -1 (auto): let the execution provider "
- "choose its own backward-compatible default; the CUDA EP treats auto as "
- "prepacked. 1: the initializers are already prepacked (e.g. produced offline by "
- "pack_weights_for_cuda_mixed_gemm) and are consumed as-is. 0: the initializers "
- "are raw, un-prepacked [E, N, K/pack] tensors as produced by "
- "quantize_matmul_{4,8}bits; the kernel runs the CUTLASS layout transform itself "
- "in PrePack(), matching the behaviour of MatMulNBits and removing the offline "
- "pre-pack requirement from exporters. Defaults to -1 (auto) so each execution "
- "provider can pick its own backward-compatible default rather than the schema "
- "imposing one.",
+ "Only meaningful when quant_type='int'. Tri-state control over the layout of the "
+ "int4/int8 fc1/fc2 weight initializers. The concrete prepacked layouts selected by "
+ "-1 and 1 are determined by the execution provider. 0: the initializers are raw, "
+ "un-prepacked [E, N, K/pack] tensors as produced by quantize_matmul_{4,8}bits. Defaults to -1.",
AttributeProto::INT,
static_cast(-1))
.Input(0,
diff --git a/onnxruntime/python/onnxruntime_pybind_quant.cc b/onnxruntime/python/onnxruntime_pybind_quant.cc
index 5b1d590a06234..7220153b4fa17 100644
--- a/onnxruntime/python/onnxruntime_pybind_quant.cc
+++ b/onnxruntime/python/onnxruntime_pybind_quant.cc
@@ -16,7 +16,6 @@
#endif
#include
#include
-#include
#include
namespace pybind11 {
@@ -252,17 +251,8 @@ py::array_t PackWeightsForMixedGemm(
cudaDeviceProp device_prop;
ThrowIfCudaError(cudaGetDeviceProperties(&device_prop, device_id), "cudaGetDeviceProperties");
sm = device_prop.major * 10 + device_prop.minor;
- } else {
- // Validate force_arch against the SM versions for which preprocess_weights_for_mixed_gemm_cuda
- // has tile/permutation tables. Unknown SMs would silently produce incorrect weight layouts.
- static const std::set kSupportedSm = {75, 80, 90};
- if (kSupportedSm.find(sm) == kSupportedSm.end()) {
- std::ostringstream oss;
- oss << "force_arch=" << sm << " is not a supported SM version. "
- << "Pass -1 for auto-detect, or one of: 75, 80, 90 (arch > 90 will fallback to 80).";
- throw std::invalid_argument(oss.str());
- }
}
+ sm = ::onnxruntime::llm::kernels::weight_only::get_arch_for_mixed_gemm_weight_preprocess(sm);
auto permutation_map_buffer = make_cuda_ptr(32 * sizeof(int32_t));
diff --git a/onnxruntime/test/python/transformers/test_moe_cuda.py b/onnxruntime/test/python/transformers/test_moe_cuda.py
index c5fc826a5a6ed..9677542270a53 100644
--- a/onnxruntime/test/python/transformers/test_moe_cuda.py
+++ b/onnxruntime/test/python/transformers/test_moe_cuda.py
@@ -152,7 +152,10 @@ def quant_dequant(weights, is_4_bit_quantization: bool = True):
q_weight_reshaped = q_weight.reshape(n, -1)
# Pack weights for CUDA mixed-gemm kernel (FpA_IntB format), and qMoE kernel uses the same format.
- processed_q_weight = _quantize.pack_weights_for_cuda_mixed_gemm(q_weight_reshaped, n, k, 4)
+ # Pin arch=80: the QMoE grouped MoE GEMM always runs the Ampere (SM80) kernel -- even on SM90 --
+ # so it consumes the SM80 (column-interleaved) layout on every GPU. Auto-detect (force_arch=-1)
+ # would emit the non-interleaved SM90 layout on Hopper and produce wrong results.
+ processed_q_weight = _quantize.pack_weights_for_cuda_mixed_gemm(q_weight_reshaped, n, k, 4, 80)
# So we need to DEQUANTIZE back to get `result`.
# scale is [n, block_per_k]
@@ -232,8 +235,11 @@ def quant_dequant(weights, is_4_bit_quantization: bool = True):
)
q_weight_reshaped = q_weight.reshape(n, -1)
- # Pack weights for CUDA mixed-gemm kernel (FpA_IntB format)
- processed_q_weight = _quantize.pack_weights_for_cuda_mixed_gemm(q_weight_reshaped, n, k, 8)
+ # Pack weights for CUDA mixed-gemm kernel (FpA_IntB format).
+ # Pin arch=80: the QMoE grouped MoE GEMM always runs the Ampere (SM80) kernel -- even on SM90 --
+ # so it consumes the SM80 (column-interleaved) layout on every GPU. Auto-detect (force_arch=-1)
+ # would emit the non-interleaved SM90 layout on Hopper and produce wrong results.
+ processed_q_weight = _quantize.pack_weights_for_cuda_mixed_gemm(q_weight_reshaped, n, k, 8, 80)
# Dequantize for reference
# (q - 128) * scale if using 128 offset? or (q) * scale if symmetric around 0?
@@ -1084,8 +1090,8 @@ def parity_check(self):
ort_dtype_quant_bits_tolerance_map = {
"FP32:0": (5e-3, 1e-3),
"FP16:0": (0.3, 0.05),
- "FP16:4": (3.0, 1e-2),
- "FP16:8": (2.0, 1e-2),
+ "FP16:4": (0.5, 1e-2),
+ "FP16:8": (0.5, 1e-2),
"BF16:0": (1.0, 1e-2),
"BF16:4": (30.0, 1e-1),
"BF16:8": (20.0, 1e-1),
diff --git a/onnxruntime/test/python/transformers/test_qmoe_cuda.py b/onnxruntime/test/python/transformers/test_qmoe_cuda.py
index 993716a4c80b0..c56383d2851d3 100644
--- a/onnxruntime/test/python/transformers/test_qmoe_cuda.py
+++ b/onnxruntime/test/python/transformers/test_qmoe_cuda.py
@@ -137,43 +137,6 @@ def print_diff_statistics(diff_tensor: torch.Tensor, prefix: str = ""):
)
-def preprocess_weights_for_mixed_gemm(
- tensor: torch.Tensor, quant_bits: int, sm: int = -1, do_weight_interleave: bool = True
-) -> torch.Tensor:
- if len(tensor.shape) == 2:
- tensor = tensor.unsqueeze(0)
-
- # Input tensor shape is [Experts, n, k_packed]. k_packed is k/2 for 4-bit, k for 8-bit.
- num_experts = tensor.shape[0]
- n = tensor.shape[1]
- k_packed = tensor.shape[2]
- k = k_packed * 2 if quant_bits == 4 else k_packed
-
- packed_list = []
-
- if _pybind and hasattr(_pybind, "pack_weights_for_cuda_mixed_gemm") and torch.cuda.is_available():
- for i in range(num_experts):
- if tensor[i].dtype == torch.bfloat16:
- weight = tensor[i].to(torch.float32).cpu().numpy()
- else:
- weight = tensor[i].cpu().numpy()
- packed = _pybind.pack_weights_for_cuda_mixed_gemm(weight, n, k, quant_bits, sm)
- # pack_weights_for_cuda_mixed_gemm returns int8 array of shape [packed_size]
- # We need to reshape it to (k, n/2) for 4-bit, (k, n) for 8-bit.
- output_rows = k
- output_cols = n // 2 if quant_bits == 4 else n
- packed_tensor = torch.from_numpy(packed).to(tensor.device)
- packed_tensor = packed_tensor.view(torch.uint8).view(output_rows, output_cols)
- packed_list.append(packed_tensor)
-
- return torch.stack(packed_list)
- else:
- # This shall not happen unless older version of onnxruntime is used.
- raise ImportError(
- "onnxruntime._pybind_state.pack_weights_for_cuda_mixed_gemm not found. Cannot preprocess weights."
- )
-
-
def quant_dequant_blockwise(weights, block_size, is_4_bit_quantization: bool = True, asymmetric: bool = False):
# DEBUG
# print(f"DEBUG: quant_dequant input shape={weights.shape}, 4bit={is_4_bit_quantization}, asym={asymmetric}")
@@ -2110,7 +2073,7 @@ class TestQMoEIntPrePackSmoke(unittest.TestCase):
hardware (the other ``test_swiglu_qmoe_parity_*`` cases in this file
fail on H200 / H100 with max-diff > 1.0 on plain main, by
inspection — pre-existing). A real parity check can be added once
- that harness honours the runtime SM.
+ that harness honors the runtime SM.
"""
def _run_one(self, *, hidden_size, inter_size, num_experts, top_k, swiglu_fusion, batch_size):