diff --git a/docs/ContribOperators.md b/docs/ContribOperators.md index 9d19a95136ad7..38d101786b41a 100644 --- a/docs/ContribOperators.md +++ b/docs/ContribOperators.md @@ -4949,7 +4949,7 @@ This version of the operator has been available since version 1 of the 'com.micr
use_sparse_mixer : int
Whether to use sparse mixer
weights_prepacked : int
-
Only meaningful when quant_type='int'. Tri-state control over whether the int4/int8 fc1/fc2 weight initializers are already laid out in the CUTLASS fpA_intB format expected by the runner. -1 (auto): let the execution provider choose its own backward-compatible default; the CUDA EP treats auto as prepacked. 1: the initializers are already prepacked (e.g. produced offline by pack_weights_for_cuda_mixed_gemm) and are consumed as-is. 0: the initializers are raw, un-prepacked [E, N, K/pack] tensors as produced by quantize_matmul_{4,8}bits; the kernel runs the CUTLASS layout transform itself in PrePack(), matching the behaviour of MatMulNBits and removing the offline pre-pack requirement from exporters. Defaults to -1 (auto) so each execution provider can pick its own backward-compatible default rather than the schema imposing one.
+
Only meaningful when quant_type='int'. Tri-state control over the layout of the int4/int8 fc1/fc2 weight initializers. The concrete prepacked layouts selected by -1 and 1 are determined by the execution provider. 0: the initializers are raw, un-prepacked [E, N, K/pack] tensors as produced by quantize_matmul_{4,8}bits. Defaults to -1.
#### Inputs (6 - 21) diff --git a/docs/contrib_ops/cuda/moe_qmoe.md b/docs/contrib_ops/cuda/moe_qmoe.md index 6d53211ff40cb..36b68889ae582 100644 --- a/docs/contrib_ops/cuda/moe_qmoe.md +++ b/docs/contrib_ops/cuda/moe_qmoe.md @@ -71,6 +71,7 @@ input tokens → router (top-k softmax) → permute by expert | `expert_weight_bits` (QMoE only) | int | 4 | 4 (INT4/MXFP4) or 8 (INT8/FP8). | | `block_size` (QMoE only) | int | -1 | Group size for INT4/INT8 group-wise quantization. -1 = per-output-channel. | | `quant_type` (QMoE only) | string | `"int"` | `"int"`, `"fp4"`, `"fp8"`, `"wfp4afp8"`. See [§3](#3-quantization-modes). | +| `weights_prepacked` (QMoE only) | int | -1 | Tri-state, only meaningful when `quant_type="int"`. The prepacked layouts selected by `-1` and `1` are **EP-determined**. `-1` (default): the INT4/INT8 `fc1`/`fc2` initializers are already prepacked in the EP's default layout (e.g. from `pack_weights_for_cuda_mixed_gemm` for the CUDA EP). `1`: already prepacked in an alternate EP-selected layout. `0`: the initializers are raw `[E, N, K/pack]` tensors (as produced by `quantize_matmul_{4,8}bits`) and the kernel runs the CUTLASS layout transform in `PrePack()`. **Note:** the CUDA EP INT4/INT8 MoE GEMM always runs the Ampere (SM80) kernel — even on SM90 — so it consumes the SM80 `fpA_intB` layout on all architectures; `-1` and `1` are therefore equivalent for the CUDA EP today, and `1` is reserved for a possible future Hopper-specific layout. See [§5.1](#51-weights-input-2--5--8). | ### 2.2 Type Constraints @@ -228,10 +229,53 @@ extra subtraction. ### 5.1 Weights (input 2 / 5 / 8) -Not transformed at runtime. INT4/INT8 weights must already be packed offline by -`pack_weights_for_cuda_mixed_gemm` (see [§6](#6-weight-formats)). MXFP4 weights -must be packed by `pack_fp4_weights_for_cuda_moe_gemm`. FP8 weights are stored -as raw e4m3 bytes (no packing). +**INT4/INT8** weight layout is controlled by the `weights_prepacked` attribute +([§2.1](#21-attributes)). The prepacked layouts selected by `-1` and `1` are +determined by the execution provider: + +- **`weights_prepacked=-1` (default)** — the `fc1`/`fc2` weights are already in + the EP's default prepacked layout (e.g. packed offline by + `pack_weights_for_cuda_mixed_gemm` for the CUDA EP). They are copied to GPU + and consumed as-is. +- **`weights_prepacked=1`** — the `fc1`/`fc2` weights are already in the EP's + **SM90** (Hopper) prepacked layout (reserved; see the note below). +- **`weights_prepacked=0`** — the `fc1`/`fc2` weights are raw, schema-conformant + `[E, N, K/pack]` tensors as produced by `quantize_matmul_{4,8}bits`. `PrePack` + runs the CUTLASS layout transform itself via `PrePackIntExpertWeights`, + removing the offline pre-pack dependency. This makes integer QMoE symmetric + with `MatMulNBits::PrePack_B`. + +> **Single layout on the CUDA EP.** The CUDA EP INT4/INT8 MoE GEMM always +> dispatches to the Ampere (**SM80**) grouped-GEMM kernel — even on SM90 — +> because mixed int-weight + fp16/bf16 activation is not a valid Hopper TMA +> warp-specialized specialisation (`isValidHopperMOESpecialisation` is `false`). +> This matches **TensorRT-LLM**, which likewise routes `W4A16`/`W8A16` MoE to the +> SM80 kernel on Hopper; its Hopper TMA-WS mixed-dtype MoE kernel is reserved for +> `W4A8` (FP8 activation) and `WFP4A16` (FP4 weight). Consequently the CUDA EP +> consumes the **SM80 `fpA_intB` layout on every GPU**, `PrePack` always packs +> for SM80, and `weights_prepacked=-1` and `=1` are equivalent today. `1` is +> accepted and reserved for a possible future Hopper-specific layout (e.g. +> `W4A8`). There is therefore no architecture-match constraint: SM80-format +> weights run correctly on SM90 via the SM80 kernel. + +`PrePackIntExpertWeights` loops over the `E` experts and, per expert, applies the +same transpose + row-permutation / column-interleave / bias / pair-interleave +transform as `pack_weights_for_cuda_mixed_gemm` (see [§6.1](#61-int4-group-wise-quant_typeint-expert_weight_bits4)), +always targeting the SM80 layout. SM75+ is required. The source +`[E, N, K/pack]` initializers are released after their shapes are cached +(`fc1_weights_shape_` / `fc2_weights_shape_`), so peak weight memory stays ~1×. +The prepacked GPU buffers (`packed_fc1_weights_` / `packed_fc2_weights_`) are then +preferred by `ComputeInternal`. If prepacking is disabled at the session level +(`session.disable_prepacking`), the buffers stay null and the raw initializer +pointers are read at compute time instead. + +> **Note**: `weights_prepacked=0` is the only path that triggers an in-`PrePack` +> layout transform for INT weights. FP4 / FP8 / WFP4AFP8 weight handling is +> unaffected. + +MXFP4 weights must be packed by `pack_fp4_weights_for_cuda_moe_gemm`. FP8 weights +are stored as raw e4m3 bytes (no packing). + ### 5.2 INT4/INT8 scales + zero-point → bias @@ -287,7 +331,12 @@ This section covers the five distinct weight encodings supported by QMoE. INT4 packing layout within a byte: `[high_nibble | low_nibble] = [elt_1 | elt_0]`. Each INT4 element is in `[-8, 7]` (signed) before bias, `[0, 15]` after the +8 bias. -#### Preprocessing pipeline (offline, `pack_weights_for_cuda_mixed_gemm`) +#### Preprocessing pipeline (offline `pack_weights_for_cuda_mixed_gemm`, or in-`PrePack` via `PrePackIntExpertWeights`) + +This is the layout transform applied either offline by +`pack_weights_for_cuda_mixed_gemm`, or per-expert inside `PrePack` when +`weights_prepacked=0` (see [§5.1](#51-weights-input-2--5--8)). + 1. **Input layout**: `[N, K]` per expert (Out × In), 2 elements per byte for INT4. 2. **Transpose & signed conversion**: @@ -405,6 +454,17 @@ weights are interchangeable across SMs: — does not use `pack_weights_for_cuda_mixed_gemm`. - **FP8**: no packing. +> **QMoE uses Group A on every GPU.** The table above describes the layouts the +> `pack_weights_for_cuda_mixed_gemm` *preprocessor* can emit. The QMoE INT4/INT8 +> MoE GEMM, however, always dispatches to the Ampere (SM80) grouped-GEMM kernel — +> even on SM90 — because mixed int-weight + fp16/bf16 activation is not a valid +> Hopper TMA warp-specialized specialisation (the same is true in TensorRT-LLM). +> It therefore consumes the **Group A (SM80) layout on all architectures, +> including Hopper**. For QMoE, always pack INT4/INT8 weights for SM80 (`arch=80`), +> and `PrePackIntExpertWeights` (`weights_prepacked=0`) does exactly that +> regardless of the runtime device SM. Group B (SM90) layout is currently unused +> by QMoE. + --- ## 8. SwiGLU Fusion @@ -830,7 +890,7 @@ will not change the operator interface. |-----------|----------| | [test_moe_cuda.py](onnxruntime/test/python/transformers/test_moe_cuda.py) | Standard MoE on CUDA: FP16/BF16, SiLU/GeLU/SwiGLU, routing, GEMM parity. SwiGLU coverage includes both GPT-OSS (`TestSwigluMoE`: interleaved, alpha=1.702/beta=1.0/limit=7.0) and Standard/Llama-Gemma (`TestStandardSwigluMoE`: concatenated `swiglu_fusion=2`, alpha=1.0/beta=0.0/no limit → `SiLU(Gate)×Value`). | | [test_moe_cpu.py](onnxruntime/test/python/transformers/test_moe_cpu.py) | Standard MoE on CPU (smoke). | -| [test_qmoe_cuda.py](onnxruntime/test/python/transformers/test_qmoe_cuda.py) | INT4/INT8 QMoE — primary regression signal for the production QMoE path. Exercises `pack_weights_for_cuda_mixed_gemm` and dequant-then-matmul reference. | +| [test_qmoe_cuda.py](onnxruntime/test/python/transformers/test_qmoe_cuda.py) | INT4/INT8 QMoE — primary regression signal for the production QMoE path. Exercises `pack_weights_for_cuda_mixed_gemm` and dequant-then-matmul reference. `TestQMoEIntPrePackSmoke` covers the raw-weight `weights_prepacked=0` in-`PrePack` layout transform (smoke test: asserts finite output, not bit-parity). | | [test_qmoe_cpu.py](onnxruntime/test/python/transformers/test_qmoe_cpu.py) | INT4/INT8 QMoE on CPU (smoke). | | [test_qmoe_fp4_cuda.py](onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py) | MXFP4 QMoE: quantization utilities, packing, FP16/BF16, SiLU/SwiGLU, top-k and expert-count variants. End-to-end runs on SM120; on SM<120 the dequant fallback is exercised. | | [test_qmoe_fp8_cuda.py](onnxruntime/test/python/transformers/test_qmoe_fp8_cuda.py) | FP8 W8A16 QMoE on SM90+ native path and SM<90 dequant fallback. | @@ -954,6 +1014,11 @@ over-aligned by-value parameters. cannot. See [§14.1](#141-msvc-and-tma-grouped-moe-gemm). - **WFP4AFP8 native** requires SM100+ hardware; only the dequant fallback path is validated end-to-end so far. +- **In-`PrePack` INT weight layout transform** (`weights_prepacked=0`) is + currently covered only by a smoke test (`TestQMoEIntPrePackSmoke`), not a + bit-parity check: the existing offline pre-pack harness hardcodes + `force_arch=80` (the same SM80 layout consumed by the CUDA EP on all GPUs), + so a separate parity harness for this path is still pending. - **Hopper W4A8** (INT4 weight + FP8 activation) is not supported — TRT-LLM gates its fast path to SM89 only. diff --git a/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors.h b/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors.h index b9e62443145e5..c3b734816cf84 100644 --- a/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors.h +++ b/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors.h @@ -31,6 +31,8 @@ enum class QuantType { W4_AFP8 }; +int get_arch_for_mixed_gemm_weight_preprocess(int arch); + void preprocess_weights_for_mixed_gemm_cuda(cudaStream_t stream, int arch, int8_t* preprocessed_quantized_weight, diff --git a/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.cu b/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.cu index a006612ddadc9..7e83bdda72eab 100644 --- a/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.cu +++ b/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.cu @@ -521,6 +521,19 @@ void add_bias_and_interleave_quantized_tensor_inplace_cuda( } } +int get_arch_for_mixed_gemm_weight_preprocess(int arch) { + ORT_ENFORCE(arch >= 75, "Unsupported CUDA architecture: ", arch); + if (arch < 80) { + return 75; + } +#ifndef EXCLUDE_SM_90 + if (arch >= 90 && arch < 100) { + return 90; + } +#endif + return 80; +} + void preprocess_weights_for_mixed_gemm_cuda(cudaStream_t stream, int arch, int8_t* preprocessed_quantized_weight, diff --git a/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.h b/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.h index a8fb411ed0663..47bbe0c0e10ec 100644 --- a/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.h +++ b/onnxruntime/contrib_ops/cuda/llm/fpA_intB_gemm_preprocessors_impl.h @@ -120,11 +120,11 @@ LayoutDetails getLayoutDetailsForArch(QuantType quant_type) { } LayoutDetails getLayoutDetailsForTransform(QuantType quant_type, int arch) { - ORT_ENFORCE(arch >= 75, "Unsupported CUDA architecture: ", arch); - if (arch < 80) { + arch = get_arch_for_mixed_gemm_weight_preprocess(arch); + if (arch == 75) { return getLayoutDetailsForArch(quant_type); #ifndef EXCLUDE_SM_90 - } else if (arch >= 90 && arch < 100) { + } else if (arch == 90) { return getLayoutDetailsForArch(quant_type); #endif } else { diff --git a/onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc b/onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc index e1ddcac0cea4f..7d1291e004d78 100644 --- a/onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc +++ b/onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc @@ -62,18 +62,28 @@ QMoE::QMoE(const OpKernelInfo& op_kernel_info) : CudaKernel(op_kernel_info), MoE this->quant_type_ = op_kernel_info.GetAttrOrDefault("quant_type", "int"); ORT_ENFORCE(quant_type_ == "int" || quant_type_ == "fp4" || quant_type_ == "fp8" || quant_type_ == "wfp4afp8", "quant_type must be 'int', 'fp4', 'fp8', or 'wfp4afp8', but got '", quant_type_, "'"); - // ``weights_prepacked`` is an optional tri-state attribute that defaults to - // -1 (auto) in the schema, so each EP picks its own backward-compatible - // default rather than the schema imposing one: - // -1 (auto, also the schema default): the EP decides. The CUDA EP's - // backward-compatible default is "prepacked" because all pre-existing - // tooling ships CUTLASS-prepacked weights. - // 1: initializers are already prepacked; the compute path reads them as-is. - // 0: initializers are raw [E, N, K/pack]; the PrePack hook lays them out. + // ``weights_prepacked`` is an optional tri-state attribute (default -1) that + // declares the layout of the int4/int8 fc1/fc2 weight initializers. The + // concrete prepacked layouts selected by -1 and 1 are determined by the + // execution provider. The CUDA EP maps the tri-state as: + // -1 (default): already prepacked in the EP's default int weight layout. + // 1: already prepacked in an alternate EP-selected int weight layout. + // 0: raw [E, N, K/pack] initializers; the PrePack hook lays them out. + // + // Important: the CUDA QMoE int4/int8 MoE GEMM always dispatches to the + // Ampere (SM80) grouped-GEMM kernel -- even on SM90 -- because mixed + // int-weight + fp16/bf16 activation is not a valid Hopper TMA warp-specialized + // specialisation (see isValidHopperMOESpecialisation). The kernel therefore + // consumes the SM80/Ampere CUTLASS fpA_intB layout on every GPU. As a result + // the EP default (-1) is the SM80 layout regardless of the runtime device SM, + // and SM80-format weights are valid on SM90 (they run via the SM80 kernel). + // For CUDA today, -1 and 1 are equivalent (both SM80 layout), and 1 is + // reserved for a possible future Hopper-specific layout. + // PrePack (weights_prepacked=0) packs for the SM80 layout accordingly. const int64_t weights_prepacked_mode = op_kernel_info.GetAttrOrDefault("weights_prepacked", static_cast(-1)); ORT_ENFORCE(weights_prepacked_mode == -1 || weights_prepacked_mode == 0 || weights_prepacked_mode == 1, - "weights_prepacked must be -1 (auto), 0, or 1, but got ", weights_prepacked_mode); + "weights_prepacked must be -1 (default), 0, or 1, but got ", weights_prepacked_mode); weights_prepacked_ = (weights_prepacked_mode != 0); #if !defined(ENABLE_FP4) || !defined(USE_FP4_QMOE) ORT_ENFORCE(quant_type_ != "fp4", "QMoE quant_type='fp4' requires USE_FP4_QMOE with CUDA 12.8 or newer."); @@ -850,7 +860,7 @@ Status QMoE::ComputeInternal(OpKernelContext* context) const { // PrePack converted the raw int4/int8 weights to the CUTLASS fpA_intB // layout that the runner consumes and freed the source initializer // (``is_packed = true``). Gate on ``int_weights_consumed_by_prepack`` - // (which already requires ``packed_fc1_weights_ != nullptr``) rather than + // (which already requires both packed weight buffers) rather than // just ``is_int && !weights_prepacked_``: when prepacking is disabled at // the session level (``session.disable_prepacking``) PrePack never runs, // the prepack buffers stay null, and the raw initializer pointers read @@ -1146,6 +1156,9 @@ void QMoE::PrePackIntExpertWeights(const Tensor& tensor, cudaStream_t stream, Al IAllocatorUniquePtr& packed_buf, bool& is_packed) { ORT_ENFORCE(expert_weight_bits_ == 4 || expert_weight_bits_ == 8, "PrePackIntExpertWeights: only 4 and 8 bits are supported, got ", expert_weight_bits_); + ORT_ENFORCE(sm_ >= 75, + "PrePackIntExpertWeights: quant_type='int' with weights_prepacked=0 requires SM75+ CUDA hardware, got SM", + sm_); const auto& shape = tensor.Shape(); ORT_ENFORCE(shape.NumDimensions() == 3, "PrePackIntExpertWeights: expected 3-D weight tensor [E, N, K/pack], got ndim=", @@ -1158,22 +1171,15 @@ void QMoE::PrePackIntExpertWeights(const Tensor& tensor, cudaStream_t stream, Al const int64_t k_packed = shape[2]; const int64_t k = k_packed * pack_factor; - // Weight packing is architecture-aware (see - // docs/contrib_ops/cuda/moe_qmoe.md §7 "Cross-Architecture Packing - // Compatibility"). SM90 (Hopper) uses its own Permuted-Linear layout that - // skips column interleaving, so it is its own compatibility group. Every - // other supported arch — SM75/80/86/89 and SM100/120 (Blackwell) — shares - // the SM80 fpA_intB layout, so they all pack as SM80. SM70 and older lack - // INT8 LDSM and are unsupported. The compute-side runner selects the same - // layout from this clamped arch, so the two cannot drift. - // - // SM75 is passed through unchanged (rather than clamped to 80) even though it - // shares SM80's layout: the compute-side dispatch (getLayoutDetailsForTransform) - // still has a distinct SM75 branch, so mirroring it here avoids confusing a - // reader into thinking prepack and dispatch disagree. - ORT_ENFORCE(sm_ >= 75, - "QMoE int4/int8 weight prepack requires SM75 or newer, got sm=", sm_); - const int packing_sm = (sm_ == 90 || sm_ == 75) ? sm_ : 80; + // The CUDA QMoE int4/int8 MoE GEMM always dispatches to the Ampere (SM80) + // grouped-GEMM kernel -- even on SM90 -- because mixed int-weight + fp16/bf16 + // is not a valid Hopper TMA warp-specialized specialisation. The kernel thus + // consumes the SM80 CUTLASS fpA_intB layout on every GPU, so the weights must + // always be preprocessed for SM80 regardless of the runtime device SM. + // (Using get_arch_for_mixed_gemm_weight_preprocess(sm_) here would emit the + // SM90 layout on Hopper, which the SM80 kernel cannot consume -> wrong output.) + const int packing_sm = + onnxruntime::llm::kernels::weight_only::get_arch_for_mixed_gemm_weight_preprocess(80); // Per-expert sizes. const size_t per_expert_bytes = static_cast(n) * static_cast(k) / pack_factor; diff --git a/onnxruntime/contrib_ops/cuda/moe/moe_quantization.h b/onnxruntime/contrib_ops/cuda/moe/moe_quantization.h index 5722ac41cc470..2bbadc205b5d8 100644 --- a/onnxruntime/contrib_ops/cuda/moe/moe_quantization.h +++ b/onnxruntime/contrib_ops/cuda/moe/moe_quantization.h @@ -46,16 +46,23 @@ class QMoE final : public CudaKernel, public MoEBase { IAllocatorUniquePtr& packed_buf, bool& is_packed); int64_t expert_weight_bits_; bool is_fp16_; - // When true (the schema default), the int4/int8 fc1/fc2 weight - // initializers are already in the CUTLASS fpA_intB layout — produced - // offline e.g. via ``pack_weights_for_cuda_mixed_gemm`` — and the - // compute path reads them as-is. When false, the raw schema-conformant - // ``[E, N, K/pack]`` layout (as produced by - // ``quantize_matmul_{4,8}bits``) is rewritten inside the PrePack hook - // via ``PrePackIntExpertWeights``, removing the offline prepack - // dependency. Only meaningful when ``quant_type_ == "int"``. Derived from - // the optional tri-state ``weights_prepacked`` attribute: -1/auto (or - // absent) maps to true on the CUDA EP, 1 maps to true, 0 maps to false. + // When true, the int4/int8 fc1/fc2 weight initializers are already in a + // CUTLASS fpA_intB layout — produced offline e.g. via + // ``pack_weights_for_cuda_mixed_gemm`` — and the compute path reads them + // as-is. When false, the raw schema-conformant ``[E, N, K/pack]`` layout + // (as produced by ``quantize_matmul_{4,8}bits``) is rewritten inside the + // PrePack hook via ``PrePackIntExpertWeights``, removing the offline + // prepack dependency. Only meaningful when ``quant_type_ == "int"``. + // Derived from the optional tri-state ``weights_prepacked`` attribute: + // -1 (default) and 1 both map to true; 0 maps to false. The concrete + // prepacked layouts selected by -1 and 1 are determined by the execution + // provider. For the CUDA EP the int4/int8 MoE GEMM always dispatches to the + // Ampere (SM80) grouped-GEMM kernel -- even on SM90 -- because mixed + // int-weight + fp16/bf16 activation is not a valid Hopper TMA warp-specialized + // specialisation (matches TensorRT-LLM, which also routes W4A16/W8A16 MoE to + // the SM80 kernel on Hopper). The kernel therefore consumes the SM80 fpA_intB + // layout on every GPU, so -1 and 1 are currently equivalent for the CUDA EP; + // 1 is reserved for a possible future Hopper-specific layout (e.g. W4A8). bool weights_prepacked_ = true; // Cached source weight shapes captured at PrePack time. When the // PrePack hook consumed and released the original int4/int8 weight diff --git a/onnxruntime/core/graph/contrib_ops/contrib_defs.cc b/onnxruntime/core/graph/contrib_ops/contrib_defs.cc index 1054fd94ef423..f3f2f521ecab2 100644 --- a/onnxruntime/core/graph/contrib_ops/contrib_defs.cc +++ b/onnxruntime/core/graph/contrib_ops/contrib_defs.cc @@ -1520,18 +1520,10 @@ ONNX_MS_OPERATOR_SET_SCHEMA( AttributeProto::STRING, std::string("int")) .Attr("weights_prepacked", - "Only meaningful when quant_type='int'. Tri-state control over whether the " - "int4/int8 fc1/fc2 weight initializers are already laid out in the CUTLASS " - "fpA_intB format expected by the runner. -1 (auto): let the execution provider " - "choose its own backward-compatible default; the CUDA EP treats auto as " - "prepacked. 1: the initializers are already prepacked (e.g. produced offline by " - "pack_weights_for_cuda_mixed_gemm) and are consumed as-is. 0: the initializers " - "are raw, un-prepacked [E, N, K/pack] tensors as produced by " - "quantize_matmul_{4,8}bits; the kernel runs the CUTLASS layout transform itself " - "in PrePack(), matching the behaviour of MatMulNBits and removing the offline " - "pre-pack requirement from exporters. Defaults to -1 (auto) so each execution " - "provider can pick its own backward-compatible default rather than the schema " - "imposing one.", + "Only meaningful when quant_type='int'. Tri-state control over the layout of the " + "int4/int8 fc1/fc2 weight initializers. The concrete prepacked layouts selected by " + "-1 and 1 are determined by the execution provider. 0: the initializers are raw, " + "un-prepacked [E, N, K/pack] tensors as produced by quantize_matmul_{4,8}bits. Defaults to -1.", AttributeProto::INT, static_cast(-1)) .Input(0, diff --git a/onnxruntime/python/onnxruntime_pybind_quant.cc b/onnxruntime/python/onnxruntime_pybind_quant.cc index 5b1d590a06234..7220153b4fa17 100644 --- a/onnxruntime/python/onnxruntime_pybind_quant.cc +++ b/onnxruntime/python/onnxruntime_pybind_quant.cc @@ -16,7 +16,6 @@ #endif #include #include -#include #include namespace pybind11 { @@ -252,17 +251,8 @@ py::array_t PackWeightsForMixedGemm( cudaDeviceProp device_prop; ThrowIfCudaError(cudaGetDeviceProperties(&device_prop, device_id), "cudaGetDeviceProperties"); sm = device_prop.major * 10 + device_prop.minor; - } else { - // Validate force_arch against the SM versions for which preprocess_weights_for_mixed_gemm_cuda - // has tile/permutation tables. Unknown SMs would silently produce incorrect weight layouts. - static const std::set kSupportedSm = {75, 80, 90}; - if (kSupportedSm.find(sm) == kSupportedSm.end()) { - std::ostringstream oss; - oss << "force_arch=" << sm << " is not a supported SM version. " - << "Pass -1 for auto-detect, or one of: 75, 80, 90 (arch > 90 will fallback to 80)."; - throw std::invalid_argument(oss.str()); - } } + sm = ::onnxruntime::llm::kernels::weight_only::get_arch_for_mixed_gemm_weight_preprocess(sm); auto permutation_map_buffer = make_cuda_ptr(32 * sizeof(int32_t)); diff --git a/onnxruntime/test/python/transformers/test_moe_cuda.py b/onnxruntime/test/python/transformers/test_moe_cuda.py index c5fc826a5a6ed..9677542270a53 100644 --- a/onnxruntime/test/python/transformers/test_moe_cuda.py +++ b/onnxruntime/test/python/transformers/test_moe_cuda.py @@ -152,7 +152,10 @@ def quant_dequant(weights, is_4_bit_quantization: bool = True): q_weight_reshaped = q_weight.reshape(n, -1) # Pack weights for CUDA mixed-gemm kernel (FpA_IntB format), and qMoE kernel uses the same format. - processed_q_weight = _quantize.pack_weights_for_cuda_mixed_gemm(q_weight_reshaped, n, k, 4) + # Pin arch=80: the QMoE grouped MoE GEMM always runs the Ampere (SM80) kernel -- even on SM90 -- + # so it consumes the SM80 (column-interleaved) layout on every GPU. Auto-detect (force_arch=-1) + # would emit the non-interleaved SM90 layout on Hopper and produce wrong results. + processed_q_weight = _quantize.pack_weights_for_cuda_mixed_gemm(q_weight_reshaped, n, k, 4, 80) # So we need to DEQUANTIZE back to get `result`. # scale is [n, block_per_k] @@ -232,8 +235,11 @@ def quant_dequant(weights, is_4_bit_quantization: bool = True): ) q_weight_reshaped = q_weight.reshape(n, -1) - # Pack weights for CUDA mixed-gemm kernel (FpA_IntB format) - processed_q_weight = _quantize.pack_weights_for_cuda_mixed_gemm(q_weight_reshaped, n, k, 8) + # Pack weights for CUDA mixed-gemm kernel (FpA_IntB format). + # Pin arch=80: the QMoE grouped MoE GEMM always runs the Ampere (SM80) kernel -- even on SM90 -- + # so it consumes the SM80 (column-interleaved) layout on every GPU. Auto-detect (force_arch=-1) + # would emit the non-interleaved SM90 layout on Hopper and produce wrong results. + processed_q_weight = _quantize.pack_weights_for_cuda_mixed_gemm(q_weight_reshaped, n, k, 8, 80) # Dequantize for reference # (q - 128) * scale if using 128 offset? or (q) * scale if symmetric around 0? @@ -1084,8 +1090,8 @@ def parity_check(self): ort_dtype_quant_bits_tolerance_map = { "FP32:0": (5e-3, 1e-3), "FP16:0": (0.3, 0.05), - "FP16:4": (3.0, 1e-2), - "FP16:8": (2.0, 1e-2), + "FP16:4": (0.5, 1e-2), + "FP16:8": (0.5, 1e-2), "BF16:0": (1.0, 1e-2), "BF16:4": (30.0, 1e-1), "BF16:8": (20.0, 1e-1), diff --git a/onnxruntime/test/python/transformers/test_qmoe_cuda.py b/onnxruntime/test/python/transformers/test_qmoe_cuda.py index 993716a4c80b0..c56383d2851d3 100644 --- a/onnxruntime/test/python/transformers/test_qmoe_cuda.py +++ b/onnxruntime/test/python/transformers/test_qmoe_cuda.py @@ -137,43 +137,6 @@ def print_diff_statistics(diff_tensor: torch.Tensor, prefix: str = ""): ) -def preprocess_weights_for_mixed_gemm( - tensor: torch.Tensor, quant_bits: int, sm: int = -1, do_weight_interleave: bool = True -) -> torch.Tensor: - if len(tensor.shape) == 2: - tensor = tensor.unsqueeze(0) - - # Input tensor shape is [Experts, n, k_packed]. k_packed is k/2 for 4-bit, k for 8-bit. - num_experts = tensor.shape[0] - n = tensor.shape[1] - k_packed = tensor.shape[2] - k = k_packed * 2 if quant_bits == 4 else k_packed - - packed_list = [] - - if _pybind and hasattr(_pybind, "pack_weights_for_cuda_mixed_gemm") and torch.cuda.is_available(): - for i in range(num_experts): - if tensor[i].dtype == torch.bfloat16: - weight = tensor[i].to(torch.float32).cpu().numpy() - else: - weight = tensor[i].cpu().numpy() - packed = _pybind.pack_weights_for_cuda_mixed_gemm(weight, n, k, quant_bits, sm) - # pack_weights_for_cuda_mixed_gemm returns int8 array of shape [packed_size] - # We need to reshape it to (k, n/2) for 4-bit, (k, n) for 8-bit. - output_rows = k - output_cols = n // 2 if quant_bits == 4 else n - packed_tensor = torch.from_numpy(packed).to(tensor.device) - packed_tensor = packed_tensor.view(torch.uint8).view(output_rows, output_cols) - packed_list.append(packed_tensor) - - return torch.stack(packed_list) - else: - # This shall not happen unless older version of onnxruntime is used. - raise ImportError( - "onnxruntime._pybind_state.pack_weights_for_cuda_mixed_gemm not found. Cannot preprocess weights." - ) - - def quant_dequant_blockwise(weights, block_size, is_4_bit_quantization: bool = True, asymmetric: bool = False): # DEBUG # print(f"DEBUG: quant_dequant input shape={weights.shape}, 4bit={is_4_bit_quantization}, asym={asymmetric}") @@ -2110,7 +2073,7 @@ class TestQMoEIntPrePackSmoke(unittest.TestCase): hardware (the other ``test_swiglu_qmoe_parity_*`` cases in this file fail on H200 / H100 with max-diff > 1.0 on plain main, by inspection — pre-existing). A real parity check can be added once - that harness honours the runtime SM. + that harness honors the runtime SM. """ def _run_one(self, *, hidden_size, inter_size, num_experts, top_k, swiglu_fusion, batch_size):