Sync with Microsoft ONNX Runtime - 21062026 by ai-fw-intg · Pull Request #1155 · intel/onnxruntime

ai-fw-intg · 2026-06-20T20:33:54Z

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

…9067) ## Description The symmetric INT4/INT8 MoE decode GEMV could emit `NaN`/garbage when a GEMM reduction dimension was not a whole multiple of the 64-element interleaved-weight K tile (e.g. `intermediate_size` such as 544). The interleaved weight layout's CUTLASS K iterator reads K in whole tiles of 64; a partial final tile makes threads read past the valid activation range. This PR fixes the decode GEMV selection gate to reject such shapes, adds an explicit up-front validation in the QMoE op so the grouped GEMM path fails with a clear error instead of silently producing wrong results, and folds in several QMoE review-feedback cleanups (checked size arithmetic, env-var parsing, and documentation). ## Summary of Changes ### NaN fix and hardening (INT weight-only path) | File | Change | |------|--------| | `onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu` | `is_moe_gemv_supported` now rejects `k % kTileSizeK != 0` (64), so the decode GEMV is not selected for a partial final K tile and the path falls back to the grouped GEMM. | | `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc` | Added an up-front guard for `quant_type == "int"`: `hidden_size` (fc1.K) and `inter_size` (fc2.K) must be multiples of 64 (the interleaved-weight K tile); otherwise return `INVALID_ARGUMENT` with a clear message instead of computing garbage. | ### Review-feedback cleanups | File | Change | |------|--------| | `onnxruntime/contrib_ops/cuda/moe/moe.cc` | Use `SafeInt<size_t>` for scratch byte-count arithmetic (expanded rows × element sizes) feeding a single allocation. | | `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc` | Same `SafeInt<size_t>` scratch-size hardening. | | `onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu` | Parse `ORT_MOE_GEMV_FP16_ACCUM` via `ParseEnvironmentVariableWithDefault<int>`; include `env_var_utils.h` after `dispatcher.h` (SHARED_PROVIDER guard ordering, documented inline). | | `onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_kernels.cu` | Parse `ORT_DISABLE_MOE_GEMV` via the same helper; clarify fast-path comments (symmetric INT4/INT8, per-column or block-wise). | | `onnxruntime/contrib_ops/cuda/llm/cutlass_extensions/gemm/threadblock/dq_mma_base.h` | Comment explaining the worst-case sizing of `kFinegrainedScaleRowsPerStage` for the smallest fine-grained group size. | ### Documentation | File | Change | |------|--------| | `docs/contrib_ops/cuda/moe_qmoe.md` | Document the `swiglu_fusion=0` + SwiGLU backward-compatibility remap (gpt-oss-20b interleaved layout) and the one-time warning. | | `docs/contrib_ops/cuda/qmoe_gemv_experiments.md` | Note that recorded numbers are point-in-time baselines tied to the listed GPU/driver/CUDA/ORT build. | ## Testing - `python -m pytest onnxruntime/test/python/transformers/test_qmoe_cuda.py -k "gemv or swiglu or block or PrePack or prepack"` — 84 passed, 6 skipped on H200 (sm90). - New `TestSwigluQMoE::test_swiglu_qmoe_int_partial_ktile_rejected` builds an `inter_size=544` (= 17×32, partial 64 tile) INT8 SwiGLU QMoE and asserts the run raises `"inter_size to be a multiple of 64"`. - New `TestSwigluQMoE::test_swiglu_qmoe_fusion0_remap_parity` exercises the `swiglu_fusion=0` → interleaved remap parity. - `TestQMoEIntPrePackSmoke::test_int4_swiglu_interleaved_small` bumped from `inter_size=32` (a now-rejected partial K tile) to `64`. - `ORT_ENABLE_FP4_GEMV=1 python -m pytest onnxruntime/test/python/transformers/test_qmoe_fp4_cuda.py` — no failures (the new guard is scoped to `quant_type == "int"`, so FP4/FP8 are unaffected). - `lintrunner` clean on the changed C++ and Python files. ## Motivation and Context The interleaved column-major weight layout (`ColumnMajorTileInterleave<64, …>`) requires the GEMM reduction dim K to be a whole multiple of `ThreadblockK` (64 for fp16/bf16 activations). The single-matrix `fpA_intB` GEMM already throws on this, but the grouped MoE GEMM and the decode GEMV had no equivalent guard and silently produced `NaN`/garbage. This PR closes that gap at the QMoE boundary (clear error) and in the GEMV dispatch gate (safe fallback). No supported, 64-aligned shape changes behavior. ## Checklist - [x] Tests added/updated - [x] Documentation updated (if applicable) - [x] No breaking changes (rejected shapes were already producing incorrect output) - [ ] CI passes

…icrosoft#28985) ### Description Adds a decode-optimized CUDA path for the `LinearAttention` contrib op (the gated-delta / linear-attention recurrence used by hybrid models such as Qwen3-Next / Qwen3.6). The existing recurrent kernels are tuned for prefill; at decode (`seq_len == 1`) they leave the GPU mostly idle. This PR adds two decode-specific kernels that saturate the GPU and access the recurrent state in a coalesced pattern, **without changing the op's `present_state` layout or numerics**. ### Motivation `LinearAttentionRecurrentKernelFixedShape` launches one block per `(batch, kv_head)` and keeps the full `d_k x d_v` state in shared memory across the token loop, with block-wide `__syncthreads` at every step. That design amortizes state I/O during prefill, but at decode it: - launches only `kv_num_heads` blocks (e.g. 32) — a fraction of the SMs, and - gets no amortization from the shared-memory state cache (one token per launch), so the op is latency-bound. On an H200 profile of Qwen3.6-35B-A3B it was the single most expensive decode kernel after the dense/MoE GEMMs (~0.69 ms/token). ### Key Changes All in `onnxruntime/contrib_ops/cuda/bert/linear_attention_impl.cu`: | Addition | Description | |---|---| | `warp_reduce_sum` | `__shfl_xor_sync` full-warp sum helper. | | `LinearAttentionDecodeKernel<T, DK>` | Warp-per-column decode kernel: grid `(kv_num_heads, batch, ceil(d_v/4))`, each warp owns one output column with the state column sharded into registers; reductions via warp shuffles, no shared memory, no block barriers. Handles any `d_v`. | | `LinearAttentionDecodeColKernel<T, DK>` | Column-per-thread decode kernel (default): one thread owns a full state column in registers. For a fixed row `i`, consecutive threads read consecutive addresses `i*d_v + col`, so state load/store is fully **coalesced with no transpose** — the row-major `[d_k, d_v]` `present_state` layout is unchanged. Used when `d_v % 32 == 0`; otherwise falls back to the warp kernel. | | Dispatch in `LaunchLinearAttentionKernel` | Routes `seq_len <= 16` and `d_k in {64,128,256}` to the decode kernels; all other shapes fall through to the existing recurrent kernels, so the **prefill path is unchanged**. | Both kernels cover the full op semantics: `linear` / `gated` / `delta` / `gated_delta` update rules, scalar and per-key-dim decay, per-head and scalar beta, standard GQA and inverse GQA, and `n_k_heads` K-sharing. ### Performance H200, Qwen3.6-35B-A3B (INT4), single-sequence decode, CUDA graph on. Kernel time measured with Nsight Systems (steady-state, warmup excluded): | Kernel | Time / token | |---|---| | `LinearAttentionRecurrentKernelFixedShape` (existing) | 693 µs | | `LinearAttentionDecodeKernel` (warp-per-column) | 346 µs (2.0x) | | `LinearAttentionDecodeColKernel` (column-per-thread) | **202 µs (3.4x)** | End-to-end decode throughput improved measurably (the kernel is ~half its prior cost), with no change to prefill. ### Testing - All 26 `ContribOpLinearAttentionTest` parity tests pass (the decode kernels are exercised by the single-token, inverse-GQA, KGQA, and Qwen3.5-like cases): ``` ./onnxruntime_provider_test --gtest_filter='*LinearAttention*' ``` - No public API or `present_state` layout change; the decode path is opt-in by shape and falls back to the existing kernels for unsupported `d_k` / `d_v`. ### Motivation and Context Decode-throughput optimization for hybrid linear-attention + MoE models. No breaking changes; numerics and output layout are preserved. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

…(onnx#8068, microsoft#28904) (microsoft#28958) ## Summary Aligns the opset-24 ONNX-domain `Attention` kernels (CPU + CUDA) with the ONNX errata onnx/onnx#8068 (tracking RFC onnx/onnx#8054) for the **external static KV-cache** path — keyed by `nonpad_kv_seqlen` (input #6), no `past_key`. **Addresses microsoft#28904.** ## What changed 1. **Bottom-right `is_causal` alignment.** Per batch, `offset[b] = nonpad_kv_seqlen[b] − q_sequence_length`; a query at in-block index `i` attends key `j` iff `j <= i + offset[b]`. Applied on CPU and on the CUDA Flash / Memory-Efficient (MEA) / unfused paths. The MEA causal-alignment selector is now offset-aware (no unconditional top-left when an external cache is present); Flash's native bottom-right + per-batch `seqlens_k` is used where eligible. 2. **Fully-masked-row → 0 guard (Bug-2).** A query row with no allowed key now outputs a **zero** row instead of mean-of-V (the finite-sentinel softmax result). Detected with an exact per-key structural predicate (`isneginf`-equivalent) and zeroed with **select (not multiply)** before `P @ V`, so `0 @ V = 0`. Added on CPU and the CUDA MEA path. The Flash `is_causal` + `seqlens_k` path (`offset >= 0`) cannot produce a fully-masked row and is intentionally left unguarded. Bool-mask conversion was already select-not-multiply on both EPs (Bug-1 satisfied; no change needed). 3. **Reject removal.** Removes the CUDA `NOT_IMPLEMENTED` reject for `is_causal` + `nonpad_kv_seqlen` with `S_q != total_kv` and no `past_key` — the spec now *defines* this result, so the op computes it rather than rejecting. Full-prefill (`offset = 0`) and `past_key` decode paths remain **bit-identical**. Contrib `MultiHeadAttention` / `GroupQueryAttention` consume the shared FMHA kernels and are **unchanged** — only the ONNX-domain `Attention` dispatch is retargeted. ## Test coverage - **C++ `AttentionTest` gtests: 73/73 pass**, including new bottom-right-offset, structural-empty causal row → 0 (CPU + CUDA), and fp16 fully-masked-row goldens. - **Python `test_onnx_attention`: 277/0** — includes the updated `test_tensorscatter_attention.py` (stale negative-reject → positive bottom-right acceptance). - QA final gate: from-scratch Debug build green. ## Preemptive onnx#8068 node-test skips (de-skip TODO) This branch adds the new onnx/onnx#8068 `Attention` backend node tests to both skip lists so they don't fail before the onnx dependency is bumped: - `onnxruntime/test/testdata/onnx_backend_test_series_filters.jsonc` - `onnxruntime/test/onnx/TestCase.cc` (C++ `GetBrokenTests`) These are a **no-op on the current onnx pin (v1.21.0)**. **TODO (de-skip):** remove **both** skip lists once `cmake/external/onnx` is bumped to a release containing onnx#8068. ## Deferred follow-up `q_seq > 1` Python bottom-right **parity** coverage requires upgrading the `test_onnx_attention` suite's numpy/torch reference functions from **total-kv-relative** causal (`offset = kv_seq − q_seq`) to **nonpad-relative** bottom-right (`offset = nonpad_kv_seqlen − q_seq`); a naive `is_causal=1` flip on the current refs is a no-op or a false failure against the correct kernel. The `q_seq > 1` / `nonpad < q_seq` behavior (including structural-empty rows) is already locked by the C++ gtest goldens. Tracked as follow-up. ## References - onnx/onnx#8068 — spec + reference errata (bottom-right `is_causal` on the `nonpad_kv_seqlen`/no-`past_key` path + composed `is_causal` + `attn_mask` NaN robustness). Separately pushed, CI green, awaiting SIG review. - onnx/onnx#8054 — RFC: offset-aware causal masking for KV-cache decode / chunked prefill. --------- Signed-off-by: Ti-Tai Wang <titaiwang@microsoft.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Bumps and [ws](https://github.com/websockets/ws). These dependencies needed to be updated together. Updates `ws` from 7.5.10 to 7.5.11 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/websockets/ws/releases">ws's releases</a>.</em></p> <blockquote> <h2>7.5.11</h2> <h1>Bug fixes</h1> <ul> <li>Backported 2b2abd45 to the 7.x release line (e14c4586).</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/websockets/ws/commit/fd36cd864fcdf62a08273a99e19a7d975401fee8"><code>fd36cd8</code></a> [dist] 7.5.11</li> <li><a href="https://github.com/websockets/ws/commit/e14c45861deca0cef60dec0f9109b694abebdf52"><code>e14c458</code></a> [security] Limit retained message parts</li> <li>See full diff in <a href="https://github.com/websockets/ws/compare/7.5.10...7.5.11">compare view</a></li> </ul> </details> <br /> Updates `ws` from 6.2.3 to 6.2.4 <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/websockets/ws/releases">ws's releases</a>.</em></p> <blockquote> <h2>7.5.11</h2> <h1>Bug fixes</h1> <ul> <li>Backported 2b2abd45 to the 7.x release line (e14c4586).</li> </ul> </blockquote> </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/websockets/ws/commit/fd36cd864fcdf62a08273a99e19a7d975401fee8"><code>fd36cd8</code></a> [dist] 7.5.11</li> <li><a href="https://github.com/websockets/ws/commit/e14c45861deca0cef60dec0f9109b694abebdf52"><code>e14c458</code></a> [security] Limit retained message parts</li> <li>See full diff in <a href="https://github.com/websockets/ws/compare/7.5.10...7.5.11">compare view</a></li> </ul> </details> <br /> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/microsoft/onnxruntime/network/alerts). </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…osoft#29161) ## Description When `GroupQueryAttention` runs the CUDA FlashDecode fast-decode path with a sliding/local attention window (`local_window_size > 0`), the split-K planning was sized using the full `total_sequence_length` even though only the last `local_window_size` KV positions can contribute to the output. This caused local-window decode layers to over-split and run an unnecessary split-K combine pass. This PR clamps the sequence length used for split planning to the local window size, so local-window decode no longer pays for splits/combine work it does not need. This is motivated by models that use local attention windows for GQA (e.g. gpt-oss-style decode with a small sliding window over a large KV cache). ## Summary of Changes ### Kernel dispatch | File | Change | |------|--------| | `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | In the FlashDecode fast-decode path, clamp the sequence length passed to `get_num_splits_and_buffer_sizes` to `local_window_size` when `local_window_size > 0`, so split-K planning reflects only the windowed KV range. | ### Tests | File | Change | |------|--------| | `onnxruntime/test/python/transformers/test_gqa.py` | Add `test_gqa_local_window_large_context_decode` regression test: decode step (q_len=1) with a large past context (4096) and a small local window (128), verifying parity of the narrowed split planning. Skips when Flash Attention is unavailable. | ### Profiling helpers | File | Change | |------|--------| | `onnxruntime/test/python/transformers/profile_gqa.py` | New nsys profiling helper for the GQA decode path, with a `--local-window-size` option and NVTX range markers. | | `onnxruntime/test/python/transformers/profile_gqa.sh` | New shell wrapper that runs `nsys` profiling per precision mode and parses results with the shared `parse_nsys.py`; checks `nsys`/`nvtx` availability instead of mutating the environment. | ## Testing - Unit test: ```bash cd onnxruntime/test/python/transformers PIPELINE_MODE=1 python test_gqa.py -k test_gqa_local_window_large_context_decode -v ``` - Existing FlashDecode parity coverage: ```bash PIPELINE_MODE=1 python test_gqa.py -k test_gqa_past_flash_attention -v ``` - Profiling (optional, requires an NVIDIA GPU + Nsight Systems): ```bash cd onnxruntime/test/python/transformers ./profile_gqa.sh --fp16 --past-sequence-length 4096 --local-window-size 128 ``` Observed on H200 (SM90, fp16, batch=2, num_heads=64, kv_num_heads=8, head_size=64): the split-K combine pass is eliminated for the local-window case and the main decode kernel time drops significantly versus the unclamped (full-context) split planning. - Backward compatibility: behavior is unchanged when `local_window_size <= 0`; the clamp only applies on the FlashDecode fast-decode path with a positive local window. ## Motivation and Context Local-window GQA decode layers only attend to the most recent `local_window_size` KV positions, so splitting and combining across the entire KV cache wastes split-K combine work. Clamping the split planning sequence length to the window size keeps the fast path correct while removing the redundant combine pass for windowed decode layers. ## Checklist - [x] Tests added/updated - [x] No breaking changes (behavior unchanged when `local_window_size <= 0`) - [ ] Documentation updated (if applicable)

### Description This PR removes the TensorRT fused **causal** attention kernels (the `fmha_v2_*_Causal_*` and `fmha_v2_flash_attention_*_Causal_*` cubins) and all of the code paths that selected them from the CUDA `Attention` operator. These causal fused kernels were disabled by default (since [microsoft#14732](microsoft#14732)) and were only reachable via the opt-in `ORT_ENABLE_FUSED_CAUSAL_ATTENTION` environment variable / `TRT_CAUSAL_ATTENTION` backend bit. They used fp16 accumulation, which can cause accuracy drops, and have been superseded by flash attention, memory-efficient attention, and cuDNN SDPA. Removing them deletes ~1.27M lines of generated cubin source and simplifies the attention dispatch logic. ### Key Changes - **Removed cubin sources**: Deleted all `causal/fmha_v2_fp16_Causal_*` and `flash_attention/fmha_v2_flash_attention_fp16_Causal_*` generated cubin files (70+ files). - **Dispatch simplification** ([attention.cc](onnxruntime/contrib_ops/cuda/bert/attention.cc)): Removed the `is_unidirectional_` / causal fused-runner branch in `ComputeInternal`; the fused runner path now only handles the BERT (non-causal) case. - **Kernel options** ([attention_kernel_options.cc](onnxruntime/contrib_ops/cuda/bert/attention_kernel_options.cc), [attention_kernel_options.h](onnxruntime/contrib_ops/cuda/bert/attention_kernel_options.h)): Removed `use_trt_causal_attention_`, `UseTrtCausalAttention()`, the `TRT_CAUSAL_ATTENTION` debug print, and the `causal` argument of `SetTrtFusedKernel`. - **QKV format** ([attention_common.h](onnxruntime/contrib_ops/cpu/bert/attention_common.h), [attention_prepare_qkv.cu](onnxruntime/contrib_ops/cuda/bert/attention_prepare_qkv.cu)): Removed the `Q_K_V_BNSH_QKV_BS3NH` format and the fused-causal gemm-buffer-with-bias preparation path. - **Runner API** ([mha_runner.cu](onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/mha_runner.cu), [mha_runner.h](onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/mha_runner.h), [fused_multihead_attention_v2.h](onnxruntime/contrib_ops/cuda/bert/tensorrt_fused_multihead_attention/fused_multihead_attention_v2.h)): Dropped the `causal` parameter from `FusedMHARunnerFP16v2::Create` / `IsSupported` and removed the causal kernel metadata. - **Env var removed**: `ORT_ENABLE_FUSED_CAUSAL_ATTENTION` (`kEnableFusedCausalAttention`) is no longer recognized. - **Callers updated**: [multihead_attention.cc](onnxruntime/contrib_ops/cuda/bert/multihead_attention.cc), [packed_attention.cc](onnxruntime/contrib_ops/cuda/bert/packed_attention.cc), [packed_multihead_attention.cc](onnxruntime/contrib_ops/cuda/bert/packed_multihead_attention.cc), [attention_impl.cu](onnxruntime/contrib_ops/cuda/bert/attention_impl.cu) updated to the new no-causal signatures. - **Python helpers**: Removed stale `ORT_ENABLE_FUSED_CAUSAL_ATTENTION` references from the transformers benchmark helper and stable diffusion benchmark. - **Tests updated**: [attention_op_test.cc](onnxruntime/test/contrib_ops/attention_op_test.cc) and [attention_kernel_options_test.cc](onnxruntime/test/providers/cuda/test_cases/attention_kernel_options_test.cc) no longer set/assert the causal-fused option. ### Motivation and Context The fused causal kernels were off by default, carried potential fp16-accumulation accuracy risk, and added a large amount of generated cubin source to the repo. Causal attention is already well covered by flash attention, memory-efficient attention, and cuDNN SDPA, so these kernels can be safely removed to reduce binary size and simplify maintenance. ### Testing - Build the CUDA EP and run the attention contrib op tests (`ContribOpAttentionTest.*`, including `Causal_EmptyPastState`). - Run `AttentionKernelOptionsTest.*` to verify the kernel-option parsing no longer references the causal backend.

### Description  This PR fixes a convolution performance regression affecting some OCR models with large-kernel convolutions when the KleidiAI SME IGEMM convolution path is selected. The change has 2 parts: 1. updates to the KleidiAI IGEMM LHS packing to pack rows in bounded chunks instead of packing the full LHS buffer up front, which reduces memory usage and improves cache locality for large convolutions, 2. a new route selection function `ArmKleidiAI::SelectConvRoute` that decides between `Igemm`, `GemmFallback` and `None` based on convolution parameters and a workload size-based heuristic. The function `CheckCapabilitiesSme` runs `SelectConvRoute` and only returns true if the selected route is `Igemm`. The patch also adds a standard GEMM fallback to the `ConvRoute` possibilities, and runs `MlasGemm` if said fallback is selected. If the function selects `None`, then the convolution falls back to `MlasSgemmOperation`. ### Motivation and Context  Fixes microsoft#27633. --------- Signed-off-by: Qxiang Xu <Qixiang.Xu@arm.com> Signed-off-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com> Signed-off-by: Martin Klacer <martin.klacer@arm.com> Co-authored-by: Jonathan Clohessy <Jonathan.Clohessy@arm.com> Co-authored-by: Damien Dooley <damien.dooley@arm.com>

### Description The current tests in the `NhwcTransformerTests` suite `ConvFloat_UsesNhwcOnlyWithKleidi` and `FusedConvFloat_UsesNhwcOnlyWithKleidi` assumed that whenever KleidiAI NHWC float-conv support is available, the test graph must be rewritten to `com.microsoft.NhwcFusedConv`. That assumption is not valid when ONNX Runtime is built with `--enable_arm_neon_nchwc`. In that cofiguration, the Level 3 NCHWc transformer is registered before the NHWC transformer, so the NCHWc rewrite can be selected instead. The optimiser priority is intentional, so these tests shouldn't require NHWC to be chosen over NCHWc. This change keeps the existing optimiser ordering and instead updates the assertions in the 2 tests. If the NHWC path is selected, the tests still validate the expected `NhwcFusedConv` graph shape and verify that the path is only used when KleidiAI NHWC support is available. If another valid layout optimisation is selected first, the tests no longer fail just because the `NhwcFusedConv` op isn't generated. ### Motivation and Context  This change fixes the 2 mentioned unit tests which fail when ONNX Runtime is built and tested with `--enable_arm_neon_nchwc`. --------- Signed-off-by: Martin Klacer <martin.klacer@arm.com>

…osoft#29166) ### Description Make fp16 accumulation the default for the CUDA QMoE GEMV fast path when activations are fp16. The previous fp32 accumulation behavior remains available as an opt-in fallback with `ORT_MOE_GEMV_FP32_ACCUM=1`, and bf16 activations continue to use fp32 accumulation. This is motivated by GPT-OSS-20B decode measurements where fp16 accumulation was close in accuracy to the fp32 path and materially faster. ### Changes - Invert the QMoE GEMV accumulation environment knob: - default fp16 accumulation for fp16 activations - `ORT_MOE_GEMV_FP32_ACCUM=1` restores fp32 accumulation - bf16 stays on fp32 accumulation - Document the new runtime knob in the QMoE CUDA docs. - Add the standalone helper, full-model decode, and MMLU smoke measurements to the QMoE GEMV experiment log. ### Measurements | Measurement | Default fp16 accumulation | `ORT_MOE_GEMV_FP32_ACCUM=1` | |---|---:|---:| | Standalone GPT-OSS QMoE helper latency | 0.0708 ms | 0.0812 ms | | Helper FC1 SwiGLU GEMV avg | 13.93 us | 21.57 us | | Helper FC2 GEMV avg | 10.14 us | 12.24 us | | Full GPT-OSS CUDA-graph decode latency | 2.588930 ms/token | 2.827260 ms/token | | Full GPT-OSS CUDA-graph decode throughput | 386.259956 tok/s | 353.699315 tok/s | The full-model A/B shows about +9.2% decode throughput for the default fp16 accumulation path versus the fp32 fallback in this run. ### Accuracy Prior 1000-sample MMLU smoke runs matched pooled accuracy for both modes: | Mode | Pooled accuracy | |---|---:| | fp32 accumulation | 0.8260 | | fp16 accumulation | 0.8260 | ### Testing - `lintrunner -a onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu` - `cmake --build /home/tianlei/onnxruntime/build/cu130/Release --target onnxruntime_providers_cuda --parallel $(nproc)` - `git diff --check -- onnxruntime/contrib_ops/cuda/llm/moe_gemm/moe_gemv.cu docs/contrib_ops/cuda/qmoe_gemv_experiments.md docs/contrib_ops/cuda/moe_qmoe.md` - Standalone QMoE helper A/B on `gpt_oss_20b_m1_top4_fp16_2880x2880_e32` - Full GPT-OSS CUDA-graph decode A/B

…icrosoft#29162) ## Description This PR enables the XQA decode kernel for the CUDA `GroupQueryAttention` (GQA) operator when an attention-sink input (`head_sink`) is present, the common pattern in GPT-OSS style decode models. The sink is treated as a smooth-softmax term, and a `PrePack` step converts a constant `head_sink` initializer to a cached FP32 buffer once at session init to avoid a per-step conversion. XQA now turns on by default for the `head_sink` decode path while preserving the existing `ORT_ENABLE_XQA` opt-in/opt-out semantics for all other non-quantized cases. ## Summary of Changes ### Kernel: XQA dispatch and head_sink handling | File | Change | |------|--------| | `onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc` | Add `PrePack` that caches a constant `head_sink` initializer as FP32 (`xqa_head_sink_`); allow XQA when `head_sink` is present (smooth-softmax via attention sink); default XQA on for the `head_sink` decode path; add `xqa_force_disabled_` so an explicit `ORT_ENABLE_XQA=0` always wins; reserve per-launch FP32 scratch when `head_sink` is dynamic (not prepacked). | | `onnxruntime/contrib_ops/cuda/bert/group_query_attention.h` | Add `PrePack` declaration and members: `xqa_head_sink_`, `xqa_head_sink_count_`, `xqa_force_disabled_`. | | `onnxruntime/contrib_ops/cuda/bert/group_query_attention_impl.cu` / `.h` | Add `LaunchConvertHeadSinkToFloat` to convert FP16/BF16 `head_sink` to FP32 for XQA. | | `onnxruntime/contrib_ops/cuda/bert/attention_data.h` | Add `xqa_head_sink` (FP32 sink pointer) and `xqa_head_sink_needs_conversion` to `GroupQueryAttentionData`. | | `onnxruntime/contrib_ops/cuda/bert/attention_kernel_options.{cc,h}` | Add `use_xqa` to the debug-info print so `SdpaKernel=XQA` is reported. | ### XQA loaders: attention-sink plumbing - `onnxruntime/contrib_ops/cuda/bert/xqa/xqa_loader*.{cu,cuh,h}` and `xqa_impl_gen.cuh` — thread the FP32 attention-sink pointer through the FP16/BF16 (and int8/fp8 KV) XQA loader entry points. ### Tests and docs | File | Change | |------|--------| | `onnxruntime/test/python/transformers/test_gqa.py` | Add `TestXQAHeadSinkParity` with runtime and PrePack (`head_sink` as initializer) parity cases; add `has_xqa()` skip guard; `setUp` clears `ORT_ENABLE_XQA` to exercise the real default-on behavior. | | `onnxruntime/test/python/transformers/gqa_test_helper.py` | Support `head_sink` plumbing for the new tests. | | `onnxruntime/test/python/transformers/profile_gqa.py` | Minor `head_sink` profiling support. | | `docs/contrib_ops/gqa.md` | New document describing the GQA operator, inputs/attributes, and XQA selection defaults (quantized on; non-quantized `head_sink` on; otherwise opt-in via `ORT_ENABLE_XQA`; `ORT_ENABLE_XQA=0` disables). | ## Testing - Run the XQA parity suites on an Ampere+ GPU: ``` cd onnxruntime/test/python/transformers python -m pytest test_gqa.py -k "TestXQAHeadSinkParity or TestXQAQuantizedParity" -q ``` All 224 cases pass on H200 (SM90). Tests skip automatically on devices without XQA support. - Kernel selection was verified via `ORT_ENABLE_ATTENTION_KERNEL_DEBUG_INFO=1`: - `head_sink` present, env unset → `SdpaKernel=XQA`. - `head_sink` present, `ORT_ENABLE_XQA=0` → falls back to `SdpaKernel=FLASH_ATTENTION` (parity still passes). - Non-`head_sink` decode is unchanged (still Flash / cuDNN). - Backward compatibility: Flash/fallback paths keep the original FP16/BF16 `head_sink`, so XQA is a performance default, not a correctness requirement; `ORT_ENABLE_XQA=1/0` semantics are preserved. ## Motivation and Context GPT-OSS style decode models use a per-head attention sink. Routing these decode steps through XQA improves decode latency, and prepacking the constant sink to FP32 removes a per-step conversion. This PR targets `main` but depends on PR microsoft#29161 (FlashDecode split planning for local-window GQA), which should merge first. Until then this PR's diff also includes the microsoft#29161 commit; once microsoft#29161 lands on `main`, GitHub's merge-base will drop it and this PR's diff will contain only the XQA change. ## Checklist - [x] Tests added/updated - [x] Documentation updated (`docs/contrib_ops/gqa.md`) - [x] No breaking changes (existing `ORT_ENABLE_XQA` semantics preserved) - [x] CI passes

microsoft#29040) ### Description Adds a patch for Dawn's `emdawnwebgpu_headers_gen_add` macro that inserts `cmake -E make_directory` before the `cmake -E copy` step, fixing a race condition where parallel build jobs (`-j32`) copy `webgpu_glfw.h` / `webgpu_enum_class_bitmasks.h` into a directory that hasn't been created yet. **Files changed:** - `cmake/patches/dawn/dawn_parallel_build_fix.patch` — new patch targeting `src/emdawnwebgpu/CMakeLists.txt` in Dawn commit `ec7b457`: ```diff + COMMAND ${CMAKE_COMMAND} -E make_directory "${EM_BUILD_GEN_DIR}/include/webgpu" COMMAND ${CMAKE_COMMAND} -E copy ``` - `cmake/external/onnxruntime_external_deps.cmake` — registers the new patch in `ONNXRUNTIME_Dawn_PATCH_COMMAND` ### Motivation and Context The `wasm_Release / build-wasm` CI job was failing at "Build (simd + threads + WebGPU experimental)" with: ``` Error copying file "…/include/webgpu/webgpu_enum_class_bitmasks.h" to "…/gen/src/emdawnwebgpu/include/webgpu/webgpu_enum_class_bitmasks.h" ``` Dawn's copy custom commands in `emdawnwebgpu_headers_gen_add` have no CMake dependency on the `DawnJSONGenerator` that would normally create `gen/src/emdawnwebgpu/include/webgpu/` first. With high parallelism the copy commands win the race and fail because the destination directory does not yet exist. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>

### Description Bumps the protobufjs lockfile entry used by onnxruntime-node from 7.6.0 to 7.6.3. The lockfile also updates @protobufjs/eventemitter from 1.1.0 to 1.1.1, matching protobufjs 7.6.3 dependency metadata. No package.json range change is needed because the existing ^7.2.4 range already permits 7.6.3. ### Motivation and Context [GHSA-f38q-mgvj-vph7 / CVE-2026-54269](GHSA-f38q-mgvj-vph7) reports that protobufjs versions <=7.6.2 are affected by schema-derived names that can shadow runtime-significant properties and make affected processing paths unusable. Version 7.6.3 is the patched 7.x release. PR microsoft#29061 already updated /js/web to protobufjs 7.6.3, but /js/node/package-lock.json still pinned protobufjs 7.6.0. This change brings the Node.js package lockfile in line with the patched version. Validation: - pm ci in js/node - pm run prepare in js/node - pm audit --json in js/node returned 0 vulnerabilities - pm ls protobufjs @protobufjs/eventemitter --depth=0 shows protobufjs@7.6.3 - Loaded ./test/ort-schema/protobuf/onnx.js with protobufjs/minimal pm test was attempted. It prepared the Node.js test data and then stopped because the local tree does not have the native binding js/node/bin/napi-v6/win32/x64/onnxruntime_binding.node; this requires a local native build and is unrelated to the protobufjs lockfile update.

…t#29140) ### Description Fixes microsoft#28388. Large-head Memory Efficient Attention kernels require an explicit `cudaFuncAttributeMaxDynamicSharedMemorySize` opt-in when their dynamic shared-memory requirement exceeds `0xc000`. The default batched kernel and the custom right-padding kernel are distinct device functions, but the previous function-local `once` configured only whichever function pointer was selected on the first invocation. If the other variant ran later in the same process, it launched without the required opt-in and failed with excessive shared-memory usage. This change: - keeps separate thread-safe initialization state for the default and right-padding kernel variants; - applies `cudaFuncSetAttribute` to the exact kernel that will launch; - checks the CUDA API result with `CUDA_CALL_THROW`; and - launches the matching kernel directly after its own shared-memory initialization. ### Tests - Added `TestONNXAttentionGQALargeHeadNonpadMEA`, which runs the default large-head MEA kernel first and then the right-padding/nonpad variant in the same process. This ordering reproduces the original process-global initialization bug and verifies that both distinct kernels opt in independently. - `python3 -m py_compile onnxruntime/test/python/transformers/test_onnx_attention/test_gqa.py` - `git diff --check` The targeted CUDA test requires an ONNX Runtime CUDA build and a compatible GPU, so execution is delegated to CI. --------- Signed-off-by: Kevin-Li-2025 <2242139@qq.com> Co-authored-by: Kevin-Li-2025 <2242139@qq.com>

tianleiwu and others added 14 commits June 18, 2026 15:01

Merge remote-tracking branch 'origin/master' into sync_msft_21062026

9c43178

ai-fw-intg requested review from Jaswanth51, ankitm3k, jatinwadhwa921 and vthaniel June 20, 2026 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sync with Microsoft ONNX Runtime - 21062026#1155

Sync with Microsoft ONNX Runtime - 21062026#1155
ai-fw-intg wants to merge 14 commits into
ovep-developfrom
sync_msft_21062026

ai-fw-intg commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

Uh oh!

Conversation

ai-fw-intg commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants