intel · ai-fw-intg · Jun 18, 2026 · Jun 18, 2026 · Jun 19, 2026 · Jun 19, 2026
diff --git a/.agents/skills/cuda-attention-kernel-patterns/SKILL.md b/.agents/skills/cuda-attention-kernel-patterns/SKILL.md
diff --git a/.agents/skills/cuda-cutlass-fmha-incremental-rebuild/SKILL.md b/.agents/skills/cuda-cutlass-fmha-incremental-rebuild/SKILL.md
@@ -0,0 +1,111 @@
+---
+name: cuda-cutlass-fmha-incremental-rebuild
+description: >
+  Use when rebuilding ONNX Runtime CUDA after editing CUTLASS fused-MHA headers
+  (onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/*.h such as kernel_forward.h or
+  fmha_launch_template.h), or when a header edit "passed" an incremental build but
+  test behavior did not change. Explains the nvcc depfile gotcha that produces stale
+  Memory-Efficient-Attention (MEA) kernels and binaries, and how to force a correct
+  recompile. Also covers disk-space frugality on shared GPU dev boxes.
+---
+
+# Incremental rebuilds silently use STALE CUTLASS fused-MHA kernels
+
+> The **general** false-green principles (stale binary, wrong-artifact mtime) are summarised
+> in the `ort-test` skill's "False-green taxonomy". This skill is the CUDA/CUTLASS-specific
+> detail.
+
+## The gotcha (verification-integrity bug)
+
+`nvcc`-generated depfiles do **not** track the CUTLASS fused-MHA headers under
+`onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/` (e.g. `kernel_forward.h`,
+`fmha_launch_template.h`). These headers are `#include`d by the `fmha_sm*.cu`
+translation units, but the build system does not record that dependency.
+
+Consequence: after you edit one of those headers, an **incremental** `build.sh`:
+
+- does **not** recompile `fmha_sm*.cu`,
+- reports `[100%] Built target ...` and exits 0,
+- leaves the recompiled artifacts — the `fmha_sm*.cu.o` objects and the
+  `libonnxruntime_providers_cuda.so` they link into — **unchanged** (same `mtime` as
+  the pre-edit build).
+
+(Do **not** use the gtest test-exe mtime as the stale symptom: in the shared-provider
+build the exe `dlopen`s the `.so` and is **not** relinked, so its mtime stays old even
+after a *correct* rebuild — see "How to confirm" below. The reliable diagnostic signal
+is the `fmha_sm*.cu.o` / `.so` mtime.)
+
+So your "successful" rebuild is running the **old** kernel. Tests that should now
+pass (or fail) reflect the previous code, not your edit. This silently invalidates
+any FAIL→PASS / PASS→FAIL verification.
+
+## The fix — force recompile the .cu units
+
+Before rebuilding after editing any `cutlass_fmha/*.h` header:
+
+```bash
+touch onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/*.cu
+```
+
+Then run the normal build command. This forces the `fmha_sm*.cu` translation units
+(and downstream binaries) to recompile against your header change.
+
+## How to confirm the rebuild was real (don't trust "[100%] Built")
+
+Confirm that the artifact which actually **links** the recompiled `fmha_sm*.cu.o`
+is newer than your header edit.
+
+⚠️ **Do NOT just check the test EXE mtime — it can falsely flag a good build as
+stale.** In the shared-provider build configuration (the default here), the CUDA
+execution provider is a **shared module**: the recompiled `fmha_sm*.cu.o` link into
+`libonnxruntime_providers_cuda.so`, and the `onnxruntime_provider_test` executable
+**dlopens** that `.so` — it is **not relinked**. So after a *correct* rebuild the
+test exe `mtime` stays **old** while the `.so` advances. Checking the exe alone
+would wrongly conclude the build was stale.
+
+Check the right artifact for your link mode:
+
+- **Shared-provider build (default):** the `.so` that links the recompiled `.o` —
+  `build/<dir>/<cfg>/libonnxruntime_providers_cuda.so`
+- **Statically-linked provider:** the test exe itself (`onnxruntime_provider_test`)
+
+Safest check — `stat` both the recompiled object and the `.so`, and confirm BOTH
+are newer than the header edit:
+
+```bash
+stat -c '%y %n' onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/kernel_forward.h
+# in your build dir, e.g. build/Debug_quickbuild/Debug/:
+stat -c '%y %n' libonnxruntime_providers_cuda.so
+# and the actual recompiled object (path varies by build dir):
+find . -name 'fmha_sm80.cu.o' -exec stat -c '%y %n' {} +
+```
+
+If the `.so` (and the `fmha_sm*.cu.o`) timestamps are older than (or equal to) the
+header edit, the build was stale — `touch` the `.cu` files and rebuild. The most
+reliable signal of all is behavioral: a test that was failing now passes (a stale
+binary cannot flip its result).
+
+## Related: pick the right test binary
+
+This is the **CUDA/CUTLASS instance of false-green mode 1** (zero-match / wrong binary) —
+see the `ort-test` skill's "False-green taxonomy" for the general principle. In short:
+attention/MEA/Flash boundary gtests (e.g. `FlashStructuralEmptyRows*`,
+`Attention_Causal_NonPadKVSeqLen_MEA_*`) live in **`onnxruntime_provider_test`**, which CI
+runs; `onnxruntime_test_all` does not contain them and gives a false green. Verify the
+MEA/Flash boundary fix against `onnxruntime_provider_test`.
+
+## Related: disk frugality on shared GPU dev boxes
+
+Full ORT CUDA builds are large (test binaries ~1 GB each; a build dir can reach
+tens of GB). On a shared box, `/home` filling to 100% makes builds fail in
+non-obvious places — e.g. `git submodule sync` reporting `No space left on device`
+or a `config.lock` error, not an obvious "disk full" at the compile step.
+
+Before a big rebuild, check free space and clean only clearly-stale, regenerable
+build directories (old dated experiment dirs). Never delete another agent's active
+build dir or anything ambiguous:
+
+```bash
+df -h /home
+du -sh build/* | sort -h
+```
diff --git a/.agents/skills/ort-build/SKILL.md b/.agents/skills/ort-build/SKILL.md
@@ -66,6 +66,7 @@ You do **not** need `--update` when only modifying existing `.cc`/`.h` files —
 | `--use_cuda` | Enable CUDA EP. Requires `--cuda_home`/`--cudnn_home` or `CUDA_HOME`/`CUDNN_HOME` env vars. On Windows, only `cuda_home`/`CUDA_HOME` is validated. |
 | `--target T` | Build a specific CMake target (requires `--build`; e.g., `onnxruntime_common`, `onnxruntime_test_all`) |
 | `--use_webgpu` | Enable WebGPU EP. To run its tests locally on Linux without a GPU, see the `webgpu-local-testing` skill. |
+| `--cmake_extra_defines onnxruntime_QUICK_BUILD=ON` | Faster CUDA build: instantiates a reduced kernel set. **Side effect:** Flash is compiled for head_dim 128 only, so most attention shapes fall back to **MEA** (changes which attention kernel is compiled/dispatched). Don't use it to characterize Flash-vs-arch behavior. |
 | `--build_dir` | Build output directory |
 
 ## Build output path
@@ -79,6 +80,15 @@ It may be customized with `--build_dir`.
 ## Agent tips
 
 - **Activate a Python virtual environment** before building. See "Python > Virtual environment" in `AGENTS.md`.
+- **Build flags can silently reroute which kernel/code path executes.** A build option can
+  change *which* kernel is compiled, and therefore which code path actually runs — so a CI
+  failure can live in a different code path than your local build exercises. Before
+  hypothesizing a hardware- or algorithm-specific cause (e.g. "this GPU arch miscomputes"),
+  first identify **which kernel actually ran** for the failing configuration (see the
+  `ort-test` skill → "Verify which path/kernel actually executed"). Concrete instance:
+  `onnxruntime_QUICK_BUILD=ON` compiles FlashAttention for head_dim 128 only, so most
+  attention shapes silently dispatch to Memory-Efficient Attention instead of Flash —
+  details in the `cuda-attention-kernel-patterns` skill.
 - **Prefer `python tools/ci_build/build.py` directly** over `build.bat`/`build.sh` when redirecting output. The `.bat` wrapper runs in `cmd.exe`, which breaks PowerShell redirection.
 - **Redirect output to a file** (e.g., `> build_log.txt 2>&1`). Build output is large and will overflow terminal buffers.
 - **Run builds in the background** — a full build can take tens of minutes to over an hour. Poll the log for `"Build complete"` or errors.

diff --git a/.agents/skills/ort-test/SKILL.md b/.agents/skills/ort-test/SKILL.md
@@ -16,6 +16,20 @@ ONNX Runtime uses **Google Test** for C++ and **unittest** (preferred) / **pytes
 | `onnxruntime_test_all` | Core framework, graph, optimizer, session tests |
 | `onnxruntime_provider_test` | Operator/kernel tests (Conv, MatMul, etc.) across execution providers |
 
+### Two `attention_op_test.cc` files — don't confuse them
+
+There are two same-named files testing **different operators**. Both build into
+`onnxruntime_provider_test`:
+
+| Path | Operator | gtest suite |
+|---|---|---|
+| `test/providers/cpu/llm/attention_op_test.cc` | **ONNX-domain** `Attention` (opset 23/24) | `AttentionTest.*` |
+| `test/contrib_ops/attention_op_test.cc` | **contrib** MultiHeadAttention / GroupQueryAttention | `ContribOpAttentionTest.*` |
+
+The MEA negative-offset regression tests (`Attention_Causal_NonPadKVSeqLen_MEA_*`,
+e.g. `..._MEA_NegOffset_ForceFlashDisabled_FP16_CUDA`) live in the **providers/cpu/llm** file —
+the ONNX-domain op.
+
 Use `--gtest_filter` to select specific tests:
 
 ```bash
@@ -76,7 +90,62 @@ Python test naming convention: `test_<method>_<expected_behavior>_[when_<conditi
 ## Agent tips
 
 - **Activate a Python virtual environment** before running tests. See "Python > Virtual environment" in `AGENTS.md`.
+- **Beware false-green results** — a green run does not always prove anything. See the
+  "False-green taxonomy" section below for the four ways a test can pass without testing
+  your change.
 - **Redirect test output to a file** (e.g., `> test_output.txt 2>&1`) — output can be large.
 - For C++ tests, verify the build directory exists and a prior build completed before running.
 - Use `--gtest_filter` to run a targeted subset when the full suite takes too long.
 - **Running WebGPU tests locally on Linux without a GPU** — WebGPU op tests build into `onnxruntime_provider_test` and can run against a software Vulkan adapter (Mesa lavapipe). See the `webgpu-local-testing` skill.
+
+## False-green taxonomy — ways a test can "pass" without proving anything
+
+A green result is not always a real pass. Watch for all five modes:
+
+1. **Zero-match filter.** A `--gtest_filter` that matches no tests still exits 0 (green).
+   Confirm the `[==========] N tests ran` line is non-zero — a zero-match run prints
+   `0 tests from 0 test suites`. Many operator/kernel gtests run only in
+   **`onnxruntime_provider_test`** (CI runs this), NOT `onnxruntime_test_all`; the wrong
+   binary matches nothing and looks green.
+2. **Stale binary from an incremental build.** If the build did not actually recompile your
+   change (e.g. a header not tracked by the compiler's depfile), the "passing" run executes
+   the OLD code. A test that was failing cannot truly flip to passing without a real
+   rebuild — treat an unexpected FAIL→PASS with suspicion and confirm the linked artifact's
+   mtime advanced. CUDA/CUTLASS instance (nvcc depfiles don't track `cutlass_fmha/*.h`): see
+   the `cuda-cutlass-fmha-incremental-rebuild` skill.
+3. **Checking the wrong artifact's freshness.** With a dlopen'd shared provider (e.g.
+   `libonnxruntime_providers_cuda.so`), the test executable is NOT relinked when the provider
+   recompiles — its mtime stays old while the `.so` advances. Verify the artifact that
+   actually links your change, not the test exe. Detail: `cuda-cutlass-fmha-incremental-rebuild`
+   skill.
+4. **A correct fallback path masks the intended path.** A value-only assertion can pass via a
+   *different, correct* code path without ever exercising the one you meant to test (e.g. a
+   test meant for MEA silently handled by the unfused fallback). Assert/verify **which path
+   ran**, not just the output value — see "Verify which path/kernel actually executed" below.
+5. **Arch-portability false-green (verified on only one GPU arch).** A CUDA kernel that
+   launches on a large-dynamic-smem arch (e.g. sm90/H100, ~227KB) can **fail to launch** on a
+   smaller opt-in cap (sm86/89 ~99KB, sm80 ~163KB) with `CUDA failure 1: invalid argument` —
+   and a path with no fallback (e.g. ORT's MEA) turns that into a hard error, not a silent
+   degrade. So a green run on your local GPU can mask a launch failure on CI's arch. Verify
+   arch-portability, or pick a config whose shared-memory footprint fits **every** target arch
+   (e.g. a small `head_size`). Concrete instance: CUTLASS MEA `head_size=512` FP16 exceeds
+   sm86's smem opt-in cap and dies at launch — live bug #28388 (the
+   `cuda-attention-kernel-patterns` skill §1 has the dispatch detail).
+
+## Verify which path/kernel actually executed
+
+Value equality alone does not prove the intended code path ran — a correct fallback can
+produce the right answer (false-green mode 4 above). When a test targets a specific
+kernel/path, confirm it actually dispatched there instead of trusting the output:
+
+- Enable verbose logging and check the dispatch log line. ORT attention logs one of these
+  exact strings (`core/providers/cuda/llm/attention.cc`):
+  - `ONNX Attention: using Flash Attention` (:1400)
+  - `ONNX Attention: using Memory Efficient Attention` (:1451)
+  - `Attention: using unified unfused path` (:1482) — note: **no `ONNX ` prefix** and it
+    reads "unified unfused path", not "Unfused".
+- Or force the path via the relevant env var / build config AND add a compile-time guard so
+  the test **SKIPs** (not silently passes) when the target path is unavailable — e.g.
+  `SKIP_IF_MEA_NOT_COMPILED`.
+
+Operator-specific routing/forcing details: `cuda-attention-kernel-patterns` skill §1/§7.
diff --git a/AGENTS.md b/AGENTS.md
@@ -66,6 +66,15 @@ Use `reserve()` not `resize()`. Do not use `absl::` directly — use the ORT typ
 - Prefer `gsl::span<const T>` over `const std::vector<T>&` for input parameters
 - Prefer `std::string_view` by value over `const std::string&`
 - `SafeInt<size_t>` (from `core/common/safeint.h`) for memory size arithmetic
+- **Signed vs unsigned on negative-capable differences.** Any expression of the form `a - b`
+  that can be negative (an offset or remaining-budget computed from counts, e.g.
+  `num_keys - num_queries`) must be stored and compared using a **signed** type
+  (`int32_t`/`int64_t`), and any unsigned operand must be `static_cast` to signed *before*
+  the subtraction/comparison. An unsigned result silently wraps to a huge value (`~4.29e9`
+  for `uint32_t`), which can permanently satisfy or skip a relational guard with **no crash
+  and no warning** — a correct-looking-but-wrong result. Concrete ORT instance + the exact
+  fix sites: CUTLASS FMHA `causal_diagonal_offset`, see the `cuda-attention-kernel-patterns`
+  skill §12.
 - Don't use `else` after `return`
 - Avoid `long` (ambiguous width) — use `int64_t` for dimensions, `size_t` for counts
 - `using namespace` allowed in limited scope but never at global scope in headers

diff --git a/cmake/external/onnxruntime_external_deps.cmake b/cmake/external/onnxruntime_external_deps.cmake
@@ -797,6 +797,19 @@ if (onnxruntime_USE_WEBGPU)
           #
           ${Patch_EXECUTABLE} --binary --ignore-whitespace -p1 < ${PROJECT_SOURCE_DIR}/patches/dawn/dawn_buffer_fix_injection.patch &&
 
+          # The dawn_parallel_build_fix.patch contains the following changes:
+          #
+          # - (private) Fix parallel build race condition in emdawnwebgpu header copy
+          #   The emdawnwebgpu_headers_gen_add macro's add_custom_command uses cmake -E copy
+          #   without ensuring the destination directory exists first. When building with
+          #   parallel jobs (-j32), the copy commands for webgpu_glfw.h and
+          #   webgpu_enum_class_bitmasks.h can run before any DawnJSONGenerator command
+          #   has created gen/src/emdawnwebgpu/include/webgpu/, causing the copy to fail.
+          #   This patch adds cmake -E make_directory before the copy so the directory is
+          #   always present regardless of parallel build ordering.
+          #
+          ${Patch_EXECUTABLE} --binary --ignore-whitespace -p1 < ${PROJECT_SOURCE_DIR}/patches/dawn/dawn_parallel_build_fix.patch &&
+
           # Remove the test folder to speed up potential file scan operations (70k+ files not needed for build).
           # Using <SOURCE_DIR> token ensures the correct absolute path regardless of working directory.
           ${CMAKE_COMMAND} -E rm -rf <SOURCE_DIR>/test)

diff --git a/cmake/patches/dawn/dawn_parallel_build_fix.patch b/cmake/patches/dawn/dawn_parallel_build_fix.patch
@@ -0,0 +1,11 @@
+diff --git a/src/emdawnwebgpu/CMakeLists.txt b/src/emdawnwebgpu/CMakeLists.txt
+--- a/src/emdawnwebgpu/CMakeLists.txt
++++ b/src/emdawnwebgpu/CMakeLists.txt
+@@ -36,6 +36,7 @@ macro(emdawnwebgpu_headers_gen_add filename)
+             "${EM_BUILD_GEN_DIR}/include/webgpu/${filename}"
+         MAIN_DEPENDENCY
+             "${DAWN_INCLUDE_DIR}/webgpu/${filename}"
++        COMMAND ${CMAKE_COMMAND} -E make_directory "${EM_BUILD_GEN_DIR}/include/webgpu"
+         COMMAND ${CMAKE_COMMAND} -E copy
+             "${DAWN_INCLUDE_DIR}/webgpu/${filename}"
+             "${EM_BUILD_GEN_DIR}/include/webgpu/${filename}"
diff --git a/docs/contrib_ops/cuda/moe_qmoe.md b/docs/contrib_ops/cuda/moe_qmoe.md
@@ -536,6 +536,15 @@ The operator supports three fusion modes via the `swiglu_fusion` attribute:
 | 1 (interleaved) | `fc1`, `fc2` | `[Gate_0, Value_0, Gate_1, Value_1, …]` — `[E, 2×inter, hidden]` | GPT-OSS layout. |
 | 2 (block) | `fc1`, `fc2` | `[Gate_0…Gate_N | Value_0…Value_N]` — `[E, 2×inter, hidden]` | Concatenated halves; Llama/Gemma layout. |
 
+> **Backward-compatibility remap (`swiglu_fusion=0` + SwiGLU)**: The published gpt-oss-20b
+> model before June 2025 uses **interleaved** SwiGLU layout but `swiglu_fusion` attribute to `0`.
+> To keep those models working, when `activation_type="swiglu"` and `swiglu_fusion=0`,
+> the CUDA op treats the FC1 weights as interleaved (i.e. as if `swiglu_fusion=1`):
+> unconditionally for **QMoE** (which never has a separate `fc3`).
+> A one-time warning is logged. Consequently a SwiGLU model that genuinely intended the
+> non-interleaved split must provide a separate `fc3` (standard MoE) rather than rely on
+> `swiglu_fusion=0`. New exporters should set `swiglu_fusion` explicitly.
+
 > **CPU note**: The CPU MoE/QMoE implementation only supports the **interleaved**
 > SwiGLU layout (`swiglu_fusion=1`). The concatenated layout (`swiglu_fusion=2`)
 > throws `ORT_NOT_IMPLEMENTED` on CPU; use the CUDA EP for concatenated SwiGLU.
@@ -980,6 +989,20 @@ per-column INT4, block-wise INT4/INT8, and interleaved-SwiGLU GEMV kernels.
 | Kernel instantiation | `moe_gemv.cu` adds `__nv_bfloat16` details/instantiations (group sizes 0/32/64/128, INT4/INT8, bias on/off) under `ENABLE_BF16`. | The custom FC1/FC2 GEMV kernels run for BF16; no grouped-GEMM fallback when the FP16 gate would route. |
 | Profiling | GPT-OSS-20B, Qwen3.6-35B-A3B, and Gemma model shapes profiled with `block_size=64` for both dtypes. | BF16 matches FP16 routing and latency within noise (about 1.3x–1.5x faster than grouped GEMM); SwiGLU BF16 parity tests pass. |
 
+#### Accumulation policy
+
+The QMoE GEMV fast path accumulates fp16 activations in fp16 by default. Set
+`ORT_MOE_GEMV_FP32_ACCUM=1` before process start to restore the previous fp32
+accumulation path for fp16 activations. BF16 activations always use fp32
+accumulation because bf16 accumulation is too lossy.
+
+On the GPT-OSS-20B decode-shaped helper case
+`gpt_oss_20b_m1_top4_fp16_2880x2880_e32`, the default fp16-accumulation path was
+0.0708 ms versus 0.0812 ms with `ORT_MOE_GEMV_FP32_ACCUM=1`. In a full GPT-OSS
+CUDA-graph decode run, default fp16 accumulation reached 386.26 tok/s versus
+353.70 tok/s with the fp32 fallback. A 1000-sample MMLU smoke test matched pooled
+accuracy at 0.8260 for both modes.
+
 #### Experiments rejected after profiling
 
 | Experiment | Why it was rejected |