Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
195 changes: 171 additions & 24 deletions .agents/skills/cuda-attention-kernel-patterns/SKILL.md

Large diffs are not rendered by default.

111 changes: 111 additions & 0 deletions .agents/skills/cuda-cutlass-fmha-incremental-rebuild/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
---
name: cuda-cutlass-fmha-incremental-rebuild
description: >
Use when rebuilding ONNX Runtime CUDA after editing CUTLASS fused-MHA headers
(onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/*.h such as kernel_forward.h or
fmha_launch_template.h), or when a header edit "passed" an incremental build but
test behavior did not change. Explains the nvcc depfile gotcha that produces stale
Memory-Efficient-Attention (MEA) kernels and binaries, and how to force a correct
recompile. Also covers disk-space frugality on shared GPU dev boxes.
---

# Incremental rebuilds silently use STALE CUTLASS fused-MHA kernels

> The **general** false-green principles (stale binary, wrong-artifact mtime) are summarised

Check notice on line 14 in .agents/skills/cuda-cutlass-fmha-incremental-rebuild/SKILL.md

View workflow job for this annotation

GitHub Actions / misspell

[misspell] .agents/skills/cuda-cutlass-fmha-incremental-rebuild/SKILL.md#L14

"summarised" is a misspelling of "summarized"
Raw output
./.agents/skills/cuda-cutlass-fmha-incremental-rebuild/SKILL.md:14:82: "summarised" is a misspelling of "summarized"
> in the `ort-test` skill's "False-green taxonomy". This skill is the CUDA/CUTLASS-specific
> detail.

## The gotcha (verification-integrity bug)

`nvcc`-generated depfiles do **not** track the CUTLASS fused-MHA headers under
`onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/` (e.g. `kernel_forward.h`,
`fmha_launch_template.h`). These headers are `#include`d by the `fmha_sm*.cu`
translation units, but the build system does not record that dependency.

Consequence: after you edit one of those headers, an **incremental** `build.sh`:

- does **not** recompile `fmha_sm*.cu`,
- reports `[100%] Built target ...` and exits 0,
- leaves the recompiled artifacts — the `fmha_sm*.cu.o` objects and the
`libonnxruntime_providers_cuda.so` they link into — **unchanged** (same `mtime` as
the pre-edit build).

(Do **not** use the gtest test-exe mtime as the stale symptom: in the shared-provider
build the exe `dlopen`s the `.so` and is **not** relinked, so its mtime stays old even
after a *correct* rebuild — see "How to confirm" below. The reliable diagnostic signal
is the `fmha_sm*.cu.o` / `.so` mtime.)

So your "successful" rebuild is running the **old** kernel. Tests that should now
pass (or fail) reflect the previous code, not your edit. This silently invalidates
any FAIL→PASS / PASS→FAIL verification.

## The fix — force recompile the .cu units

Before rebuilding after editing any `cutlass_fmha/*.h` header:

```bash
touch onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/*.cu
```

Then run the normal build command. This forces the `fmha_sm*.cu` translation units
(and downstream binaries) to recompile against your header change.

## How to confirm the rebuild was real (don't trust "[100%] Built")

Confirm that the artifact which actually **links** the recompiled `fmha_sm*.cu.o`
is newer than your header edit.

⚠️ **Do NOT just check the test EXE mtime — it can falsely flag a good build as
stale.** In the shared-provider build configuration (the default here), the CUDA
execution provider is a **shared module**: the recompiled `fmha_sm*.cu.o` link into
`libonnxruntime_providers_cuda.so`, and the `onnxruntime_provider_test` executable
**dlopens** that `.so` — it is **not relinked**. So after a *correct* rebuild the
test exe `mtime` stays **old** while the `.so` advances. Checking the exe alone
would wrongly conclude the build was stale.

Check the right artifact for your link mode:

- **Shared-provider build (default):** the `.so` that links the recompiled `.o` —
`build/<dir>/<cfg>/libonnxruntime_providers_cuda.so`
- **Statically-linked provider:** the test exe itself (`onnxruntime_provider_test`)

Safest check — `stat` both the recompiled object and the `.so`, and confirm BOTH
are newer than the header edit:

```bash
stat -c '%y %n' onnxruntime/contrib_ops/cuda/bert/cutlass_fmha/kernel_forward.h
# in your build dir, e.g. build/Debug_quickbuild/Debug/:
stat -c '%y %n' libonnxruntime_providers_cuda.so
# and the actual recompiled object (path varies by build dir):
find . -name 'fmha_sm80.cu.o' -exec stat -c '%y %n' {} +
```

If the `.so` (and the `fmha_sm*.cu.o`) timestamps are older than (or equal to) the
header edit, the build was stale — `touch` the `.cu` files and rebuild. The most
reliable signal of all is behavioral: a test that was failing now passes (a stale
binary cannot flip its result).

## Related: pick the right test binary

This is the **CUDA/CUTLASS instance of false-green mode 1** (zero-match / wrong binary) —
see the `ort-test` skill's "False-green taxonomy" for the general principle. In short:
attention/MEA/Flash boundary gtests (e.g. `FlashStructuralEmptyRows*`,
`Attention_Causal_NonPadKVSeqLen_MEA_*`) live in **`onnxruntime_provider_test`**, which CI
runs; `onnxruntime_test_all` does not contain them and gives a false green. Verify the
MEA/Flash boundary fix against `onnxruntime_provider_test`.

## Related: disk frugality on shared GPU dev boxes

Full ORT CUDA builds are large (test binaries ~1 GB each; a build dir can reach
tens of GB). On a shared box, `/home` filling to 100% makes builds fail in
non-obvious places — e.g. `git submodule sync` reporting `No space left on device`
or a `config.lock` error, not an obvious "disk full" at the compile step.

Before a big rebuild, check free space and clean only clearly-stale, regenerable
build directories (old dated experiment dirs). Never delete another agent's active
build dir or anything ambiguous:

```bash
df -h /home
du -sh build/* | sort -h
```
10 changes: 10 additions & 0 deletions .agents/skills/ort-build/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,7 @@ You do **not** need `--update` when only modifying existing `.cc`/`.h` files —
| `--use_cuda` | Enable CUDA EP. Requires `--cuda_home`/`--cudnn_home` or `CUDA_HOME`/`CUDNN_HOME` env vars. On Windows, only `cuda_home`/`CUDA_HOME` is validated. |
| `--target T` | Build a specific CMake target (requires `--build`; e.g., `onnxruntime_common`, `onnxruntime_test_all`) |
| `--use_webgpu` | Enable WebGPU EP. To run its tests locally on Linux without a GPU, see the `webgpu-local-testing` skill. |
| `--cmake_extra_defines onnxruntime_QUICK_BUILD=ON` | Faster CUDA build: instantiates a reduced kernel set. **Side effect:** Flash is compiled for head_dim 128 only, so most attention shapes fall back to **MEA** (changes which attention kernel is compiled/dispatched). Don't use it to characterize Flash-vs-arch behavior. |
| `--build_dir` | Build output directory |

## Build output path
Expand All @@ -79,6 +80,15 @@ It may be customized with `--build_dir`.
## Agent tips

- **Activate a Python virtual environment** before building. See "Python > Virtual environment" in `AGENTS.md`.
- **Build flags can silently reroute which kernel/code path executes.** A build option can
change *which* kernel is compiled, and therefore which code path actually runs — so a CI
failure can live in a different code path than your local build exercises. Before
hypothesizing a hardware- or algorithm-specific cause (e.g. "this GPU arch miscomputes"),
first identify **which kernel actually ran** for the failing configuration (see the
`ort-test` skill → "Verify which path/kernel actually executed"). Concrete instance:
`onnxruntime_QUICK_BUILD=ON` compiles FlashAttention for head_dim 128 only, so most
attention shapes silently dispatch to Memory-Efficient Attention instead of Flash —
details in the `cuda-attention-kernel-patterns` skill.
- **Prefer `python tools/ci_build/build.py` directly** over `build.bat`/`build.sh` when redirecting output. The `.bat` wrapper runs in `cmd.exe`, which breaks PowerShell redirection.
- **Redirect output to a file** (e.g., `> build_log.txt 2>&1`). Build output is large and will overflow terminal buffers.
- **Run builds in the background** — a full build can take tens of minutes to over an hour. Poll the log for `"Build complete"` or errors.
Expand Down
69 changes: 69 additions & 0 deletions .agents/skills/ort-test/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,20 @@ ONNX Runtime uses **Google Test** for C++ and **unittest** (preferred) / **pytes
| `onnxruntime_test_all` | Core framework, graph, optimizer, session tests |
| `onnxruntime_provider_test` | Operator/kernel tests (Conv, MatMul, etc.) across execution providers |

### Two `attention_op_test.cc` files — don't confuse them

There are two same-named files testing **different operators**. Both build into
`onnxruntime_provider_test`:

| Path | Operator | gtest suite |
|---|---|---|
| `test/providers/cpu/llm/attention_op_test.cc` | **ONNX-domain** `Attention` (opset 23/24) | `AttentionTest.*` |
| `test/contrib_ops/attention_op_test.cc` | **contrib** MultiHeadAttention / GroupQueryAttention | `ContribOpAttentionTest.*` |

The MEA negative-offset regression tests (`Attention_Causal_NonPadKVSeqLen_MEA_*`,
e.g. `..._MEA_NegOffset_ForceFlashDisabled_FP16_CUDA`) live in the **providers/cpu/llm** file —
the ONNX-domain op.

Use `--gtest_filter` to select specific tests:

```bash
Expand Down Expand Up @@ -76,7 +90,62 @@ Python test naming convention: `test_<method>_<expected_behavior>_[when_<conditi
## Agent tips

- **Activate a Python virtual environment** before running tests. See "Python > Virtual environment" in `AGENTS.md`.
- **Beware false-green results** — a green run does not always prove anything. See the
"False-green taxonomy" section below for the four ways a test can pass without testing
your change.
- **Redirect test output to a file** (e.g., `> test_output.txt 2>&1`) — output can be large.
- For C++ tests, verify the build directory exists and a prior build completed before running.
- Use `--gtest_filter` to run a targeted subset when the full suite takes too long.
- **Running WebGPU tests locally on Linux without a GPU** — WebGPU op tests build into `onnxruntime_provider_test` and can run against a software Vulkan adapter (Mesa lavapipe). See the `webgpu-local-testing` skill.

## False-green taxonomy — ways a test can "pass" without proving anything

A green result is not always a real pass. Watch for all five modes:

1. **Zero-match filter.** A `--gtest_filter` that matches no tests still exits 0 (green).
Confirm the `[==========] N tests ran` line is non-zero — a zero-match run prints
`0 tests from 0 test suites`. Many operator/kernel gtests run only in
**`onnxruntime_provider_test`** (CI runs this), NOT `onnxruntime_test_all`; the wrong
binary matches nothing and looks green.
2. **Stale binary from an incremental build.** If the build did not actually recompile your
change (e.g. a header not tracked by the compiler's depfile), the "passing" run executes
the OLD code. A test that was failing cannot truly flip to passing without a real
rebuild — treat an unexpected FAIL→PASS with suspicion and confirm the linked artifact's
mtime advanced. CUDA/CUTLASS instance (nvcc depfiles don't track `cutlass_fmha/*.h`): see
the `cuda-cutlass-fmha-incremental-rebuild` skill.
3. **Checking the wrong artifact's freshness.** With a dlopen'd shared provider (e.g.
`libonnxruntime_providers_cuda.so`), the test executable is NOT relinked when the provider
recompiles — its mtime stays old while the `.so` advances. Verify the artifact that
actually links your change, not the test exe. Detail: `cuda-cutlass-fmha-incremental-rebuild`
skill.
4. **A correct fallback path masks the intended path.** A value-only assertion can pass via a
*different, correct* code path without ever exercising the one you meant to test (e.g. a
test meant for MEA silently handled by the unfused fallback). Assert/verify **which path
ran**, not just the output value — see "Verify which path/kernel actually executed" below.
5. **Arch-portability false-green (verified on only one GPU arch).** A CUDA kernel that
launches on a large-dynamic-smem arch (e.g. sm90/H100, ~227KB) can **fail to launch** on a
smaller opt-in cap (sm86/89 ~99KB, sm80 ~163KB) with `CUDA failure 1: invalid argument` —
and a path with no fallback (e.g. ORT's MEA) turns that into a hard error, not a silent
degrade. So a green run on your local GPU can mask a launch failure on CI's arch. Verify
arch-portability, or pick a config whose shared-memory footprint fits **every** target arch
(e.g. a small `head_size`). Concrete instance: CUTLASS MEA `head_size=512` FP16 exceeds
sm86's smem opt-in cap and dies at launch — live bug #28388 (the
`cuda-attention-kernel-patterns` skill §1 has the dispatch detail).

## Verify which path/kernel actually executed

Value equality alone does not prove the intended code path ran — a correct fallback can
produce the right answer (false-green mode 4 above). When a test targets a specific
kernel/path, confirm it actually dispatched there instead of trusting the output:

- Enable verbose logging and check the dispatch log line. ORT attention logs one of these
exact strings (`core/providers/cuda/llm/attention.cc`):
- `ONNX Attention: using Flash Attention` (:1400)
- `ONNX Attention: using Memory Efficient Attention` (:1451)
- `Attention: using unified unfused path` (:1482) — note: **no `ONNX ` prefix** and it
reads "unified unfused path", not "Unfused".
- Or force the path via the relevant env var / build config AND add a compile-time guard so
the test **SKIPs** (not silently passes) when the target path is unavailable — e.g.
`SKIP_IF_MEA_NOT_COMPILED`.

Operator-specific routing/forcing details: `cuda-attention-kernel-patterns` skill §1/§7.
9 changes: 9 additions & 0 deletions AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,6 +66,15 @@ Use `reserve()` not `resize()`. Do not use `absl::` directly — use the ORT typ
- Prefer `gsl::span<const T>` over `const std::vector<T>&` for input parameters
- Prefer `std::string_view` by value over `const std::string&`
- `SafeInt<size_t>` (from `core/common/safeint.h`) for memory size arithmetic
- **Signed vs unsigned on negative-capable differences.** Any expression of the form `a - b`
that can be negative (an offset or remaining-budget computed from counts, e.g.
`num_keys - num_queries`) must be stored and compared using a **signed** type
(`int32_t`/`int64_t`), and any unsigned operand must be `static_cast` to signed *before*
the subtraction/comparison. An unsigned result silently wraps to a huge value (`~4.29e9`
for `uint32_t`), which can permanently satisfy or skip a relational guard with **no crash
and no warning** — a correct-looking-but-wrong result. Concrete ORT instance + the exact
fix sites: CUTLASS FMHA `causal_diagonal_offset`, see the `cuda-attention-kernel-patterns`
skill §12.
- Don't use `else` after `return`
- Avoid `long` (ambiguous width) — use `int64_t` for dimensions, `size_t` for counts
- `using namespace` allowed in limited scope but never at global scope in headers
Expand Down
13 changes: 13 additions & 0 deletions cmake/external/onnxruntime_external_deps.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -797,6 +797,19 @@ if (onnxruntime_USE_WEBGPU)
#
${Patch_EXECUTABLE} --binary --ignore-whitespace -p1 < ${PROJECT_SOURCE_DIR}/patches/dawn/dawn_buffer_fix_injection.patch &&

# The dawn_parallel_build_fix.patch contains the following changes:
#
# - (private) Fix parallel build race condition in emdawnwebgpu header copy
# The emdawnwebgpu_headers_gen_add macro's add_custom_command uses cmake -E copy
# without ensuring the destination directory exists first. When building with
# parallel jobs (-j32), the copy commands for webgpu_glfw.h and
# webgpu_enum_class_bitmasks.h can run before any DawnJSONGenerator command
# has created gen/src/emdawnwebgpu/include/webgpu/, causing the copy to fail.
# This patch adds cmake -E make_directory before the copy so the directory is
# always present regardless of parallel build ordering.
#
${Patch_EXECUTABLE} --binary --ignore-whitespace -p1 < ${PROJECT_SOURCE_DIR}/patches/dawn/dawn_parallel_build_fix.patch &&

# Remove the test folder to speed up potential file scan operations (70k+ files not needed for build).
# Using <SOURCE_DIR> token ensures the correct absolute path regardless of working directory.
${CMAKE_COMMAND} -E rm -rf <SOURCE_DIR>/test)
Expand Down
11 changes: 11 additions & 0 deletions cmake/patches/dawn/dawn_parallel_build_fix.patch
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
diff --git a/src/emdawnwebgpu/CMakeLists.txt b/src/emdawnwebgpu/CMakeLists.txt
--- a/src/emdawnwebgpu/CMakeLists.txt
+++ b/src/emdawnwebgpu/CMakeLists.txt
@@ -36,6 +36,7 @@ macro(emdawnwebgpu_headers_gen_add filename)
"${EM_BUILD_GEN_DIR}/include/webgpu/${filename}"
MAIN_DEPENDENCY
"${DAWN_INCLUDE_DIR}/webgpu/${filename}"
+ COMMAND ${CMAKE_COMMAND} -E make_directory "${EM_BUILD_GEN_DIR}/include/webgpu"
COMMAND ${CMAKE_COMMAND} -E copy
"${DAWN_INCLUDE_DIR}/webgpu/${filename}"
"${EM_BUILD_GEN_DIR}/include/webgpu/${filename}"
23 changes: 23 additions & 0 deletions docs/contrib_ops/cuda/moe_qmoe.md
Original file line number Diff line number Diff line change
Expand Up @@ -536,6 +536,15 @@ The operator supports three fusion modes via the `swiglu_fusion` attribute:
| 1 (interleaved) | `fc1`, `fc2` | `[Gate_0, Value_0, Gate_1, Value_1, …]` — `[E, 2×inter, hidden]` | GPT-OSS layout. |
| 2 (block) | `fc1`, `fc2` | `[Gate_0…Gate_N | Value_0…Value_N]` — `[E, 2×inter, hidden]` | Concatenated halves; Llama/Gemma layout. |

> **Backward-compatibility remap (`swiglu_fusion=0` + SwiGLU)**: The published gpt-oss-20b
> model before June 2025 uses **interleaved** SwiGLU layout but `swiglu_fusion` attribute to `0`.
> To keep those models working, when `activation_type="swiglu"` and `swiglu_fusion=0`,
> the CUDA op treats the FC1 weights as interleaved (i.e. as if `swiglu_fusion=1`):
> unconditionally for **QMoE** (which never has a separate `fc3`).
> A one-time warning is logged. Consequently a SwiGLU model that genuinely intended the
> non-interleaved split must provide a separate `fc3` (standard MoE) rather than rely on
> `swiglu_fusion=0`. New exporters should set `swiglu_fusion` explicitly.

> **CPU note**: The CPU MoE/QMoE implementation only supports the **interleaved**
> SwiGLU layout (`swiglu_fusion=1`). The concatenated layout (`swiglu_fusion=2`)
> throws `ORT_NOT_IMPLEMENTED` on CPU; use the CUDA EP for concatenated SwiGLU.
Expand Down Expand Up @@ -980,6 +989,20 @@ per-column INT4, block-wise INT4/INT8, and interleaved-SwiGLU GEMV kernels.
| Kernel instantiation | `moe_gemv.cu` adds `__nv_bfloat16` details/instantiations (group sizes 0/32/64/128, INT4/INT8, bias on/off) under `ENABLE_BF16`. | The custom FC1/FC2 GEMV kernels run for BF16; no grouped-GEMM fallback when the FP16 gate would route. |
| Profiling | GPT-OSS-20B, Qwen3.6-35B-A3B, and Gemma model shapes profiled with `block_size=64` for both dtypes. | BF16 matches FP16 routing and latency within noise (about 1.3x–1.5x faster than grouped GEMM); SwiGLU BF16 parity tests pass. |

#### Accumulation policy

The QMoE GEMV fast path accumulates fp16 activations in fp16 by default. Set
`ORT_MOE_GEMV_FP32_ACCUM=1` before process start to restore the previous fp32
accumulation path for fp16 activations. BF16 activations always use fp32
accumulation because bf16 accumulation is too lossy.

On the GPT-OSS-20B decode-shaped helper case
`gpt_oss_20b_m1_top4_fp16_2880x2880_e32`, the default fp16-accumulation path was
0.0708 ms versus 0.0812 ms with `ORT_MOE_GEMV_FP32_ACCUM=1`. In a full GPT-OSS
CUDA-graph decode run, default fp16 accumulation reached 386.26 tok/s versus
353.70 tok/s with the fp32 fallback. A 1000-sample MMLU smoke test matched pooled
accuracy at 0.8260 for both modes.

#### Experiments rejected after profiling

| Experiment | Why it was rejected |
Expand Down
Loading
Loading