Sync with Microsoft ONNX Runtime - 27062026#1167
Open
ai-fw-intg wants to merge 11 commits into
Open
Conversation
…icrosoft#28771) ### Description <!-- Describe your changes. --> Relax the input-validation in OrtApi::CompileModel to accept OrtModel instances with zero graph inputs. Previously, ModelCompilationOptions::Check() rejected such models with "OrtModel graph must have at least one input and one output defined." The check now requires only at least one graph output; the zero-input case is legal. Tests in test_model_builder_api.cc are restructured: - The old CompileFromModelWithEmptyInputsOutputs_Fails is renamed to CompileFromModelWithEmptyOutputs_Fails and reshaped to provide 1 input + 0 outputs, isolating the output-only check. - A new regression test CompileFromModelWithEmptyInputs_Succeeds builds a 0-input model with a RandomNormal node and verifies compilation succeeds. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fixes microsoft#28135 The original check was too restrictive and impacts callers (e.g., WebNN/Chromium needs to call CompileModel on such models in a separate compiler process (and then load the compiled artifact via CreateSessionFromArray in the GPU process)).
…ttention (microsoft#29240) ### Description The CUDA `GroupQueryAttention` kernel derives a KV-cache append offset from the `seqlens_k` input (`past_seq_lens = (seqlens_k + 1) - sequence_length`). On the CUDA EP `seqlens_k` is device-resident (only `total_sequence_length` is a CPU input), so the host-side range validation in the operator/helper is skipped. The device kernel `UnpackRoPEAppend` then guarded the cache store with only a one-sided upper bound (`cache_s < max_seqlen`), so an out-of-range `seqlens_k` could produce a negative offset that is sign-extended into the cache-index arithmetic. The CPU operator already validates `seqlens_k` host-side; this change brings the CUDA path to parity by guarding on the device. ### Changes - `group_query_attention_impl.cu` (`GetSequenceLengths`): clamp the negative case at the source so both `total_seq_lens` and the append offset `past_seq_lens` stay non-negative for all downstream consumers. - `group_query_attention_qkv.cuh` (`UnpackRoPEAppend`): make the KV-cache store bound two-sided (`cache_s >= 0 && cache_s < max_seqlen`), mirroring the existing position-index guard a few lines above. This also covers the fast-decode path, where `past_seq_lens` points directly at the raw input and bypasses `GetSequenceLengths`. - Added `NegativeSeqlensK_CacheAppend_NoOOB_CUDA` regression test exercising the KV-cache append path with an out-of-range `seqlens_k` (CUDA-guarded; skips when CUDA EP is unavailable). ### Notes - The two-sided guard matches the pattern introduced for the rotary position index in microsoft#27597. - CPU is unaffected (already validated host-side); WebGPU relies on the CPU-validated `total_sequence_length`. The CUDA implementation is shared with ROCm via hipify. - The regression is a device-memory write best observed under `compute-sanitizer`; the test asserts the run completes with finite outputs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oft#28962) ## Summary Adds an FP32 flash attention path for the CPU `com.microsoft.GroupQueryAttention` (GQA) contrib op, mirroring the existing quantized-KV flash attention path. The new tiled, online-softmax kernel avoids materializing the full `[S, T]` attention score matrix. It is restricted to prefill / chunked-prefill (`sequence_length > 1`); single-token decode falls back to the naive path. With causal early-termination it is faster than the naive path across all measured prefill lengths while using a fraction of the memory. ## Key changes - **New MLAS kernel** `onnxruntime/core/mlas/lib/flashattn_gqa.cpp` (`MlasFlashAttentionGQA`): - Tiled QK / softmax / SV with online-softmax (running max/sum rescaling). - GQA head grouping (`num_heads % kv_num_heads == 0`), causal masking, local window, additive attention bias, and packed-QKV input. - **Causal early-termination**: during prefill, KV blocks that fall entirely in the causally masked upper triangle are skipped (`break` once `ir >= past_seqlen + q_idx + row_size_q`), avoiding the wasted QK/SV GEMMs over roughly half of the square prefill attention matrix. - Per-batch invocation for ragged / shared-buffer `seqlens_k`. - **MLAS API** `onnxruntime/core/mlas/inc/mlas.h`: new `MlasFlashAttentionGQAArgs` struct and `MlasFlashAttentionGQA` declaration. - **Dispatch** `onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h`: new `ApplyAttentionFlash` that concatenates new K/V into the FP32 present cache and invokes the kernel. The per-thread scratch buffer size is computed with `SafeInt<size_t>` to guard against `size_t` overflow on large/malformed shapes before allocation. - **Wiring** `onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc`: float-only flash dispatch, active only for prefill (`sequence_length > 1`) and when `softcap == 0`, no smooth softmax, no head sink, no QK output; falls back to the naive path otherwise. The existing `ORT_GQA_DISABLE_FLASH_ATTENTION` env var disables it. - **CMake** `cmake/onnxruntime_mlas.cmake`: register the new source file. - **Docs** `docs/contrib_ops/cpu/gqa.md`: document the non-quantized flash attention path, activation conditions, causal early-termination, file list, and FP32 flash-vs-naive benchmark results. - **Benchmark** `onnxruntime/test/python/transformers/benchmark_gqa_cpu_flash.py`: add an FP32 (non-quantized) mode (`--fp32`) for operator-level flash-vs-naive comparison. ### Why prefill-only (`sequence_length > 1`) Single-token decode (`sequence_length == 1`) produces only a `[1, total_sequence_length]` score row per head, so there is nothing to tile away and the extra online-softmax bookkeeping makes the flash kernel slower and noisier than naive in practice. Restricting the flash path to prefill keeps the consistent prefill win without regressing decode. Because decode is excluded, the two-phase flash-decoding kernels are unreachable and have been removed for a smaller, simpler implementation. `float16` continues to use the naive path (the kernel is float-only, matching the quantized flash constraint). ## Performance Operator-level, AMD EPYC 7763 (16 physical cores), threads=8, FP32 KV cache, `B=1, num_heads=16, kv_num_heads=8, head_size=128`. Flash is faster than naive across all measured prefill lengths (and single-threaded as well, 1.4-1.8x), confirming the gain is algorithmic - the causal early-termination removes the wasted upper-triangle work that previously made flash slower than naive at short sequences. | Prefill Seq Length | Naive (ms) | Flash (ms) | Speedup | |---:|---:|---:|---:| | 512 | 5.8-8.4 | 4.2-5.3 | 1.4-1.6x | | 1024 | 25-29 | 13-18 | 1.6-2.0x | | 2048 | 87-118 | 52-65 | 1.5-2.0x | | 4096 | 365-380 | 213-234 | 1.6-1.7x | The flash path's primary structural benefit is memory: it never allocates the full O(N x S x T) attention matrix (~1 GB at S=4096, N=16) and instead uses an O(S x Bc) per-thread tile. ## Testing - **C++ op tests**: `onnxruntime_provider_test --gtest_filter="GroupQueryAttentionTest.*"` - 38 passed (12 GPU/WebGPU skipped) with flash on (default) and with `ORT_GQA_DISABLE_FLASH_ATTENTION=1`. - **Flash vs. naive parity** (FP32): output of the flash path matches the naive path (max abs diff ~1e-7) across prefill (block-aligned and non-aligned `S`), MHA and GQA head ratios, and local window. Decode now uses the naive path on both sides (diff 0). - **Python parity** (`test_gqa_cpu.py`, flash vs. naive reference): focused FP32 sweep of 600 prompt configurations covering all head sizes (32-256), GQA ratios `(6,6)/(6,3)/(9,9)/(9,3)`, batches `1/3/5`, causal/local window, attention bias, position ids, packed QKV, and with/without KV buffer - all passed. The official `test_gqa_cpu.py` suite passes. Two correctness bugs were found and fixed via the parity sweep while developing this path: 1. Attention-bias batch stride ignored head broadcasting for `[batch, 1, S, T]` bias. 2. Query batch stride was hardcoded to `num_heads * S * H`, which is incorrect for packed-QKV input (correct stride is `(num_heads + 2 * kv_num_heads) * S * H`).
…, GQA underflow, and ep_weight_sharing_ctx_gen build (microsoft#28245) ### Description This PR contains three commits: **Commit 1: Miscellaneous fixes** - Downgrade QNN ETW profiling mismatch logs from ERROR to VERBOSE to reduce excessive telemetry noise (~1 billion events/week across Windows devices) - Add bounds checking in GQA attention to prevent `size_t` underflow when `seqlens_k` contains invalid data (fixes microsoft#27170) - Build `ep_weight_sharing_ctx_gen` for TensorRT, OpenVINO, and VitisAI in addition to QNN **Commit 2: Bump cpuinfo and add `cpuinfo_deinitialize()` integration** Applications that dynamically load and unload the onnxruntime DLL leave orphaned heap allocations from cpuinfo when the library is unloaded mid-process. These are flagged as memory leaks by App Verifier, Valgrind, AddressSanitizer, and LeakSanitizer. This commit bumps `pytorch/cpuinfo` to a version that implements `cpuinfo_deinitialize()` ([pytorch/cpuinfo#387](pytorch/cpuinfo#387)) and adds ORT integration: - `CPUIDInfo::ShutDown()` calls `cpuinfo_deinitialize()` to free heap-allocated globals - `DllMain` calls `ShutdownCpuInfo()` on `DLL_PROCESS_DETACH` - In memleak-check builds, shutdown also runs during process termination - `InstanceCreated` atomic guard prevents singleton creation during DLL unload **Commit 3: Update to official cpuinfo merged fix** After [pytorch/cpuinfo#387](pytorch/cpuinfo#387) merged upstream, updated the dependency to point to `pytorch/cpuinfo` main (`4628dc06`). Patch changes: - **Removed** `win_arm_fp16_detection_fallback.patch` — upstreamed via [pytorch/cpuinfo#348](pytorch/cpuinfo#348) - **Updated** `patch_vcpkg_arm64ec_support.patch` — regenerated for new cpuinfo; still needed ([pytorch/cpuinfo#324](pytorch/cpuinfo#324) not yet merged) - **Updated** `patch_cpuinfo_h_for_arm64ec.patch` — retained, not yet upstream - **Regenerated** `fix_missing_sysfs_fallback.patch` — updated context lines for new cpuinfo code ### Motivation and Context - pytorch/cpuinfo#150 - microsoft#16117 - microsoft#23762
…icrosoft#29221) ## Description The CUDA plugin EP previously rejected combining a user-provided compute stream (`user_compute_stream`) with CUDA graph capture (`enable_cuda_graph`), returning `ORT_INVALID_ARGUMENT`. This PR removes that restriction so the two options can be used together: when both are set, graph capture and replay run on the user-owned stream (the same stream the kernels are issued to), matching the bundled (non-plugin) CUDA EP behavior. Several supporting fixes make capture on a shared stream stable and Memcpy-free. ## Summary of Changes ### Allow user stream + CUDA graph | File | Change | |------|--------| | [onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc](onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc) | Remove the validation that rejected `user_compute_stream` + `enable_cuda_graph` together. | | [onnxruntime/core/providers/cuda/plugin/cuda_ep.cc](onnxruntime/core/providers/cuda/plugin/cuda_ep.cc) | `PerThreadContext` accepts an optional external graph stream. When both options are set it captures/replays on the user stream and does **not** create or destroy it (the user owns its lifetime); otherwise it owns a dedicated graph stream as before. | ### Stable, Memcpy-free CUDA graph capture | File | Change | |------|--------| | [onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h](onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h) | Route kernel scratch/workspace allocations through the EP allocator (BFC arena) instead of raw `cudaMallocAsync`/`cudaMalloc`. After warmup the arena reaches steady state, so the capture run serves scratch from already-reserved chunks and the device free-memory footprint stays stable — required for correct capture. Matches the built-in CUDA EP. | | [onnxruntime/core/providers/cuda/tensor/shape_op.cc](onnxruntime/core/providers/cuda/tensor/shape_op.cc) | Add an adapter-based `Shape` kernel under `#ifdef BUILD_CUDA_EP_AS_PLUGIN` with identical semantics to the CPU `Shape`. Registering `Shape` on the EP keeps it off the CPU EP and avoids the Memcpy nodes that would otherwise break CUDA graph capture. | | [cmake/onnxruntime_providers_cuda_plugin.cmake](cmake/onnxruntime_providers_cuda_plugin.cmake) | Stop excluding `shape_op.cc` from the plugin build so the adapter-based `Shape` kernel is compiled in. | ### Null-allocator fallback in PrePack (plugin boundary) In the plugin build the `AllocatorPtr` passed to `PrePack` can arrive null across the library boundary. Each kernel now falls back to its own default-memory allocator (`Info().GetAllocator(OrtMemTypeDefault)`), which is always valid. - [onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc](onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc) - [onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc](onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc) - [onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc](onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc) ### Misc - [onnxruntime/core/framework/session_state.cc](onnxruntime/core/framework/session_state.cc) — wrap a long line (no behavior change). ## Testing - New test: [onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc](onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc) covering: 1. Session creation succeeds with both `user_compute_stream` and `enable_cuda_graph` set (regression for the removed validation). 2. Capture + replay on the user stream produce correct results. 3. Replay after an in-place input update on the user stream is correct. - Tests are gated on `ORT_UNIT_TEST_HAS_CUDA_PLUGIN_EP` and skip gracefully when no CUDA device or plugin library is available. ## Motivation and Context Users that drive ORT from their own CUDA stream (e.g. to interleave ORT inference with their own kernels) previously could not also benefit from CUDA graph capture on the plugin EP. This change brings the plugin EP to parity with the bundled CUDA EP for that workflow. ## Checklist - [x] Tests added/updated - [x] No breaking changes (relaxes a previously rejected option combination) - [ ] Documentation updated (if applicable)
## Summary - align CPU ONNX Attention causal masking with upper-left behavior for q_len=1, kv_len>1, no past - preserve the existing `nonpad_kv_seqlen` / TensorScatter single-query causal behavior - update Python attention reference causal mask to model ONNX upper-left alignment with an explicit past offset - add a regression test for issue microsoft#29020 Fixes microsoft#29020 ## Validation - `python -m py_compile onnxruntime/test/python/transformers/test_onnx_attention/common.py onnxruntime/test/python/transformers/test_onnx_attention/test_mha.py onnxruntime/test/python/transformers/test_onnx_attention/test_gqa.py onnxruntime/test/python/transformers/test_onnx_attention/test_tensorscatter_attention.py` - `git diff --check` Notes: - `pytest onnxruntime/test/python/transformers/test_onnx_attention/test_tensorscatter_attention.py -k "cpu_fp32 and causal" -q` could not run locally because this Python environment does not have `onnx` / `onnxruntime` installed. - After the latest follow-up commit, an incremental rebuild of `onnxruntime_provider_test` was attempted but failed in MSBuild before compiling this change due to a local environment issue: duplicate `Path` / `PATH` environment keys when launching `CL.exe`.
…ts (microsoft#29247) ## Summary Lift WebGPU FlashAttention's `batch_size == 1` restriction so batched GQA with right-padded prompts (the common GenAI batched-prefill shape) takes the fused FlashAttention path instead of falling back to `ApplyAttention`. - **Per-batch seqlens in FlashAttention shaders.** Prefill, decode split-reduce, CopyKVCache, and the fused rotary-and-copyKV template now read `seqlens_k[batch_idx]` instead of hardcoding `seqlens_k[0]`. All `past_X = total_X - new_X` subtractions are clamped to avoid u32 underflow when a short batch's per-batch total is less than the batch-wide `sequence_length`. - **Indirect-dispatch sizing uses GQA's `total_sequence_length` input.** `CopyKVCache`, `SplitPackedQKVWithRotaryEmbeddingAndCopyKV`, and `FlashAttentionDecodeQKV` now take a new `total_sequence_length_input` binding (GQA input #6, GPU-resident under graph capture) for the indirect-dispatch grid sizing. This is the global max KV span across the batch by construction, replacing the previous `seqlens_k[0] + 1u` that under-dispatched whenever batch 0 wasn't the longest. Per-batch `seqlens_k[batch] + 1` still drives causal masking and K/V bounds inside the kernels. GQA now enforces `graph_capture_enabled -> past_present_share_buffer_` so the host-side `use_indirect_dispatch` predicate stays simple. - **Decoupled attention_bias stride from per-batch OOB.** `attention_bias` is still allocated to the global max `total_sequence_length`; only the causal-mask / softmax tile loops are gated by the per-batch total. The one-past-end fallback was tightened to clamp inside the same row (`offset_base + stride_total_seq - 1u`). - **Decode workgroup grid stays at global max.** `decode_qkv` keeps a workgroup grid sized to the global max tile count to keep `workgroup_idx` slicing consistent across batches, with neutral `(-inf, 0)` early-exit for tiles beyond a short batch's per-batch total so the `VxReduce` online softmax rescaling is not skewed. - **New `use_seqlen_k` template parameter** (separate from `use_indirect_dispatch` which still requires graph capture). It is enabled whenever `seqlen_k` is provided and (`graph_capture || batch_size_ > 1`). - **Rotary fix prerequisite** (`webgpu: fix GQA batched right-padded prefill with do_rotary`, 591df5b): clamps `past_seqlen` to 0 in `RotaryEmbeddingProgram`, `FusedQKRotaryEmbeddingProgram`, and `split_packed_qkv_with_rotary_embedding`, which previously produced gibberish for the shorter batches. ## Motivation GenAI's batched prefill right-pads short prompts to the batch max and reports each batch's real length via `seqlens_k[b] = real_len[b] - 1`. The previous FlashAttention gate forced every batched call onto the slower `ApplyAttention` path, and the rotary shaders underflowed `u32` for any batch shorter than the batch-wide `sequence_length`, producing garbage Q/K positions and gibberish output text for the shorter batches. ## Test plan - [x] All `GroupQueryAttentionTest.WebGPU_*` op tests pass, including `BatchedRightPaddedRotaryPrefill` (FlashAttention path) and the new `BatchedRightPaddedRotaryPrefillFlashAttentionLargeSpread_WebGPU` covering a `real_lens` spread > tile_size - [x] phi4-prune three-prompt batched generation: coherent outputs on WebGPU matching CPU reference (3 prompts, 384 tokens, 173 tps) - [x] phi4-prune single-prompt generation regression: coherent - [x] phi4-graph-prune (graph capture enabled): `verify_model_correctness.py` 4/4 PASS; `verify_multi_gen.py` sequential + overlapping both PASS - [x] whisper-tiny-int4 transcription regression: 2/2 byte-exact with CPU - [x] Lintrunner clean on all changed files
…29216) ### Description PR1 microsoft#28962 adds flash attention for **prefill**, and removed flash decoding. This PR will add optimized kernel for **single-token decode**, which will be faster than other kernels including flash decoding. This PR builds on the prefill-only flash attention change and additionally introduces a dedicated decode kernel. #### What's included - **Decode (GEMV) kernel** — A dedicated single-token decode kernel (`MlasGQADecodeGQAThreaded`) for `sequence_length == 1`, parallelized over (batch, head) with a two-pass softmax, using GEMV (`acc[8]`-lane dot product / AXPY) helpers instead of per-block M=1 SGEMM calls. This fixes the per-block SGEMM decode regression. - The FP32 flash gate (`group_query_attention.cc`) is enabled for `total_sequence_length > 1`, routing prefill to the tiled kernel and decode to the GEMV kernel. - The quantized KV-cache path is unchanged (FP32-only scope). #### Results (AMD EPYC 7763, AVX2, 8 threads) - **Decode:** correctness ~1e-8 vs naive; long-context decode ~1.0–1.5x (T = 4097 ~1.3–1.5x). ### Motivation and Context The naive GQA path materializes the full score matrix, which is memory-bound for long sequences. Flash attention reduces memory traffic for prefill, and the GEMV decode kernel avoids SGEMM overhead for the M=1 decode case. ### Testing - Built with `--compile_no_warning_as_error`. - Correctness verified against the naive path for both prefill and decode (max abs diff ~1e-8). - Benchmarked via `benchmark_gqa_cpu_flash.py`.
…t#29251) # Fix unbounded lifetime on WithOutputTensor in Rust bindings ## Description The `WithOutputTensor<'a, T>` struct had a free lifetime parameter `'a` on its `TryFrom<OrtOutputTensor>` impl that was unconstrained by any input. Combined with the `Deref` impl (whose `Target = ArrayView<'a, T, IxDyn>` exposed a `Clone`-able view), it was possible for the `ArrayView` to outlive the underlying `OrtOutputTensor` buffer owner. This change restructures `WithOutputTensor` to eliminate the unbounded lifetime: - Removes the `'a` lifetime parameter from `WithOutputTensor`, `OrtOutput`, and `Session::run` - Removes the `Deref` impl (the escape hatch) - Replaces the stored `ArrayView<'a, T>` with a raw pointer + shape - Adds a `view(&self)` method returning `ArrayView<'_, T, IxDyn>` — the view lifetime is now tied to `&self` - Updates all call sites (examples, integration tests) to use `.view()` ## Motivation The C API contract (`onnxruntime_c_api.h`) explicitly bounds the data pointer lifetime to the `OrtValue`: the pointer is only valid until the value is destroyed. The Rust type system must enforce this invariant. Previously it did not — the `ArrayView` could be cloned out and observed after the `OrtValue` was freed. ## API Change ```rust // Before: Deref-based access let output = outputs[0].float_array().unwrap(); let sum: f32 = output.iter().sum(); // After: explicit view() call let output = outputs[0].float_array().unwrap(); let sum: f32 = output.view().iter().sum(); ``` ## Testing Existing integration tests updated to use the new `view()` API. The fix is enforced at compile time by the borrow checker — the previously problematic pattern now produces a lifetime error. Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
### Description Use new GitHub CI identity for azcopy. ### Motivation and Context GitHub CI pools have been assigned a new identity.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.