Sync with Microsoft ONNX Runtime - 28062026#1169
Open
ai-fw-intg wants to merge 20 commits into
Open
Conversation
…icrosoft#28771) ### Description <!-- Describe your changes. --> Relax the input-validation in OrtApi::CompileModel to accept OrtModel instances with zero graph inputs. Previously, ModelCompilationOptions::Check() rejected such models with "OrtModel graph must have at least one input and one output defined." The check now requires only at least one graph output; the zero-input case is legal. Tests in test_model_builder_api.cc are restructured: - The old CompileFromModelWithEmptyInputsOutputs_Fails is renamed to CompileFromModelWithEmptyOutputs_Fails and reshaped to provide 1 input + 0 outputs, isolating the output-only check. - A new regression test CompileFromModelWithEmptyInputs_Succeeds builds a 0-input model with a RandomNormal node and verifies compilation succeeds. ### Motivation and Context <!-- - Why is this change required? What problem does it solve? - If it fixes an open issue, please link to the issue here. --> Fixes microsoft#28135 The original check was too restrictive and impacts callers (e.g., WebNN/Chromium needs to call CompileModel on such models in a separate compiler process (and then load the compiled artifact via CreateSessionFromArray in the GPU process)).
…ttention (microsoft#29240) ### Description The CUDA `GroupQueryAttention` kernel derives a KV-cache append offset from the `seqlens_k` input (`past_seq_lens = (seqlens_k + 1) - sequence_length`). On the CUDA EP `seqlens_k` is device-resident (only `total_sequence_length` is a CPU input), so the host-side range validation in the operator/helper is skipped. The device kernel `UnpackRoPEAppend` then guarded the cache store with only a one-sided upper bound (`cache_s < max_seqlen`), so an out-of-range `seqlens_k` could produce a negative offset that is sign-extended into the cache-index arithmetic. The CPU operator already validates `seqlens_k` host-side; this change brings the CUDA path to parity by guarding on the device. ### Changes - `group_query_attention_impl.cu` (`GetSequenceLengths`): clamp the negative case at the source so both `total_seq_lens` and the append offset `past_seq_lens` stay non-negative for all downstream consumers. - `group_query_attention_qkv.cuh` (`UnpackRoPEAppend`): make the KV-cache store bound two-sided (`cache_s >= 0 && cache_s < max_seqlen`), mirroring the existing position-index guard a few lines above. This also covers the fast-decode path, where `past_seq_lens` points directly at the raw input and bypasses `GetSequenceLengths`. - Added `NegativeSeqlensK_CacheAppend_NoOOB_CUDA` regression test exercising the KV-cache append path with an out-of-range `seqlens_k` (CUDA-guarded; skips when CUDA EP is unavailable). ### Notes - The two-sided guard matches the pattern introduced for the rotary position index in microsoft#27597. - CPU is unaffected (already validated host-side); WebGPU relies on the CPU-validated `total_sequence_length`. The CUDA implementation is shared with ROCm via hipify. - The regression is a device-memory write best observed under `compute-sanitizer`; the test asserts the run completes with finite outputs. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oft#28962) ## Summary Adds an FP32 flash attention path for the CPU `com.microsoft.GroupQueryAttention` (GQA) contrib op, mirroring the existing quantized-KV flash attention path. The new tiled, online-softmax kernel avoids materializing the full `[S, T]` attention score matrix. It is restricted to prefill / chunked-prefill (`sequence_length > 1`); single-token decode falls back to the naive path. With causal early-termination it is faster than the naive path across all measured prefill lengths while using a fraction of the memory. ## Key changes - **New MLAS kernel** `onnxruntime/core/mlas/lib/flashattn_gqa.cpp` (`MlasFlashAttentionGQA`): - Tiled QK / softmax / SV with online-softmax (running max/sum rescaling). - GQA head grouping (`num_heads % kv_num_heads == 0`), causal masking, local window, additive attention bias, and packed-QKV input. - **Causal early-termination**: during prefill, KV blocks that fall entirely in the causally masked upper triangle are skipped (`break` once `ir >= past_seqlen + q_idx + row_size_q`), avoiding the wasted QK/SV GEMMs over roughly half of the square prefill attention matrix. - Per-batch invocation for ragged / shared-buffer `seqlens_k`. - **MLAS API** `onnxruntime/core/mlas/inc/mlas.h`: new `MlasFlashAttentionGQAArgs` struct and `MlasFlashAttentionGQA` declaration. - **Dispatch** `onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h`: new `ApplyAttentionFlash` that concatenates new K/V into the FP32 present cache and invokes the kernel. The per-thread scratch buffer size is computed with `SafeInt<size_t>` to guard against `size_t` overflow on large/malformed shapes before allocation. - **Wiring** `onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc`: float-only flash dispatch, active only for prefill (`sequence_length > 1`) and when `softcap == 0`, no smooth softmax, no head sink, no QK output; falls back to the naive path otherwise. The existing `ORT_GQA_DISABLE_FLASH_ATTENTION` env var disables it. - **CMake** `cmake/onnxruntime_mlas.cmake`: register the new source file. - **Docs** `docs/contrib_ops/cpu/gqa.md`: document the non-quantized flash attention path, activation conditions, causal early-termination, file list, and FP32 flash-vs-naive benchmark results. - **Benchmark** `onnxruntime/test/python/transformers/benchmark_gqa_cpu_flash.py`: add an FP32 (non-quantized) mode (`--fp32`) for operator-level flash-vs-naive comparison. ### Why prefill-only (`sequence_length > 1`) Single-token decode (`sequence_length == 1`) produces only a `[1, total_sequence_length]` score row per head, so there is nothing to tile away and the extra online-softmax bookkeeping makes the flash kernel slower and noisier than naive in practice. Restricting the flash path to prefill keeps the consistent prefill win without regressing decode. Because decode is excluded, the two-phase flash-decoding kernels are unreachable and have been removed for a smaller, simpler implementation. `float16` continues to use the naive path (the kernel is float-only, matching the quantized flash constraint). ## Performance Operator-level, AMD EPYC 7763 (16 physical cores), threads=8, FP32 KV cache, `B=1, num_heads=16, kv_num_heads=8, head_size=128`. Flash is faster than naive across all measured prefill lengths (and single-threaded as well, 1.4-1.8x), confirming the gain is algorithmic - the causal early-termination removes the wasted upper-triangle work that previously made flash slower than naive at short sequences. | Prefill Seq Length | Naive (ms) | Flash (ms) | Speedup | |---:|---:|---:|---:| | 512 | 5.8-8.4 | 4.2-5.3 | 1.4-1.6x | | 1024 | 25-29 | 13-18 | 1.6-2.0x | | 2048 | 87-118 | 52-65 | 1.5-2.0x | | 4096 | 365-380 | 213-234 | 1.6-1.7x | The flash path's primary structural benefit is memory: it never allocates the full O(N x S x T) attention matrix (~1 GB at S=4096, N=16) and instead uses an O(S x Bc) per-thread tile. ## Testing - **C++ op tests**: `onnxruntime_provider_test --gtest_filter="GroupQueryAttentionTest.*"` - 38 passed (12 GPU/WebGPU skipped) with flash on (default) and with `ORT_GQA_DISABLE_FLASH_ATTENTION=1`. - **Flash vs. naive parity** (FP32): output of the flash path matches the naive path (max abs diff ~1e-7) across prefill (block-aligned and non-aligned `S`), MHA and GQA head ratios, and local window. Decode now uses the naive path on both sides (diff 0). - **Python parity** (`test_gqa_cpu.py`, flash vs. naive reference): focused FP32 sweep of 600 prompt configurations covering all head sizes (32-256), GQA ratios `(6,6)/(6,3)/(9,9)/(9,3)`, batches `1/3/5`, causal/local window, attention bias, position ids, packed QKV, and with/without KV buffer - all passed. The official `test_gqa_cpu.py` suite passes. Two correctness bugs were found and fixed via the parity sweep while developing this path: 1. Attention-bias batch stride ignored head broadcasting for `[batch, 1, S, T]` bias. 2. Query batch stride was hardcoded to `num_heads * S * H`, which is incorrect for packed-QKV input (correct stride is `(num_heads + 2 * kv_num_heads) * S * H`).
…, GQA underflow, and ep_weight_sharing_ctx_gen build (microsoft#28245) ### Description This PR contains three commits: **Commit 1: Miscellaneous fixes** - Downgrade QNN ETW profiling mismatch logs from ERROR to VERBOSE to reduce excessive telemetry noise (~1 billion events/week across Windows devices) - Add bounds checking in GQA attention to prevent `size_t` underflow when `seqlens_k` contains invalid data (fixes microsoft#27170) - Build `ep_weight_sharing_ctx_gen` for TensorRT, OpenVINO, and VitisAI in addition to QNN **Commit 2: Bump cpuinfo and add `cpuinfo_deinitialize()` integration** Applications that dynamically load and unload the onnxruntime DLL leave orphaned heap allocations from cpuinfo when the library is unloaded mid-process. These are flagged as memory leaks by App Verifier, Valgrind, AddressSanitizer, and LeakSanitizer. This commit bumps `pytorch/cpuinfo` to a version that implements `cpuinfo_deinitialize()` ([pytorch/cpuinfo#387](pytorch/cpuinfo#387)) and adds ORT integration: - `CPUIDInfo::ShutDown()` calls `cpuinfo_deinitialize()` to free heap-allocated globals - `DllMain` calls `ShutdownCpuInfo()` on `DLL_PROCESS_DETACH` - In memleak-check builds, shutdown also runs during process termination - `InstanceCreated` atomic guard prevents singleton creation during DLL unload **Commit 3: Update to official cpuinfo merged fix** After [pytorch/cpuinfo#387](pytorch/cpuinfo#387) merged upstream, updated the dependency to point to `pytorch/cpuinfo` main (`4628dc06`). Patch changes: - **Removed** `win_arm_fp16_detection_fallback.patch` — upstreamed via [pytorch/cpuinfo#348](pytorch/cpuinfo#348) - **Updated** `patch_vcpkg_arm64ec_support.patch` — regenerated for new cpuinfo; still needed ([pytorch/cpuinfo#324](pytorch/cpuinfo#324) not yet merged) - **Updated** `patch_cpuinfo_h_for_arm64ec.patch` — retained, not yet upstream - **Regenerated** `fix_missing_sysfs_fallback.patch` — updated context lines for new cpuinfo code ### Motivation and Context - pytorch/cpuinfo#150 - microsoft#16117 - microsoft#23762
…icrosoft#29221) ## Description The CUDA plugin EP previously rejected combining a user-provided compute stream (`user_compute_stream`) with CUDA graph capture (`enable_cuda_graph`), returning `ORT_INVALID_ARGUMENT`. This PR removes that restriction so the two options can be used together: when both are set, graph capture and replay run on the user-owned stream (the same stream the kernels are issued to), matching the bundled (non-plugin) CUDA EP behavior. Several supporting fixes make capture on a shared stream stable and Memcpy-free. ## Summary of Changes ### Allow user stream + CUDA graph | File | Change | |------|--------| | [onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc](onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc) | Remove the validation that rejected `user_compute_stream` + `enable_cuda_graph` together. | | [onnxruntime/core/providers/cuda/plugin/cuda_ep.cc](onnxruntime/core/providers/cuda/plugin/cuda_ep.cc) | `PerThreadContext` accepts an optional external graph stream. When both options are set it captures/replays on the user stream and does **not** create or destroy it (the user owns its lifetime); otherwise it owns a dedicated graph stream as before. | ### Stable, Memcpy-free CUDA graph capture | File | Change | |------|--------| | [onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h](onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h) | Route kernel scratch/workspace allocations through the EP allocator (BFC arena) instead of raw `cudaMallocAsync`/`cudaMalloc`. After warmup the arena reaches steady state, so the capture run serves scratch from already-reserved chunks and the device free-memory footprint stays stable — required for correct capture. Matches the built-in CUDA EP. | | [onnxruntime/core/providers/cuda/tensor/shape_op.cc](onnxruntime/core/providers/cuda/tensor/shape_op.cc) | Add an adapter-based `Shape` kernel under `#ifdef BUILD_CUDA_EP_AS_PLUGIN` with identical semantics to the CPU `Shape`. Registering `Shape` on the EP keeps it off the CPU EP and avoids the Memcpy nodes that would otherwise break CUDA graph capture. | | [cmake/onnxruntime_providers_cuda_plugin.cmake](cmake/onnxruntime_providers_cuda_plugin.cmake) | Stop excluding `shape_op.cc` from the plugin build so the adapter-based `Shape` kernel is compiled in. | ### Null-allocator fallback in PrePack (plugin boundary) In the plugin build the `AllocatorPtr` passed to `PrePack` can arrive null across the library boundary. Each kernel now falls back to its own default-memory allocator (`Info().GetAllocator(OrtMemTypeDefault)`), which is always valid. - [onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc](onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc) - [onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc](onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc) - [onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc](onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc) ### Misc - [onnxruntime/core/framework/session_state.cc](onnxruntime/core/framework/session_state.cc) — wrap a long line (no behavior change). ## Testing - New test: [onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc](onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc) covering: 1. Session creation succeeds with both `user_compute_stream` and `enable_cuda_graph` set (regression for the removed validation). 2. Capture + replay on the user stream produce correct results. 3. Replay after an in-place input update on the user stream is correct. - Tests are gated on `ORT_UNIT_TEST_HAS_CUDA_PLUGIN_EP` and skip gracefully when no CUDA device or plugin library is available. ## Motivation and Context Users that drive ORT from their own CUDA stream (e.g. to interleave ORT inference with their own kernels) previously could not also benefit from CUDA graph capture on the plugin EP. This change brings the plugin EP to parity with the bundled CUDA EP for that workflow. ## Checklist - [x] Tests added/updated - [x] No breaking changes (relaxes a previously rejected option combination) - [ ] Documentation updated (if applicable)
## Summary - align CPU ONNX Attention causal masking with upper-left behavior for q_len=1, kv_len>1, no past - preserve the existing `nonpad_kv_seqlen` / TensorScatter single-query causal behavior - update Python attention reference causal mask to model ONNX upper-left alignment with an explicit past offset - add a regression test for issue microsoft#29020 Fixes microsoft#29020 ## Validation - `python -m py_compile onnxruntime/test/python/transformers/test_onnx_attention/common.py onnxruntime/test/python/transformers/test_onnx_attention/test_mha.py onnxruntime/test/python/transformers/test_onnx_attention/test_gqa.py onnxruntime/test/python/transformers/test_onnx_attention/test_tensorscatter_attention.py` - `git diff --check` Notes: - `pytest onnxruntime/test/python/transformers/test_onnx_attention/test_tensorscatter_attention.py -k "cpu_fp32 and causal" -q` could not run locally because this Python environment does not have `onnx` / `onnxruntime` installed. - After the latest follow-up commit, an incremental rebuild of `onnxruntime_provider_test` was attempted but failed in MSBuild before compiling this change due to a local environment issue: duplicate `Path` / `PATH` environment keys when launching `CL.exe`.
…ts (microsoft#29247) ## Summary Lift WebGPU FlashAttention's `batch_size == 1` restriction so batched GQA with right-padded prompts (the common GenAI batched-prefill shape) takes the fused FlashAttention path instead of falling back to `ApplyAttention`. - **Per-batch seqlens in FlashAttention shaders.** Prefill, decode split-reduce, CopyKVCache, and the fused rotary-and-copyKV template now read `seqlens_k[batch_idx]` instead of hardcoding `seqlens_k[0]`. All `past_X = total_X - new_X` subtractions are clamped to avoid u32 underflow when a short batch's per-batch total is less than the batch-wide `sequence_length`. - **Indirect-dispatch sizing uses GQA's `total_sequence_length` input.** `CopyKVCache`, `SplitPackedQKVWithRotaryEmbeddingAndCopyKV`, and `FlashAttentionDecodeQKV` now take a new `total_sequence_length_input` binding (GQA input #6, GPU-resident under graph capture) for the indirect-dispatch grid sizing. This is the global max KV span across the batch by construction, replacing the previous `seqlens_k[0] + 1u` that under-dispatched whenever batch 0 wasn't the longest. Per-batch `seqlens_k[batch] + 1` still drives causal masking and K/V bounds inside the kernels. GQA now enforces `graph_capture_enabled -> past_present_share_buffer_` so the host-side `use_indirect_dispatch` predicate stays simple. - **Decoupled attention_bias stride from per-batch OOB.** `attention_bias` is still allocated to the global max `total_sequence_length`; only the causal-mask / softmax tile loops are gated by the per-batch total. The one-past-end fallback was tightened to clamp inside the same row (`offset_base + stride_total_seq - 1u`). - **Decode workgroup grid stays at global max.** `decode_qkv` keeps a workgroup grid sized to the global max tile count to keep `workgroup_idx` slicing consistent across batches, with neutral `(-inf, 0)` early-exit for tiles beyond a short batch's per-batch total so the `VxReduce` online softmax rescaling is not skewed. - **New `use_seqlen_k` template parameter** (separate from `use_indirect_dispatch` which still requires graph capture). It is enabled whenever `seqlen_k` is provided and (`graph_capture || batch_size_ > 1`). - **Rotary fix prerequisite** (`webgpu: fix GQA batched right-padded prefill with do_rotary`, 591df5b): clamps `past_seqlen` to 0 in `RotaryEmbeddingProgram`, `FusedQKRotaryEmbeddingProgram`, and `split_packed_qkv_with_rotary_embedding`, which previously produced gibberish for the shorter batches. ## Motivation GenAI's batched prefill right-pads short prompts to the batch max and reports each batch's real length via `seqlens_k[b] = real_len[b] - 1`. The previous FlashAttention gate forced every batched call onto the slower `ApplyAttention` path, and the rotary shaders underflowed `u32` for any batch shorter than the batch-wide `sequence_length`, producing garbage Q/K positions and gibberish output text for the shorter batches. ## Test plan - [x] All `GroupQueryAttentionTest.WebGPU_*` op tests pass, including `BatchedRightPaddedRotaryPrefill` (FlashAttention path) and the new `BatchedRightPaddedRotaryPrefillFlashAttentionLargeSpread_WebGPU` covering a `real_lens` spread > tile_size - [x] phi4-prune three-prompt batched generation: coherent outputs on WebGPU matching CPU reference (3 prompts, 384 tokens, 173 tps) - [x] phi4-prune single-prompt generation regression: coherent - [x] phi4-graph-prune (graph capture enabled): `verify_model_correctness.py` 4/4 PASS; `verify_multi_gen.py` sequential + overlapping both PASS - [x] whisper-tiny-int4 transcription regression: 2/2 byte-exact with CPU - [x] Lintrunner clean on all changed files
…29216) ### Description PR1 microsoft#28962 adds flash attention for **prefill**, and removed flash decoding. This PR will add optimized kernel for **single-token decode**, which will be faster than other kernels including flash decoding. This PR builds on the prefill-only flash attention change and additionally introduces a dedicated decode kernel. #### What's included - **Decode (GEMV) kernel** — A dedicated single-token decode kernel (`MlasGQADecodeGQAThreaded`) for `sequence_length == 1`, parallelized over (batch, head) with a two-pass softmax, using GEMV (`acc[8]`-lane dot product / AXPY) helpers instead of per-block M=1 SGEMM calls. This fixes the per-block SGEMM decode regression. - The FP32 flash gate (`group_query_attention.cc`) is enabled for `total_sequence_length > 1`, routing prefill to the tiled kernel and decode to the GEMV kernel. - The quantized KV-cache path is unchanged (FP32-only scope). #### Results (AMD EPYC 7763, AVX2, 8 threads) - **Decode:** correctness ~1e-8 vs naive; long-context decode ~1.0–1.5x (T = 4097 ~1.3–1.5x). ### Motivation and Context The naive GQA path materializes the full score matrix, which is memory-bound for long sequences. Flash attention reduces memory traffic for prefill, and the GEMV decode kernel avoids SGEMM overhead for the M=1 decode case. ### Testing - Built with `--compile_no_warning_as_error`. - Correctness verified against the naive path for both prefill and decode (max abs diff ~1e-8). - Benchmarked via `benchmark_gqa_cpu_flash.py`.
…t#29251) # Fix unbounded lifetime on WithOutputTensor in Rust bindings ## Description The `WithOutputTensor<'a, T>` struct had a free lifetime parameter `'a` on its `TryFrom<OrtOutputTensor>` impl that was unconstrained by any input. Combined with the `Deref` impl (whose `Target = ArrayView<'a, T, IxDyn>` exposed a `Clone`-able view), it was possible for the `ArrayView` to outlive the underlying `OrtOutputTensor` buffer owner. This change restructures `WithOutputTensor` to eliminate the unbounded lifetime: - Removes the `'a` lifetime parameter from `WithOutputTensor`, `OrtOutput`, and `Session::run` - Removes the `Deref` impl (the escape hatch) - Replaces the stored `ArrayView<'a, T>` with a raw pointer + shape - Adds a `view(&self)` method returning `ArrayView<'_, T, IxDyn>` — the view lifetime is now tied to `&self` - Updates all call sites (examples, integration tests) to use `.view()` ## Motivation The C API contract (`onnxruntime_c_api.h`) explicitly bounds the data pointer lifetime to the `OrtValue`: the pointer is only valid until the value is destroyed. The Rust type system must enforce this invariant. Previously it did not — the `ArrayView` could be cloned out and observed after the `OrtValue` was freed. ## API Change ```rust // Before: Deref-based access let output = outputs[0].float_array().unwrap(); let sum: f32 = output.iter().sum(); // After: explicit view() call let output = outputs[0].float_array().unwrap(); let sum: f32 = output.view().iter().sum(); ``` ## Testing Existing integration tests updated to use the new `view()` API. The fix is enforced at compile time by the borrow checker — the previously problematic pattern now produces a lifetime error. Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
### Description Use new GitHub CI identity for azcopy. ### Motivation and Context GitHub CI pools have been assigned a new identity.
…crosoft#29244) ## Description This PR adds a kernel-context C API accessor for the framework `OrtSyncStream*` and uses it in the CUDA plugin EP so scratch allocations can be tagged with the actual compute stream selected for the kernel. It is stacked on microsoft#29221 and turns the previously documented concurrent multi-stream limitation into a gated capability: older runtimes keep the conservative fallback, while runtimes with the new API can safely advertise concurrent runs when EP-level unified stream mode is not forced. ## Summary of Changes ### Public API and Adapters | File | Change | |------|--------| | `include/onnxruntime/core/session/onnxruntime_c_api.h` | Adds `KernelContext_GetSyncStream` to expose the borrowed framework stream for stream-aware allocation and synchronization bookkeeping. | | `onnxruntime/core/session/custom_ops.cc` | Implements the API by retrieving the kernel's `OpKernelContext::GetComputeStream()` inside ORT core. | | `onnxruntime/core/session/ort_apis.h` and `onnxruntime/core/session/onnxruntime_c_api.cc` | Declares and wires the new API entry. | | `include/onnxruntime/core/session/onnxruntime_cxx_api.h` and `include/onnxruntime/core/session/onnxruntime_cxx_inline.h` | Adds the C++ `Ort::KernelContext::GetSyncStream()` wrapper. | | `include/onnxruntime/ep/adapter/op_kernel.h` | Adds a version-gated EP adapter accessor so plugins can use the API when available and fall back safely otherwise. | ### CUDA Plugin EP - Tracks the framework stream corresponding to both raw CUDA stream handles and `OrtStreamAdapter` stream arguments. - Passes the framework stream to scratch allocation so arena chunks are stream-tagged instead of using a null stream tag. - Re-enables concurrent run support only when `KernelContext_GetSyncStream` is available and EP-level unified stream mode is not forced. ### Tests and Docs - Extends the shared-lib custom-op test helper to exercise `Ort::KernelContext::GetSyncStream()`. - Updates CUDA plugin EP docs to describe stream-tagged scratch allocation, compatibility fallback, and the new API audit entry. ## Why a C API is needed The implementation of `KernelContext_GetSyncStream` is intentionally small, but the API boundary is the important part. ORT core can safely cast `OrtKernelContext*` back to `onnxruntime::OpKernelContext*` because it owns both the opaque C handle and the private C++ implementation. A plugin kernel should not perform that cast directly: it would make the plugin depend on ORT-core private C++ layout, vtables, and exact build compatibility. The new API keeps that private cast inside ORT core and gives plugin kernels a stable ABI entry point: ```text plugin kernel -> opaque OrtKernelContext* -> OrtApi::KernelContext_GetSyncStream -> ORT core retrieves the actual framework stream ``` This also lets the plugin use runtime version gating. When loaded by an older ORT runtime that does not expose the API, the adapter returns null, scratch allocation uses the conservative fallback, and concurrent runs are not advertised. ## Testing - `lintrunner -a` - `ninja -C build/cu130_plugin/Debug onnxruntime_providers_cuda_plugin` - `ninja -C build/cu130_plugin/Debug onnxruntime_shared_lib_test` - `cd build/cu130_plugin/Debug && ./onnxruntime_shared_lib_test --gtest_filter=CApiTest.custom_op_handler --gtest_color=no` - VS Code diagnostics on touched C++ and header files ## Checklist - [x] Tests added/updated - [x] Documentation updated - [x] Backward compatibility guarded by runtime API-version checks - [ ] CI passes
…icrosoft#29242) This pull request introduces device-side and host-side validation for sparse attention input indices and key sequence lengths, improving error detection and robustness for both CPU and CUDA implementations. It also adds an environment variable to optionally disable the new device-side validation for performance reasons. Some related test code has been updated to match these changes. **Validation improvements:** * Added host-side validation functions (`ValidateCSRIndices`, `ValidateKeyLengths`) for CSR indices and key sequence lengths in the CPU kernel, and integrated them into the compute path. This ensures invalid inputs are caught early on the CPU. [[1]](diffhunk://#diff-44554dbe530c593f0f0b85a591859bb6b0a21e62992c61f9622e5456a144cb45R39-R106) [[2]](diffhunk://#diff-44554dbe530c593f0f0b85a591859bb6b0a21e62992c61f9622e5456a144cb45R146-R150) * Implemented a CUDA kernel (`ValidateCSRIndicesKernel`) and supporting function (`ValidateCSRIndicesOnDevice`) to check CSR row-pointer monotonicity, column-index range, and key lengths on device, with detailed error codes and messages. The CUDA kernel is invoked in the compute path unless disabled by environment variable. [[1]](diffhunk://#diff-0c8a322cd4611e589f38a67876f309f9d83869a6d2239cadf86970ed2005ebd0R336-R449) [[2]](diffhunk://#diff-fff06841efe15d5f95c02bb38daa1d5aa0775de0e1777d9d418222e44ebc88feR71-R95) [[3]](diffhunk://#diff-08ea97fecd6c161add2607d5b7d406c0f3d2b0f1280ebb49d801f085d941770aR220-R236) **Configurability:** * Introduced the environment variable `ORT_DISABLE_SPARSE_ATTENTION_INPUT_VALIDATION` to allow users to skip device-side validation for performance when inputs are known to be valid. This is parsed and used in the CUDA kernel. [[1]](diffhunk://#diff-56cfb57cdd5f9134a8fea24bd006c691860c64a1f78f4b2c69a861d847dee9ddR94-R99) [[2]](diffhunk://#diff-08ea97fecd6c161add2607d5b7d406c0f3d2b0f1280ebb49d801f085d941770aR59-R60) [[3]](diffhunk://#diff-8e47232d826ceae90a93aa6cde8f2869dcb1998f37804055fcfb584d95e21a96R29) **Test updates:** * Refactored test helper function names and test input shapes to match the new validation logic and error messages. [[1]](diffhunk://#diff-21bdb8e3ad8b50a72c5da77349f280c15a6938ece2bea4e987bd12ff8bcb2a0eL27-R27) [[2]](diffhunk://#diff-21bdb8e3ad8b50a72c5da77349f280c15a6938ece2bea4e987bd12ff8bcb2a0eL45-R45) [[3]](diffhunk://#diff-21bdb8e3ad8b50a72c5da77349f280c15a6938ece2bea4e987bd12ff8bcb2a0eL212-R220) **Other:** * Removed redundant key length value checks from `CheckInputs` in favor of the new validation routines.
microsoft#29239) ### Description `WhisperDecoderSubgraph::CreateInitialFeeds` constructed the decoder initial feeds using a single value that mixed a **byte count** with an **element count**. The total size was computed as `cur_len * batch_beam_size * sizeof(int)` (bytes) and then reused as: - the element count for the int32 staging buffer (`MakeUniquePtr<int>`), and - the element count for the `gsl::span<int>` source/destination passed to the device copy. Because the `input_ids` tensor is allocated for exactly `batch_beam_size * cur_len` int32 elements, the spans claimed 4x the real extent, so the device copy ran past the end of the buffer. The per-beam `memcpy` also used the same combined value as its length instead of a single sequence's byte size. This mirrors the correct T5 sibling (`subgraph_t5_decoder.cc`), which separates the element count (used for the spans/staging allocation) from the per-sequence byte count (used for the `memcpy`). ### Changes - `subgraph_whisper_decoder.cc`: `total_size` is now the element count `cur_len * batch_beam_size`; introduced `sequence_bytes = cur_len * sizeof(int32_t)` for the per-beam `memcpy`. The staging buffer and spans use `int32_t` consistently to match the `int32_t` tensors/sequences. - Added regression test `BeamSearchTest.DummyWhisperWithSequenceInputIds` (CPU, and CUDA under `USE_CUDA`) exercising the `use_sequence_as_input_ids` path, with a deterministic dummy model and its generator script. The test validates both the `sequences` and `scores` outputs. ### Related bool-tensor normalization fixes While exercising the Whisper path, bool tensors copied from raw data could hold non-canonical byte values (anything non-zero rather than strictly `{0, 1}`), causing provider-dependent behavior. To keep the fix self-contained, the following normalization changes are included: - `tensorprotoutils.cc`: `UnpackTensor<bool>` normalizes raw-data bytes to `{0, 1}` (with a `static_assert(sizeof(bool) == 1)` guarding the byte-wise loop). - `compress_impl.cu` (CUDA `Compress`): the prefix-sum sizing predicate normalizes bool bytes to `{0, 1}` so the output sizing agrees with the element-selection truthiness check. Since bool initializers are now normalized on unpack, the remaining exposure is runtime-produced bool condition tensors. - Added `CompressTest.Compress_cuda_non_canonical_bool_condition` (under `USE_CUDA`), which feeds a raw `0xFF` condition byte through a session-level run (`OpTester` normalizes bool inputs and so cannot reproduce this) and asserts the Compress output is sized by truthiness rather than by the sign-extended byte value. ### Motivation The decoder shares one implementation file across CPU/CUDA/ROCm, so this single change covers all execution providers. The previous behavior could overrun the staging/feed buffers for models that drive the sequence-as-input-ids decoder path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
) ### Description Enable pre-packed weights sharing for `MatMulNBits` operator on CPU. When performing DQ + MatMul -> MatMulNBits fusion, the original weight names are lost, so the standard `AddInitializer` approach does not work. To overcome this, introduced the option for graph optimization pass to tag weights which are sharable across sessions (hashing the content and matching it across the sessions). ### Motivation and Context For executing ASG SLMs on CPU - there are two sessions, one for prefill stage and for decode stage (due to different shapes and session options). With this change, storing the weights in memory twice is avoided. The first sessions pre-packs the weights which the second session can reuse. Confirmed memory reduction through the WPA memory traces. --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description The RKNPU execution provider's ONNX converter creates implicit (all-zero) bias buffers when a `Conv` / `Gemm` / `QLinearConv` node omits its bias input. The buffer size was computed as `sizeof(T) * dim` (where `dim` derives from a model weight's shape) with no overflow check, and the raw allocation was tracked in a manually-freed `void*` list. This PR hardens the converter: - **Dimension validation:** ONNX `int64_t` dimensions are validated (via `ORT_ENFORCE` in a new `ToRknpuDim` helper) before being narrowed to the RKNPU `uint32_t` shape representation, rejecting negative or out-of-range values. This covers all four ingestion points (`HandleInitializer`, `GetInputOfOnnxModel`, `GetShape`, `GetSupportedNodes`). - **Overflow-checked allocation:** all four implicit-bias sites (`AddLayerConvImpl`, `AddLayerQLinearConvImpl`, `AddLayerDepthwiseConvImpl`, `AddLayerFC`) go through a shared `AllocZeroedBias` helper that computes the byte count with `SafeInt<size_t>` (throws on overflow) and returns a zero-initialized `std::make_unique<uint8_t[]>` buffer. - **RAII ownership:** `free_list_` is now `std::vector<std::unique_ptr<uint8_t[]>>`, so the bias buffers are freed automatically and `Clear()` no longer walks/`free()`s raw pointers. ### Motivation and Context A malicious ONNX model can provide dimensions that are unsafe for the RKNPU converter's 32-bit shape representation or for byte-size allocation arithmetic: - ONNX stores dimensions as `int64_t`, while the RKNPU converter/DDK uses `uint32_t` shape values. Silently narrowing a large or negative `int64_t` value can produce a misleading `uint32_t` dimension. - Even after a dimension is represented as `uint32_t`, the original `sizeof(T) * dim` could overflow `size_t` on 32-bit RKNPU targets. For example, `dim = 0x40000400` makes `sizeof(float) * dim` wrap to **4096 bytes** while the created tensor still advertises `dim` elements, which would corrupt the heap when the bias is consumed by the driver. - The original code also passed the `malloc` result to `memset` without a null check. The fix uses `SafeInt<size_t>` (the ORT-standard idiom for memory-size arithmetic), validates ONNX dimensions before they enter the RKNPU `uint32_t` shape model, and replaces the manual `malloc`/`free` list with zero-initialized, RAII-owned `std::make_unique<uint8_t[]>` buffers. **Validation:** the RKNPU EP requires the Rockchip DDK (`RKNPU_DDK_PATH`) and an ARM target, so it does not build on a typical x64 dev box and has no GPU-free CI leg. The change was validated with `clang-format`, `git diff --check`, and a standalone test against ORT's `SafeInt.hpp`, confirming the guard throws on the overflowing dimensions (`0x40000400`, `0xFFFFFFFF`) while preserving normal and zero-dimension behavior. **Testing:** No unit test is included because the RKNPU EP is not compiled in any CI leg (it requires the proprietary Rockchip DDK and an ARM target), and the `ToRknpuDim` / `AllocZeroedBias` helpers have internal (file-`static`) linkage, so they are not reachable from the gtest suites. The overflow/validation logic was instead exercised with a standalone test against ORT's `SafeInt.hpp` as noted above. --------- Co-authored-by: Gopalakrishnan Nallasamy <gopalakrishnan.nallasamy@microsoft.com>
microsoft#29202) ## Description Fix microsoft#29198. NVIDIA restructured the CUDA Python wheels starting with CUDA 13: the per-component CUDA Toolkit packages (cublas, cufft, cuda_runtime, cuda_nvrtc, curand, ...) were consolidated into a single `nvidia/cu{major}` package and the `-cuNN` suffix was dropped from those package names. This PR updates the DLL/shared-library preload logic and the wheel dependency metadata so `onnxruntime-gpu` (and `onnxruntime-trt-rtx`) keep working on both the legacy CUDA 12 layout and the new CUDA 13 consolidated layout. ## Summary of Changes ### Preload logic (`onnxruntime/__init__.py`) | File | Change | |------|--------| | `onnxruntime/__init__.py` | `_get_nvidia_dll_paths` now detects the CUDA 13+ consolidated layout and resolves CUDA libraries under `nvidia/cu{major}` — Windows uses an architecture sub-folder (`bin/<arch>`, e.g. `bin/x86_64`), Linux uses a flat `lib`. The legacy CUDA 12 per-component paths are preserved. | | `onnxruntime/__init__.py` | Added `build_cuda_version` and `arch` parameters (for testability/arch override); cuDNN paths factored out since cuDNN keeps its own `nvidia/cudnn` package layout in both schemes. | | `onnxruntime/__init__.py` | `print_debug_info` drops the `-cuNN` suffix from CUDA Toolkit package names for CUDA 13+ (cuDNN keeps its suffixed name). | ### Wheel dependency metadata (`setup.py`) | File | Change | |------|--------| | `setup.py` | `onnxruntime-gpu` `cuda` extras drop the `-cuNN` suffix for CUDA 13+ (`nvidia-cuda-nvrtc`, `nvidia-cuda-runtime`, `nvidia-cufft`, `nvidia-curand`); cuDNN dependency keeps the suffixed name. | | `setup.py` | `onnxruntime-trt-rtx` CUDA Runtime dependency drops the `-cuNN` suffix for CUDA 13+. | ### Tests (`onnxruntime/test/python/onnxruntime_test_python_preload_dlls.py`) - New unit tests pin the expected relative paths for the CUDA 12 (legacy) and CUDA 13 (consolidated) layouts on both Windows and Linux, the Windows arch override, the Linux flat-`lib` layout, the unchanged cuDNN layout, and the `cuda`/`cudnn` toggles. ## Testing - Run the new tests: `python -m pytest onnxruntime/test/python/onnxruntime_test_python_preload_dlls.py` (or `python -m unittest onnxruntime.test.python.onnxruntime_test_python_preload_dlls`). - Backward compatibility: CUDA 12 paths and the cuDNN layout are unchanged; only CUDA 13+ takes the new consolidated paths and unsuffixed package names. - Build in Linux and Windows, and `pip install onnxruntime-gpu*.whl[cuda,cudnn]`, then `import onnxruntime; onnxruntime.preload_dlls()` can run successfully in python. ## Checklist - [x] Tests added/updated - [x] No breaking changes (CUDA 12 behavior preserved)
…ck helpers (microsoft#28624) ## Summary This PR adds an opt-in mechanism that lets an application supply its own I/O callbacks for an execution provider's EPContext binary data, so the data can live somewhere other than a plain file on disk (for example, an encrypted store or an in-memory buffer). It introduces the callback APIs end-to-end and demonstrates their use with a sample helper in the AutoEP example plugin EP. When an EP compiles a model into an EPContext model, it may emit the compiled blob either embedded in the ONNX model or as a separate external payload. For the external case, ORT previously assumed the payload is a file. These callbacks let the application own that read/write instead, while ORT core stays policy-neutral and never imposes a storage format. ### What this PR adds - **Write callback (`OrtWriteNamedBufferFunc`) + setter `OrtCompileApi::ModelCompilationOptions_SetEpContextDataWriteFunc`.** Set on `OrtModelCompilationOptions`, because writing EPContext binary data happens only during **compilation**. Passing a NULL callback clears a previously set one. - **Read callback (`OrtReadNamedBufferFunc`) + setter `OrtApi::SessionOptions_SetEpContextDataReadFunc`.** Set on `OrtSessionOptions`, because reading external EPContext binary data happens during **session load / inference**. Passing a NULL callback clears a previously set one. - **EP-facing access via `OrtEpContextConfig`.** Both callbacks are surfaced to execution providers through a single unified handle, `OrtEpContextConfig`, obtained via `OrtEpApi::SessionOptions_GetEpContextConfig` (getters `EpContextConfig_GetEpContextDataReadFunc` / `EpContextConfig_GetEpContextDataWriteFunc`, released with `ReleaseEpContextConfig`). This keeps the application-facing setters scoped to the correct lifecycle while giving EPs one consistent place to retrieve both callbacks. Each setter's doc comment cross-references the other so the split is discoverable. - **Experimental API surface + C++ accessors.** These functions ship through ORT's experimental API mechanism (declared in `include/onnxruntime/core/session/onnxruntime_experimental_c_api.inc`), so they are reached via the generated `Ort::Experimental::Get_<name>_SinceV28_Fn(...)` / `...FnOrThrow(...)` accessors rather than fixed `OrtApi` slots. A move-only RAII wrapper, **`Ort::Experimental::EpContextConfig`** (in `onnxruntime_experimental_cxx_api.h`), owns an `OrtEpContextConfig` and exposes `GetReadFunc()` / `GetWriteFunc()`; it can be constructed directly from a C++ `SessionOptions` / `ConstSessionOptions`. - **Sample-only helper utilities** (`onnxruntime/test/autoep/library/ep_context_data_utils.h`) implementing callback-or-file fallback behavior: if a callback is supplied it is used, otherwise the helper falls back to direct file I/O. The AutoEP example plugin EP uses this helper for its external EPContext read/write paths. Because the names read on the load side originate from the untrusted EPContext model (`ep_cache_context` attribute), the helper validates them: it rejects absolute/rooted paths, `..` traversal, and directory-like names (`.` or a trailing separator), and confines model-relative names to the model directory (resolving `.`/`..` and symlinks via `std::filesystem::weakly_canonical`). It reports all failures via `OrtStatus*` (no exceptions) and lives outside the public C API / EP ABI, so it is purely illustrative and imposes no policy on ORT core; its doc comments note that production EPs should still apply their own sandboxing and payload size limits. The callback typedef names (`OrtReadNamedBufferFunc` / `OrtWriteNamedBufferFunc`) are intentionally generic. They are currently used for EPContext binary data, but the contract is deliberately storage-agnostic so future APIs can reuse the same callback shape for other named data payloads. ### Note on the Android workflow change `.github/workflows/android.yml` bumps the minimal-build binary-size threshold (`1436672` -> `1438720` bytes) to accommodate the small size increase from compiling the new experimental API into the Android minimal build. ## Testing - Built and tested in RelWithDebInfo: `python tools/ci_build/build.py --config RelWithDebInfo --build --parallel --test --build_dir build\Windows`. - Focused EPContext suites: - Public C/C++ API: `onnxruntime_shared_lib_test.exe --gtest_filter=EpContextDataApiTest.*` -> 9 passed. - AutoEP helper + compile/load end-to-end (callbacks and file fallback): `onnxruntime_autoep_test.exe --gtest_filter=*EpContext*` -> 17 passed, 1 skipped (`EpContextDataUtils_ResolvePathRejectsSymlinkEscape` requires the Windows "create symbolic link" privilege). - `clang-format` clean on touched C++ files; `git diff --check`: clean. Test layout: public EPContext API tests in `onnxruntime/test/shared_lib/test_ep_context_data_api.cc`; sample-helper unit tests in `onnxruntime/test/autoep/ep_context_data_utils_test.cc`; compile/load end-to-end tests in `onnxruntime/test/autoep/test_execution.cc`. --------- Co-authored-by: Gopalakrishnan Nallasamy <gnallasamy@microsoft.com> Co-authored-by: Gopalakrishnan Nallasamy <gopalakrishnan.nallasamy@microsoft.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
### Description
Bool initializers supplied via `TensorProto` `raw_data` are copied
verbatim by `UnpackTensor<bool>`, so their bytes are not guaranteed to
be the canonical `{0, 1}` (the `int32_data` path normalizes via
`static_cast<bool>`, but the `raw_data` path did not). Kernels across
the codebase assume bool tensors hold `{0, 1}`.
The CUDA `Compress` kernel is concretely affected: its output-sizing
path sign-extends the condition bytes (`int8_t` -> `int32_t`) through
`cub::DeviceScan::InclusiveSum`, while `_CompressKernel` selects
elements using bool truthiness (`condition_data[div]`). For condition
bytes outside `{0, 1}` the two interpretations disagree and the output
is sized inconsistently with how elements are written. The CPU kernel
uses truthiness for both sizing and selection and is unaffected.
### Changes
- `UnpackTensor<bool>` (`tensorprotoutils.cc`): normalize `raw_data`
bytes to `{0, 1}` after copy. The `UnpackTensorWithExternalData<bool>`
specialization does the same for external data read through that path.
- CUDA `Compress` `CastToInt32` (`compress_impl.cu`): normalize to `{0,
1}` (still returns `int32_t`, preserving the accumulator-widening intent
of microsoft#9295) so the sizing path matches the kernel's write predicate,
matching the CPU kernel and the CUDA `NonZero` `bool(x)` convention.
This makes the CUDA `Compress` kernel correct independently of how its
bool condition initializer was materialized.
- Shared helper `utils::NormalizeBoolTensorIfNeeded(Tensor&)`
(`tensorprotoutils.{h,cc}`): single normalization point, reused by
`TensorProtoToTensor()` (external branch, after `MakeCpuTensorCopy`) and
by the session-init external device-copy path.
- `session_state_utils.cc::DeserializeTensorProto`: for **external**
bool initializers loaded onto a non-CPU device, normalize the writable
CPU staging copy before `CopyTensorFromCPUToDevice`. The
`GetExtDataFromTensorProto` buffer may be a read-only mmap, so it is
normalized via a writable copy rather than in place.
- Unit tests in `tensorutils_test.cc` for bool `raw_data` with
non-canonical bytes and for the `NormalizeBoolTensorIfNeeded` helper. A
`Compress` `OpTester` test cannot reproduce the original bug because the
test harness itself normalizes bool during input construction, so
coverage is placed at the deserialization layer. Tests use only Status
returns and gtest assertions, so they build and run in no-exception
builds.
### Coverage / scope of bool normalization
Fully covered:
- In-proto `raw_data` bool initializers (all EPs), via
`UnpackTensor<bool>`.
- External bool initializers reaching `TensorProtoToTensor()`.
- External bool initializers copied to a non-CPU device through the
session-init device-copy path.
Intentionally **not** normalized (by design):
- The CPU zero-copy mmap path for external initializers
(`GetExtDataFromTensorProto` returns a read-only/shared mapping that
cannot be safely modified in place).
- The custom external-data-loader path
(`LoadExtDataToTensorFromTensorProto`), which loads directly into a
device tensor.
These remaining paths are safe for the concrete bug this PR targets
because the CUDA `Compress` kernel is hardened in `compress_impl.cu`
regardless of initializer storage. Other byte-comparing bool consumers
fed by an external mmap/custom-loader initializer with non-canonical
bytes are out of scope here.
### Motivation and Context
`CastToInt32` was introduced in microsoft#9295 to widen the `cub::InclusiveSum`
accumulator (an int8 overflow fix); it did not normalize the bool
interpretation. The accumulator-width and bool-normalization concerns
are independent. This change addresses the latter at the deserialization
source and hardens the CUDA `Compress` kernel.
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This removes some unused code for MinLatency, which is not used in MoE or QMoE.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.