Sync with Microsoft ONNX Runtime - 12062026#1134
Open
ai-fw-intg wants to merge 9 commits into
Open
Conversation
### Description Move the existing model package C API off the stable `OrtApi` onto the experimental name-based lookup mechanism added in microsoft#28746. Each model package function is registered individually in `include/onnxruntime/core/session/onnxruntime_experimental_c_api.inc` with the `OrtModelPackageApi_` prefix and the `_SinceV28` version suffix, following the lifecycle rules in `docs/design/Experimental_C_API.md`. Headline changes: - `OrtApi::GetModelPackageApi`, the `OrtModelPackageApi` struct, `OrtApis::GetModelPackageApi`, the `OrtModelPackageAPI` namespace, `onnxruntime/core/session/model_package_api.h`, and the C++ wrappers (`Ort::GetModelPackageApi`, `ORT_DEFINE_RELEASE_FROM_API_STRUCT(ModelPackage*)`, `ModelPackageOptions/Context/ComponentContext`) are removed. - Opaque handle types (`OrtModelPackageOptions`, `OrtModelPackageContext`, `OrtModelPackageComponentContext`) move into `onnxruntime_experimental_c_api.h`. - All 15 model package functions are registered in `onnxruntime_experimental_c_api.inc`. Impls move into `namespace OrtExperimentalApis` with `_SinceV28`-suffixed names in `model_package_api.cc`; bodies are unchanged. - `experimental_c_api.cc` gains a forward-decl block (driven by the same `.inc` X-macro) so the auto-generated registration table can take the address of every entry, even those defined in `model_package_api.cc`. - The Python bindings (`PyModelPackageContext` / `PyModelPackageOptions` / `PyModelPackageComponentContext` and their `onnxruntime.__init__` exports) are removed. Per the design doc we start the experimental API in C/C++ only. - `onnxruntime/test/autoep/test_model_package.cc` switches to a local `ModelPackageFns` struct populated through the `Ort::Experimental::Get_OrtModelPackageApi_*_Fn(api)` typed accessors. Consumer usage going forward, in C++: ```cpp #include "onnxruntime_c_api.h" #include "onnxruntime_experimental_c_api.h" const OrtApi* ort = OrtGetApiBase()->GetApi(ORT_API_VERSION); if (auto* fn = Ort::Experimental::Get_OrtModelPackageApi_CreateModelPackageContext_SinceV28_Fn(ort)) { OrtModelPackageContext* ctx = nullptr; Ort::ThrowOnError(fn(ORT_TSTR("/path/to/pkg"), &ctx)); // ... } ``` ### Motivation and Context The model package API was added to the stable `OrtApi` in 1.27 but has not shipped in a release yet. Now that microsoft#28746 has landed the experimental C API framework, the right home for an iterating preview surface like model package is behind `OrtApi::GetExperimentalFunction`, not on the stable struct. Moving it to experimental: - frees us to change signatures (each name is uniquely versioned) without breaking the stable ABI; - gives consumers a clear "is this specific thing available?" contract instead of a struct that *looks* stable but isn't; - lets the surface be promoted to stable cleanly later (move entries to `OrtApi`, drop the `_SinceV<N>` suffix, remove the experimental entries). --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#28978) ## Summary The CUDA QMoE INT4/INT8 grouped GEMM always dispatches to the Ampere (SM80) CUTLASS kernel — even on Hopper (SM90) — because mixed int-weight + fp16/bf16 activation is not a valid Hopper TMA warp-specialized specialisation. This PR makes weight prepacking always emit the SM80 (column-interleaved) `fpA_intB` layout regardless of the runtime device SM, fixing silently-wrong output on Hopper, and centralizes the arch-clamping logic in a single shared helper. It also cleans up the related tests and tightens MoE parity tolerances that were too loose to catch the layout bug. ## Motivation microsoft#28749 uses 90 for sm90 weight prepacking. On SM90, `isValidHopperMOESpecialisation<half_t, uint4b_t/uint8_t>()` is `false`, so the grouped MoE GEMM falls back to the SM80 kernel. The weight preprocessor, however, skips column interleaving for `arch == 90`, so an auto-detected (`force_arch=-1`) pack on an H200 produced the non-interleaved SM90 layout that the SM80 kernel cannot consume — yielding wrong results. The previous `PrePackIntExpertWeights` logic clamped to `sm_` (passing SM90 through), and the test that exercised the offline packer used auto-detect, so both could emit the wrong layout. ## Key Changes | Area | Change | |---|---| | `fpA_intB_gemm_preprocessors{.h,_impl.cu}` | Extracted `get_arch_for_mixed_gemm_weight_preprocess(int arch)` as a shared, declared helper (clamps SM to the layout group: `<80→75`, `90→90`, else `80`). | | `fpA_intB_gemm_preprocessors_impl.h` | `getLayoutDetailsForTransform` now routes through the shared helper instead of duplicating the arch-range logic. | | `moe_quantization.cc` (`PrePackIntExpertWeights`) | Always packs INT4/INT8 expert weights for the SM80 layout (`get_arch_for_mixed_gemm_weight_preprocess(80)`) instead of clamping to the runtime `sm_`, since the SM80 kernel runs on every GPU. | | `onnxruntime_pybind_quant.cc` (`PackWeightsForMixedGemm`) | Replaced the ad-hoc `{75,80,90}` allowlist with the shared helper, so `force_arch` is clamped consistently with the runtime dispatch (removes the now-unused `<set>` include). | | `contrib_defs.cc` / `moe_quantization.h` | Updated `weights_prepacked` schema/field docs: layouts for `-1`/`1` are EP-determined; for the CUDA EP `-1` and `1` are equivalent today (both SM80), `1` reserved for a future Hopper-specific layout. | | `test_qmoe_cuda.py` | Removed the dead, never-called `preprocess_weights_for_mixed_gemm` helper; the real path (`quant_dequant_blockwise`) already pins `sm=80`. | | `test_moe_cuda.py` | Pinned the offline packer to `arch=80`, and tightened FP16 QMoE parity tolerance from `atol 3.0 (4-bit)` / `2.0 (8-bit)` to `0.5` now that the layout is correct. | | `docs/` | Regenerated `ContribOperators.md` and updated `moe_qmoe.md` to match the new schema docs and SM80-always packing rationale. | ## Testing Notes On an H200 (SM90), with the CUDA 12.x/13.x Python wheel: ```bash python -m pytest onnxruntime/test/python/transformers/test_qmoe_cuda.py python -m pytest onnxruntime/test/python/transformers/test_moe_cuda.py -k "PhiQMoE or qmoe" ``` - `test_qmoe_cuda.py` SwiGLU parity: SM80 layout → max diff ~0.001 (pass, tol 0.1); the prior SM90 layout produced max diff ~1.2 (fail), confirming the fix. - `test_moe_cuda.py` `TestPhiQMoE` (4-bit and 8-bit, all batch/seq combinations): worst observed `max_diff` ≈ 0.375 with the fixed layout, comfortably under the new `atol=0.5`. - `ruff check` passes on both edited test files. --------- Co-authored-by: tlwu <tlwu@example.com> Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
… length (microsoft#28389) ## Summary - Extend the FlashAttention decode path to work with any sequence length (not just seq_len=1), with causal masking and `use_seqlen_k` support for static KV cache - Add m_tile optimization to process multiple Q rows per workgroup (m_tile=1/2/4), amortizing K/V loads - Fuse the separate QKT and SplitVx shaders into a single QKV kernel using online softmax, eliminating the intermediate `qk` tensor (`B×H×seq×present_seq`) and reducing dispatch count from 3 to 2 - Route between prefill (FlashAttentionProgram) and split-reduce (fused QKV + VxReduce) paths based on sequence length ## Resolved Issues **Whisper decoding prefill improved from 4.68ms to 1.09ms.** Whisper's decoder attention has a small sequence length but large total sequence length (seq_len=4, total_seq_len=1500). The default prefill shader (FlashAttentionProgram) has low parallelism in this case because each workgroup iterates serially over the full KV cache. The split-reduce path tiles the KV dimension across workgroups, achieving much higher GPU occupancy for this workload shape. ## Details **Fused QKV kernel**: Each workgroup computes QK^T dot products, applies attention bias and causal mask, computes local softmax (per-tile max and sum), normalizes, and multiplies by V — all in one kernel. Per-tile metadata (max, sum) is written for the VxReduce shader to rescale partial outputs using online softmax: `output = Σ(partial_i × local_sum_i × exp(local_max_i - global_max)) / global_sum`. **Path routing** (`use_split_reduce`): The split-reduce path is used when `sequence_length_ < 32`; otherwise the single-kernel FlashAttentionProgram prefill path is used. Microbenchmarks on Phi-4 (32 heads, head_size 128, GQA group 3) show split-reduce is 1.13×-2.07× faster than the fused prefill kernel across `sequence_length ∈ {16, 30, 31}` × `total_sequence_length ∈ {128, 500, 2000}`. The previous heuristic additionally gated on `total_sequence_length_ > 1000`, but that signal is 0 under graph capture (seqlen_k lives on the GPU) and the carve-out is unnecessary because split-reduce is uniformly faster for short Q. ## Test plan - [x] 30/30 MHA unit tests pass - [x] phi4-graph-prune produces correct output - [x] whisper-tiny-int4 produces correct transcription - [x] clang-format clean
This pull request introduces important safety checks to prevent out-of-bounds access in the logits processing code for transformers. The main updates ensure that token IDs are validated against the vocabulary size before being used, which improves robustness and prevents potential crashes. **Safety and robustness improvements:** * Added bounds checking for token IDs in the `RepetitionPenaltyLogitsProcessor<T>::Process` method to ensure only valid IDs are used when accessing `beam_token_scores`. * Added bounds checking for token IDs in the `NoRepeatNGramLogitsProcessor<T>::Process` method to prevent out-of-bounds writes to `beam_token_scores`. * Updated the `NextTokenScores::SetScore` method to return early if the provided `token_id` is out of bounds, replacing the previous assert with a safe check.
…8703) ## Description This PR adds Linux NPU discovery through sysfs accel devices Currently, `DeviceDiscovery::DiscoverDevicesForPlatform()` on Linux discovers CPU and GPU devices, but NPU discovery is still missing. As a result, plugin execution providers that filter devices by `OrtHardwareDeviceType_NPU` do not receive any NPU hardware devices on Linux, even when the NPU is present and exposed by the kernel. This change scans `/sys/class/accel` for `accelN` devices and creates `OrtHardwareDevice` entries with: - `type = OrtHardwareDeviceType_NPU` - PCI `vendor_id` - PCI `device_id` - `accel_idx` metadata - `pci_bus_id` metadata when available This enables Linux systems with NPUs exposed through the accel subsystem, such as AMD Ryzen AI / XDNA devices, to be reported through ORT device discovery and made available to plugin EP factories. ## Changes - Add Linux sysfs discovery for NPU devices under `/sys/class/accel`. - Read NPU PCI vendor and device IDs from the underlying sysfs device path. - Add NPU metadata including `accel_idx` and `pci_bus_id`. - Include discovered NPU devices in `DeviceDiscovery::DiscoverDevicesForPlatform()`. - Add a `kSysfsAccelPath` constant for the Linux accel sysfs path. ## Motivation Linux plugin EPs that target NPUs rely on ORT passing `OrtHardwareDeviceType_NPU` devices into `GetSupportedDevices()`. Without Linux NPU discovery, those EPs cannot claim NPU devices and provider selection policies such as `PREFER_NPU` silently fall back to CPU. Fixes microsoft#28660.
…RT version (microsoft#28794) ### Description Adds new telemetry event for inference failure which logs ep versions and types along with runtime error. Adds logging of ORT version in other telemetry events. Adds logging of ep versions in SessionCreation telemetry ### Motivation and Context To better diagnose failures in inference --------- Co-authored-by: Darshak Bhatti <dabhatti@micorsoft.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description Fix STFT frame pointer arithmetic for complex-valued input so frame starts are computed in input samples, not trailing real/imag components. Since the frame view pointer is `U*`, one pointer increment advances one full real or complex sample. Also add validation that `frame_step` is positive and keep a defensive bounds check before creating non-owning tensor views. Review feedback addressed: simplified the frame pointer arithmetic, fixed the swapped STFT input comments, documented the defensive bounds check, and added double-complex regression coverage. The new STFT validation/regression tests exclude `kDmlExecutionProvider` because these CPU STFT validation/regression paths do not consistently match DirectML behavior in Windows GPU CI. ### Motivation and Context For complex input shaped `[batch_size, signal_length, 2]`, pointer increments already advance by one real/imag pair. Multiplying frame offsets by `signal_components == 2` again can advance past the valid frame start, allowing later frames to read across batches or beyond the input allocation. ### Testing - `git diff --check -- onnxruntime/core/providers/cpu/signal/dft.cc onnxruntime/test/providers/cpu/signal/signal_ops_test.cc` - `.\.venv\Scripts\python.exe tools\ci_build\build.py --config RelWithDebInfo --build --parallel --target onnxruntime_provider_test --build_dir build\Windows` - `.\onnxruntime_provider_test.exe --gtest_filter="SignalOpsTest.STFTFloat:SignalOpsTest.STFTFrameStepMustBePositive:SignalOpsTest.STFTFloatComplexInputBatched:SignalOpsTest.STFTDoubleComplexInputBatched"` from `build\Windows\RelWithDebInfo\RelWithDebInfo` --------- Co-authored-by: Gopalakrishnan Nallasamy <gnallasamy@microsoft.com>
…icrosoft#28965) ### Description When a QMoE model sets `weights_prepacked=0` (raw `[E, N, K/pack]` int weights) and the session has `session.disable_prepacking`, `PrePack()` never runs, so `packed_fc{1,2}_weights_` stay null and `int_weights_consumed_by_prepack` is false. The code then falls through to the raw initializer pointers — but those bytes are not in CUTLASS layout, so the runner consumes them as-if-prepacked and produces silently wrong output with no diagnostic. Changes in `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc` (`QMoE::ComputeInternal`): - **Int path**: Added a defensive `INVALID_ARGUMENT` guard — when `is_int && !weights_prepacked_` but either prepack buffer is null, return a clear error instead of feeding non-CUTLASS bytes to the runner. - **wfp4afp8 native path**: Same fall-through (`packed_fp4_fc{1,2}_weights_ ? ... : raw`) replaced with an explicit guard that errors when the repacked FP4 buffers were not produced. Also added a focused regression test in `onnxruntime/test/contrib_ops/moe_test.cc` covering `quant_type='int'` with `weights_prepacked=0` and `session.disable_prepacking=1`, asserting that QMoE fails with an actionable error instead of producing output. Merged the branch with the latest `main`. ### Motivation and Context A prior fix removed the null-pointer crash on this path but left a misleading-success outcome that is newly user-reachable via the `weights_prepacked=0` contract — the exact silent-failure mode the offline-path work set out to eliminate. These guards convert that into a loud, actionable error. The wfp4afp8 branch shares the same fall-through and is hardened for consistency. The added regression test ensures this fail-loudly behavior remains covered going forward, especially when prepacking is disabled at the session level. --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.