Skip to content

Sync with Microsoft ONNX Runtime - 12062026#1134

Open
ai-fw-intg wants to merge 9 commits into
ovep-developfrom
sync_msft_12062026
Open

Sync with Microsoft ONNX Runtime - 12062026#1134
ai-fw-intg wants to merge 9 commits into
ovep-developfrom
sync_msft_12062026

Conversation

@ai-fw-intg

Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

jambayk and others added 9 commits June 10, 2026 17:13
### Description

Move the existing model package C API off the stable `OrtApi` onto the
experimental name-based lookup mechanism added in microsoft#28746. Each model
package function is registered individually in
`include/onnxruntime/core/session/onnxruntime_experimental_c_api.inc`
with the `OrtModelPackageApi_` prefix and the `_SinceV28` version
suffix, following the lifecycle rules in
`docs/design/Experimental_C_API.md`.

Headline changes:

- `OrtApi::GetModelPackageApi`, the `OrtModelPackageApi` struct,
`OrtApis::GetModelPackageApi`, the `OrtModelPackageAPI` namespace,
`onnxruntime/core/session/model_package_api.h`, and the C++ wrappers
(`Ort::GetModelPackageApi`,
`ORT_DEFINE_RELEASE_FROM_API_STRUCT(ModelPackage*)`,
`ModelPackageOptions/Context/ComponentContext`) are removed.
- Opaque handle types (`OrtModelPackageOptions`,
`OrtModelPackageContext`, `OrtModelPackageComponentContext`) move into
`onnxruntime_experimental_c_api.h`.
- All 15 model package functions are registered in
`onnxruntime_experimental_c_api.inc`. Impls move into `namespace
OrtExperimentalApis` with `_SinceV28`-suffixed names in
`model_package_api.cc`; bodies are unchanged.
- `experimental_c_api.cc` gains a forward-decl block (driven by the same
`.inc` X-macro) so the auto-generated registration table can take the
address of every entry, even those defined in `model_package_api.cc`.
- The Python bindings (`PyModelPackageContext` / `PyModelPackageOptions`
/ `PyModelPackageComponentContext` and their `onnxruntime.__init__`
exports) are removed. Per the design doc we start the experimental API
in C/C++ only.
- `onnxruntime/test/autoep/test_model_package.cc` switches to a local
`ModelPackageFns` struct populated through the
`Ort::Experimental::Get_OrtModelPackageApi_*_Fn(api)` typed accessors.

Consumer usage going forward, in C++:

```cpp
#include "onnxruntime_c_api.h"
#include "onnxruntime_experimental_c_api.h"

const OrtApi* ort = OrtGetApiBase()->GetApi(ORT_API_VERSION);

if (auto* fn = Ort::Experimental::Get_OrtModelPackageApi_CreateModelPackageContext_SinceV28_Fn(ort)) {
  OrtModelPackageContext* ctx = nullptr;
  Ort::ThrowOnError(fn(ORT_TSTR("/path/to/pkg"), &ctx));
  // ...
}
```

### Motivation and Context

The model package API was added to the stable `OrtApi` in 1.27 but has
not shipped in a release yet. Now that microsoft#28746 has landed the
experimental C API framework, the right home for an iterating preview
surface like model package is behind `OrtApi::GetExperimentalFunction`,
not on the stable struct.

Moving it to experimental:

- frees us to change signatures (each name is uniquely versioned)
without breaking the stable ABI;
- gives consumers a clear "is this specific thing available?" contract
instead of a struct that *looks* stable but isn't;
- lets the surface be promoted to stable cleanly later (move entries to
`OrtApi`, drop the `_SinceV<N>` suffix, remove the experimental
entries).

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icrosoft#28978)

## Summary

The CUDA QMoE INT4/INT8 grouped GEMM always dispatches to the Ampere
(SM80) CUTLASS kernel — even on Hopper (SM90) — because mixed int-weight
+ fp16/bf16 activation is not a valid Hopper TMA warp-specialized
specialisation. This PR makes weight prepacking always emit the SM80
(column-interleaved) `fpA_intB` layout regardless of the runtime device
SM, fixing silently-wrong output on Hopper, and centralizes the
arch-clamping logic in a single shared helper. It also cleans up the
related tests and tightens MoE parity tolerances that were too loose to
catch the layout bug.

## Motivation

microsoft#28749 uses 90 for sm90
weight prepacking.

On SM90, `isValidHopperMOESpecialisation<half_t, uint4b_t/uint8_t>()` is
`false`, so the grouped MoE GEMM falls back to the SM80 kernel. The
weight preprocessor, however, skips column interleaving for `arch ==
90`, so an auto-detected (`force_arch=-1`) pack on an H200 produced the
non-interleaved SM90 layout that the SM80 kernel cannot consume —
yielding wrong results. The previous `PrePackIntExpertWeights` logic
clamped to `sm_` (passing SM90 through), and the test that exercised the
offline packer used auto-detect, so both could emit the wrong layout.

## Key Changes

| Area | Change |
|---|---|
| `fpA_intB_gemm_preprocessors{.h,_impl.cu}` | Extracted
`get_arch_for_mixed_gemm_weight_preprocess(int arch)` as a shared,
declared helper (clamps SM to the layout group: `<80→75`, `90→90`, else
`80`). |
| `fpA_intB_gemm_preprocessors_impl.h` | `getLayoutDetailsForTransform`
now routes through the shared helper instead of duplicating the
arch-range logic. |
| `moe_quantization.cc` (`PrePackIntExpertWeights`) | Always packs
INT4/INT8 expert weights for the SM80 layout
(`get_arch_for_mixed_gemm_weight_preprocess(80)`) instead of clamping to
the runtime `sm_`, since the SM80 kernel runs on every GPU. |
| `onnxruntime_pybind_quant.cc` (`PackWeightsForMixedGemm`) | Replaced
the ad-hoc `{75,80,90}` allowlist with the shared helper, so
`force_arch` is clamped consistently with the runtime dispatch (removes
the now-unused `<set>` include). |
| `contrib_defs.cc` / `moe_quantization.h` | Updated `weights_prepacked`
schema/field docs: layouts for `-1`/`1` are EP-determined; for the CUDA
EP `-1` and `1` are equivalent today (both SM80), `1` reserved for a
future Hopper-specific layout. |
| `test_qmoe_cuda.py` | Removed the dead, never-called
`preprocess_weights_for_mixed_gemm` helper; the real path
(`quant_dequant_blockwise`) already pins `sm=80`. |
| `test_moe_cuda.py` | Pinned the offline packer to `arch=80`, and
tightened FP16 QMoE parity tolerance from `atol 3.0 (4-bit)` / `2.0
(8-bit)` to `0.5` now that the layout is correct. |
| `docs/` | Regenerated `ContribOperators.md` and updated `moe_qmoe.md`
to match the new schema docs and SM80-always packing rationale. |

## Testing Notes

On an H200 (SM90), with the CUDA 12.x/13.x Python wheel:

```bash
python -m pytest onnxruntime/test/python/transformers/test_qmoe_cuda.py
python -m pytest onnxruntime/test/python/transformers/test_moe_cuda.py -k "PhiQMoE or qmoe"
```

- `test_qmoe_cuda.py` SwiGLU parity: SM80 layout → max diff ~0.001
(pass, tol 0.1); the prior SM90 layout produced max diff ~1.2 (fail),
confirming the fix.
- `test_moe_cuda.py` `TestPhiQMoE` (4-bit and 8-bit, all batch/seq
combinations): worst observed `max_diff` ≈ 0.375 with the fixed layout,
comfortably under the new `atol=0.5`.
- `ruff check` passes on both edited test files.

---------

Co-authored-by: tlwu <tlwu@example.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
… length (microsoft#28389)

## Summary

- Extend the FlashAttention decode path to work with any sequence length
(not just seq_len=1), with causal masking and `use_seqlen_k` support for
static KV cache
- Add m_tile optimization to process multiple Q rows per workgroup
(m_tile=1/2/4), amortizing K/V loads
- Fuse the separate QKT and SplitVx shaders into a single QKV kernel
using online softmax, eliminating the intermediate `qk` tensor
(`B×H×seq×present_seq`) and reducing dispatch count from 3 to 2
- Route between prefill (FlashAttentionProgram) and split-reduce (fused
QKV + VxReduce) paths based on sequence length

## Resolved Issues

**Whisper decoding prefill improved from 4.68ms to 1.09ms.** Whisper's
decoder attention has a small sequence length but large total sequence
length (seq_len=4, total_seq_len=1500). The default prefill shader
(FlashAttentionProgram) has low parallelism in this case because each
workgroup iterates serially over the full KV cache. The split-reduce
path tiles the KV dimension across workgroups, achieving much higher GPU
occupancy for this workload shape.

## Details

**Fused QKV kernel**: Each workgroup computes QK^T dot products, applies
attention bias and causal mask, computes local softmax (per-tile max and
sum), normalizes, and multiplies by V — all in one kernel. Per-tile
metadata (max, sum) is written for the VxReduce shader to rescale
partial outputs using online softmax: `output = Σ(partial_i ×
local_sum_i × exp(local_max_i - global_max)) / global_sum`.

**Path routing** (`use_split_reduce`): The split-reduce path is used
when `sequence_length_ < 32`; otherwise the single-kernel
FlashAttentionProgram prefill path is used. Microbenchmarks on Phi-4 (32
heads, head_size 128, GQA group 3) show split-reduce is 1.13×-2.07×
faster than the fused prefill kernel across `sequence_length ∈ {16, 30,
31}` × `total_sequence_length ∈ {128, 500, 2000}`. The previous
heuristic additionally gated on `total_sequence_length_ > 1000`, but
that signal is 0 under graph capture (seqlen_k lives on the GPU) and the
carve-out is unnecessary because split-reduce is uniformly faster for
short Q.

## Test plan

- [x] 30/30 MHA unit tests pass
- [x] phi4-graph-prune produces correct output
- [x] whisper-tiny-int4 produces correct transcription
- [x] clang-format clean
This pull request introduces important safety checks to prevent
out-of-bounds access in the logits processing code for transformers. The
main updates ensure that token IDs are validated against the vocabulary
size before being used, which improves robustness and prevents potential
crashes.

**Safety and robustness improvements:**

* Added bounds checking for token IDs in the
`RepetitionPenaltyLogitsProcessor<T>::Process` method to ensure only
valid IDs are used when accessing `beam_token_scores`.
* Added bounds checking for token IDs in the
`NoRepeatNGramLogitsProcessor<T>::Process` method to prevent
out-of-bounds writes to `beam_token_scores`.
* Updated the `NextTokenScores::SetScore` method to return early if the
provided `token_id` is out of bounds, replacing the previous assert with
a safe check.
…8703)

## Description

This PR adds Linux NPU discovery through sysfs accel devices

Currently, `DeviceDiscovery::DiscoverDevicesForPlatform()` on Linux
discovers CPU and GPU devices, but NPU discovery is still missing. As a
result, plugin execution providers that filter devices by
`OrtHardwareDeviceType_NPU` do not receive any NPU hardware devices on
Linux, even when the NPU is present and exposed by the kernel.

This change scans `/sys/class/accel` for `accelN` devices and creates
`OrtHardwareDevice` entries with:

- `type = OrtHardwareDeviceType_NPU`
- PCI `vendor_id`
- PCI `device_id`
- `accel_idx` metadata
- `pci_bus_id` metadata when available

This enables Linux systems with NPUs exposed through the accel
subsystem, such as AMD Ryzen AI / XDNA devices, to be reported through
ORT device discovery and made available to plugin EP factories.

## Changes

- Add Linux sysfs discovery for NPU devices under `/sys/class/accel`.
- Read NPU PCI vendor and device IDs from the underlying sysfs device
path.
- Add NPU metadata including `accel_idx` and `pci_bus_id`.
- Include discovered NPU devices in
`DeviceDiscovery::DiscoverDevicesForPlatform()`.
- Add a `kSysfsAccelPath` constant for the Linux accel sysfs path.

## Motivation

Linux plugin EPs that target NPUs rely on ORT passing
`OrtHardwareDeviceType_NPU` devices into `GetSupportedDevices()`.
Without Linux NPU discovery, those EPs cannot claim NPU devices and
provider selection policies such as `PREFER_NPU` silently fall back to
CPU.

Fixes microsoft#28660.
…RT version (microsoft#28794)

### Description
Adds new telemetry event for inference failure which logs ep versions
and types along with runtime error.
Adds logging of ORT version in other telemetry events.
Adds logging of ep versions in SessionCreation telemetry



### Motivation and Context
To better diagnose failures in inference

---------

Co-authored-by: Darshak Bhatti <dabhatti@micorsoft.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description
Fix STFT frame pointer arithmetic for complex-valued input so frame
starts are computed in input samples, not trailing real/imag components.
Since the frame view pointer is `U*`, one pointer increment advances one
full real or complex sample.

Also add validation that `frame_step` is positive and keep a defensive
bounds check before creating non-owning tensor views.

Review feedback addressed: simplified the frame pointer arithmetic,
fixed the swapped STFT input comments, documented the defensive bounds
check, and added double-complex regression coverage. The new STFT
validation/regression tests exclude `kDmlExecutionProvider` because
these CPU STFT validation/regression paths do not consistently match
DirectML behavior in Windows GPU CI.

### Motivation and Context
For complex input shaped `[batch_size, signal_length, 2]`, pointer
increments already advance by one real/imag pair. Multiplying frame
offsets by `signal_components == 2` again can advance past the valid
frame start, allowing later frames to read across batches or beyond the
input allocation.

### Testing
- `git diff --check -- onnxruntime/core/providers/cpu/signal/dft.cc
onnxruntime/test/providers/cpu/signal/signal_ops_test.cc`
- `.\.venv\Scripts\python.exe tools\ci_build\build.py --config
RelWithDebInfo --build --parallel --target onnxruntime_provider_test
--build_dir build\Windows`
- `.\onnxruntime_provider_test.exe
--gtest_filter="SignalOpsTest.STFTFloat:SignalOpsTest.STFTFrameStepMustBePositive:SignalOpsTest.STFTFloatComplexInputBatched:SignalOpsTest.STFTDoubleComplexInputBatched"`
from `build\Windows\RelWithDebInfo\RelWithDebInfo`

---------

Co-authored-by: Gopalakrishnan Nallasamy <gnallasamy@microsoft.com>
…icrosoft#28965)

### Description

When a QMoE model sets `weights_prepacked=0` (raw `[E, N, K/pack]` int
weights) and the session has `session.disable_prepacking`, `PrePack()`
never runs, so `packed_fc{1,2}_weights_` stay null and
`int_weights_consumed_by_prepack` is false. The code then falls through
to the raw initializer pointers — but those bytes are not in CUTLASS
layout, so the runner consumes them as-if-prepacked and produces
silently wrong output with no diagnostic.

Changes in `onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc`
(`QMoE::ComputeInternal`):

- **Int path**: Added a defensive `INVALID_ARGUMENT` guard — when
`is_int && !weights_prepacked_` but either prepack buffer is null,
return a clear error instead of feeding non-CUTLASS bytes to the runner.
- **wfp4afp8 native path**: Same fall-through
(`packed_fp4_fc{1,2}_weights_ ? ... : raw`) replaced with an explicit
guard that errors when the repacked FP4 buffers were not produced.

Also added a focused regression test in
`onnxruntime/test/contrib_ops/moe_test.cc` covering `quant_type='int'`
with `weights_prepacked=0` and `session.disable_prepacking=1`, asserting
that QMoE fails with an actionable error instead of producing output.

Merged the branch with the latest `main`.


### Motivation and Context

A prior fix removed the null-pointer crash on this path but left a
misleading-success outcome that is newly user-reachable via the
`weights_prepacked=0` contract — the exact silent-failure mode the
offline-path work set out to eliminate. These guards convert that into a
loud, actionable error. The wfp4afp8 branch shares the same fall-through
and is hardened for consistency.

The added regression test ensures this fail-loudly behavior remains
covered going forward, especially when prepacking is disabled at the
session level.

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants