Skip to content

Sync with Microsoft ONNX Runtime - 28062026#1169

Open
ai-fw-intg wants to merge 20 commits into
ovep-developfrom
sync_msft_28062026
Open

Sync with Microsoft ONNX Runtime - 28062026#1169
ai-fw-intg wants to merge 20 commits into
ovep-developfrom
sync_msft_28062026

Conversation

@ai-fw-intg

Copy link
Copy Markdown

Automated daily backmerge from ORT main to ovep-develop. No conflicts detected. Do NOT squash or rebase - use merge commit only.

adrastogi and others added 20 commits June 24, 2026 14:47
…icrosoft#28771)

### Description
<!-- Describe your changes. -->
Relax the input-validation in OrtApi::CompileModel to accept OrtModel
instances with zero graph inputs. Previously,
ModelCompilationOptions::Check() rejected such models with "OrtModel
graph must have at least one input and one output defined." The check
now requires only at least one graph output; the zero-input case is
legal.

Tests in test_model_builder_api.cc are restructured:

- The old CompileFromModelWithEmptyInputsOutputs_Fails is renamed to
CompileFromModelWithEmptyOutputs_Fails and reshaped to provide 1 input +
0 outputs, isolating the output-only check.
- A new regression test CompileFromModelWithEmptyInputs_Succeeds builds
a 0-input model with a RandomNormal node and verifies compilation
succeeds.

### Motivation and Context
<!-- - Why is this change required? What problem does it solve?
- If it fixes an open issue, please link to the issue here. -->
Fixes microsoft#28135 
The original check was too restrictive and impacts callers (e.g.,
WebNN/Chromium needs to call CompileModel on such models in a separate
compiler process (and then load the compiled artifact via
CreateSessionFromArray in the GPU process)).
…ttention (microsoft#29240)

### Description

The CUDA `GroupQueryAttention` kernel derives a KV-cache append offset
from the `seqlens_k` input (`past_seq_lens = (seqlens_k + 1) -
sequence_length`). On the CUDA EP `seqlens_k` is device-resident (only
`total_sequence_length` is a CPU input), so the host-side range
validation in the operator/helper is skipped. The device kernel
`UnpackRoPEAppend` then guarded the cache store with only a one-sided
upper bound (`cache_s < max_seqlen`), so an out-of-range `seqlens_k`
could produce a negative offset that is sign-extended into the
cache-index arithmetic.

The CPU operator already validates `seqlens_k` host-side; this change
brings the CUDA path to parity by guarding on the device.

### Changes
- `group_query_attention_impl.cu` (`GetSequenceLengths`): clamp the
negative case at the source so both `total_seq_lens` and the append
offset `past_seq_lens` stay non-negative for all downstream consumers.
- `group_query_attention_qkv.cuh` (`UnpackRoPEAppend`): make the
KV-cache store bound two-sided (`cache_s >= 0 && cache_s < max_seqlen`),
mirroring the existing position-index guard a few lines above. This also
covers the fast-decode path, where `past_seq_lens` points directly at
the raw input and bypasses `GetSequenceLengths`.
- Added `NegativeSeqlensK_CacheAppend_NoOOB_CUDA` regression test
exercising the KV-cache append path with an out-of-range `seqlens_k`
(CUDA-guarded; skips when CUDA EP is unavailable).

### Notes
- The two-sided guard matches the pattern introduced for the rotary
position index in microsoft#27597.
- CPU is unaffected (already validated host-side); WebGPU relies on the
CPU-validated `total_sequence_length`. The CUDA implementation is shared
with ROCm via hipify.
- The regression is a device-memory write best observed under
`compute-sanitizer`; the test asserts the run completes with finite
outputs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oft#28962)

## Summary

Adds an FP32 flash attention path for the CPU
`com.microsoft.GroupQueryAttention` (GQA) contrib op, mirroring the
existing quantized-KV flash attention path. The new tiled,
online-softmax kernel avoids materializing the full `[S, T]` attention
score matrix. It is restricted to prefill / chunked-prefill
(`sequence_length > 1`); single-token decode falls back to the naive
path. With causal early-termination it is faster than the naive path
across all measured prefill lengths while using a fraction of the
memory.

## Key changes

- **New MLAS kernel** `onnxruntime/core/mlas/lib/flashattn_gqa.cpp`
(`MlasFlashAttentionGQA`):
- Tiled QK / softmax / SV with online-softmax (running max/sum
rescaling).
- GQA head grouping (`num_heads % kv_num_heads == 0`), causal masking,
local window, additive attention bias, and packed-QKV input.
- **Causal early-termination**: during prefill, KV blocks that fall
entirely in the causally masked upper triangle are skipped (`break` once
`ir >= past_seqlen + q_idx + row_size_q`), avoiding the wasted QK/SV
GEMMs over roughly half of the square prefill attention matrix.
  - Per-batch invocation for ragged / shared-buffer `seqlens_k`.
- **MLAS API** `onnxruntime/core/mlas/inc/mlas.h`: new
`MlasFlashAttentionGQAArgs` struct and `MlasFlashAttentionGQA`
declaration.
- **Dispatch** `onnxruntime/contrib_ops/cpu/bert/gqa_attention_base.h`:
new `ApplyAttentionFlash` that concatenates new K/V into the FP32
present cache and invokes the kernel. The per-thread scratch buffer size
is computed with `SafeInt<size_t>` to guard against `size_t` overflow on
large/malformed shapes before allocation.
- **Wiring**
`onnxruntime/contrib_ops/cpu/bert/group_query_attention.cc`: float-only
flash dispatch, active only for prefill (`sequence_length > 1`) and when
`softcap == 0`, no smooth softmax, no head sink, no QK output; falls
back to the naive path otherwise. The existing
`ORT_GQA_DISABLE_FLASH_ATTENTION` env var disables it.
- **CMake** `cmake/onnxruntime_mlas.cmake`: register the new source
file.
- **Docs** `docs/contrib_ops/cpu/gqa.md`: document the non-quantized
flash attention path, activation conditions, causal early-termination,
file list, and FP32 flash-vs-naive benchmark results.
- **Benchmark**
`onnxruntime/test/python/transformers/benchmark_gqa_cpu_flash.py`: add
an FP32 (non-quantized) mode (`--fp32`) for operator-level
flash-vs-naive comparison.

### Why prefill-only (`sequence_length > 1`)

Single-token decode (`sequence_length == 1`) produces only a `[1,
total_sequence_length]` score row per head, so there is nothing to tile
away and the extra online-softmax bookkeeping makes the flash kernel
slower and noisier than naive in practice. Restricting the flash path to
prefill keeps the consistent prefill win without regressing decode.
Because decode is excluded, the two-phase flash-decoding kernels are
unreachable and have been removed for a smaller, simpler implementation.

`float16` continues to use the naive path (the kernel is float-only,
matching the quantized flash constraint).

## Performance

Operator-level, AMD EPYC 7763 (16 physical cores), threads=8, FP32 KV
cache, `B=1, num_heads=16, kv_num_heads=8, head_size=128`. Flash is
faster than naive across all measured prefill lengths (and
single-threaded as well, 1.4-1.8x), confirming the gain is algorithmic -
the causal early-termination removes the wasted upper-triangle work that
previously made flash slower than naive at short sequences.

| Prefill Seq Length | Naive (ms) | Flash (ms) | Speedup |
|---:|---:|---:|---:|
| 512  | 5.8-8.4 | 4.2-5.3 | 1.4-1.6x |
| 1024 | 25-29   | 13-18   | 1.6-2.0x |
| 2048 | 87-118  | 52-65   | 1.5-2.0x |
| 4096 | 365-380 | 213-234 | 1.6-1.7x |

The flash path's primary structural benefit is memory: it never
allocates the full O(N x S x T) attention matrix (~1 GB at S=4096, N=16)
and instead uses an O(S x Bc) per-thread tile.

## Testing

- **C++ op tests**: `onnxruntime_provider_test
--gtest_filter="GroupQueryAttentionTest.*"` - 38 passed (12 GPU/WebGPU
skipped) with flash on (default) and with
`ORT_GQA_DISABLE_FLASH_ATTENTION=1`.
- **Flash vs. naive parity** (FP32): output of the flash path matches
the naive path (max abs diff ~1e-7) across prefill (block-aligned and
non-aligned `S`), MHA and GQA head ratios, and local window. Decode now
uses the naive path on both sides (diff 0).
- **Python parity** (`test_gqa_cpu.py`, flash vs. naive reference):
focused FP32 sweep of 600 prompt configurations covering all head sizes
(32-256), GQA ratios `(6,6)/(6,3)/(9,9)/(9,3)`, batches `1/3/5`,
causal/local window, attention bias, position ids, packed QKV, and
with/without KV buffer - all passed. The official `test_gqa_cpu.py`
suite passes.

Two correctness bugs were found and fixed via the parity sweep while
developing this path:
1. Attention-bias batch stride ignored head broadcasting for `[batch, 1,
S, T]` bias.
2. Query batch stride was hardcoded to `num_heads * S * H`, which is
incorrect for packed-QKV input (correct stride is `(num_heads + 2 *
kv_num_heads) * S * H`).
…, GQA underflow, and ep_weight_sharing_ctx_gen build (microsoft#28245)

### Description

This PR contains three commits:

**Commit 1: Miscellaneous fixes**
- Downgrade QNN ETW profiling mismatch logs from ERROR to VERBOSE to
reduce excessive telemetry noise (~1 billion events/week across Windows
devices)
- Add bounds checking in GQA attention to prevent `size_t` underflow
when `seqlens_k` contains invalid data (fixes microsoft#27170)
- Build `ep_weight_sharing_ctx_gen` for TensorRT, OpenVINO, and VitisAI
in addition to QNN

**Commit 2: Bump cpuinfo and add `cpuinfo_deinitialize()` integration**

Applications that dynamically load and unload the onnxruntime DLL leave
orphaned heap allocations from cpuinfo when the library is unloaded
mid-process. These are flagged as memory leaks by App Verifier,
Valgrind, AddressSanitizer, and LeakSanitizer.

This commit bumps `pytorch/cpuinfo` to a version that implements
`cpuinfo_deinitialize()`
([pytorch/cpuinfo#387](pytorch/cpuinfo#387)) and
adds ORT integration:
- `CPUIDInfo::ShutDown()` calls `cpuinfo_deinitialize()` to free
heap-allocated globals
- `DllMain` calls `ShutdownCpuInfo()` on `DLL_PROCESS_DETACH`
- In memleak-check builds, shutdown also runs during process termination
- `InstanceCreated` atomic guard prevents singleton creation during DLL
unload

**Commit 3: Update to official cpuinfo merged fix**

After [pytorch/cpuinfo#387](pytorch/cpuinfo#387)
merged upstream, updated the dependency to point to `pytorch/cpuinfo`
main (`4628dc06`).

Patch changes:
- **Removed** `win_arm_fp16_detection_fallback.patch` — upstreamed via
[pytorch/cpuinfo#348](pytorch/cpuinfo#348)
- **Updated** `patch_vcpkg_arm64ec_support.patch` — regenerated for new
cpuinfo; still needed
([pytorch/cpuinfo#324](pytorch/cpuinfo#324) not
yet merged)
- **Updated** `patch_cpuinfo_h_for_arm64ec.patch` — retained, not yet
upstream
- **Regenerated** `fix_missing_sysfs_fallback.patch` — updated context
lines for new cpuinfo code

### Motivation and Context

- pytorch/cpuinfo#150
- microsoft#16117
- microsoft#23762
…icrosoft#29221)

## Description

The CUDA plugin EP previously rejected combining a user-provided compute
stream
(`user_compute_stream`) with CUDA graph capture (`enable_cuda_graph`),
returning
`ORT_INVALID_ARGUMENT`. This PR removes that restriction so the two
options can
be used together: when both are set, graph capture and replay run on the
user-owned stream (the same stream the kernels are issued to), matching
the
bundled (non-plugin) CUDA EP behavior. Several supporting fixes make
capture on a
shared stream stable and Memcpy-free.

## Summary of Changes

### Allow user stream + CUDA graph

| File | Change |
|------|--------|
|
[onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc](onnxruntime/core/providers/cuda/plugin/cuda_ep_factory.cc)
| Remove the validation that rejected `user_compute_stream` +
`enable_cuda_graph` together. |
|
[onnxruntime/core/providers/cuda/plugin/cuda_ep.cc](onnxruntime/core/providers/cuda/plugin/cuda_ep.cc)
| `PerThreadContext` accepts an optional external graph stream. When
both options are set it captures/replays on the user stream and does
**not** create or destroy it (the user owns its lifetime); otherwise it
owns a dedicated graph stream as before. |

### Stable, Memcpy-free CUDA graph capture

| File | Change |
|------|--------|
|
[onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h](onnxruntime/core/providers/cuda/plugin/cuda_kernel_adapter.h)
| Route kernel scratch/workspace allocations through the EP allocator
(BFC arena) instead of raw `cudaMallocAsync`/`cudaMalloc`. After warmup
the arena reaches steady state, so the capture run serves scratch from
already-reserved chunks and the device free-memory footprint stays
stable — required for correct capture. Matches the built-in CUDA EP. |
|
[onnxruntime/core/providers/cuda/tensor/shape_op.cc](onnxruntime/core/providers/cuda/tensor/shape_op.cc)
| Add an adapter-based `Shape` kernel under `#ifdef
BUILD_CUDA_EP_AS_PLUGIN` with identical semantics to the CPU `Shape`.
Registering `Shape` on the EP keeps it off the CPU EP and avoids the
Memcpy nodes that would otherwise break CUDA graph capture. |
|
[cmake/onnxruntime_providers_cuda_plugin.cmake](cmake/onnxruntime_providers_cuda_plugin.cmake)
| Stop excluding `shape_op.cc` from the plugin build so the
adapter-based `Shape` kernel is compiled in. |

### Null-allocator fallback in PrePack (plugin boundary)

In the plugin build the `AllocatorPtr` passed to `PrePack` can arrive
null across
the library boundary. Each kernel now falls back to its own
default-memory
allocator (`Info().GetAllocator(OrtMemTypeDefault)`), which is always
valid.

-
[onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc](onnxruntime/contrib_ops/cuda/bert/group_query_attention.cc)
-
[onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc](onnxruntime/contrib_ops/cuda/moe/moe_quantization.cc)
-
[onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc](onnxruntime/contrib_ops/cuda/quantization/matmul_nbits.cc)

### Misc

-
[onnxruntime/core/framework/session_state.cc](onnxruntime/core/framework/session_state.cc)
— wrap a long line (no behavior change).

## Testing

- New test:
[onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc](onnxruntime/test/providers/cuda/plugin/cuda_plugin_user_stream_graph_test.cc)
covering:
1. Session creation succeeds with both `user_compute_stream` and
`enable_cuda_graph` set (regression for the removed validation).
  2. Capture + replay on the user stream produce correct results.
3. Replay after an in-place input update on the user stream is correct.
- Tests are gated on `ORT_UNIT_TEST_HAS_CUDA_PLUGIN_EP` and skip
gracefully when no CUDA device or plugin library is available.

## Motivation and Context

Users that drive ORT from their own CUDA stream (e.g. to interleave ORT
inference
with their own kernels) previously could not also benefit from CUDA
graph capture
on the plugin EP. This change brings the plugin EP to parity with the
bundled
CUDA EP for that workflow.

## Checklist

- [x] Tests added/updated
- [x] No breaking changes (relaxes a previously rejected option
combination)
- [ ] Documentation updated (if applicable)
## Summary
- align CPU ONNX Attention causal masking with upper-left behavior for
q_len=1, kv_len>1, no past
- preserve the existing `nonpad_kv_seqlen` / TensorScatter single-query
causal behavior
- update Python attention reference causal mask to model ONNX upper-left
alignment with an explicit past offset
- add a regression test for issue microsoft#29020

Fixes microsoft#29020

## Validation
- `python -m py_compile
onnxruntime/test/python/transformers/test_onnx_attention/common.py
onnxruntime/test/python/transformers/test_onnx_attention/test_mha.py
onnxruntime/test/python/transformers/test_onnx_attention/test_gqa.py
onnxruntime/test/python/transformers/test_onnx_attention/test_tensorscatter_attention.py`
- `git diff --check`

Notes:
- `pytest
onnxruntime/test/python/transformers/test_onnx_attention/test_tensorscatter_attention.py
-k "cpu_fp32 and causal" -q` could not run locally because this Python
environment does not have `onnx` / `onnxruntime` installed.
- After the latest follow-up commit, an incremental rebuild of
`onnxruntime_provider_test` was attempted but failed in MSBuild before
compiling this change due to a local environment issue: duplicate `Path`
/ `PATH` environment keys when launching `CL.exe`.
…ts (microsoft#29247)

## Summary

Lift WebGPU FlashAttention's `batch_size == 1` restriction so batched
GQA with right-padded prompts (the common GenAI batched-prefill shape)
takes the fused FlashAttention path instead of falling back to
`ApplyAttention`.

- **Per-batch seqlens in FlashAttention shaders.** Prefill, decode
split-reduce, CopyKVCache, and the fused rotary-and-copyKV template now
read `seqlens_k[batch_idx]` instead of hardcoding `seqlens_k[0]`. All
`past_X = total_X - new_X` subtractions are clamped to avoid u32
underflow when a short batch's per-batch total is less than the
batch-wide `sequence_length`.
- **Indirect-dispatch sizing uses GQA's `total_sequence_length` input.**
`CopyKVCache`, `SplitPackedQKVWithRotaryEmbeddingAndCopyKV`, and
`FlashAttentionDecodeQKV` now take a new `total_sequence_length_input`
binding (GQA input #6, GPU-resident under graph capture) for the
indirect-dispatch grid sizing. This is the global max KV span across the
batch by construction, replacing the previous `seqlens_k[0] + 1u` that
under-dispatched whenever batch 0 wasn't the longest. Per-batch
`seqlens_k[batch] + 1` still drives causal masking and K/V bounds inside
the kernels. GQA now enforces `graph_capture_enabled ->
past_present_share_buffer_` so the host-side `use_indirect_dispatch`
predicate stays simple.
- **Decoupled attention_bias stride from per-batch OOB.**
`attention_bias` is still allocated to the global max
`total_sequence_length`; only the causal-mask / softmax tile loops are
gated by the per-batch total. The one-past-end fallback was tightened to
clamp inside the same row (`offset_base + stride_total_seq - 1u`).
- **Decode workgroup grid stays at global max.** `decode_qkv` keeps a
workgroup grid sized to the global max tile count to keep
`workgroup_idx` slicing consistent across batches, with neutral `(-inf,
0)` early-exit for tiles beyond a short batch's per-batch total so the
`VxReduce` online softmax rescaling is not skewed.
- **New `use_seqlen_k` template parameter** (separate from
`use_indirect_dispatch` which still requires graph capture). It is
enabled whenever `seqlen_k` is provided and (`graph_capture ||
batch_size_ > 1`).
- **Rotary fix prerequisite** (`webgpu: fix GQA batched right-padded
prefill with do_rotary`, 591df5b): clamps `past_seqlen` to 0 in
`RotaryEmbeddingProgram`, `FusedQKRotaryEmbeddingProgram`, and
`split_packed_qkv_with_rotary_embedding`, which previously produced
gibberish for the shorter batches.

## Motivation

GenAI's batched prefill right-pads short prompts to the batch max and
reports each batch's real length via `seqlens_k[b] = real_len[b] - 1`.
The previous FlashAttention gate forced every batched call onto the
slower `ApplyAttention` path, and the rotary shaders underflowed `u32`
for any batch shorter than the batch-wide `sequence_length`, producing
garbage Q/K positions and gibberish output text for the shorter batches.

## Test plan

- [x] All `GroupQueryAttentionTest.WebGPU_*` op tests pass, including
`BatchedRightPaddedRotaryPrefill` (FlashAttention path) and the new
`BatchedRightPaddedRotaryPrefillFlashAttentionLargeSpread_WebGPU`
covering a `real_lens` spread > tile_size
- [x] phi4-prune three-prompt batched generation: coherent outputs on
WebGPU matching CPU reference (3 prompts, 384 tokens, 173 tps)
- [x] phi4-prune single-prompt generation regression: coherent
- [x] phi4-graph-prune (graph capture enabled):
`verify_model_correctness.py` 4/4 PASS; `verify_multi_gen.py` sequential
+ overlapping both PASS
- [x] whisper-tiny-int4 transcription regression: 2/2 byte-exact with
CPU
- [x] Lintrunner clean on all changed files
…29216)

### Description

PR1 microsoft#28962 adds flash
attention for **prefill**, and removed flash decoding. This PR will add
optimized kernel for **single-token decode**, which will be faster than
other kernels including flash decoding.

This PR builds on the prefill-only flash attention change and
additionally introduces a dedicated decode kernel.

#### What's included
- **Decode (GEMV) kernel** — A dedicated single-token decode kernel
(`MlasGQADecodeGQAThreaded`) for `sequence_length == 1`, parallelized
over (batch, head) with a two-pass softmax, using GEMV (`acc[8]`-lane
dot product / AXPY) helpers instead of per-block M=1 SGEMM calls. This
fixes the per-block SGEMM decode regression.
- The FP32 flash gate (`group_query_attention.cc`) is enabled for
`total_sequence_length > 1`, routing prefill to the tiled kernel and
decode to the GEMV kernel.
- The quantized KV-cache path is unchanged (FP32-only scope).

#### Results (AMD EPYC 7763, AVX2, 8 threads)
- **Decode:** correctness ~1e-8 vs naive; long-context decode ~1.0–1.5x
(T = 4097 ~1.3–1.5x).

### Motivation and Context

The naive GQA path materializes the full score matrix, which is
memory-bound for long sequences. Flash attention reduces memory traffic
for prefill, and the GEMV decode kernel avoids SGEMM overhead for the
M=1 decode case.

### Testing

- Built with `--compile_no_warning_as_error`.
- Correctness verified against the naive path for both prefill and
decode (max abs diff ~1e-8).
- Benchmarked via `benchmark_gqa_cpu_flash.py`.
…t#29251)

# Fix unbounded lifetime on WithOutputTensor in Rust bindings

## Description

The `WithOutputTensor<'a, T>` struct had a free lifetime parameter `'a`
on its `TryFrom<OrtOutputTensor>` impl that was unconstrained by any
input. Combined with the `Deref` impl (whose `Target = ArrayView<'a, T,
IxDyn>` exposed a `Clone`-able view), it was possible for the
`ArrayView` to outlive the underlying `OrtOutputTensor` buffer owner.

This change restructures `WithOutputTensor` to eliminate the unbounded
lifetime:

- Removes the `'a` lifetime parameter from `WithOutputTensor`,
`OrtOutput`, and `Session::run`
- Removes the `Deref` impl (the escape hatch)
- Replaces the stored `ArrayView<'a, T>` with a raw pointer + shape
- Adds a `view(&self)` method returning `ArrayView<'_, T, IxDyn>` — the
view lifetime is now tied to `&self`
- Updates all call sites (examples, integration tests) to use `.view()`

## Motivation

The C API contract (`onnxruntime_c_api.h`) explicitly bounds the data
pointer lifetime to the `OrtValue`: the pointer is only valid until the
value is destroyed. The Rust type system must enforce this invariant.
Previously it did not — the `ArrayView` could be cloned out and observed
after the `OrtValue` was freed.

## API Change

```rust
// Before: Deref-based access
let output = outputs[0].float_array().unwrap();
let sum: f32 = output.iter().sum();

// After: explicit view() call
let output = outputs[0].float_array().unwrap();
let sum: f32 = output.view().iter().sum();
```

## Testing

Existing integration tests updated to use the new `view()` API. The fix
is enforced at compile time by the borrow checker — the previously
problematic pattern now produces a lifetime error.

Co-authored-by: Sayan Shaw <sayanshaw@microsoft.com>
### Description

Use new GitHub CI identity for azcopy.

### Motivation and Context

GitHub CI pools have been assigned a new identity.
…crosoft#29244)

## Description

This PR adds a kernel-context C API accessor for the framework
`OrtSyncStream*` and uses it in the CUDA plugin EP so scratch
allocations can be tagged with the actual compute stream selected for
the kernel. It is stacked on microsoft#29221 and turns the previously documented
concurrent multi-stream limitation into a gated capability: older
runtimes keep the conservative fallback, while runtimes with the new API
can safely advertise concurrent runs when EP-level unified stream mode
is not forced.

## Summary of Changes

### Public API and Adapters

| File | Change |
|------|--------|
| `include/onnxruntime/core/session/onnxruntime_c_api.h` | Adds
`KernelContext_GetSyncStream` to expose the borrowed framework stream
for stream-aware allocation and synchronization bookkeeping. |
| `onnxruntime/core/session/custom_ops.cc` | Implements the API by
retrieving the kernel's `OpKernelContext::GetComputeStream()` inside ORT
core. |
| `onnxruntime/core/session/ort_apis.h` and
`onnxruntime/core/session/onnxruntime_c_api.cc` | Declares and wires the
new API entry. |
| `include/onnxruntime/core/session/onnxruntime_cxx_api.h` and
`include/onnxruntime/core/session/onnxruntime_cxx_inline.h` | Adds the
C++ `Ort::KernelContext::GetSyncStream()` wrapper. |
| `include/onnxruntime/ep/adapter/op_kernel.h` | Adds a version-gated EP
adapter accessor so plugins can use the API when available and fall back
safely otherwise. |

### CUDA Plugin EP

- Tracks the framework stream corresponding to both raw CUDA stream
handles and `OrtStreamAdapter` stream arguments.
- Passes the framework stream to scratch allocation so arena chunks are
stream-tagged instead of using a null stream tag.
- Re-enables concurrent run support only when
`KernelContext_GetSyncStream` is available and EP-level unified stream
mode is not forced.

### Tests and Docs

- Extends the shared-lib custom-op test helper to exercise
`Ort::KernelContext::GetSyncStream()`.
- Updates CUDA plugin EP docs to describe stream-tagged scratch
allocation, compatibility fallback, and the new API audit entry.

## Why a C API is needed

The implementation of `KernelContext_GetSyncStream` is intentionally
small, but the API boundary is the important part. ORT core can safely
cast `OrtKernelContext*` back to `onnxruntime::OpKernelContext*` because
it owns both the opaque C handle and the private C++ implementation. A
plugin kernel should not perform that cast directly: it would make the
plugin depend on ORT-core private C++ layout, vtables, and exact build
compatibility.

The new API keeps that private cast inside ORT core and gives plugin
kernels a stable ABI entry point:

```text
plugin kernel -> opaque OrtKernelContext* -> OrtApi::KernelContext_GetSyncStream -> ORT core retrieves the actual framework stream
```

This also lets the plugin use runtime version gating. When loaded by an
older ORT runtime that does not expose the API, the adapter returns
null, scratch allocation uses the conservative fallback, and concurrent
runs are not advertised.

## Testing

- `lintrunner -a`
- `ninja -C build/cu130_plugin/Debug onnxruntime_providers_cuda_plugin`
- `ninja -C build/cu130_plugin/Debug onnxruntime_shared_lib_test`
- `cd build/cu130_plugin/Debug && ./onnxruntime_shared_lib_test
--gtest_filter=CApiTest.custom_op_handler --gtest_color=no`
- VS Code diagnostics on touched C++ and header files

## Checklist

- [x] Tests added/updated
- [x] Documentation updated
- [x] Backward compatibility guarded by runtime API-version checks
- [ ] CI passes
…icrosoft#29242)

This pull request introduces device-side and host-side validation for
sparse attention input indices and key sequence lengths, improving error
detection and robustness for both CPU and CUDA implementations. It also
adds an environment variable to optionally disable the new device-side
validation for performance reasons. Some related test code has been
updated to match these changes.

**Validation improvements:**

* Added host-side validation functions (`ValidateCSRIndices`,
`ValidateKeyLengths`) for CSR indices and key sequence lengths in the
CPU kernel, and integrated them into the compute path. This ensures
invalid inputs are caught early on the CPU.
[[1]](diffhunk://#diff-44554dbe530c593f0f0b85a591859bb6b0a21e62992c61f9622e5456a144cb45R39-R106)
[[2]](diffhunk://#diff-44554dbe530c593f0f0b85a591859bb6b0a21e62992c61f9622e5456a144cb45R146-R150)
* Implemented a CUDA kernel (`ValidateCSRIndicesKernel`) and supporting
function (`ValidateCSRIndicesOnDevice`) to check CSR row-pointer
monotonicity, column-index range, and key lengths on device, with
detailed error codes and messages. The CUDA kernel is invoked in the
compute path unless disabled by environment variable.
[[1]](diffhunk://#diff-0c8a322cd4611e589f38a67876f309f9d83869a6d2239cadf86970ed2005ebd0R336-R449)
[[2]](diffhunk://#diff-fff06841efe15d5f95c02bb38daa1d5aa0775de0e1777d9d418222e44ebc88feR71-R95)
[[3]](diffhunk://#diff-08ea97fecd6c161add2607d5b7d406c0f3d2b0f1280ebb49d801f085d941770aR220-R236)

**Configurability:**

* Introduced the environment variable
`ORT_DISABLE_SPARSE_ATTENTION_INPUT_VALIDATION` to allow users to skip
device-side validation for performance when inputs are known to be
valid. This is parsed and used in the CUDA kernel.
[[1]](diffhunk://#diff-56cfb57cdd5f9134a8fea24bd006c691860c64a1f78f4b2c69a861d847dee9ddR94-R99)
[[2]](diffhunk://#diff-08ea97fecd6c161add2607d5b7d406c0f3d2b0f1280ebb49d801f085d941770aR59-R60)
[[3]](diffhunk://#diff-8e47232d826ceae90a93aa6cde8f2869dcb1998f37804055fcfb584d95e21a96R29)

**Test updates:**

* Refactored test helper function names and test input shapes to match
the new validation logic and error messages.
[[1]](diffhunk://#diff-21bdb8e3ad8b50a72c5da77349f280c15a6938ece2bea4e987bd12ff8bcb2a0eL27-R27)
[[2]](diffhunk://#diff-21bdb8e3ad8b50a72c5da77349f280c15a6938ece2bea4e987bd12ff8bcb2a0eL45-R45)
[[3]](diffhunk://#diff-21bdb8e3ad8b50a72c5da77349f280c15a6938ece2bea4e987bd12ff8bcb2a0eL212-R220)

**Other:**

* Removed redundant key length value checks from `CheckInputs` in favor
of the new validation routines.
microsoft#29239)

### Description

`WhisperDecoderSubgraph::CreateInitialFeeds` constructed the decoder
initial feeds using a single value that mixed a **byte count** with an
**element count**. The total size was computed as `cur_len *
batch_beam_size * sizeof(int)` (bytes) and then reused as:

- the element count for the int32 staging buffer (`MakeUniquePtr<int>`),
and
- the element count for the `gsl::span<int>` source/destination passed
to the device copy.

Because the `input_ids` tensor is allocated for exactly `batch_beam_size
* cur_len` int32 elements, the spans claimed 4x the real extent, so the
device copy ran past the end of the buffer. The per-beam `memcpy` also
used the same combined value as its length instead of a single
sequence's byte size.

This mirrors the correct T5 sibling (`subgraph_t5_decoder.cc`), which
separates the element count (used for the spans/staging allocation) from
the per-sequence byte count (used for the `memcpy`).

### Changes
- `subgraph_whisper_decoder.cc`: `total_size` is now the element count
`cur_len * batch_beam_size`; introduced `sequence_bytes = cur_len *
sizeof(int32_t)` for the per-beam `memcpy`. The staging buffer and spans
use `int32_t` consistently to match the `int32_t` tensors/sequences.
- Added regression test
`BeamSearchTest.DummyWhisperWithSequenceInputIds` (CPU, and CUDA under
`USE_CUDA`) exercising the `use_sequence_as_input_ids` path, with a
deterministic dummy model and its generator script. The test validates
both the `sequences` and `scores` outputs.

### Related bool-tensor normalization fixes
While exercising the Whisper path, bool tensors copied from raw data
could hold non-canonical byte values (anything non-zero rather than
strictly `{0, 1}`), causing provider-dependent behavior. To keep the fix
self-contained, the following normalization changes are included:
- `tensorprotoutils.cc`: `UnpackTensor<bool>` normalizes raw-data bytes
to `{0, 1}` (with a `static_assert(sizeof(bool) == 1)` guarding the
byte-wise loop).
- `compress_impl.cu` (CUDA `Compress`): the prefix-sum sizing predicate
normalizes bool bytes to `{0, 1}` so the output sizing agrees with the
element-selection truthiness check. Since bool initializers are now
normalized on unpack, the remaining exposure is runtime-produced bool
condition tensors.
- Added `CompressTest.Compress_cuda_non_canonical_bool_condition` (under
`USE_CUDA`), which feeds a raw `0xFF` condition byte through a
session-level run (`OpTester` normalizes bool inputs and so cannot
reproduce this) and asserts the Compress output is sized by truthiness
rather than by the sign-extended byte value.

### Motivation
The decoder shares one implementation file across CPU/CUDA/ROCm, so this
single change covers all execution providers. The previous behavior
could overrun the staging/feed buffers for models that drive the
sequence-as-input-ids decoder path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
)

### Description

Enable pre-packed weights sharing for `MatMulNBits` operator on CPU.
When performing DQ + MatMul -> MatMulNBits fusion, the original weight
names are lost, so the standard `AddInitializer` approach does not work.
To overcome this, introduced the option for graph optimization pass to
tag weights which are sharable across sessions (hashing the content and
matching it across the sessions).

### Motivation and Context

For executing ASG SLMs on CPU - there are two sessions, one for prefill
stage and for decode stage (due to different shapes and session
options). With this change, storing the weights in memory twice is
avoided. The first sessions pre-packs the weights which the second
session can reuse.

Confirmed memory reduction through the WPA memory traces.

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
### Description

The RKNPU execution provider's ONNX converter creates implicit
(all-zero) bias
buffers when a `Conv` / `Gemm` / `QLinearConv` node omits its bias
input. The
buffer size was computed as `sizeof(T) * dim` (where `dim` derives from
a model
weight's shape) with no overflow check, and the raw allocation was
tracked in a
manually-freed `void*` list.

This PR hardens the converter:

- **Dimension validation:** ONNX `int64_t` dimensions are validated (via
`ORT_ENFORCE` in a new `ToRknpuDim` helper) before being narrowed to the
RKNPU
`uint32_t` shape representation, rejecting negative or out-of-range
values.
  This covers all four ingestion points (`HandleInitializer`,
  `GetInputOfOnnxModel`, `GetShape`, `GetSupportedNodes`).
- **Overflow-checked allocation:** all four implicit-bias sites
(`AddLayerConvImpl`, `AddLayerQLinearConvImpl`,
`AddLayerDepthwiseConvImpl`,
`AddLayerFC`) go through a shared `AllocZeroedBias` helper that computes
the
  byte count with `SafeInt<size_t>` (throws on overflow) and returns a
  zero-initialized `std::make_unique<uint8_t[]>` buffer.
- **RAII ownership:** `free_list_` is now
`std::vector<std::unique_ptr<uint8_t[]>>`, so the bias buffers are freed
  automatically and `Clear()` no longer walks/`free()`s raw pointers.

### Motivation and Context

A malicious ONNX model can provide dimensions that are unsafe for the
RKNPU
converter's 32-bit shape representation or for byte-size allocation
arithmetic:

- ONNX stores dimensions as `int64_t`, while the RKNPU converter/DDK
uses
`uint32_t` shape values. Silently narrowing a large or negative
`int64_t`
  value can produce a misleading `uint32_t` dimension.
- Even after a dimension is represented as `uint32_t`, the original
`sizeof(T) * dim` could overflow `size_t` on 32-bit RKNPU targets. For
example,
`dim = 0x40000400` makes `sizeof(float) * dim` wrap to **4096 bytes**
while the
created tensor still advertises `dim` elements, which would corrupt the
heap
  when the bias is consumed by the driver.
- The original code also passed the `malloc` result to `memset` without
a null
  check.

The fix uses `SafeInt<size_t>` (the ORT-standard idiom for memory-size
arithmetic), validates ONNX dimensions before they enter the RKNPU
`uint32_t`
shape model, and replaces the manual `malloc`/`free` list with
zero-initialized,
RAII-owned `std::make_unique<uint8_t[]>` buffers.

**Validation:** the RKNPU EP requires the Rockchip DDK
(`RKNPU_DDK_PATH`) and an
ARM target, so it does not build on a typical x64 dev box and has no
GPU-free CI
leg. The change was validated with `clang-format`, `git diff --check`,
and a
standalone test against ORT's `SafeInt.hpp`, confirming the guard throws
on the
overflowing dimensions (`0x40000400`, `0xFFFFFFFF`) while preserving
normal and
zero-dimension behavior.

**Testing:** No unit test is included because the RKNPU EP is not
compiled in
any CI leg (it requires the proprietary Rockchip DDK and an ARM target),
and the
`ToRknpuDim` / `AllocZeroedBias` helpers have internal (file-`static`)
linkage,
so they are not reachable from the gtest suites. The overflow/validation
logic
was instead exercised with a standalone test against ORT's `SafeInt.hpp`
as
noted above.

---------

Co-authored-by: Gopalakrishnan Nallasamy <gopalakrishnan.nallasamy@microsoft.com>
microsoft#29202)

## Description

Fix microsoft#29198.

NVIDIA restructured the CUDA Python wheels starting with CUDA 13: the
per-component CUDA Toolkit packages (cublas, cufft, cuda_runtime,
cuda_nvrtc, curand, ...) were consolidated into a single
`nvidia/cu{major}` package and the `-cuNN` suffix was dropped from those
package names. This PR updates the DLL/shared-library preload logic and
the wheel dependency metadata so `onnxruntime-gpu` (and
`onnxruntime-trt-rtx`) keep working on both the legacy CUDA 12 layout
and the new CUDA 13 consolidated layout.

## Summary of Changes

### Preload logic (`onnxruntime/__init__.py`)

| File | Change |
|------|--------|
| `onnxruntime/__init__.py` | `_get_nvidia_dll_paths` now detects the
CUDA 13+ consolidated layout and resolves CUDA libraries under
`nvidia/cu{major}` — Windows uses an architecture sub-folder
(`bin/<arch>`, e.g. `bin/x86_64`), Linux uses a flat `lib`. The legacy
CUDA 12 per-component paths are preserved. |
| `onnxruntime/__init__.py` | Added `build_cuda_version` and `arch`
parameters (for testability/arch override); cuDNN paths factored out
since cuDNN keeps its own `nvidia/cudnn` package layout in both schemes.
|
| `onnxruntime/__init__.py` | `print_debug_info` drops the `-cuNN`
suffix from CUDA Toolkit package names for CUDA 13+ (cuDNN keeps its
suffixed name). |

### Wheel dependency metadata (`setup.py`)

| File | Change |
|------|--------|
| `setup.py` | `onnxruntime-gpu` `cuda` extras drop the `-cuNN` suffix
for CUDA 13+ (`nvidia-cuda-nvrtc`, `nvidia-cuda-runtime`,
`nvidia-cufft`, `nvidia-curand`); cuDNN dependency keeps the suffixed
name. |
| `setup.py` | `onnxruntime-trt-rtx` CUDA Runtime dependency drops the
`-cuNN` suffix for CUDA 13+. |

### Tests
(`onnxruntime/test/python/onnxruntime_test_python_preload_dlls.py`)

- New unit tests pin the expected relative paths for the CUDA 12
(legacy) and CUDA 13 (consolidated) layouts on both Windows and Linux,
the Windows arch override, the Linux flat-`lib` layout, the unchanged
cuDNN layout, and the `cuda`/`cudnn` toggles.

## Testing

- Run the new tests: `python -m pytest
onnxruntime/test/python/onnxruntime_test_python_preload_dlls.py` (or
`python -m unittest
onnxruntime.test.python.onnxruntime_test_python_preload_dlls`).
- Backward compatibility: CUDA 12 paths and the cuDNN layout are
unchanged; only CUDA 13+ takes the new consolidated paths and unsuffixed
package names.
- Build in Linux and Windows, and `pip install
onnxruntime-gpu*.whl[cuda,cudnn]`, then `import onnxruntime;
onnxruntime.preload_dlls()` can run successfully in python.

## Checklist

- [x] Tests added/updated
- [x] No breaking changes (CUDA 12 behavior preserved)
…ck helpers (microsoft#28624)

## Summary

This PR adds an opt-in mechanism that lets an application supply its own
I/O callbacks for an execution provider's EPContext binary data, so the
data can live somewhere other than a plain file on disk (for example, an
encrypted store or an in-memory buffer). It introduces the callback APIs
end-to-end and demonstrates their use with a sample helper in the AutoEP
example plugin EP.

When an EP compiles a model into an EPContext model, it may emit the
compiled blob either embedded in the ONNX model or as a separate
external payload. For the external case, ORT previously assumed the
payload is a file. These callbacks let the application own that
read/write instead, while ORT core stays policy-neutral and never
imposes a storage format.

### What this PR adds

- **Write callback (`OrtWriteNamedBufferFunc`) + setter
`OrtCompileApi::ModelCompilationOptions_SetEpContextDataWriteFunc`.**
Set on `OrtModelCompilationOptions`, because writing EPContext binary
data happens only during **compilation**. Passing a NULL callback clears
a previously set one.
- **Read callback (`OrtReadNamedBufferFunc`) + setter
`OrtApi::SessionOptions_SetEpContextDataReadFunc`.** Set on
`OrtSessionOptions`, because reading external EPContext binary data
happens during **session load / inference**. Passing a NULL callback
clears a previously set one.
- **EP-facing access via `OrtEpContextConfig`.** Both callbacks are
surfaced to execution providers through a single unified handle,
`OrtEpContextConfig`, obtained via
`OrtEpApi::SessionOptions_GetEpContextConfig` (getters
`EpContextConfig_GetEpContextDataReadFunc` /
`EpContextConfig_GetEpContextDataWriteFunc`, released with
`ReleaseEpContextConfig`). This keeps the application-facing setters
scoped to the correct lifecycle while giving EPs one consistent place to
retrieve both callbacks. Each setter's doc comment cross-references the
other so the split is discoverable.
- **Experimental API surface + C++ accessors.** These functions ship
through ORT's experimental API mechanism (declared in
`include/onnxruntime/core/session/onnxruntime_experimental_c_api.inc`),
so they are reached via the generated
`Ort::Experimental::Get_<name>_SinceV28_Fn(...)` / `...FnOrThrow(...)`
accessors rather than fixed `OrtApi` slots. A move-only RAII wrapper,
**`Ort::Experimental::EpContextConfig`** (in
`onnxruntime_experimental_cxx_api.h`), owns an `OrtEpContextConfig` and
exposes `GetReadFunc()` / `GetWriteFunc()`; it can be constructed
directly from a C++ `SessionOptions` / `ConstSessionOptions`.
- **Sample-only helper utilities**
(`onnxruntime/test/autoep/library/ep_context_data_utils.h`) implementing
callback-or-file fallback behavior: if a callback is supplied it is
used, otherwise the helper falls back to direct file I/O. The AutoEP
example plugin EP uses this helper for its external EPContext read/write
paths. Because the names read on the load side originate from the
untrusted EPContext model (`ep_cache_context` attribute), the helper
validates them: it rejects absolute/rooted paths, `..` traversal, and
directory-like names (`.` or a trailing separator), and confines
model-relative names to the model directory (resolving `.`/`..` and
symlinks via `std::filesystem::weakly_canonical`). It reports all
failures via `OrtStatus*` (no exceptions) and lives outside the public C
API / EP ABI, so it is purely illustrative and imposes no policy on ORT
core; its doc comments note that production EPs should still apply their
own sandboxing and payload size limits.

The callback typedef names (`OrtReadNamedBufferFunc` /
`OrtWriteNamedBufferFunc`) are intentionally generic. They are currently
used for EPContext binary data, but the contract is deliberately
storage-agnostic so future APIs can reuse the same callback shape for
other named data payloads.

### Note on the Android workflow change

`.github/workflows/android.yml` bumps the minimal-build binary-size
threshold (`1436672` -> `1438720` bytes) to accommodate the small size
increase from compiling the new experimental API into the Android
minimal build.

## Testing

- Built and tested in RelWithDebInfo: `python tools/ci_build/build.py
--config RelWithDebInfo --build --parallel --test --build_dir
build\Windows`.
- Focused EPContext suites:
- Public C/C++ API: `onnxruntime_shared_lib_test.exe
--gtest_filter=EpContextDataApiTest.*` -> 9 passed.
- AutoEP helper + compile/load end-to-end (callbacks and file fallback):
`onnxruntime_autoep_test.exe --gtest_filter=*EpContext*` -> 17 passed, 1
skipped (`EpContextDataUtils_ResolvePathRejectsSymlinkEscape` requires
the Windows "create symbolic link" privilege).
- `clang-format` clean on touched C++ files; `git diff --check`: clean.

Test layout: public EPContext API tests in
`onnxruntime/test/shared_lib/test_ep_context_data_api.cc`; sample-helper
unit tests in `onnxruntime/test/autoep/ep_context_data_utils_test.cc`;
compile/load end-to-end tests in
`onnxruntime/test/autoep/test_execution.cc`.

---------

Co-authored-by: Gopalakrishnan Nallasamy <gnallasamy@microsoft.com>
Co-authored-by: Gopalakrishnan Nallasamy <gopalakrishnan.nallasamy@microsoft.com>
Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
### Description

Bool initializers supplied via `TensorProto` `raw_data` are copied
verbatim by `UnpackTensor<bool>`, so their bytes are not guaranteed to
be the canonical `{0, 1}` (the `int32_data` path normalizes via
`static_cast<bool>`, but the `raw_data` path did not). Kernels across
the codebase assume bool tensors hold `{0, 1}`.

The CUDA `Compress` kernel is concretely affected: its output-sizing
path sign-extends the condition bytes (`int8_t` -> `int32_t`) through
`cub::DeviceScan::InclusiveSum`, while `_CompressKernel` selects
elements using bool truthiness (`condition_data[div]`). For condition
bytes outside `{0, 1}` the two interpretations disagree and the output
is sized inconsistently with how elements are written. The CPU kernel
uses truthiness for both sizing and selection and is unaffected.

### Changes

- `UnpackTensor<bool>` (`tensorprotoutils.cc`): normalize `raw_data`
bytes to `{0, 1}` after copy. The `UnpackTensorWithExternalData<bool>`
specialization does the same for external data read through that path.
- CUDA `Compress` `CastToInt32` (`compress_impl.cu`): normalize to `{0,
1}` (still returns `int32_t`, preserving the accumulator-widening intent
of microsoft#9295) so the sizing path matches the kernel's write predicate,
matching the CPU kernel and the CUDA `NonZero` `bool(x)` convention.
This makes the CUDA `Compress` kernel correct independently of how its
bool condition initializer was materialized.
- Shared helper `utils::NormalizeBoolTensorIfNeeded(Tensor&)`
(`tensorprotoutils.{h,cc}`): single normalization point, reused by
`TensorProtoToTensor()` (external branch, after `MakeCpuTensorCopy`) and
by the session-init external device-copy path.
- `session_state_utils.cc::DeserializeTensorProto`: for **external**
bool initializers loaded onto a non-CPU device, normalize the writable
CPU staging copy before `CopyTensorFromCPUToDevice`. The
`GetExtDataFromTensorProto` buffer may be a read-only mmap, so it is
normalized via a writable copy rather than in place.
- Unit tests in `tensorutils_test.cc` for bool `raw_data` with
non-canonical bytes and for the `NormalizeBoolTensorIfNeeded` helper. A
`Compress` `OpTester` test cannot reproduce the original bug because the
test harness itself normalizes bool during input construction, so
coverage is placed at the deserialization layer. Tests use only Status
returns and gtest assertions, so they build and run in no-exception
builds.

### Coverage / scope of bool normalization

Fully covered:
- In-proto `raw_data` bool initializers (all EPs), via
`UnpackTensor<bool>`.
- External bool initializers reaching `TensorProtoToTensor()`.
- External bool initializers copied to a non-CPU device through the
session-init device-copy path.

Intentionally **not** normalized (by design):
- The CPU zero-copy mmap path for external initializers
(`GetExtDataFromTensorProto` returns a read-only/shared mapping that
cannot be safely modified in place).
- The custom external-data-loader path
(`LoadExtDataToTensorFromTensorProto`), which loads directly into a
device tensor.

These remaining paths are safe for the concrete bug this PR targets
because the CUDA `Compress` kernel is hardened in `compress_impl.cu`
regardless of initializer storage. Other byte-comparing bool consumers
fed by an external mmap/custom-loader initializer with non-canonical
bytes are out of scope here.

### Motivation and Context

`CastToInt32` was introduced in microsoft#9295 to widen the `cub::InclusiveSum`
accumulator (an int8 overflow fix); it did not normalize the bool
interpretation. The accumulator-width and bool-normalization concerns
are independent. This change addresses the latter at the deserialization
source and hardens the CUDA `Compress` kernel.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This removes some unused code for MinLatency, which is not used in MoE
or QMoE.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.