Normalize bool tensor raw_data to {0, 1} on unpack by jiafatom · Pull Request #29238 · microsoft/onnxruntime

jiafatom · 2026-06-23T20:50:45Z

Description

Bool initializers supplied via TensorProto raw_data are copied verbatim by UnpackTensor<bool>, so their bytes are not guaranteed to be the canonical {0, 1} (the int32_data path normalizes via static_cast<bool>, but the raw_data path did not). Kernels across the codebase assume bool tensors hold {0, 1}.

The CUDA Compress kernel is concretely affected: its output-sizing path sign-extends the condition bytes (int8_t -> int32_t) through cub::DeviceScan::InclusiveSum, while _CompressKernel selects elements using bool truthiness (condition_data[div]). For condition bytes outside {0, 1} the two interpretations disagree and the output is sized inconsistently with how elements are written. The CPU kernel uses truthiness for both sizing and selection and is unaffected.

Changes

UnpackTensor<bool> (tensorprotoutils.cc): normalize raw_data bytes to {0, 1} after copy. The UnpackTensorWithExternalData<bool> specialization does the same for external data read through that path.
CUDA Compress CastToInt32 (compress_impl.cu): normalize to {0, 1} (still returns int32_t, preserving the accumulator-widening intent of Fix inclusive sum overlfow when applied on int8_t buffer in Compress #9295) so the sizing path matches the kernel's write predicate, matching the CPU kernel and the CUDA NonZero bool(x) convention. This makes the CUDA Compress kernel correct independently of how its bool condition initializer was materialized.
Shared helper utils::NormalizeBoolTensorIfNeeded(Tensor&) (tensorprotoutils.{h,cc}): single normalization point, reused by TensorProtoToTensor() (external branch, after MakeCpuTensorCopy) and by the session-init external device-copy path.
session_state_utils.cc::DeserializeTensorProto: for external bool initializers loaded onto a non-CPU device, normalize the writable CPU staging copy before CopyTensorFromCPUToDevice. The GetExtDataFromTensorProto buffer may be a read-only mmap, so it is normalized via a writable copy rather than in place.
Unit tests in tensorutils_test.cc for bool raw_data with non-canonical bytes and for the NormalizeBoolTensorIfNeeded helper. A Compress OpTester test cannot reproduce the original bug because the test harness itself normalizes bool during input construction, so coverage is placed at the deserialization layer. Tests use only Status returns and gtest assertions, so they build and run in no-exception builds.

Coverage / scope of bool normalization

Fully covered:

In-proto raw_data bool initializers (all EPs), via UnpackTensor<bool>.
External bool initializers reaching TensorProtoToTensor().
External bool initializers copied to a non-CPU device through the session-init device-copy path.

Intentionally not normalized (by design):

The CPU zero-copy mmap path for external initializers (GetExtDataFromTensorProto returns a read-only/shared mapping that cannot be safely modified in place).
The custom external-data-loader path (LoadExtDataToTensorFromTensorProto), which loads directly into a device tensor.

These remaining paths are safe for the concrete bug this PR targets because the CUDA Compress kernel is hardened in compress_impl.cu regardless of initializer storage. Other byte-comparing bool consumers fed by an external mmap/custom-loader initializer with non-canonical bytes are out of scope here.

Motivation and Context

CastToInt32 was introduced in #9295 to widen the cub::InclusiveSum accumulator (an int8 overflow fix); it did not normalize the bool interpretation. The accumulator-width and bool-normalization concerns are independent. This change addresses the latter at the deserialization source and hardens the CUDA Compress kernel.

Copilot

Pull request overview

This PR hardens ONNX Runtime’s deserialization and CUDA Compress behavior by ensuring boolean tensors always use canonical byte values {0, 1} when coming from TensorProto.raw_data. This prevents downstream kernels (notably CUDA Compress sizing vs. selection) from disagreeing when encountering non-canonical nonzero bytes.

Changes:

Normalize unpacked bool tensor raw_data bytes to {0, 1} in UnpackTensor<bool> after copying.
Normalize CUDA Compress prefix-sum input values to {0, 1} (while still widening to int32_t) so sizing matches the write predicate.
Add a unit test verifying non-canonical raw_data bytes are normalized on unpack.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
onnxruntime/core/framework/tensorprotoutils.cc	Normalizes `bool` `raw_data` bytes to `{0, 1}` after unpacking so all consumers observe consistent canonical bool storage.
onnxruntime/core/providers/cuda/tensor/compress_impl.cu	Makes `CastToInt32` treat any nonzero condition byte as `1` to keep CUDA `Compress` sizing consistent with its boolean predicate.
onnxruntime/test/framework/tensorutils_test.cc	Adds coverage for bool `TensorProto.raw_data` containing non-canonical bytes, validating normalization at the deserialization layer.

tianleiwu

Thanks for the fix. The CUDA Compress change is correct and self-consistent on its own: CastToInt32 now returns v != 0 ? 1 : 0, so the cub::InclusiveSum sizing path and the _CompressKernel write predicate (condition_data[div]) agree for any non-canonical byte. That alone fixes the concrete Compress bug. The UnpackTensor<bool> raw_data normalization and the new external-data specialization are correct, and the no-exception-safe tests are a good addition.

One scope note: the primary initializer-materialization path does not go through the new UnpackTensorWithExternalData<bool>. TensorProtoToTensor() handles external data via GetExtDataFromTensorProto() + MakeCpuTensorCopy(), which copies the raw external bytes verbatim and never normalizes bool. So an external bool initializer with non-canonical bytes still reaches kernels un-normalized — the in-proto raw_data path is fully covered, but the external path is not. Either broaden the fix to normalize in that path too, or soften the PR description's "root-cause fix applies to all EPs" wording. The Compress kernel itself is safe regardless because of the .cu change. Details inline.

tianleiwu

Thanks for the follow-up — both round-1 points are addressed: the shared NormalizeBoolBytes helper removes the duplication (and unifies the sizeof(bool) == 1 guard), and the external branch in TensorProtoToTensor() now normalizes after MakeCpuTensorCopy.

One remaining coverage note (non-blocking): the primary session-init flow for external initializers does not actually route through TensorProtoToTensor(). In session_state_utils.cc::DeserializeTensorProto, external data is materialized directly via GetExtDataFromTensorProto / LoadExtDataToTensorFromTensorProto (CPU zero-copy mmap; non-CPU read-then-CopyTensorFromCPUToDevice; custom loader), all of which bypass the new normalization. So a bool external initializer consumed on CPU or copied to GPU can still reach kernels un-normalized. The CUDA Compress kernel is safe regardless thanks to the .cu CastToInt32 fix, so this isn't blocking — just flagging that the "normalize in the primary weight path" framing doesn't fully cover the external-initializer flow, and other byte-comparing bool consumers wouldn't be covered for external initializers. Detail inline.

jiafatom · 2026-06-26T17:36:46Z

Thanks @tianleiwu. Addressed the external-initializer scope note:

Extracted a shared utils::NormalizeBoolTensorIfNeeded(Tensor&) helper (reused by TensorProtoToTensor() and the new path).
In session_state_utils.cc::DeserializeTensorProto, the non-CPU device-copy path for external bool initializers now normalizes the writable CPU staging copy before CopyTensorFromCPUToDevice. Since GetExtDataFromTensorProto may hand back a read-only mmap, I normalize via a writable copy rather than in place.
Added a unit test for the helper.

The two remaining external paths are intentionally left un-normalized and I've tightened the PR description accordingly:

CPU zero-copy mmap (GetExtDataFromTensorProto read-only/shared mapping) — cannot be modified in place safely.
Custom external-data-loader (LoadExtDataToTensorFromTensorProto) — loads directly into a device tensor.

Both are safe for the concrete bug here because CUDA Compress is hardened in compress_impl.cu regardless of initializer storage; the description now scopes the claim to those covered paths instead of "all EPs / primary weight path."

tianleiwu

Re-reviewed at head 4232193c. All previously raised concerns are addressed:

The shared NormalizeBoolBytes(uint8_t*, size_t) helper now backs both UnpackTensor<bool> (raw_data) and the UnpackTensorWithExternalData<bool> specialization, keeping the static_assert(sizeof(bool) == 1) guard in one place instead of two duplicated loops.
TensorProtoToTensor() now normalizes external bool data right after MakeCpuTensorCopy, so the primary weight-materialization path is covered, not just UnpackTensor.
The session-init non-CPU device-copy path (session_state_utils.cc::DeserializeTensorProto) normalizes a writable CPU staging copy before CopyTensorFromCPUToDevice. I verified this branch is actually reachable for CUDA: the CUDA EP does not override GetExternalDataLoader, so external bool initializers headed to a CUDA device do not divert through the un-normalized custom-loader branch.

Additional verification:

The CUDA Compress kernel hardening (CastToInt32 returning v != 0 ? 1 : 0) is present on main, so the intentionally-out-of-scope paths (CPU zero-copy mmap; custom external-data loader, used only by WebGPU/JS which don't run the affected CUDA Compress sizing path) don't expose the inconsistency this PR targets.
UnpackTensorWithExternalData is file-local to tensorprotoutils.cc and the explicit <bool> specialization precedes its only call site, so dropping INSTANTIATE_UNPACK_EXTERNAL_TENSOR(bool) (to avoid -Winstantiation-after-specialization) is safe with no ODR/implicit-instantiation risk.
Tests cover raw_data, external-data, and the NormalizeBoolTensorIfNeeded helper, and use only Status/gtest assertions so they hold under no-exception builds.

LGTM.

Addresses review comment: apply the same {0, 1} normalization to the UnpackTensorWithExternalData path, since external data is memcpy'd verbatim and may contain non-canonical bool bytes. Add a test covering this path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rmalizeBoolBytes helper Address review feedback: - Normalize bool bytes in TensorProtoToTensor's external-data path (GetExtDataFromTensorProto + MakeCpuTensorCopy), which copied external bytes verbatim and bypassed normalization for bool initializers. - Extract shared NormalizeBoolBytes() helper (with sizeof(bool)==1 guard) used by UnpackTensorWithExternalData<bool>, UnpackTensor<bool> raw-data path, and the external-data weight path, removing duplicated loops. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…nit path The primary session-init flow for external initializers does not route through TensorProtoToTensor(): session_state_utils.cc::DeserializeTensorProto materializes external data directly via GetExtDataFromTensorProto. For the non-CPU device-copy path, normalize bool bytes to canonical {0, 1} in a writable CPU staging copy before CopyTensorFromCPUToDevice (the GetExtDataFromTensorProto buffer may be a read-only mmap, so it cannot be normalized in place). Extract a shared utils::NormalizeBoolTensorIfNeeded helper and reuse it in both TensorProtoToTensor() and the new device-copy path. Add a unit test for the helper. The CPU zero-copy mmap path and the custom external-data-loader path are intentionally left un-normalized (read-only mapping / device-side load); the CUDA Compress kernel is safe regardless thanks to the CastToInt32 fix in compress_impl.cu. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…ation The bool {0,1} normalization changes add a shared NormalizeBoolBytes helper and NormalizeBoolTensorIfNeeded calls, growing the minimal-baseline libonnxruntime.so to 1439258 bytes, just over the current 1438720 threshold. Bump to 1440768 (1407 KiB) to accommodate the increase with headroom. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

jiafatom requested a review from Copilot June 24, 2026 04:00

Copilot started reviewing on behalf of jiafatom June 24, 2026 04:00 View session

Copilot AI reviewed Jun 24, 2026

View reviewed changes

apsonawane reviewed Jun 24, 2026

View reviewed changes

Comment thread onnxruntime/core/framework/tensorprotoutils.cc

jiafatom force-pushed the jiafatom/compress-bool-normalize branch 2 times, most recently from 46a5ef4 to 9f1ceea Compare June 24, 2026 22:55

tianleiwu reviewed Jun 25, 2026

View reviewed changes

Comment thread onnxruntime/core/framework/tensorprotoutils.cc Outdated

Comment thread onnxruntime/core/framework/tensorprotoutils.cc Outdated

tianleiwu reviewed Jun 25, 2026

View reviewed changes

Comment thread onnxruntime/core/framework/tensorprotoutils.cc Outdated

jiafatom force-pushed the jiafatom/compress-bool-normalize branch 2 times, most recently from f395dfa to 4232193 Compare June 26, 2026 22:30

tianleiwu previously approved these changes Jun 26, 2026

View reviewed changes

jiafatom and others added 3 commits June 27, 2026 00:31

jiafatom dismissed tianleiwu’s stale review via 47061a9 June 27, 2026 00:31

jiafatom force-pushed the jiafatom/compress-bool-normalize branch from 4232193 to 47061a9 Compare June 27, 2026 00:31

jiafatom enabled auto-merge (squash) June 27, 2026 04:53

tianleiwu approved these changes Jun 27, 2026

View reviewed changes

jiafatom merged commit c65d9b6 into microsoft:main Jun 27, 2026
86 checks passed

jiafatom deleted the jiafatom/compress-bool-normalize branch June 27, 2026 04:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalize bool tensor raw_data to {0, 1} on unpack#29238

Normalize bool tensor raw_data to {0, 1} on unpack#29238
jiafatom merged 4 commits into
microsoft:mainfrom
jiafatom:jiafatom/compress-bool-normalize

jiafatom commented Jun 23, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

tianleiwu left a comment

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Uh oh!

Uh oh!

jiafatom commented Jun 26, 2026

Uh oh!

tianleiwu left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

jiafatom commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Changes

Coverage / scope of bool normalization

Motivation and Context

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jiafatom commented Jun 26, 2026

Uh oh!

tianleiwu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiafatom commented Jun 23, 2026 •

edited

Loading