Skip to content

Normalize bool tensor raw_data to {0, 1} on unpack#29238

Merged
jiafatom merged 4 commits into
microsoft:mainfrom
jiafatom:jiafatom/compress-bool-normalize
Jun 27, 2026
Merged

Normalize bool tensor raw_data to {0, 1} on unpack#29238
jiafatom merged 4 commits into
microsoft:mainfrom
jiafatom:jiafatom/compress-bool-normalize

Conversation

@jiafatom

@jiafatom jiafatom commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

Description

Bool initializers supplied via TensorProto raw_data are copied verbatim by UnpackTensor<bool>, so their bytes are not guaranteed to be the canonical {0, 1} (the int32_data path normalizes via static_cast<bool>, but the raw_data path did not). Kernels across the codebase assume bool tensors hold {0, 1}.

The CUDA Compress kernel is concretely affected: its output-sizing path sign-extends the condition bytes (int8_t -> int32_t) through cub::DeviceScan::InclusiveSum, while _CompressKernel selects elements using bool truthiness (condition_data[div]). For condition bytes outside {0, 1} the two interpretations disagree and the output is sized inconsistently with how elements are written. The CPU kernel uses truthiness for both sizing and selection and is unaffected.

Changes

  • UnpackTensor<bool> (tensorprotoutils.cc): normalize raw_data bytes to {0, 1} after copy. The UnpackTensorWithExternalData<bool> specialization does the same for external data read through that path.
  • CUDA Compress CastToInt32 (compress_impl.cu): normalize to {0, 1} (still returns int32_t, preserving the accumulator-widening intent of Fix inclusive sum overlfow when applied on int8_t buffer in Compress #9295) so the sizing path matches the kernel's write predicate, matching the CPU kernel and the CUDA NonZero bool(x) convention. This makes the CUDA Compress kernel correct independently of how its bool condition initializer was materialized.
  • Shared helper utils::NormalizeBoolTensorIfNeeded(Tensor&) (tensorprotoutils.{h,cc}): single normalization point, reused by TensorProtoToTensor() (external branch, after MakeCpuTensorCopy) and by the session-init external device-copy path.
  • session_state_utils.cc::DeserializeTensorProto: for external bool initializers loaded onto a non-CPU device, normalize the writable CPU staging copy before CopyTensorFromCPUToDevice. The GetExtDataFromTensorProto buffer may be a read-only mmap, so it is normalized via a writable copy rather than in place.
  • Unit tests in tensorutils_test.cc for bool raw_data with non-canonical bytes and for the NormalizeBoolTensorIfNeeded helper. A Compress OpTester test cannot reproduce the original bug because the test harness itself normalizes bool during input construction, so coverage is placed at the deserialization layer. Tests use only Status returns and gtest assertions, so they build and run in no-exception builds.

Coverage / scope of bool normalization

Fully covered:

  • In-proto raw_data bool initializers (all EPs), via UnpackTensor<bool>.
  • External bool initializers reaching TensorProtoToTensor().
  • External bool initializers copied to a non-CPU device through the session-init device-copy path.

Intentionally not normalized (by design):

  • The CPU zero-copy mmap path for external initializers (GetExtDataFromTensorProto returns a read-only/shared mapping that cannot be safely modified in place).
  • The custom external-data-loader path (LoadExtDataToTensorFromTensorProto), which loads directly into a device tensor.

These remaining paths are safe for the concrete bug this PR targets because the CUDA Compress kernel is hardened in compress_impl.cu regardless of initializer storage. Other byte-comparing bool consumers fed by an external mmap/custom-loader initializer with non-canonical bytes are out of scope here.

Motivation and Context

CastToInt32 was introduced in #9295 to widen the cub::InclusiveSum accumulator (an int8 overflow fix); it did not normalize the bool interpretation. The accumulator-width and bool-normalization concerns are independent. This change addresses the latter at the deserialization source and hardens the CUDA Compress kernel.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens ONNX Runtime’s deserialization and CUDA Compress behavior by ensuring boolean tensors always use canonical byte values {0, 1} when coming from TensorProto.raw_data. This prevents downstream kernels (notably CUDA Compress sizing vs. selection) from disagreeing when encountering non-canonical nonzero bytes.

Changes:

  • Normalize unpacked bool tensor raw_data bytes to {0, 1} in UnpackTensor<bool> after copying.
  • Normalize CUDA Compress prefix-sum input values to {0, 1} (while still widening to int32_t) so sizing matches the write predicate.
  • Add a unit test verifying non-canonical raw_data bytes are normalized on unpack.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
onnxruntime/core/framework/tensorprotoutils.cc Normalizes bool raw_data bytes to {0, 1} after unpacking so all consumers observe consistent canonical bool storage.
onnxruntime/core/providers/cuda/tensor/compress_impl.cu Makes CastToInt32 treat any nonzero condition byte as 1 to keep CUDA Compress sizing consistent with its boolean predicate.
onnxruntime/test/framework/tensorutils_test.cc Adds coverage for bool TensorProto.raw_data containing non-canonical bytes, validating normalization at the deserialization layer.

Comment thread onnxruntime/core/framework/tensorprotoutils.cc
@jiafatom jiafatom force-pushed the jiafatom/compress-bool-normalize branch 2 times, most recently from 46a5ef4 to 9f1ceea Compare June 24, 2026 22:55

@tianleiwu tianleiwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix. The CUDA Compress change is correct and self-consistent on its own: CastToInt32 now returns v != 0 ? 1 : 0, so the cub::InclusiveSum sizing path and the _CompressKernel write predicate (condition_data[div]) agree for any non-canonical byte. That alone fixes the concrete Compress bug. The UnpackTensor<bool> raw_data normalization and the new external-data specialization are correct, and the no-exception-safe tests are a good addition.

One scope note: the primary initializer-materialization path does not go through the new UnpackTensorWithExternalData<bool>. TensorProtoToTensor() handles external data via GetExtDataFromTensorProto() + MakeCpuTensorCopy(), which copies the raw external bytes verbatim and never normalizes bool. So an external bool initializer with non-canonical bytes still reaches kernels un-normalized — the in-proto raw_data path is fully covered, but the external path is not. Either broaden the fix to normalize in that path too, or soften the PR description's "root-cause fix applies to all EPs" wording. The Compress kernel itself is safe regardless because of the .cu change. Details inline.

Comment thread onnxruntime/core/framework/tensorprotoutils.cc Outdated
Comment thread onnxruntime/core/framework/tensorprotoutils.cc Outdated

@tianleiwu tianleiwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the follow-up — both round-1 points are addressed: the shared NormalizeBoolBytes helper removes the duplication (and unifies the sizeof(bool) == 1 guard), and the external branch in TensorProtoToTensor() now normalizes after MakeCpuTensorCopy.

One remaining coverage note (non-blocking): the primary session-init flow for external initializers does not actually route through TensorProtoToTensor(). In session_state_utils.cc::DeserializeTensorProto, external data is materialized directly via GetExtDataFromTensorProto / LoadExtDataToTensorFromTensorProto (CPU zero-copy mmap; non-CPU read-then-CopyTensorFromCPUToDevice; custom loader), all of which bypass the new normalization. So a bool external initializer consumed on CPU or copied to GPU can still reach kernels un-normalized. The CUDA Compress kernel is safe regardless thanks to the .cu CastToInt32 fix, so this isn't blocking — just flagging that the "normalize in the primary weight path" framing doesn't fully cover the external-initializer flow, and other byte-comparing bool consumers wouldn't be covered for external initializers. Detail inline.

Comment thread onnxruntime/core/framework/tensorprotoutils.cc Outdated
@jiafatom

Copy link
Copy Markdown
Contributor Author

Thanks @tianleiwu. Addressed the external-initializer scope note:

  • Extracted a shared utils::NormalizeBoolTensorIfNeeded(Tensor&) helper (reused by TensorProtoToTensor() and the new path).
  • In session_state_utils.cc::DeserializeTensorProto, the non-CPU device-copy path for external bool initializers now normalizes the writable CPU staging copy before CopyTensorFromCPUToDevice. Since GetExtDataFromTensorProto may hand back a read-only mmap, I normalize via a writable copy rather than in place.
  • Added a unit test for the helper.

The two remaining external paths are intentionally left un-normalized and I've tightened the PR description accordingly:

  • CPU zero-copy mmap (GetExtDataFromTensorProto read-only/shared mapping) — cannot be modified in place safely.
  • Custom external-data-loader (LoadExtDataToTensorFromTensorProto) — loads directly into a device tensor.

Both are safe for the concrete bug here because CUDA Compress is hardened in compress_impl.cu regardless of initializer storage; the description now scopes the claim to those covered paths instead of "all EPs / primary weight path."

@jiafatom jiafatom force-pushed the jiafatom/compress-bool-normalize branch 2 times, most recently from f395dfa to 4232193 Compare June 26, 2026 22:30
tianleiwu
tianleiwu previously approved these changes Jun 26, 2026

@tianleiwu tianleiwu left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed at head 4232193c. All previously raised concerns are addressed:

  • The shared NormalizeBoolBytes(uint8_t*, size_t) helper now backs both UnpackTensor<bool> (raw_data) and the UnpackTensorWithExternalData<bool> specialization, keeping the static_assert(sizeof(bool) == 1) guard in one place instead of two duplicated loops.
  • TensorProtoToTensor() now normalizes external bool data right after MakeCpuTensorCopy, so the primary weight-materialization path is covered, not just UnpackTensor.
  • The session-init non-CPU device-copy path (session_state_utils.cc::DeserializeTensorProto) normalizes a writable CPU staging copy before CopyTensorFromCPUToDevice. I verified this branch is actually reachable for CUDA: the CUDA EP does not override GetExternalDataLoader, so external bool initializers headed to a CUDA device do not divert through the un-normalized custom-loader branch.

Additional verification:

  • The CUDA Compress kernel hardening (CastToInt32 returning v != 0 ? 1 : 0) is present on main, so the intentionally-out-of-scope paths (CPU zero-copy mmap; custom external-data loader, used only by WebGPU/JS which don't run the affected CUDA Compress sizing path) don't expose the inconsistency this PR targets.
  • UnpackTensorWithExternalData is file-local to tensorprotoutils.cc and the explicit <bool> specialization precedes its only call site, so dropping INSTANTIATE_UNPACK_EXTERNAL_TENSOR(bool) (to avoid -Winstantiation-after-specialization) is safe with no ODR/implicit-instantiation risk.
  • Tests cover raw_data, external-data, and the NormalizeBoolTensorIfNeeded helper, and use only Status/gtest assertions so they hold under no-exception builds.

LGTM.

jiafatom and others added 3 commits June 27, 2026 00:31
Addresses review comment: apply the same {0, 1} normalization to the
UnpackTensorWithExternalData path, since external data is memcpy'd verbatim
and may contain non-canonical bool bytes. Add a test covering this path.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rmalizeBoolBytes helper

Address review feedback:
- Normalize bool bytes in TensorProtoToTensor's external-data path
  (GetExtDataFromTensorProto + MakeCpuTensorCopy), which copied external
  bytes verbatim and bypassed normalization for bool initializers.
- Extract shared NormalizeBoolBytes() helper (with sizeof(bool)==1 guard)
  used by UnpackTensorWithExternalData<bool>, UnpackTensor<bool> raw-data
  path, and the external-data weight path, removing duplicated loops.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nit path

The primary session-init flow for external initializers does not route through
TensorProtoToTensor(): session_state_utils.cc::DeserializeTensorProto materializes
external data directly via GetExtDataFromTensorProto. For the non-CPU device-copy
path, normalize bool bytes to canonical {0, 1} in a writable CPU staging copy before
CopyTensorFromCPUToDevice (the GetExtDataFromTensorProto buffer may be a read-only
mmap, so it cannot be normalized in place).

Extract a shared utils::NormalizeBoolTensorIfNeeded helper and reuse it in both
TensorProtoToTensor() and the new device-copy path. Add a unit test for the helper.

The CPU zero-copy mmap path and the custom external-data-loader path are intentionally
left un-normalized (read-only mapping / device-side load); the CUDA Compress kernel is
safe regardless thanks to the CastToInt32 fix in compress_impl.cu.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jiafatom jiafatom force-pushed the jiafatom/compress-bool-normalize branch from 4232193 to 47061a9 Compare June 27, 2026 00:31
…ation

The bool {0,1} normalization changes add a shared NormalizeBoolBytes helper
and NormalizeBoolTensorIfNeeded calls, growing the minimal-baseline
libonnxruntime.so to 1439258 bytes, just over the current 1438720 threshold.
Bump to 1440768 (1407 KiB) to accommodate the increase with headroom.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jiafatom jiafatom enabled auto-merge (squash) June 27, 2026 04:53
@jiafatom jiafatom merged commit c65d9b6 into microsoft:main Jun 27, 2026
86 checks passed
@jiafatom jiafatom deleted the jiafatom/compress-bool-normalize branch June 27, 2026 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants