Normalize bool tensor raw_data to {0, 1} on unpack#29238
Conversation
There was a problem hiding this comment.
Pull request overview
This PR hardens ONNX Runtime’s deserialization and CUDA Compress behavior by ensuring boolean tensors always use canonical byte values {0, 1} when coming from TensorProto.raw_data. This prevents downstream kernels (notably CUDA Compress sizing vs. selection) from disagreeing when encountering non-canonical nonzero bytes.
Changes:
- Normalize unpacked
booltensorraw_databytes to{0, 1}inUnpackTensor<bool>after copying. - Normalize CUDA
Compressprefix-sum input values to{0, 1}(while still widening toint32_t) so sizing matches the write predicate. - Add a unit test verifying non-canonical
raw_databytes are normalized on unpack.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| onnxruntime/core/framework/tensorprotoutils.cc | Normalizes bool raw_data bytes to {0, 1} after unpacking so all consumers observe consistent canonical bool storage. |
| onnxruntime/core/providers/cuda/tensor/compress_impl.cu | Makes CastToInt32 treat any nonzero condition byte as 1 to keep CUDA Compress sizing consistent with its boolean predicate. |
| onnxruntime/test/framework/tensorutils_test.cc | Adds coverage for bool TensorProto.raw_data containing non-canonical bytes, validating normalization at the deserialization layer. |
46a5ef4 to
9f1ceea
Compare
tianleiwu
left a comment
There was a problem hiding this comment.
Thanks for the fix. The CUDA Compress change is correct and self-consistent on its own: CastToInt32 now returns v != 0 ? 1 : 0, so the cub::InclusiveSum sizing path and the _CompressKernel write predicate (condition_data[div]) agree for any non-canonical byte. That alone fixes the concrete Compress bug. The UnpackTensor<bool> raw_data normalization and the new external-data specialization are correct, and the no-exception-safe tests are a good addition.
One scope note: the primary initializer-materialization path does not go through the new UnpackTensorWithExternalData<bool>. TensorProtoToTensor() handles external data via GetExtDataFromTensorProto() + MakeCpuTensorCopy(), which copies the raw external bytes verbatim and never normalizes bool. So an external bool initializer with non-canonical bytes still reaches kernels un-normalized — the in-proto raw_data path is fully covered, but the external path is not. Either broaden the fix to normalize in that path too, or soften the PR description's "root-cause fix applies to all EPs" wording. The Compress kernel itself is safe regardless because of the .cu change. Details inline.
tianleiwu
left a comment
There was a problem hiding this comment.
Thanks for the follow-up — both round-1 points are addressed: the shared NormalizeBoolBytes helper removes the duplication (and unifies the sizeof(bool) == 1 guard), and the external branch in TensorProtoToTensor() now normalizes after MakeCpuTensorCopy.
One remaining coverage note (non-blocking): the primary session-init flow for external initializers does not actually route through TensorProtoToTensor(). In session_state_utils.cc::DeserializeTensorProto, external data is materialized directly via GetExtDataFromTensorProto / LoadExtDataToTensorFromTensorProto (CPU zero-copy mmap; non-CPU read-then-CopyTensorFromCPUToDevice; custom loader), all of which bypass the new normalization. So a bool external initializer consumed on CPU or copied to GPU can still reach kernels un-normalized. The CUDA Compress kernel is safe regardless thanks to the .cu CastToInt32 fix, so this isn't blocking — just flagging that the "normalize in the primary weight path" framing doesn't fully cover the external-initializer flow, and other byte-comparing bool consumers wouldn't be covered for external initializers. Detail inline.
|
Thanks @tianleiwu. Addressed the external-initializer scope note:
The two remaining external paths are intentionally left un-normalized and I've tightened the PR description accordingly:
Both are safe for the concrete bug here because CUDA |
f395dfa to
4232193
Compare
tianleiwu
left a comment
There was a problem hiding this comment.
Re-reviewed at head 4232193c. All previously raised concerns are addressed:
- The shared
NormalizeBoolBytes(uint8_t*, size_t)helper now backs bothUnpackTensor<bool>(raw_data) and theUnpackTensorWithExternalData<bool>specialization, keeping thestatic_assert(sizeof(bool) == 1)guard in one place instead of two duplicated loops. TensorProtoToTensor()now normalizes external bool data right afterMakeCpuTensorCopy, so the primary weight-materialization path is covered, not justUnpackTensor.- The session-init non-CPU device-copy path (
session_state_utils.cc::DeserializeTensorProto) normalizes a writable CPU staging copy beforeCopyTensorFromCPUToDevice. I verified this branch is actually reachable for CUDA: the CUDA EP does not overrideGetExternalDataLoader, so external bool initializers headed to a CUDA device do not divert through the un-normalized custom-loader branch.
Additional verification:
- The CUDA
Compresskernel hardening (CastToInt32returningv != 0 ? 1 : 0) is present onmain, so the intentionally-out-of-scope paths (CPU zero-copy mmap; custom external-data loader, used only by WebGPU/JS which don't run the affected CUDACompresssizing path) don't expose the inconsistency this PR targets. UnpackTensorWithExternalDatais file-local totensorprotoutils.ccand the explicit<bool>specialization precedes its only call site, so droppingINSTANTIATE_UNPACK_EXTERNAL_TENSOR(bool)(to avoid-Winstantiation-after-specialization) is safe with no ODR/implicit-instantiation risk.- Tests cover raw_data, external-data, and the
NormalizeBoolTensorIfNeededhelper, and use only Status/gtest assertions so they hold under no-exception builds.
LGTM.
Addresses review comment: apply the same {0, 1} normalization to the
UnpackTensorWithExternalData path, since external data is memcpy'd verbatim
and may contain non-canonical bool bytes. Add a test covering this path.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rmalizeBoolBytes helper Address review feedback: - Normalize bool bytes in TensorProtoToTensor's external-data path (GetExtDataFromTensorProto + MakeCpuTensorCopy), which copied external bytes verbatim and bypassed normalization for bool initializers. - Extract shared NormalizeBoolBytes() helper (with sizeof(bool)==1 guard) used by UnpackTensorWithExternalData<bool>, UnpackTensor<bool> raw-data path, and the external-data weight path, removing duplicated loops. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nit path
The primary session-init flow for external initializers does not route through
TensorProtoToTensor(): session_state_utils.cc::DeserializeTensorProto materializes
external data directly via GetExtDataFromTensorProto. For the non-CPU device-copy
path, normalize bool bytes to canonical {0, 1} in a writable CPU staging copy before
CopyTensorFromCPUToDevice (the GetExtDataFromTensorProto buffer may be a read-only
mmap, so it cannot be normalized in place).
Extract a shared utils::NormalizeBoolTensorIfNeeded helper and reuse it in both
TensorProtoToTensor() and the new device-copy path. Add a unit test for the helper.
The CPU zero-copy mmap path and the custom external-data-loader path are intentionally
left un-normalized (read-only mapping / device-side load); the CUDA Compress kernel is
safe regardless thanks to the CastToInt32 fix in compress_impl.cu.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
4232193 to
47061a9
Compare
…ation
The bool {0,1} normalization changes add a shared NormalizeBoolBytes helper
and NormalizeBoolTensorIfNeeded calls, growing the minimal-baseline
libonnxruntime.so to 1439258 bytes, just over the current 1438720 threshold.
Bump to 1440768 (1407 KiB) to accommodate the increase with headroom.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Description
Bool initializers supplied via
TensorProtoraw_dataare copied verbatim byUnpackTensor<bool>, so their bytes are not guaranteed to be the canonical{0, 1}(theint32_datapath normalizes viastatic_cast<bool>, but theraw_datapath did not). Kernels across the codebase assume bool tensors hold{0, 1}.The CUDA
Compresskernel is concretely affected: its output-sizing path sign-extends the condition bytes (int8_t->int32_t) throughcub::DeviceScan::InclusiveSum, while_CompressKernelselects elements using bool truthiness (condition_data[div]). For condition bytes outside{0, 1}the two interpretations disagree and the output is sized inconsistently with how elements are written. The CPU kernel uses truthiness for both sizing and selection and is unaffected.Changes
UnpackTensor<bool>(tensorprotoutils.cc): normalizeraw_databytes to{0, 1}after copy. TheUnpackTensorWithExternalData<bool>specialization does the same for external data read through that path.CompressCastToInt32(compress_impl.cu): normalize to{0, 1}(still returnsint32_t, preserving the accumulator-widening intent of Fix inclusive sum overlfow when applied on int8_t buffer in Compress #9295) so the sizing path matches the kernel's write predicate, matching the CPU kernel and the CUDANonZerobool(x)convention. This makes the CUDACompresskernel correct independently of how its bool condition initializer was materialized.utils::NormalizeBoolTensorIfNeeded(Tensor&)(tensorprotoutils.{h,cc}): single normalization point, reused byTensorProtoToTensor()(external branch, afterMakeCpuTensorCopy) and by the session-init external device-copy path.session_state_utils.cc::DeserializeTensorProto: for external bool initializers loaded onto a non-CPU device, normalize the writable CPU staging copy beforeCopyTensorFromCPUToDevice. TheGetExtDataFromTensorProtobuffer may be a read-only mmap, so it is normalized via a writable copy rather than in place.tensorutils_test.ccfor boolraw_datawith non-canonical bytes and for theNormalizeBoolTensorIfNeededhelper. ACompressOpTestertest cannot reproduce the original bug because the test harness itself normalizes bool during input construction, so coverage is placed at the deserialization layer. Tests use only Status returns and gtest assertions, so they build and run in no-exception builds.Coverage / scope of bool normalization
Fully covered:
raw_databool initializers (all EPs), viaUnpackTensor<bool>.TensorProtoToTensor().Intentionally not normalized (by design):
GetExtDataFromTensorProtoreturns a read-only/shared mapping that cannot be safely modified in place).LoadExtDataToTensorFromTensorProto), which loads directly into a device tensor.These remaining paths are safe for the concrete bug this PR targets because the CUDA
Compresskernel is hardened incompress_impl.curegardless of initializer storage. Other byte-comparing bool consumers fed by an external mmap/custom-loader initializer with non-canonical bytes are out of scope here.Motivation and Context
CastToInt32was introduced in #9295 to widen thecub::InclusiveSumaccumulator (an int8 overflow fix); it did not normalize the bool interpretation. The accumulator-width and bool-normalization concerns are independent. This change addresses the latter at the deserialization source and hardens the CUDACompresskernel.