[cuda] grow data-partition block-offset buffers for large per-leaf grids by maxwbuckley · Pull Request #22 · BelixRogner/ExaBoost

maxwbuckley · 2026-06-03T22:43:18Z

Problem

CUDADataPartition::CalcBlockDim is non-monotonic in num_data_in_leaf: it
rounds the per-block data count up to a power of two, so a smaller leaf can
require more blocks than the full dataset. With SPLIT_INDICES_BLOCK_SIZE_DATA_PARTITION = 1024:

leaf size	grid_dim_
200	50
159	80
140	70
120	61

The per-block offset buffers cuda_block_data_to_{left,right}_offset_ are sized
once for the full-dataset grid (max_num_split_indices_blocks_, set at
construction and in ResetTrainingData). When Split() calls
CalcBlockDim(num_data_in_leaf) for a leaf whose grid_dim_ exceeds that
capacity, GenDataToLeftBitVectorKernel's PrepareOffset writes
block_to_{left,right}_offset_buffer[blockIdx.x + 1] past the end of those
buffers.

This is reached by the bagged root (~bagging_fraction * num_data) and by
any non-bagged leaf in the ~101–160 range on a dataset whose full-data grid
is smaller (e.g. num_data = 200, num_leaves = 7 splits a 122-row leaf →
grid 61 > 50). compute-sanitizer reports:

Invalid __global__ write of size 4 bytes
  at ... GenDataToLeftBitVectorKernel<...> ... in cuda_data_partition.cu
  ... is N bytes after the nearest allocation ...

On most allocators the overflow lands in the allocation's slack and silently
corrupts adjacent device memory — predictions remain bit-identical, which is why
it has gone unnoticed — but it is undefined behaviour and can fault outright on a
stricter allocation layout / larger overflow.

Fix

Grow the two block-offset buffers on demand in Split(), right after the
per-leaf CalcBlockDim, mirroring the existing resize logic in
ResetTrainingData. The buffers only ever grow, so it is a no-op once the
high-water mark is reached.

Test

Adds test_cuda_data_partition_block_offset_no_overflow to test_dual.py,
parametrized over leaf sizes that hit the overflow band, asserting bit-identical
CPU/CUDA parity. Run under compute-sanitizer to observe the underlying invalid
write without the fix.

🤖 Generated with Claude Code

CUDADataPartition::CalcBlockDim is non-monotonic in num_data_in_leaf: it rounds the per-block data count up to a power of two, so a *smaller* leaf can require *more* blocks than the full dataset (e.g. with SPLIT_INDICES_BLOCK_SIZE=1024, grid(200)=50 but grid(140)=70, and grid(159)=80). The per-block offset buffers cuda_block_data_to_{left,right}_offset_ are sized only for the full-dataset grid (max_num_split_indices_blocks_, set at construction and in ResetTrainingData), so when Split() runs CalcBlockDim(num_data_in_leaf) for a leaf whose grid_dim_ exceeds that capacity, GenDataToLeftBitVectorKernel's PrepareOffset writes block_to_{left,right}_offset_buffer[blockIdx.x + 1] past the end of those buffers. This is reached by the bagged root (~bagging_fraction * num_data) and by any non-bagged leaf in the ~101-160 range on a dataset whose full-data grid is smaller. compute-sanitizer reports it as an invalid __global__ write of size 4; on most allocators the overflow lands in the allocation's slack and silently corrupts adjacent device memory (predictions stay bit-identical, which is why it went unnoticed), but it is undefined behaviour and can fault on a stricter allocation layout. Fix: grow the two block-offset buffers on demand in Split(), right after the per-leaf CalcBlockDim, mirroring the existing resize logic in ResetTrainingData. The buffers only ever grow, so this is a no-op once the high-water mark is hit. Adds a parametrized CPU/CUDA parity test over leaf sizes that hit the overflow band; it pins bit-identical parity for the scenario and is clean under compute-sanitizer with the fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…block-offset-overflow # Conflicts: # tests/python_package_test/test_dual.py

BelixRogner · 2026-06-14T19:52:39Z

Thank you, Max — and thank you, Claude Code (separately 🙂). Deep-reviewed this one; verdict: NEEDS-FIX (the C++ is right, the rest needs a touch-up).

The fix itself is correct and fundamental. CalcBlockDim is non-monotonic — per-block size rounds up to a power of two, so a smaller leaf can need more blocks than the full dataset (grid(200)=50 but grid(140)=70, peaking ~80 around n∈[101,160]). Growing cuda_block_data_to_{left,right}_offset_ to the exact grid_dim_ + 1 in Split() (cuda_data_partition.cpp:156-161) is the right, magic-number-free fix, and it mirrors the existing grow-on-demand in ResetTrainingData. No int overflow at scale. 👍

Two things before merge:

Run ruff format. The bundled test_dual.py reformatting is backwards — it hand-wraps statements that fit inside 120 chars, which is exactly what trips the ruff format CI job. Letting the formatter re-collapse them clears the red. (I'll gently note this is the recurring "ran the kernels but not pre-commit" special 😄 — pre-commit run -a locally and CI goes green on the first try.)
Strengthen the parametrization. Of the five n values, 120/150/159 can't actually trigger the overflow — for those, the full-dataset grid is already the max possible, so no child leaf can exceed it. Only n=200/250 exercise the bug. Swap the dead values for ones where a child leaf lands in the 101-160 band while the root grid is smaller, and ideally add the bagged-root case your own comment calls out.

Heads up: master moved (5 PRs landed), so this needs a rebase onto current master. Good news — you now have write access to the repo, so once it's green you can rebase + merge it yourself. 🚀

maxwbuckley and others added 2 commits June 4, 2026 00:42

Merge remote-tracking branch 'belix/master' into cuda/data-partition-…

52fda52

…block-offset-overflow # Conflicts: # tests/python_package_test/test_dual.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] grow data-partition block-offset buffers for large per-leaf grids#22

[cuda] grow data-partition block-offset buffers for large per-leaf grids#22
maxwbuckley wants to merge 2 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/data-partition-block-offset-overflow

maxwbuckley commented Jun 3, 2026

Uh oh!

BelixRogner commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxwbuckley commented Jun 3, 2026

Problem

Fix

Test

Uh oh!

BelixRogner commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants