[cuda] grow data-partition block-offset buffers for large per-leaf grids#22
Conversation
CUDADataPartition::CalcBlockDim is non-monotonic in num_data_in_leaf: it rounds
the per-block data count up to a power of two, so a *smaller* leaf can require
*more* blocks than the full dataset (e.g. with SPLIT_INDICES_BLOCK_SIZE=1024,
grid(200)=50 but grid(140)=70, and grid(159)=80). The per-block offset buffers
cuda_block_data_to_{left,right}_offset_ are sized only for the full-dataset grid
(max_num_split_indices_blocks_, set at construction and in ResetTrainingData),
so when Split() runs CalcBlockDim(num_data_in_leaf) for a leaf whose grid_dim_
exceeds that capacity, GenDataToLeftBitVectorKernel's PrepareOffset writes
block_to_{left,right}_offset_buffer[blockIdx.x + 1] past the end of those
buffers.
This is reached by the bagged root (~bagging_fraction * num_data) and by any
non-bagged leaf in the ~101-160 range on a dataset whose full-data grid is
smaller. compute-sanitizer reports it as an invalid __global__ write of size 4;
on most allocators the overflow lands in the allocation's slack and silently
corrupts adjacent device memory (predictions stay bit-identical, which is why it
went unnoticed), but it is undefined behaviour and can fault on a stricter
allocation layout.
Fix: grow the two block-offset buffers on demand in Split(), right after the
per-leaf CalcBlockDim, mirroring the existing resize logic in ResetTrainingData.
The buffers only ever grow, so this is a no-op once the high-water mark is hit.
Adds a parametrized CPU/CUDA parity test over leaf sizes that hit the overflow
band; it pins bit-identical parity for the scenario and is clean under
compute-sanitizer with the fix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…block-offset-overflow # Conflicts: # tests/python_package_test/test_dual.py
|
Thank you, Max — and thank you, Claude Code (separately 🙂). Deep-reviewed this one; verdict: NEEDS-FIX (the C++ is right, the rest needs a touch-up). The fix itself is correct and fundamental. Two things before merge:
Heads up: master moved (5 PRs landed), so this needs a rebase onto current |
Problem
CUDADataPartition::CalcBlockDimis non-monotonic innum_data_in_leaf: itrounds the per-block data count up to a power of two, so a smaller leaf can
require more blocks than the full dataset. With
SPLIT_INDICES_BLOCK_SIZE_DATA_PARTITION = 1024:The per-block offset buffers
cuda_block_data_to_{left,right}_offset_are sizedonce for the full-dataset grid (
max_num_split_indices_blocks_, set atconstruction and in
ResetTrainingData). WhenSplit()callsCalcBlockDim(num_data_in_leaf)for a leaf whosegrid_dim_exceeds thatcapacity,
GenDataToLeftBitVectorKernel'sPrepareOffsetwritesblock_to_{left,right}_offset_buffer[blockIdx.x + 1]past the end of thosebuffers.
This is reached by the bagged root (
~bagging_fraction * num_data) and byany non-bagged leaf in the ~101–160 range on a dataset whose full-data grid
is smaller (e.g.
num_data= 200,num_leaves= 7 splits a 122-row leaf →grid 61 > 50).
compute-sanitizerreports:On most allocators the overflow lands in the allocation's slack and silently
corrupts adjacent device memory — predictions remain bit-identical, which is why
it has gone unnoticed — but it is undefined behaviour and can fault outright on a
stricter allocation layout / larger overflow.
Fix
Grow the two block-offset buffers on demand in
Split(), right after theper-leaf
CalcBlockDim, mirroring the existing resize logic inResetTrainingData. The buffers only ever grow, so it is a no-op once thehigh-water mark is reached.
Test
Adds
test_cuda_data_partition_block_offset_no_overflowtotest_dual.py,parametrized over leaf sizes that hit the overflow band, asserting bit-identical
CPU/CUDA parity. Run under
compute-sanitizerto observe the underlying invalidwrite without the fix.
🤖 Generated with Claude Code