Skip to content

[cuda] grow data-partition block-offset buffers for large per-leaf grids#22

Open
maxwbuckley wants to merge 2 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/data-partition-block-offset-overflow
Open

[cuda] grow data-partition block-offset buffers for large per-leaf grids#22
maxwbuckley wants to merge 2 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/data-partition-block-offset-overflow

Conversation

@maxwbuckley

Copy link
Copy Markdown
Collaborator

Problem

CUDADataPartition::CalcBlockDim is non-monotonic in num_data_in_leaf: it
rounds the per-block data count up to a power of two, so a smaller leaf can
require more blocks than the full dataset. With SPLIT_INDICES_BLOCK_SIZE_DATA_PARTITION = 1024:

leaf size grid_dim_
200 50
159 80
140 70
120 61

The per-block offset buffers cuda_block_data_to_{left,right}_offset_ are sized
once for the full-dataset grid (max_num_split_indices_blocks_, set at
construction and in ResetTrainingData). When Split() calls
CalcBlockDim(num_data_in_leaf) for a leaf whose grid_dim_ exceeds that
capacity, GenDataToLeftBitVectorKernel's PrepareOffset writes
block_to_{left,right}_offset_buffer[blockIdx.x + 1] past the end of those
buffers.

This is reached by the bagged root (~bagging_fraction * num_data) and by
any non-bagged leaf in the ~101–160 range on a dataset whose full-data grid
is smaller (e.g. num_data = 200, num_leaves = 7 splits a 122-row leaf →
grid 61 > 50). compute-sanitizer reports:

Invalid __global__ write of size 4 bytes
  at ... GenDataToLeftBitVectorKernel<...> ... in cuda_data_partition.cu
  ... is N bytes after the nearest allocation ...

On most allocators the overflow lands in the allocation's slack and silently
corrupts adjacent device memory — predictions remain bit-identical, which is why
it has gone unnoticed — but it is undefined behaviour and can fault outright on a
stricter allocation layout / larger overflow.

Fix

Grow the two block-offset buffers on demand in Split(), right after the
per-leaf CalcBlockDim, mirroring the existing resize logic in
ResetTrainingData. The buffers only ever grow, so it is a no-op once the
high-water mark is reached.

Test

Adds test_cuda_data_partition_block_offset_no_overflow to test_dual.py,
parametrized over leaf sizes that hit the overflow band, asserting bit-identical
CPU/CUDA parity. Run under compute-sanitizer to observe the underlying invalid
write without the fix.

🤖 Generated with Claude Code

maxwbuckley and others added 2 commits June 4, 2026 00:42
CUDADataPartition::CalcBlockDim is non-monotonic in num_data_in_leaf: it rounds
the per-block data count up to a power of two, so a *smaller* leaf can require
*more* blocks than the full dataset (e.g. with SPLIT_INDICES_BLOCK_SIZE=1024,
grid(200)=50 but grid(140)=70, and grid(159)=80). The per-block offset buffers
cuda_block_data_to_{left,right}_offset_ are sized only for the full-dataset grid
(max_num_split_indices_blocks_, set at construction and in ResetTrainingData),
so when Split() runs CalcBlockDim(num_data_in_leaf) for a leaf whose grid_dim_
exceeds that capacity, GenDataToLeftBitVectorKernel's PrepareOffset writes
block_to_{left,right}_offset_buffer[blockIdx.x + 1] past the end of those
buffers.

This is reached by the bagged root (~bagging_fraction * num_data) and by any
non-bagged leaf in the ~101-160 range on a dataset whose full-data grid is
smaller. compute-sanitizer reports it as an invalid __global__ write of size 4;
on most allocators the overflow lands in the allocation's slack and silently
corrupts adjacent device memory (predictions stay bit-identical, which is why it
went unnoticed), but it is undefined behaviour and can fault on a stricter
allocation layout.

Fix: grow the two block-offset buffers on demand in Split(), right after the
per-leaf CalcBlockDim, mirroring the existing resize logic in ResetTrainingData.
The buffers only ever grow, so this is a no-op once the high-water mark is hit.

Adds a parametrized CPU/CUDA parity test over leaf sizes that hit the overflow
band; it pins bit-identical parity for the scenario and is clean under
compute-sanitizer with the fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…block-offset-overflow

# Conflicts:
#	tests/python_package_test/test_dual.py
@BelixRogner

Copy link
Copy Markdown
Owner

Thank you, Max — and thank you, Claude Code (separately 🙂). Deep-reviewed this one; verdict: NEEDS-FIX (the C++ is right, the rest needs a touch-up).

The fix itself is correct and fundamental. CalcBlockDim is non-monotonic — per-block size rounds up to a power of two, so a smaller leaf can need more blocks than the full dataset (grid(200)=50 but grid(140)=70, peaking ~80 around n∈[101,160]). Growing cuda_block_data_to_{left,right}_offset_ to the exact grid_dim_ + 1 in Split() (cuda_data_partition.cpp:156-161) is the right, magic-number-free fix, and it mirrors the existing grow-on-demand in ResetTrainingData. No int overflow at scale. 👍

Two things before merge:

  1. Run ruff format. The bundled test_dual.py reformatting is backwards — it hand-wraps statements that fit inside 120 chars, which is exactly what trips the ruff format CI job. Letting the formatter re-collapse them clears the red. (I'll gently note this is the recurring "ran the kernels but not pre-commit" special 😄 — pre-commit run -a locally and CI goes green on the first try.)
  2. Strengthen the parametrization. Of the five n values, 120/150/159 can't actually trigger the overflow — for those, the full-dataset grid is already the max possible, so no child leaf can exceed it. Only n=200/250 exercise the bug. Swap the dead values for ones where a child leaf lands in the 101-160 band while the root grid is smaller, and ideally add the bagged-root case your own comment calls out.

Heads up: master moved (5 PRs landed), so this needs a rebase onto current master. Good news — you now have write access to the repo, so once it's green you can rebase + merge it yourself. 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants