Skip to content

[cuda] Fix two illegal-memory-access crashes in CUDA bagging#28

Open
maxwbuckley wants to merge 2 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/fix-bagging-oob-score-crash
Open

[cuda] Fix two illegal-memory-access crashes in CUDA bagging#28
maxwbuckley wants to merge 2 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/fix-bagging-oob-score-crash

Conversation

@maxwbuckley

Copy link
Copy Markdown
Collaborator

Problem

CUDA training with bagging (bagging_fraction < 1, bagging_freq > 0) aborts with [CUDA] an illegal memory access was encountered for any non-trivial dataset — independent of tree depth. Two distinct root causes, both in the bagged code path:

1. Out-of-bag score update dereferences freed tree structure (cuda_tree.cu)

CUDATree::ToHost() frees the per-tree GPU tree-structure arrays to bound device memory across rounds, keeping only cuda_leaf_value_. But AddPredictionToScoreKernel traverses the whole tree, and GBDT::UpdateScore launches it post-ToHost for the out-of-bag samples under bagging — dereferencing the freed/null device pointers. Fix: re-upload the structure from the host arrays for the duration of that launch, then free it again.

2. Data-partition block-offset buffer overflow (cuda_data_partition.cpp)

CalcBlockDim is non-monotonic, so a bagged leaf can need more blocks than the full dataset, overflowing the per-block offset buffers. Fix: grow the buffers to the max grid any leaf can need.

Test

Parametrized CPU/CUDA parity test (test_dual.py::test_cuda_bagging_does_not_crash_and_matches_cpu). Before: aborts the interpreter. After: CUDA finite and tracks CPU within documented bag-sampling divergence (lightgbm-org#6055).

Verification

RTX 5090 (sm_120, CUDA 13.2): before, every bagged run crashed; after, deep (depth-12 / 1024-leaf) bagged training on 2.75M × 2748 runs to completion.

🤖 Generated with Claude Code

maxwbuckley and others added 2 commits June 14, 2026 17:27
CUDA training with bagging (bagging_fraction < 1, bagging_freq > 0) aborted
with "[CUDA] an illegal memory access was encountered" for any non-trivial
dataset.

Root cause: CUDATree::ToHost() frees the per-tree GPU tree-structure arrays
(cuda_split_feature_inner_, cuda_left_child_, cuda_right_child_,
cuda_threshold_in_bin_, cuda_decision_type_, ...) to bound device memory
across many boosting rounds, keeping only cuda_leaf_value_ -- with the
assumption that leaf values are "the only field needed post-train". But
AddPredictionToScoreKernel traverses the *whole* tree, and GBDT::UpdateScore
launches it post-ToHost for the out-of-bag samples whenever bagging is on
(see gbdt.cpp: train_score_updater_->AddScore(tree, cuda_bag_data_indices ...)).
Those freed/null device pointers are then dereferenced inside the kernel
(cuda_tree.cu: cuda_split_feature_inner[node]), crashing the process.

Fix: in LaunchAddPredictionToScoreKernel, when the structure arrays have been
freed (Size()==0 with internal nodes present), re-upload them from the
still-populated host arrays for the duration of the launch and free them again
afterwards, so the memory optimization is preserved while the out-of-bag pass
gets a valid device tree to traverse. Stumps (num_leaves_ <= 1) have no
internal nodes and skip the traversal.

Adds a parametrized CPU/CUDA parity regression test in test_dual.py that trains
with bagging across several sizes/fractions; before the fix it aborted the
interpreter, after it CPU and CUDA agree within the bag-sampling divergence
documented as expected in upstream lightgbm-org#6055.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Second of two CUDA bagging crash fixes. CUDADataPartition::CalcBlockDim is non-monotonic, so a bagged leaf (~bagging_fraction*n) can need more blocks than the full dataset, overflowing the per-block offset buffers and crashing with an illegal memory access. Grow the buffers to the max grid any leaf can need. The bagging parity test added in the previous commit guards both this and the OOB-score-update crash.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant