[cuda] Fix two illegal-memory-access crashes in CUDA bagging by maxwbuckley · Pull Request #28 · BelixRogner/ExaBoost

maxwbuckley · 2026-06-14T15:35:21Z

Problem

CUDA training with bagging (bagging_fraction < 1, bagging_freq > 0) aborts with [CUDA] an illegal memory access was encountered for any non-trivial dataset — independent of tree depth. Two distinct root causes, both in the bagged code path:

1. Out-of-bag score update dereferences freed tree structure (`cuda_tree.cu`)

CUDATree::ToHost() frees the per-tree GPU tree-structure arrays to bound device memory across rounds, keeping only cuda_leaf_value_. But AddPredictionToScoreKernel traverses the whole tree, and GBDT::UpdateScore launches it post-ToHost for the out-of-bag samples under bagging — dereferencing the freed/null device pointers. Fix: re-upload the structure from the host arrays for the duration of that launch, then free it again.

2. Data-partition block-offset buffer overflow (`cuda_data_partition.cpp`)

CalcBlockDim is non-monotonic, so a bagged leaf can need more blocks than the full dataset, overflowing the per-block offset buffers. Fix: grow the buffers to the max grid any leaf can need.

Test

Parametrized CPU/CUDA parity test (test_dual.py::test_cuda_bagging_does_not_crash_and_matches_cpu). Before: aborts the interpreter. After: CUDA finite and tracks CPU within documented bag-sampling divergence (lightgbm-org#6055).

Verification

RTX 5090 (sm_120, CUDA 13.2): before, every bagged run crashed; after, deep (depth-12 / 1024-leaf) bagged training on 2.75M × 2748 runs to completion.

🤖 Generated with Claude Code

CUDA training with bagging (bagging_fraction < 1, bagging_freq > 0) aborted with "[CUDA] an illegal memory access was encountered" for any non-trivial dataset. Root cause: CUDATree::ToHost() frees the per-tree GPU tree-structure arrays (cuda_split_feature_inner_, cuda_left_child_, cuda_right_child_, cuda_threshold_in_bin_, cuda_decision_type_, ...) to bound device memory across many boosting rounds, keeping only cuda_leaf_value_ -- with the assumption that leaf values are "the only field needed post-train". But AddPredictionToScoreKernel traverses the *whole* tree, and GBDT::UpdateScore launches it post-ToHost for the out-of-bag samples whenever bagging is on (see gbdt.cpp: train_score_updater_->AddScore(tree, cuda_bag_data_indices ...)). Those freed/null device pointers are then dereferenced inside the kernel (cuda_tree.cu: cuda_split_feature_inner[node]), crashing the process. Fix: in LaunchAddPredictionToScoreKernel, when the structure arrays have been freed (Size()==0 with internal nodes present), re-upload them from the still-populated host arrays for the duration of the launch and free them again afterwards, so the memory optimization is preserved while the out-of-bag pass gets a valid device tree to traverse. Stumps (num_leaves_ <= 1) have no internal nodes and skip the traversal. Adds a parametrized CPU/CUDA parity regression test in test_dual.py that trains with bagging across several sizes/fractions; before the fix it aborted the interpreter, after it CPU and CUDA agree within the bag-sampling divergence documented as expected in upstream lightgbm-org#6055. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Second of two CUDA bagging crash fixes. CUDADataPartition::CalcBlockDim is non-monotonic, so a bagged leaf (~bagging_fraction*n) can need more blocks than the full dataset, overflowing the per-block offset buffers and crashing with an illegal memory access. Grow the buffers to the max grid any leaf can need. The bagging parity test added in the previous commit guards both this and the OOB-score-update crash. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

maxwbuckley and others added 2 commits June 14, 2026 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] Fix two illegal-memory-access crashes in CUDA bagging#28

[cuda] Fix two illegal-memory-access crashes in CUDA bagging#28
maxwbuckley wants to merge 2 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/fix-bagging-oob-score-crash

maxwbuckley commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maxwbuckley commented Jun 14, 2026

Problem

1. Out-of-bag score update dereferences freed tree structure (cuda_tree.cu)

2. Data-partition block-offset buffer overflow (cuda_data_partition.cpp)

Test

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Out-of-bag score update dereferences freed tree structure (`cuda_tree.cu`)

2. Data-partition block-offset buffer overflow (`cuda_data_partition.cpp`)