[cuda] Fix two illegal-memory-access crashes in CUDA bagging#28
Open
maxwbuckley wants to merge 2 commits into
Open
[cuda] Fix two illegal-memory-access crashes in CUDA bagging#28maxwbuckley wants to merge 2 commits into
maxwbuckley wants to merge 2 commits into
Conversation
CUDA training with bagging (bagging_fraction < 1, bagging_freq > 0) aborted with "[CUDA] an illegal memory access was encountered" for any non-trivial dataset. Root cause: CUDATree::ToHost() frees the per-tree GPU tree-structure arrays (cuda_split_feature_inner_, cuda_left_child_, cuda_right_child_, cuda_threshold_in_bin_, cuda_decision_type_, ...) to bound device memory across many boosting rounds, keeping only cuda_leaf_value_ -- with the assumption that leaf values are "the only field needed post-train". But AddPredictionToScoreKernel traverses the *whole* tree, and GBDT::UpdateScore launches it post-ToHost for the out-of-bag samples whenever bagging is on (see gbdt.cpp: train_score_updater_->AddScore(tree, cuda_bag_data_indices ...)). Those freed/null device pointers are then dereferenced inside the kernel (cuda_tree.cu: cuda_split_feature_inner[node]), crashing the process. Fix: in LaunchAddPredictionToScoreKernel, when the structure arrays have been freed (Size()==0 with internal nodes present), re-upload them from the still-populated host arrays for the duration of the launch and free them again afterwards, so the memory optimization is preserved while the out-of-bag pass gets a valid device tree to traverse. Stumps (num_leaves_ <= 1) have no internal nodes and skip the traversal. Adds a parametrized CPU/CUDA parity regression test in test_dual.py that trains with bagging across several sizes/fractions; before the fix it aborted the interpreter, after it CPU and CUDA agree within the bag-sampling divergence documented as expected in upstream lightgbm-org#6055. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Second of two CUDA bagging crash fixes. CUDADataPartition::CalcBlockDim is non-monotonic, so a bagged leaf (~bagging_fraction*n) can need more blocks than the full dataset, overflowing the per-block offset buffers and crashing with an illegal memory access. Grow the buffers to the max grid any leaf can need. The bagging parity test added in the previous commit guards both this and the OOB-score-update crash. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
CUDA training with bagging (
bagging_fraction < 1,bagging_freq > 0) aborts with[CUDA] an illegal memory access was encounteredfor any non-trivial dataset — independent of tree depth. Two distinct root causes, both in the bagged code path:1. Out-of-bag score update dereferences freed tree structure (
cuda_tree.cu)CUDATree::ToHost()frees the per-tree GPU tree-structure arrays to bound device memory across rounds, keeping onlycuda_leaf_value_. ButAddPredictionToScoreKerneltraverses the whole tree, andGBDT::UpdateScorelaunches it post-ToHostfor the out-of-bag samples under bagging — dereferencing the freed/null device pointers. Fix: re-upload the structure from the host arrays for the duration of that launch, then free it again.2. Data-partition block-offset buffer overflow (
cuda_data_partition.cpp)CalcBlockDimis non-monotonic, so a bagged leaf can need more blocks than the full dataset, overflowing the per-block offset buffers. Fix: grow the buffers to the max grid any leaf can need.Test
Parametrized CPU/CUDA parity test (
test_dual.py::test_cuda_bagging_does_not_crash_and_matches_cpu). Before: aborts the interpreter. After: CUDA finite and tracks CPU within documented bag-sampling divergence (lightgbm-org#6055).Verification
RTX 5090 (sm_120, CUDA 13.2): before, every bagged run crashed; after, deep (depth-12 / 1024-leaf) bagged training on 2.75M × 2748 runs to completion.
🤖 Generated with Claude Code