[cuda] implement feature_contri per-feature gain scaling on the CUDA tree learner#19
Open
maxwbuckley wants to merge 36 commits into
Open
[cuda] implement feature_contri per-feature gain scaling on the CUDA tree learner#19maxwbuckley wants to merge 36 commits into
maxwbuckley wants to merge 36 commits into
Conversation
Two related bugs caused CUDA to ignore the `max_depth` parameter: 1. CUDABestSplitFinder::FindBestSplitsForLeaf had no max_depth check. CPU's SerialTreeLearner::BeforeFindBestSplit invalidates a leaf's gain when its depth has reached config_->max_depth, but the CUDA path never did the equivalent. 2. CUDATree::Split / SplitCategorical updated the GPU-side cuda_leaf_depth_ via the launch kernel but never updated the host-side leaf_depth_ vector, so tree->leaf_depth(idx) always returned 0 on CUDA. Without (2), even adding the check at (1) would have done nothing. Symptom (max_depth=2, varying num_leaves): num_leaves= 4: cpu depth=2 leaves=4 | cuda depth=2 leaves=4 num_leaves= 7: cpu depth=2 leaves=4 | cuda depth=3 leaves=7 num_leaves=15: cpu depth=2 leaves=4 | cuda depth=5 leaves=15 num_leaves=31: cpu depth=2 leaves=4 | cuda depth=7 leaves=31 After fix, CUDA caps at the requested depth (2) for every num_leaves. Fix: * Mirror the host-side leaf_depth_ update in CUDATree::Split and CUDATree::SplitCategorical (matching CPU Tree::Split's behavior in include/LightGBM/tree.h). * Plumb a `smaller_leaf_below_max_depth` / `larger_leaf_below_max_depth` flag pair into FindBestSplitsForLeaf and AND them into the is_*_leaf_valid checks. The caller in cuda_single_gpu_tree_learner.cpp computes them as `config_->max_depth <= 0 || tree->leaf_depth(idx) < config_->max_depth`. Verified with the cpu/cuda parity sweep: reg_max_depth case (which used max_depth=3 with num_leaves=7) now matches CPU at FP epsilon, down from max|Δ|=0.25 raw_score. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression coverage for the prior commit (CUDA tree learner now
enforces max_depth). Two parametrized tests, gated on
LIGHTGBM_TEST_CUDA=1:
- test_cuda_respects_max_depth: across (max_depth, num_leaves)
combinations from {1,2,3,5} x {2,4,7,31}, asserts CUDA tree depth
is at most max_depth and matches CPU depth + leaf count exactly.
- test_cuda_max_depth_matches_cpu_predictions: end-to-end check
that 5 boosting rounds with max_depth=3 produce CPU/CUDA
predictions matching at FP epsilon. Without the fix, this
diverged by max|Δ|=0.47.
Verified: with the prior commit reverted, 5 of 9 cases fail
(those where num_leaves > 2^max_depth, i.e. where the bug actually
triggered). With the fix applied, all 9 pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CUDA 13.0 removed offline-compilation support for Maxwell (sm_50/52/53),
Pascal (sm_60/61/62), and Volta (sm_70/72). With nvcc 13.x, the
unconditional inclusion of sm_60/61/62/70 in CUDA_ARCHS causes the
build to fail with:
nvcc fatal : Unsupported gpu architecture 'compute_60'
Gate those architectures behind a CUDAToolkit_VERSION VERSION_LESS
"13.0" check. With CUDA >= 13.0 the initial list starts at "75"
(Turing); the existing version-conditional appends below add 80, 86,
87, 89, 90, 100, 120 as appropriate.
Verified locally with CUDA 13.2 + RTX 5090 (sm_120): builds and
installs cleanly without any other changes.
Reference for the dropped capabilities:
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cpplint's --root=.. workaround derived the expected header-guard prefix from the parent directory name. After renaming the repo to ExaBoost, that prefix changed from LIGHTGBM_INCLUDE_*_H_ to EXABOOST_INCLUDE_*_H_ and every header now fails build/header_guard. We deliberately did not rename the C/C++ symbols (still LightGBM, LGBM_*, import lightgbm) to keep ExaBoost binary-compatible. Disable the header-guard check in the cpplint pre-commit hook to match the existing setup in .ci/lint-cpp.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds with -DCMAKE_CUDA_ARCHITECTURES (e.g. "120-real" for a single-GPU local iteration on RTX 5090) currently get overwritten unconditionally by the toolkit-version-driven CUDA_ARCHS list, producing a multi-arch build that takes much longer to compile and isn't what the user asked for. Wrap the existing toolkit-version logic in a check that only applies it when CMAKE_CUDA_ARCHITECTURES is unset or empty. When the user passes it explicitly, use their value verbatim. No behavior change for users who don't pass the flag. Composes with lightgbm-org#5 (the toolkit-version gating for CUDA 13.x dropped archs) — both branches together give a sane default that adapts to the toolkit, plus an escape hatch for fast local iteration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on lightgbm-org#6/lightgbm-org#8/lightgbm-org#9/lightgbm-org#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sklearn>=1.9 dev routes check_classification_targets and LabelEncoder
through narwhals, which raises TypeError on a bare pyarrow Array /
ChunkedArray ("Please set `allow_series=True` or `series_only=True`")
because sklearn does not pass that flag. The Python - latest versions
(manylinux_2_28) CI job has been failing for 18 test variants of
test_classification_and_regression_minimally_work_with_all_accepted_data_types
on every PR for this reason.
We advertise pyarrow Array / ChunkedArray as accepted label types
(_LGBM_LabelType), so the user-facing contract should be preserved.
Convert eagerly to numpy at the top of LGBMClassifier.fit, before
calling into sklearn — _LGBMAssertAllFinite, _LGBMCheckClassificationTargets,
and _LGBMLabelEncoder all see a familiar 1-D array.
No behavior change for non-pyarrow y. Regression tests (LGBMRegressor)
don't hit this path because they don't call check_classification_targets;
they were already passing.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Compares matched configurations on CPU and CUDA at tight tolerance (1e-5 raw_score, exact tree structure). Initial run on 19 tiny configs finds 6 with real prediction divergence (reg_quantile, reg_categorical, reg_l1, reg_bagging, reg_max_depth, multi_dense) and 13 where predictions match at FP epsilon despite tree-dump threshold differences. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CUDA categorical split-finder kernels accepted min_data_per_group as a function parameter but never referenced it in the function body, so the constraint had zero effect on CUDA training. CPU correctly enforces it via FindBestThresholdCategoricalInner in feature_histogram.cpp. Add the missing left/right count check to the candidate-acceptance condition in both the shared-memory and global-memory variants of the categorical kernel, in both the left-to-right and right-to-left scans. Verified with scratch/probe_categorical3.py: across min_data_per_group values from 1 to 1,000,000, CPU and CUDA now produce identical splits or both correctly decline to split. Also closes the reg_categorical case in the broader CPU/CUDA parity sweep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Felix asked for a real CI-runnable regression test that locks in the
categorical-kernel fix. Mirrors the scratch/probe_categorical3.py probe:
on a 200-row, 5-category dataset (~40 rows per group), train one round
on CPU and CUDA at min_data_per_group in {10, 41, 100, 1000} and assert
both produce the same split decision.
Before the fix, CUDA accepted the split at mdpg in {100, 1000, 1_000_000}
while CPU correctly refused; the assertion (None, None) != (0, 44.910)
trips loudly.
Gated on TASK=cuda to match the existing CUDA-only test pattern in
test_engine.py.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The reg_categorical case now lives in test_engine.py as a real regression test, so the dev-only parity script no longer needs to ship in the production tree. Removing it also clears the lint errors (T201 print, F841 unused var) that were blocking CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CUDA PercentileDevice (used by L1 and quantile leaf-value renewal)
computed the percentile position against `len` instead of `len - 1`,
and indexed it as 0-based instead of CPU's 1-based-with-+1 offset.
For alpha=0.5 (median), this returned the upper-middle element on
even-length arrays and the average of the upper-middle and median on
odd-length arrays - i.e., systematically biased upward in the
descending-sort convention that PercentileDevice uses.
CPU PercentileFun (src/objective/regression_objective.hpp:28-29):
const double float_pos = static_cast<double>(cnt_data - 1) * (1.0 - alpha);
const data_size_t pos = static_cast<data_size_t>(float_pos) + 1;
...
const double bias = float_pos - (pos - 1);
This matches the standard Type-7 interpolated quantile (numpy.median,
R's quantile() default).
Verified against numpy:
reg_l1 leaf-value max delta vs np.median: 0.5 -> 0.0 (after fix)
reg_quantile leaf-value max delta vs np.quantile: 0.6 -> 0.0 (after fix)
After this fix every leaf in the parity benchmark reproducer matches
its numpy counterpart to FP epsilon. There is a residual structural
divergence on reg_l1 (CPU and CUDA disagree on a few splits) which
will be investigated separately - this PR fixes only the leaf-value
calculation.
The weighted-percentile path uses different conventions on CPU and
CUDA (ascending vs descending sort, alpha vs 1-alpha threshold) and
is left untouched here. None of our parity tests exercise it.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression coverage for the unweighted PercentileDevice formula fix (prior commit). Three parametrized tests, all gated on LIGHTGBM_TEST_CUDA=1 so they only run on a CUDA-enabled build: - test_cuda_l1_leaf_renewal_matches_numpy_median: across 3 random seeds, asserts every leaf value on both CPU and CUDA matches numpy.median over the leaf's data points. - test_cuda_quantile_leaf_renewal_matches_numpy_quantile: same shape but parametrized over alpha = 0.1, 0.25, 0.5, 0.7, 0.9 to cover every even/odd leaf-size combination of the percentile bias. - test_cuda_l1_median_handles_small_even_and_odd_leaves: targets the exact failure mode of the old formula (even-length leaves returned sorted[1] instead of avg(sorted[1], sorted[2])) by sweeping leaves of size 2, 3, 4, 5, 8, 9. Tolerance is 1e-6 - well below the ~0.3 bias the old formula produced, but loose enough to absorb label_t float32 quantization inside the renewal kernel. Verified: with the prior commit reverted, 13 of 14 cases fail with bias > 1e-6; with the fix applied, all 14 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on lightgbm-org#7/lightgbm-org#8/lightgbm-org#9/lightgbm-org#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same bug as PR lightgbm-org#6 fixed for the in-block PercentileDevice, but in the global-memory kernel used for init-score computation. The unweighted branch of PercentileGlobalKernel computed the percentile position against `len` instead of `len - 1`, biasing alpha=0.5 toward the upper-middle element on descending-sort layouts. Reproducer (with the Python wrapper's optimization that drops uniform weights, this is the path actually executed by `objective=regression_l1` or `quantile` when sample weights aren't supplied or are all 1): y = [1, 2, 3, 4, 5] init_score (numpy median): 3.0 CPU init_score: 3.0 (correct) CUDA init_score (before): 3.5 (biased toward upper) CUDA init_score (after): 3.0 (correct) This fix mirrors PR lightgbm-org#6 in PercentileDevice and uses the same Type-7 interpolated-quantile formula: float_pos = (1 - alpha) * (len - 1) pos = floor(float_pos) + 1 bias = float_pos - (pos - 1) Parity-sweep impact: reg_l1 max|Δ|: 0.25 -> 0.000e+00 reg_quantile max|Δ|: 0.54 -> 0.000e+00 The weighted branch of PercentileGlobalKernel uses different conventions and is not touched by this PR. There appears to be an unrelated bug in the CPU `WeightedPercentileFun` macro (off-by-one in which cdf delta is used in the interpolation), but that affects only non-uniform-weight workloads and is out of scope here - the Python wrapper drops uniform weights, so this PR's unweighted-formula fix already covers the common path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression coverage for the prior commit. 24 parametrized cases across (objective, alpha, n) verifying the init score logged by 'Start training from score' matches between CPU and CUDA at FP epsilon. Without the fix, regression_l1 (alpha=0.5) and quantile failed for small n where the formula bias landed on a different element. Gated on LIGHTGBM_TEST_CUDA=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on lightgbm-org#6/lightgbm-org#7/lightgbm-org#8/lightgbm-org#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…PT006 fix Squashes four local iterations: drop the prediction-parity test and num_leaves parity assertion (keep only the two depth assertions), drop redundant objective=regression (default value), use tuple for parametrize argnames (ruff PT006), and shrink fixture to n=64 / 4 features / min_data_in_leaf=1 — cuts runtime ~6x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts: # tests/python_package_test/test_dual.py
The function had two related bugs:
1. shared_buffer is declared __shared__ REDUCE_VAL_T shared_buffer[WARPSIZE]
(32 entries), but the line `const REDUCE_VAL_T thread_base =
shared_buffer[threadIdx.x]` reads at threadIdx.x in [0, blockDim.x).
When blockDim.x > WARPSIZE (e.g. 256 for the L1/quantile renewal
kernels), threadIdx.x in [WARPSIZE, blockDim.x) reads out-of-bounds
shared memory.
2. The loop body `out_values[index] = thread_base + in_values[...]`
does not cumulate within the per-thread chunk. It is correct only
when num_data_per_thread == 1.
Together these manifest as an "illegal memory access" crash on weighted
L1 / weighted quantile training with n >= ~100 samples. Symptom:
[LightGBM] [Fatal] [CUDA] an illegal memory access was encountered
.../cuda_regression_objective.cu 225 (SynchronizeCUDADevice after
RenewTreeOutputCUDAKernel_RegressionL1<USE_WEIGHT=true>)
Fix: use the per-thread exclusive prefix sum already returned by
ShufflePrefixSumExclusive (matching the existing correct usage in
GlobalMemoryPrefixSum at line 183), and cumulate inclusively across
the chunk.
Verified: weighted L1 and weighted quantile now train successfully on
n in {100, 200, 500, 1000} on RTX 5090 / CUDA 13.2. Predictions match
CPU within the typical L1/quantile FP-precision range.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies CUDA weighted L1 / weighted quantile training does not raise
"illegal memory access" for n in {100, 200, 500, 1000}. Without the
prior fix, these all crashed in ShuffleSortedPrefixSumDevice.
Gated on LIGHTGBM_TEST_CUDA=1.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns with the existing convention used by test_engine.py's CUDA-only
tests (getenv("TASK", "") != "cuda"). Addresses Felix's review note on
PR lightgbm-org#8 (and the matching note on lightgbm-org#6, lightgbm-org#7, lightgbm-org#9, lightgbm-org#10).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…orcement [cuda] enforce max_depth on CUDA tree learner
feature_contri scales each feature's split gain by max(0, contri[f]), applied only in the CPU FeatureHistogram (output->gain *= meta_->penalty). The CUDA best-split finder has no equivalent multiplier, so on CUDA the parameter was silently ignored and split selection diverged from CPU. Fail fast in Config::CheckParamConflict until on-GPU support is implemented, matching the house convention of Log::Fatal for unsupported CUDA features. feature_contri defaults to empty (inactive), so only users who set it are affected. CPU and GPU(OpenCL) behavior is unchanged. Adds a parametrized regression test in test_dual.py asserting CPU excludes a zero-contri feature while CUDA raises. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tree learner Implements feature_contri (per-feature split-gain scaling) in the CUDA best split finder, matching the CPU FeatureHistogram behavior: - a per-task `double penalty` field is added to SplitFindTask, computed on host from config->feature_contri indexed by the real feature index (mirroring FeatureMetainfo::penalty in SetFeatureInfo), set when tasks are built and refreshed (with a device re-copy) in ResetConfig - each kernel multiplies the finalized per-feature gain by the task penalty after the per-feature best threshold is found, exactly where CPU applies output->gain *= meta_->penalty: within-feature threshold selection and the min_gain checks use the unscaled gain, only the cross-feature/cross-leaf comparison sees the scaled gain This replaces the previous Log::Fatal guard: feature_contri now works on CUDA instead of being rejected. Before this change CUDA silently ignored feature_contri: with feature_contri=[0,1,1,1] on data dominated by feature 0, CPU excluded feature 0 from all splits while CUDA still split on it almost exclusively, with prediction divergence up to 1.22. After this change both backends select identical features and predictions match bit-for-bit (max delta 8.9e-16 across all test cases). Adds a parametrized regression test (5 contri vectors) to test_dual.py asserting identical feature selection and prediction parity at atol=1e-10. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Owner
|
Thank you, Max — and thank you, Claude Code (separately 🙂). Verdict: SOLID, no blocking bug. Reviewed against the CPU
Two non-blocking follow-ups (neither is introduced by this PR):
Blocker is just the required P.S. — smallest diff of the five (+124/−6) and it still failed lint. 😄 At this point I'm convinced |
…error # Conflicts: # tests/python_package_test/test_dual.py
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem: the delta
When training with
device_type="cuda",feature_contri(per-feature gain scaling, documented asgain[i] = feature_contri[i] * gain[i]) was silently ignored. The scaling is applied only in the CPUFeatureHistogram(output->gain *= meta_->penaltyafter the per-feature best threshold is found); the CUDA best-split finder compared raw gains with no per-feature multiplier.Measured delta before this PR (500×4 synthetic data dominated by feature 0;
learning_rate=0.1, 5 rounds,gpu_use_dp=true):CPU honors the scaling (a zero-contri feature is never split on, a down-weighted feature is used less); CUDA ignored it entirely — different features selected, predictions diverging by up to 1.22.
Fix
Implements per-feature gain scaling in the CUDA best-split finder:
SplitFindTaskgains adouble penaltyfield, computed on host fromconfig->feature_contriindexed by the real feature index — mirroringFeatureMetainfo::penaltyin the CPUSetFeatureInfo. Set when the task array is built; refreshed (with a device re-copy) inResetConfig.output->gain *= meta_->penalty. Within-feature threshold selection and themin_gain_to_splitchecks use the unscaled gain; only the cross-feature / cross-leaf comparison sees the scaled gain. All six gain-finalization sites are covered (numerical, categorical, global-memory, and discretized variants).The first commit's
Log::Fatalguard is removed:feature_contrinow works on CUDA instead of being rejected.Result: the delta after the fix
Same measurement, post-fix:
Identical feature selection on every case, and predictions match bit-for-bit (≤ 8.9e-16 = one fp64 ULP).
Tests (in
test_dual.py, gated onTASK=cuda)test_cuda_feature_contri_matches_cpu— 5 parametrized contri vectors, asserting (a) identical feature-selection sets, (b) zero-contri features excluded on both backends, (c) prediction parity atatol=1e-10. All 5 pass with the fix; the [0,1,1,1] and [0.5,1,1,1] cases fail on the unpatched build. Fulltest_dual.pysuite (59 tests) passes.Notes
feature_contrivalue to the penalty (nomax(0, ·)despite the parameter docs); the CUDA implementation mirrors the actual CPU code so both backends stay in lockstep.gainat all — that's an upstream issue independent offeature_contri.🤖 Generated with Claude Code