[cuda] implement forced splits (forcedsplits_filename) on the CUDA tree learner by maxwbuckley · Pull Request #17 · BelixRogner/ExaBoost

maxwbuckley · 2026-05-31T23:04:21Z

Problem: the delta

When training with device_type="cuda", forcedsplits_filename was silently ignored. ForceSplits exists only in SerialTreeLearner::Train; the CUDA learner overrides Train and never consults the forced-split JSON.

Measured delta before this PR (400×6 synthetic data, 30 rounds, gpu_use_dp=true; "roots" = the set of root split features across all 30 trees):

forced-split case	forced root feature	CPU tree roots	CUDA tree roots	max pred delta
root only (feat 2 @ 0.5)	2	{2} ✓	{0, 1} ✗	0.49
root + nested children	2	{2} ✓	{0, 1} ✗	0.91
mixed features (root 1, children 0/3)	1	{1} ✓	{0, 1} ✗	0.74

CPU honors the forced structure in every tree; CUDA splits on whatever has the best gain — the user's forced structure is silently discarded.

Fix

Implements forced splits on the CUDA tree learner, mirroring SerialTreeLearner::ForceSplits:

ComputeForcedSplitKernel — computes split info for a given (feature, bin threshold) from the leaf's histogram: the CUDA analogue of FeatureHistogram::GatherInfoForThresholdNumerical. Single-threaded with the same right-to-left bin accumulation order as the CPU loop, so the arithmetic is bit-identical.
ForceSplitsCUDA — a BFS pre-pass over the forced-split JSON before the main split loop, using the CPU's compute-before-apply structure (forceSplitMap pattern): each iteration constructs histograms for the active (smaller, larger) leaf pair, runs the regular best-split search, syncs host best-split arrays with the device cache, computes forced split infos for pending JSON nodes, then applies one forced split.
ApplySplit — the main loop's split-application block, extracted and shared with the pre-pass.
Unsupported configurations (quantized gradients, multi-GPU, categorical features) fall back with a clear warning.

The first commit's Log::Fatal guard is removed: forced splits now work on CUDA.

Result: the delta after the fix

Same measurement, post-fix — across 4 forced-split shapes × {8, 31} leaves × 2 seeds × {1, 30} rounds (32 configs):

forced-split case	CUDA tree roots	tree-0 structure vs CPU	max pred delta
root only	{forced feature} ✓	identical (features/gains/counts/leaf values)	≤ 8.9e-16
root + nested children	{forced feature} ✓	identical	≤ 8.9e-16
mixed features	{forced feature} ✓	identical	≤ 8.9e-16
3-deep chain	{forced feature} ✓	identical	≤ 8.9e-16

Every tree honors the forced structure, the first tree's structure matches CPU node-for-node, and predictions match at FP epsilon (≤ 1 fp64 ULP) in all 32 configurations.

(Only difference vs CPU: the displayed real-valued threshold for the same bin boundary can differ — e.g. 0.4433 vs 0.4365 for the same bin — a known threshold-encoding cosmetic difference that does not affect predictions.)

Bring-up notes for reviewers

Three bugs were found and fixed during development — they explain the design:

Forced split infos must be computed before applying any split of the same BFS level (CPU's forceSplitMap pattern), because applying a split re-targets the active (smaller, larger) leaf pair.
Every forced leaf needs a regular best-split search during the pre-pass, or the main loop never considers it for further growth (the remaining num_leaves budget would be spent only under the last forced split).
The host-side best-split arrays must be synced with the device-side cache for pre-pass-searched leaves (SyncLeafBestSplitToHost); otherwise the main loop applies splits with mismatched host (feature/threshold) and device (gain/sums) state, corrupting trees.

Tests (in `test_dual.py`, gated on `TASK=cuda`)

Test	Cases	What it pins down
`test_cuda_forced_splits_honored`	8	every tree's root is the forced feature — fails on the old build
`test_cuda_forced_splits_match_cpu`	16	tree-0 structural equality + prediction parity (atol=1e-10) over 30 rounds

Full test_dual.py suite (78 tests) passes.

🤖 Generated with Claude Code

Two related bugs caused CUDA to ignore the `max_depth` parameter: 1. CUDABestSplitFinder::FindBestSplitsForLeaf had no max_depth check. CPU's SerialTreeLearner::BeforeFindBestSplit invalidates a leaf's gain when its depth has reached config_->max_depth, but the CUDA path never did the equivalent. 2. CUDATree::Split / SplitCategorical updated the GPU-side cuda_leaf_depth_ via the launch kernel but never updated the host-side leaf_depth_ vector, so tree->leaf_depth(idx) always returned 0 on CUDA. Without (2), even adding the check at (1) would have done nothing. Symptom (max_depth=2, varying num_leaves): num_leaves= 4: cpu depth=2 leaves=4 | cuda depth=2 leaves=4 num_leaves= 7: cpu depth=2 leaves=4 | cuda depth=3 leaves=7 num_leaves=15: cpu depth=2 leaves=4 | cuda depth=5 leaves=15 num_leaves=31: cpu depth=2 leaves=4 | cuda depth=7 leaves=31 After fix, CUDA caps at the requested depth (2) for every num_leaves. Fix: * Mirror the host-side leaf_depth_ update in CUDATree::Split and CUDATree::SplitCategorical (matching CPU Tree::Split's behavior in include/LightGBM/tree.h). * Plumb a `smaller_leaf_below_max_depth` / `larger_leaf_below_max_depth` flag pair into FindBestSplitsForLeaf and AND them into the is_*_leaf_valid checks. The caller in cuda_single_gpu_tree_learner.cpp computes them as `config_->max_depth <= 0 || tree->leaf_depth(idx) < config_->max_depth`. Verified with the cpu/cuda parity sweep: reg_max_depth case (which used max_depth=3 with num_leaves=7) now matches CPU at FP epsilon, down from max|Δ|=0.25 raw_score. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Regression coverage for the prior commit (CUDA tree learner now enforces max_depth). Two parametrized tests, gated on LIGHTGBM_TEST_CUDA=1: - test_cuda_respects_max_depth: across (max_depth, num_leaves) combinations from {1,2,3,5} x {2,4,7,31}, asserts CUDA tree depth is at most max_depth and matches CPU depth + leaf count exactly. - test_cuda_max_depth_matches_cpu_predictions: end-to-end check that 5 boosting rounds with max_depth=3 produce CPU/CUDA predictions matching at FP epsilon. Without the fix, this diverged by max|Δ|=0.47. Verified: with the prior commit reverted, 5 of 9 cases fail (those where num_leaves > 2^max_depth, i.e. where the bug actually triggered). With the fix applied, all 9 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

CUDA 13.0 removed offline-compilation support for Maxwell (sm_50/52/53), Pascal (sm_60/61/62), and Volta (sm_70/72). With nvcc 13.x, the unconditional inclusion of sm_60/61/62/70 in CUDA_ARCHS causes the build to fail with: nvcc fatal : Unsupported gpu architecture 'compute_60' Gate those architectures behind a CUDAToolkit_VERSION VERSION_LESS "13.0" check. With CUDA >= 13.0 the initial list starts at "75" (Turing); the existing version-conditional appends below add 80, 86, 87, 89, 90, 100, 120 as appropriate. Verified locally with CUDA 13.2 + RTX 5090 (sm_120): builds and installs cleanly without any other changes. Reference for the dropped capabilities: https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/ Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cpplint's --root=.. workaround derived the expected header-guard prefix from the parent directory name. After renaming the repo to ExaBoost, that prefix changed from LIGHTGBM_INCLUDE_*_H_ to EXABOOST_INCLUDE_*_H_ and every header now fails build/header_guard. We deliberately did not rename the C/C++ symbols (still LightGBM, LGBM_*, import lightgbm) to keep ExaBoost binary-compatible. Disable the header-guard check in the cpplint pre-commit hook to match the existing setup in .ci/lint-cpp.sh. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Builds with -DCMAKE_CUDA_ARCHITECTURES (e.g. "120-real" for a single-GPU local iteration on RTX 5090) currently get overwritten unconditionally by the toolkit-version-driven CUDA_ARCHS list, producing a multi-arch build that takes much longer to compile and isn't what the user asked for. Wrap the existing toolkit-version logic in a check that only applies it when CMAKE_CUDA_ARCHITECTURES is unset or empty. When the user passes it explicitly, use their value verbatim. No behavior change for users who don't pass the flag. Composes with lightgbm-org#5 (the toolkit-version gating for CUDA 13.x dropped archs) — both branches together give a sane default that adapts to the toolkit, plus an escape hatch for fast local iteration. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on lightgbm-org#6/lightgbm-org#8/lightgbm-org#9/lightgbm-org#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sklearn>=1.9 dev routes check_classification_targets and LabelEncoder through narwhals, which raises TypeError on a bare pyarrow Array / ChunkedArray ("Please set `allow_series=True` or `series_only=True`") because sklearn does not pass that flag. The Python - latest versions (manylinux_2_28) CI job has been failing for 18 test variants of test_classification_and_regression_minimally_work_with_all_accepted_data_types on every PR for this reason. We advertise pyarrow Array / ChunkedArray as accepted label types (_LGBM_LabelType), so the user-facing contract should be preserved. Convert eagerly to numpy at the top of LGBMClassifier.fit, before calling into sklearn — _LGBMAssertAllFinite, _LGBMCheckClassificationTargets, and _LGBMLabelEncoder all see a familiar 1-D array. No behavior change for non-pyarrow y. Regression tests (LGBMRegressor) don't hit this path because they don't call check_classification_targets; they were already passing. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Compares matched configurations on CPU and CUDA at tight tolerance (1e-5 raw_score, exact tree structure). Initial run on 19 tiny configs finds 6 with real prediction divergence (reg_quantile, reg_categorical, reg_l1, reg_bagging, reg_max_depth, multi_dense) and 13 where predictions match at FP epsilon despite tree-dump threshold differences. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The CUDA categorical split-finder kernels accepted min_data_per_group as a function parameter but never referenced it in the function body, so the constraint had zero effect on CUDA training. CPU correctly enforces it via FindBestThresholdCategoricalInner in feature_histogram.cpp. Add the missing left/right count check to the candidate-acceptance condition in both the shared-memory and global-memory variants of the categorical kernel, in both the left-to-right and right-to-left scans. Verified with scratch/probe_categorical3.py: across min_data_per_group values from 1 to 1,000,000, CPU and CUDA now produce identical splits or both correctly decline to split. Also closes the reg_categorical case in the broader CPU/CUDA parity sweep. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Felix asked for a real CI-runnable regression test that locks in the categorical-kernel fix. Mirrors the scratch/probe_categorical3.py probe: on a 200-row, 5-category dataset (~40 rows per group), train one round on CPU and CUDA at min_data_per_group in {10, 41, 100, 1000} and assert both produce the same split decision. Before the fix, CUDA accepted the split at mdpg in {100, 1000, 1_000_000} while CPU correctly refused; the assertion (None, None) != (0, 44.910) trips loudly. Gated on TASK=cuda to match the existing CUDA-only test pattern in test_engine.py. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The reg_categorical case now lives in test_engine.py as a real regression test, so the dev-only parity script no longer needs to ship in the production tree. Removing it also clears the lint errors (T201 print, F841 unused var) that were blocking CI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The CUDA PercentileDevice (used by L1 and quantile leaf-value renewal) computed the percentile position against `len` instead of `len - 1`, and indexed it as 0-based instead of CPU's 1-based-with-+1 offset. For alpha=0.5 (median), this returned the upper-middle element on even-length arrays and the average of the upper-middle and median on odd-length arrays - i.e., systematically biased upward in the descending-sort convention that PercentileDevice uses. CPU PercentileFun (src/objective/regression_objective.hpp:28-29): const double float_pos = static_cast<double>(cnt_data - 1) * (1.0 - alpha); const data_size_t pos = static_cast<data_size_t>(float_pos) + 1; ... const double bias = float_pos - (pos - 1); This matches the standard Type-7 interpolated quantile (numpy.median, R's quantile() default). Verified against numpy: reg_l1 leaf-value max delta vs np.median: 0.5 -> 0.0 (after fix) reg_quantile leaf-value max delta vs np.quantile: 0.6 -> 0.0 (after fix) After this fix every leaf in the parity benchmark reproducer matches its numpy counterpart to FP epsilon. There is a residual structural divergence on reg_l1 (CPU and CUDA disagree on a few splits) which will be investigated separately - this PR fixes only the leaf-value calculation. The weighted-percentile path uses different conventions on CPU and CUDA (ascending vs descending sort, alpha vs 1-alpha threshold) and is left untouched here. None of our parity tests exercise it. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Regression coverage for the unweighted PercentileDevice formula fix (prior commit). Three parametrized tests, all gated on LIGHTGBM_TEST_CUDA=1 so they only run on a CUDA-enabled build: - test_cuda_l1_leaf_renewal_matches_numpy_median: across 3 random seeds, asserts every leaf value on both CPU and CUDA matches numpy.median over the leaf's data points. - test_cuda_quantile_leaf_renewal_matches_numpy_quantile: same shape but parametrized over alpha = 0.1, 0.25, 0.5, 0.7, 0.9 to cover every even/odd leaf-size combination of the percentile bias. - test_cuda_l1_median_handles_small_even_and_odd_leaves: targets the exact failure mode of the old formula (even-length leaves returned sorted[1] instead of avg(sorted[1], sorted[2])) by sweeping leaves of size 2, 3, 4, 5, 8, 9. Tolerance is 1e-6 - well below the ~0.3 bias the old formula produced, but loose enough to absorb label_t float32 quantization inside the renewal kernel. Verified: with the prior commit reverted, 13 of 14 cases fail with bias > 1e-6; with the fix applied, all 14 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on lightgbm-org#7/lightgbm-org#8/lightgbm-org#9/lightgbm-org#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Same bug as PR lightgbm-org#6 fixed for the in-block PercentileDevice, but in the global-memory kernel used for init-score computation. The unweighted branch of PercentileGlobalKernel computed the percentile position against `len` instead of `len - 1`, biasing alpha=0.5 toward the upper-middle element on descending-sort layouts. Reproducer (with the Python wrapper's optimization that drops uniform weights, this is the path actually executed by `objective=regression_l1` or `quantile` when sample weights aren't supplied or are all 1): y = [1, 2, 3, 4, 5] init_score (numpy median): 3.0 CPU init_score: 3.0 (correct) CUDA init_score (before): 3.5 (biased toward upper) CUDA init_score (after): 3.0 (correct) This fix mirrors PR lightgbm-org#6 in PercentileDevice and uses the same Type-7 interpolated-quantile formula: float_pos = (1 - alpha) * (len - 1) pos = floor(float_pos) + 1 bias = float_pos - (pos - 1) Parity-sweep impact: reg_l1 max|Δ|: 0.25 -> 0.000e+00 reg_quantile max|Δ|: 0.54 -> 0.000e+00 The weighted branch of PercentileGlobalKernel uses different conventions and is not touched by this PR. There appears to be an unrelated bug in the CPU `WeightedPercentileFun` macro (off-by-one in which cdf delta is used in the interpolation), but that affects only non-uniform-weight workloads and is out of scope here - the Python wrapper drops uniform weights, so this PR's unweighted-formula fix already covers the common path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Regression coverage for the prior commit. 24 parametrized cases across (objective, alpha, n) verifying the init score logged by 'Start training from score' matches between CPU and CUDA at FP epsilon. Without the fix, regression_l1 (alpha=0.5) and quantile failed for small n where the formula bias landed on a different element. Gated on LIGHTGBM_TEST_CUDA=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns with the existing convention used by test_engine.py's CUDA-only tests. Addresses Felix's review note (same change going on lightgbm-org#6/lightgbm-org#7/lightgbm-org#8/lightgbm-org#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…PT006 fix Squashes four local iterations: drop the prediction-parity test and num_leaves parity assertion (keep only the two depth assertions), drop redundant objective=regression (default value), use tuple for parametrize argnames (ruff PT006), and shrink fixture to n=64 / 4 features / min_data_in_leaf=1 — cuts runtime ~6x. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

# Conflicts: # tests/python_package_test/test_dual.py

The function had two related bugs: 1. shared_buffer is declared __shared__ REDUCE_VAL_T shared_buffer[WARPSIZE] (32 entries), but the line `const REDUCE_VAL_T thread_base = shared_buffer[threadIdx.x]` reads at threadIdx.x in [0, blockDim.x). When blockDim.x > WARPSIZE (e.g. 256 for the L1/quantile renewal kernels), threadIdx.x in [WARPSIZE, blockDim.x) reads out-of-bounds shared memory. 2. The loop body `out_values[index] = thread_base + in_values[...]` does not cumulate within the per-thread chunk. It is correct only when num_data_per_thread == 1. Together these manifest as an "illegal memory access" crash on weighted L1 / weighted quantile training with n >= ~100 samples. Symptom: [LightGBM] [Fatal] [CUDA] an illegal memory access was encountered .../cuda_regression_objective.cu 225 (SynchronizeCUDADevice after RenewTreeOutputCUDAKernel_RegressionL1<USE_WEIGHT=true>) Fix: use the per-thread exclusive prefix sum already returned by ShufflePrefixSumExclusive (matching the existing correct usage in GlobalMemoryPrefixSum at line 183), and cumulate inclusively across the chunk. Verified: weighted L1 and weighted quantile now train successfully on n in {100, 200, 500, 1000} on RTX 5090 / CUDA 13.2. Predictions match CPU within the typical L1/quantile FP-precision range. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Verifies CUDA weighted L1 / weighted quantile training does not raise "illegal memory access" for n in {100, 200, 500, 1000}. Without the prior fix, these all crashed in ShuffleSortedPrefixSumDevice. Gated on LIGHTGBM_TEST_CUDA=1. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Aligns with the existing convention used by test_engine.py's CUDA-only tests (getenv("TASK", "") != "cuda"). Addresses Felix's review note on PR lightgbm-org#8 (and the matching note on lightgbm-org#6, lightgbm-org#7, lightgbm-org#9, lightgbm-org#10). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…orcement [cuda] enforce max_depth on CUDA tree learner

ForceSplits is only applied in SerialTreeLearner::Train. The CUDA tree learner overrides Train with its own kernel-based split loop and never applies the forced-split JSON, so on CUDA forced splits were silently dropped. Fail fast in Config::CheckParamConflict (matching the existing data-parallel forced-splits fatal and the house convention of Log::Fatal for unsupported CUDA features) until on-GPU forced splits are implemented. CPU and GPU(OpenCL) behavior is unchanged. Adds a parametrized regression test in test_dual.py asserting CPU honors the forced split while CUDA raises. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ee learner Implements forced splits on CUDA, mirroring SerialTreeLearner::ForceSplits: - ComputeForcedSplitKernel computes split information for a given (feature, bin threshold) from the leaf's histogram - the CUDA analogue of FeatureHistogram::GatherInfoForThresholdNumerical. Single-threaded with the same right-to-left bin accumulation order as the CPU loop, so results are bit-identical - ForceSplitsCUDA runs a BFS over the forced-split JSON before the main split loop, with the CPU's compute-before-apply structure: each iteration constructs histograms for the active (smaller, larger) leaf pair, runs the regular best-split search (so every forced leaf has a valid cached best split for the main loop), syncs the host-side best-split arrays with the device cache, computes forced split infos for pending JSON nodes, then applies one forced split. Forced split infos are stored per-leaf so they survive subsequent searches - the split-application block of the main loop is extracted into ApplySplit and shared with the pre-pass - after each forced split the children's cached best-split entries are invalidated; they are re-searched when they next become the active pair - unsupported forced-split configurations (quantized gradients, multi-GPU, categorical features) fall back with a clear warning This replaces the previous Log::Fatal guard: forced splits now work on CUDA. Before this change CUDA silently ignored forcedsplits_filename: with a forced root split on feature 2, CPU put feature 2 at the root of every tree while CUDA used features 0/1 (whatever had best gain), with predictions diverging by up to 0.91. After this change, CUDA honors the forced structure in every tree, the first tree's structure (features, gains, counts, leaf values) matches CPU exactly, and predictions match at FP epsilon (<= 8.9e-16) across all tested forced-split shapes (root-only, nested left+right, mixed-feature, 3-deep chains) x num_leaves x seeds over 30 boosting rounds. The three key bugs found and fixed during bring-up, for reviewers: 1. forced split infos must be computed BEFORE applying any split of the same BFS level (the CPU's forceSplitMap pattern) because applying a split re-targets the active leaf pair 2. every forced leaf needs a regular best-split search during the pre-pass, or the main loop never considers it for further growth 3. the host-side best-split arrays (leaf_best_split_feature_/threshold_) must be synced with the device-side cache for pre-pass-searched leaves; otherwise the main loop applies splits with mismatched host/device state, corrupting trees Adds 24 regression tests to test_dual.py: 8 forced-structure tests (every tree's root must be the forced feature) and 16 CPU-parity tests (tree-structure equality + prediction parity at atol=1e-10). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

BelixRogner · 2026-06-02T21:03:42Z

Thank you, Max — and thank you, Claude Code (separately 🙂). Really strong work here, but this is the one of the five that needs a fix before merge.

Deep review against SerialTreeLearner::ForceSplits / GatherInfoForThresholdNumerical. The good news: the BFS pre-pass structure is faithful, the right-to-left bin accumulation in ComputeForcedSplitKernel is bit-identical, and all three documented bring-up bugs (compute-before-apply, every-forced-leaf-searched, host/device sync via SyncLeafBestSplitToHost) are genuinely handled, not just described. Verdict: MINOR-CONCERNS — two things:

Latent bug: ComputeForcedSplitKernel sets the split fields but never assigns out->inner_feature_index (nor out->leaf_index). The slot comes from Resize'd, uninitialized device memory, so cuda_tree.cu:91 (split_feature_inner[new_node_index] = cuda_split_info->inner_feature_index) writes garbage into split_feature_inner_ for every forced node. It's latent — your tests pass because predict()/dump_model() use split_feature_ (the real index, set correctly) — but any in-session consumer of split_feature_inner() / Tree::PredictByMap on a CUDA forced-split tree reads garbage. One-line fix in the kernel: out->inner_feature_index = task->inner_feature_index; (and set out->leaf_index for parity).
path_smooth divergence (unguarded): the forced kernel hardcodes GetLeafGain<true,false> / CalculateSplittedLeafOutput<true,false> i.e. USE_SMOOTHING=false, while the main finder uses use_smoothing_. With path_smooth>0, forced-node outputs/gains are computed without smoothing while the rest of the tree uses it — a silent CPU divergence with no warning. Please either guard path_smooth>0 with a warning, or template smoothing like the main finder. (The hardcoded USE_L1=true is fine — identity when lambda_l1=0.)

Acceptable-as-scoped: categorical forced splits abort forcing on CUDA (vs applied on CPU) — it's warned and documented as numerical-only, so no objection.

Test gap worth closing while you're in here: a path_smooth>0 case (would catch #2) and something that reads split_feature_inner/PredictByMap (would catch #1).

And: ruff format + git merge master for the lint/conflict. macOS red is the dask socket flake.

P.S. — bit-identical accumulation order down to the __double2int_rn rounding, three self-found bugs in the writeup… and the linter caught you anyway. 😄 Friends don't let friends skip ruff format.

…rror # Conflicts: # src/treelearner/cuda/cuda_single_gpu_tree_learner.cpp # tests/python_package_test/test_dual.py

maxwbuckley and others added 30 commits May 10, 2026 01:44

Merge branch 'master' into cuda/max-depth-enforcement

f526da8

lint: ruff format

2ab1f9c

Merge branch 'master' into cuda/max-depth-enforcement

f784670

lint: ruff format

5224304

lint: ruff format

2ce3ffe

lint: ruff format

581312d

lint: ruff check --fix + hoist stdlib imports + tuple parametrize

37c28a9

Merge remote-tracking branch 'upstream/master' into sync-upstream-may18

618fb64

Merge remote-tracking branch 'belix/master' into manage/max-depth-fix

46ec2b8

# Conflicts: # tests/python_package_test/test_dual.py

BelixRogner and others added 3 commits May 19, 2026 12:42

Merge pull request lightgbm-org#7 from maxwbuckley/cuda/max-depth-enf…

7d7dd48

…orcement [cuda] enforce max_depth on CUDA tree learner

maxwbuckley changed the title ~~[cuda] reject forcedsplits_filename instead of silently ignoring it~~ [cuda] implement forced splits (forcedsplits_filename) on the CUDA tree learner Jun 1, 2026

BelixRogner force-pushed the master branch from 7d7dd48 to 6090f17 Compare June 2, 2026 22:17

maxwbuckley marked this pull request as ready for review June 3, 2026 20:29

maxwbuckley and others added 3 commits June 3, 2026 22:41

lint: ruff format

35720bb

Merge remote-tracking branch 'belix/master' into cuda/forced-splits-e…

5139ec7

…rror # Conflicts: # src/treelearner/cuda/cuda_single_gpu_tree_learner.cpp # tests/python_package_test/test_dual.py

Merge branch 'master' into cuda/forced-splits-error

3ba5dcb

BelixRogner merged commit b08c882 into BelixRogner:master Jun 4, 2026
52 of 55 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] implement forced splits (forcedsplits_filename) on the CUDA tree learner#17

[cuda] implement forced splits (forcedsplits_filename) on the CUDA tree learner#17
BelixRogner merged 36 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/forced-splits-error

maxwbuckley commented May 31, 2026 •

edited

Loading

Uh oh!

BelixRogner commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxwbuckley commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem: the delta

Fix

Result: the delta after the fix

Bring-up notes for reviewers

Tests (in test_dual.py, gated on TASK=cuda)

Uh oh!

BelixRogner commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

maxwbuckley commented May 31, 2026 •

edited

Loading

Tests (in `test_dual.py`, gated on `TASK=cuda`)