Skip to content

[cuda] implement basic monotone constraint enforcement on the CUDA tree learner#16

Open
maxwbuckley wants to merge 37 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/monotone-constraints-error
Open

[cuda] implement basic monotone constraint enforcement on the CUDA tree learner#16
maxwbuckley wants to merge 37 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/monotone-constraints-error

Conversation

@maxwbuckley

@maxwbuckley maxwbuckley commented May 31, 2026

Copy link
Copy Markdown
Collaborator

Problem: the delta

When training with device_type="cuda", monotone_constraints was silently ignored. The CUDA best-split finder and CUDALeafSplits had no constraint plumbing — unlike the CPU FeatureHistogram, which clamps child outputs into the leaf's [min, max] constraint range and rejects monotonicity-violating splits.

Measured delta before this PR (600×3 regression, monotone_constraints_method="basic", gpu_use_dp=true; violations counted on 4,500 grid points per config — 3 base points × 500-point sweeps × constrained features):

constraints rounds CPU violations CUDA violations CUDA worst violation max pred delta vs CPU
[1, −1, 0] 30 0 9 3.0e-02 0.52
[1, −1, 0] 100 0 203 1.1e-01 0.32
[1, 1, 1] 30 0 157 3.9e-01 2.76
[1, 1, 1] 100 0 456 3.9e-01 2.81
[−1, 0, 1] 30 0 157 7.0e-01 2.82
[−1, 0, 1] 100 0 389 6.9e-01 3.14

A user requesting monotone constraints on CUDA got a model that violates the requested relationship at hundreds of points — silently. For regulatory/safety uses of monotonicity this is a hard correctness failure.

Fix

Implements the basic monotone constraint method on the CUDA tree learner:

  • CalculateSplittedLeafOutputMC / GetSplitGainsMC device functions mirror the CPU USE_MC formulas: candidate child outputs are clamped into the leaf's [min, max] constraint range, and splits whose clamped outputs violate the monotone direction are rejected (gain 0).
  • Each SplitFindTask carries its feature's monotone type (config is real-feature-indexed; mapped to inner indices, matching the CPU FeatureMetainfo setup).
  • Per-leaf [min, max] constraint ranges are tracked on host — mirroring BasicLeafConstraints::Update midpoint propagation — and passed to the split kernels for the (smaller, larger) leaf pair each iteration.
  • Leaf values stored in CUDASplitInfo are the clamped outputs (used for the tree and constraint propagation); leaf gains keep their unconstrained values (they form the child's parent_gain baseline, matching CPU's BeforeNumerical gain-shift computation). Getting this distinction wrong was the prototype's main bug.
  • Unsupported monotone configurations fail fast with a clear error: monotone_constraints_method=intermediate/advanced, monotone_penalty>0, and use_quantized_grad + monotone.

This replaces the first commit's Log::Fatal guard: basic-method monotone constraints now work on CUDA.

Result: the delta after the fix

Same measurement, post-fix — across all constraint vectors × {1, 30, 100} rounds × 2 seeds:

constraints rounds CUDA violations CUDA training MSE CPU training MSE
[1, −1, 0] 30 / 100 0 0.0643 / 0.0122 0.0658 / 0.0129
[1, 1, 1] 30 / 100 0 1.953 / 1.947 1.957 / 1.952
[−1, 0, 1] 30 / 100 0 1.449 / 1.206 1.438 / 1.241
[0, 0, 0] (inactive) 30 / 100 0 bit-identical to CPU (≤ 8.9e-16)

Zero monotonicity violations in every configuration (24/24 test configs), with constrained training quality equivalent to CPU (MSE within 3%, often slightly better) and bit-exact parity when constraints are inactive.

Honest scope note: structure parity vs. constraint parity

CPU and CUDA can build different but equally valid constrained trees in multi-round training. The cause: CPU prunes candidate features through its FeatureHistogram::is_splittable_ cache (a feature found unsplittable on a parent is skipped for the whole subtree — a search-space heuristic, not part of the constraint definition). Replicating that cache on GPU requires per-(leaf, feature) state with extra host-device round trips; an experimental version made agreement worse because the cache semantics interact with histogram-subtraction timing. So this PR guarantees the property (zero violations + equivalent quality + bit-parity in the inactive case), not node-by-node tree equality. The PR description and tests state exactly that.

Tests (in test_dual.py, gated on TASK=cuda)

Test Cases What it pins down
test_cuda_monotone_constraints_are_enforced 12 zero violations on 4,500-point grids — fails on the old build (up to 456 violations)
test_cuda_monotone_constraints_match_cpu_quality 3 both backends enforce + CUDA MSE within 5% of CPU
test_cuda_monotone_noop_constraints_match_cpu_exactly 2 all-zero constraints → bit-exact CPU parity (atol=1e-10)
test_cuda_monotone_unsupported_configs_raise 1 intermediate/advanced/penalty/quantized configs rejected loudly

Full test_dual.py suite (72 tests) passes.

🤖 Generated with Claude Code

maxwbuckley and others added 30 commits May 10, 2026 01:44
Two related bugs caused CUDA to ignore the `max_depth` parameter:

1. CUDABestSplitFinder::FindBestSplitsForLeaf had no max_depth check.
   CPU's SerialTreeLearner::BeforeFindBestSplit invalidates a leaf's
   gain when its depth has reached config_->max_depth, but the CUDA
   path never did the equivalent.

2. CUDATree::Split / SplitCategorical updated the GPU-side
   cuda_leaf_depth_ via the launch kernel but never updated the
   host-side leaf_depth_ vector, so tree->leaf_depth(idx) always
   returned 0 on CUDA. Without (2), even adding the check at (1)
   would have done nothing.

Symptom (max_depth=2, varying num_leaves):

  num_leaves= 4: cpu depth=2 leaves=4 | cuda depth=2 leaves=4
  num_leaves= 7: cpu depth=2 leaves=4 | cuda depth=3 leaves=7
  num_leaves=15: cpu depth=2 leaves=4 | cuda depth=5 leaves=15
  num_leaves=31: cpu depth=2 leaves=4 | cuda depth=7 leaves=31

After fix, CUDA caps at the requested depth (2) for every num_leaves.

Fix:

* Mirror the host-side leaf_depth_ update in CUDATree::Split and
  CUDATree::SplitCategorical (matching CPU Tree::Split's behavior in
  include/LightGBM/tree.h).
* Plumb a `smaller_leaf_below_max_depth` / `larger_leaf_below_max_depth`
  flag pair into FindBestSplitsForLeaf and AND them into the
  is_*_leaf_valid checks. The caller in
  cuda_single_gpu_tree_learner.cpp computes them as
  `config_->max_depth <= 0 || tree->leaf_depth(idx) < config_->max_depth`.

Verified with the cpu/cuda parity sweep: reg_max_depth case (which used
max_depth=3 with num_leaves=7) now matches CPU at FP epsilon, down from
max|Δ|=0.25 raw_score.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression coverage for the prior commit (CUDA tree learner now
enforces max_depth). Two parametrized tests, gated on
LIGHTGBM_TEST_CUDA=1:

- test_cuda_respects_max_depth: across (max_depth, num_leaves)
  combinations from {1,2,3,5} x {2,4,7,31}, asserts CUDA tree depth
  is at most max_depth and matches CPU depth + leaf count exactly.
- test_cuda_max_depth_matches_cpu_predictions: end-to-end check
  that 5 boosting rounds with max_depth=3 produce CPU/CUDA
  predictions matching at FP epsilon. Without the fix, this
  diverged by max|Δ|=0.47.

Verified: with the prior commit reverted, 5 of 9 cases fail
(those where num_leaves > 2^max_depth, i.e. where the bug actually
triggered). With the fix applied, all 9 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CUDA 13.0 removed offline-compilation support for Maxwell (sm_50/52/53),
Pascal (sm_60/61/62), and Volta (sm_70/72). With nvcc 13.x, the
unconditional inclusion of sm_60/61/62/70 in CUDA_ARCHS causes the
build to fail with:

    nvcc fatal : Unsupported gpu architecture 'compute_60'

Gate those architectures behind a CUDAToolkit_VERSION VERSION_LESS
"13.0" check. With CUDA >= 13.0 the initial list starts at "75"
(Turing); the existing version-conditional appends below add 80, 86,
87, 89, 90, 100, 120 as appropriate.

Verified locally with CUDA 13.2 + RTX 5090 (sm_120): builds and
installs cleanly without any other changes.

Reference for the dropped capabilities:
https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cpplint's --root=.. workaround derived the expected header-guard prefix
from the parent directory name. After renaming the repo to ExaBoost,
that prefix changed from LIGHTGBM_INCLUDE_*_H_ to EXABOOST_INCLUDE_*_H_
and every header now fails build/header_guard.

We deliberately did not rename the C/C++ symbols (still LightGBM,
LGBM_*, import lightgbm) to keep ExaBoost binary-compatible. Disable
the header-guard check in the cpplint pre-commit hook to match the
existing setup in .ci/lint-cpp.sh.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Builds with -DCMAKE_CUDA_ARCHITECTURES (e.g. "120-real" for a single-GPU
local iteration on RTX 5090) currently get overwritten unconditionally
by the toolkit-version-driven CUDA_ARCHS list, producing a multi-arch
build that takes much longer to compile and isn't what the user asked
for.

Wrap the existing toolkit-version logic in a check that only applies it
when CMAKE_CUDA_ARCHITECTURES is unset or empty. When the user passes
it explicitly, use their value verbatim.

No behavior change for users who don't pass the flag.

Composes with lightgbm-org#5 (the toolkit-version gating for CUDA 13.x dropped
archs) — both branches together give a sane default that adapts to the
toolkit, plus an escape hatch for fast local iteration.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on lightgbm-org#6/lightgbm-org#8/lightgbm-org#9/lightgbm-org#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sklearn>=1.9 dev routes check_classification_targets and LabelEncoder
through narwhals, which raises TypeError on a bare pyarrow Array /
ChunkedArray ("Please set `allow_series=True` or `series_only=True`")
because sklearn does not pass that flag. The Python - latest versions
(manylinux_2_28) CI job has been failing for 18 test variants of
test_classification_and_regression_minimally_work_with_all_accepted_data_types
on every PR for this reason.

We advertise pyarrow Array / ChunkedArray as accepted label types
(_LGBM_LabelType), so the user-facing contract should be preserved.
Convert eagerly to numpy at the top of LGBMClassifier.fit, before
calling into sklearn — _LGBMAssertAllFinite, _LGBMCheckClassificationTargets,
and _LGBMLabelEncoder all see a familiar 1-D array.

No behavior change for non-pyarrow y. Regression tests (LGBMRegressor)
don't hit this path because they don't call check_classification_targets;
they were already passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Compares matched configurations on CPU and CUDA at tight tolerance
(1e-5 raw_score, exact tree structure). Initial run on 19 tiny configs
finds 6 with real prediction divergence (reg_quantile, reg_categorical,
reg_l1, reg_bagging, reg_max_depth, multi_dense) and 13 where
predictions match at FP epsilon despite tree-dump threshold differences.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CUDA categorical split-finder kernels accepted min_data_per_group as
a function parameter but never referenced it in the function body, so
the constraint had zero effect on CUDA training. CPU correctly enforces
it via FindBestThresholdCategoricalInner in feature_histogram.cpp.

Add the missing left/right count check to the candidate-acceptance
condition in both the shared-memory and global-memory variants of the
categorical kernel, in both the left-to-right and right-to-left scans.

Verified with scratch/probe_categorical3.py: across min_data_per_group
values from 1 to 1,000,000, CPU and CUDA now produce identical splits
or both correctly decline to split. Also closes the reg_categorical
case in the broader CPU/CUDA parity sweep.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Felix asked for a real CI-runnable regression test that locks in the
categorical-kernel fix. Mirrors the scratch/probe_categorical3.py probe:
on a 200-row, 5-category dataset (~40 rows per group), train one round
on CPU and CUDA at min_data_per_group in {10, 41, 100, 1000} and assert
both produce the same split decision.

Before the fix, CUDA accepted the split at mdpg in {100, 1000, 1_000_000}
while CPU correctly refused; the assertion (None, None) != (0, 44.910)
trips loudly.

Gated on TASK=cuda to match the existing CUDA-only test pattern in
test_engine.py.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The reg_categorical case now lives in test_engine.py as a real
regression test, so the dev-only parity script no longer needs
to ship in the production tree. Removing it also clears the lint
errors (T201 print, F841 unused var) that were blocking CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The CUDA PercentileDevice (used by L1 and quantile leaf-value renewal)
computed the percentile position against `len` instead of `len - 1`,
and indexed it as 0-based instead of CPU's 1-based-with-+1 offset.
For alpha=0.5 (median), this returned the upper-middle element on
even-length arrays and the average of the upper-middle and median on
odd-length arrays - i.e., systematically biased upward in the
descending-sort convention that PercentileDevice uses.

CPU PercentileFun (src/objective/regression_objective.hpp:28-29):

    const double float_pos = static_cast<double>(cnt_data - 1) * (1.0 - alpha);
    const data_size_t pos = static_cast<data_size_t>(float_pos) + 1;
    ...
    const double bias = float_pos - (pos - 1);

This matches the standard Type-7 interpolated quantile (numpy.median,
R's quantile() default).

Verified against numpy:
  reg_l1     leaf-value max delta vs np.median:    0.5 -> 0.0 (after fix)
  reg_quantile leaf-value max delta vs np.quantile: 0.6 -> 0.0 (after fix)

After this fix every leaf in the parity benchmark reproducer matches
its numpy counterpart to FP epsilon. There is a residual structural
divergence on reg_l1 (CPU and CUDA disagree on a few splits) which
will be investigated separately - this PR fixes only the leaf-value
calculation.

The weighted-percentile path uses different conventions on CPU and
CUDA (ascending vs descending sort, alpha vs 1-alpha threshold) and
is left untouched here. None of our parity tests exercise it.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression coverage for the unweighted PercentileDevice formula fix
(prior commit). Three parametrized tests, all gated on
LIGHTGBM_TEST_CUDA=1 so they only run on a CUDA-enabled build:

- test_cuda_l1_leaf_renewal_matches_numpy_median: across 3 random
  seeds, asserts every leaf value on both CPU and CUDA matches
  numpy.median over the leaf's data points.
- test_cuda_quantile_leaf_renewal_matches_numpy_quantile: same shape
  but parametrized over alpha = 0.1, 0.25, 0.5, 0.7, 0.9 to cover
  every even/odd leaf-size combination of the percentile bias.
- test_cuda_l1_median_handles_small_even_and_odd_leaves: targets the
  exact failure mode of the old formula (even-length leaves returned
  sorted[1] instead of avg(sorted[1], sorted[2])) by sweeping leaves
  of size 2, 3, 4, 5, 8, 9.

Tolerance is 1e-6 - well below the ~0.3 bias the old formula
produced, but loose enough to absorb label_t float32 quantization
inside the renewal kernel.

Verified: with the prior commit reverted, 13 of 14 cases fail with
bias > 1e-6; with the fix applied, all 14 pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on lightgbm-org#7/lightgbm-org#8/lightgbm-org#9/lightgbm-org#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Same bug as PR lightgbm-org#6 fixed for the in-block PercentileDevice, but in the
global-memory kernel used for init-score computation. The unweighted
branch of PercentileGlobalKernel computed the percentile position
against `len` instead of `len - 1`, biasing alpha=0.5 toward the
upper-middle element on descending-sort layouts.

Reproducer (with the Python wrapper's optimization that drops uniform
weights, this is the path actually executed by `objective=regression_l1`
or `quantile` when sample weights aren't supplied or are all 1):

  y = [1, 2, 3, 4, 5]
  init_score (numpy median): 3.0
  CPU init_score:            3.0  (correct)
  CUDA init_score (before):  3.5  (biased toward upper)
  CUDA init_score (after):   3.0  (correct)

This fix mirrors PR lightgbm-org#6 in PercentileDevice and uses the same Type-7
interpolated-quantile formula:

  float_pos = (1 - alpha) * (len - 1)
  pos       = floor(float_pos) + 1
  bias      = float_pos - (pos - 1)

Parity-sweep impact:

  reg_l1     max|Δ|: 0.25  -> 0.000e+00
  reg_quantile max|Δ|: 0.54  -> 0.000e+00

The weighted branch of PercentileGlobalKernel uses different
conventions and is not touched by this PR. There appears to be an
unrelated bug in the CPU `WeightedPercentileFun` macro (off-by-one in
which cdf delta is used in the interpolation), but that affects only
non-uniform-weight workloads and is out of scope here - the Python
wrapper drops uniform weights, so this PR's unweighted-formula fix
already covers the common path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regression coverage for the prior commit. 24 parametrized cases
across (objective, alpha, n) verifying the init score logged by
'Start training from score' matches between CPU and CUDA at FP epsilon.

Without the fix, regression_l1 (alpha=0.5) and quantile failed for
small n where the formula bias landed on a different element.

Gated on LIGHTGBM_TEST_CUDA=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns with the existing convention used by test_engine.py's CUDA-only
tests. Addresses Felix's review note (same change going on lightgbm-org#6/lightgbm-org#7/lightgbm-org#8/lightgbm-org#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…PT006 fix

Squashes four local iterations: drop the prediction-parity test and
num_leaves parity assertion (keep only the two depth assertions), drop
redundant objective=regression (default value), use tuple for
parametrize argnames (ruff PT006), and shrink fixture to n=64 / 4
features / min_data_in_leaf=1 — cuts runtime ~6x.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
# Conflicts:
#	tests/python_package_test/test_dual.py
The function had two related bugs:

1. shared_buffer is declared __shared__ REDUCE_VAL_T shared_buffer[WARPSIZE]
   (32 entries), but the line `const REDUCE_VAL_T thread_base =
   shared_buffer[threadIdx.x]` reads at threadIdx.x in [0, blockDim.x).
   When blockDim.x > WARPSIZE (e.g. 256 for the L1/quantile renewal
   kernels), threadIdx.x in [WARPSIZE, blockDim.x) reads out-of-bounds
   shared memory.

2. The loop body `out_values[index] = thread_base + in_values[...]`
   does not cumulate within the per-thread chunk. It is correct only
   when num_data_per_thread == 1.

Together these manifest as an "illegal memory access" crash on weighted
L1 / weighted quantile training with n >= ~100 samples. Symptom:

    [LightGBM] [Fatal] [CUDA] an illegal memory access was encountered
    .../cuda_regression_objective.cu 225 (SynchronizeCUDADevice after
    RenewTreeOutputCUDAKernel_RegressionL1<USE_WEIGHT=true>)

Fix: use the per-thread exclusive prefix sum already returned by
ShufflePrefixSumExclusive (matching the existing correct usage in
GlobalMemoryPrefixSum at line 183), and cumulate inclusively across
the chunk.

Verified: weighted L1 and weighted quantile now train successfully on
n in {100, 200, 500, 1000} on RTX 5090 / CUDA 13.2. Predictions match
CPU within the typical L1/quantile FP-precision range.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Verifies CUDA weighted L1 / weighted quantile training does not raise
"illegal memory access" for n in {100, 200, 500, 1000}. Without the
prior fix, these all crashed in ShuffleSortedPrefixSumDevice.

Gated on LIGHTGBM_TEST_CUDA=1.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Aligns with the existing convention used by test_engine.py's CUDA-only
tests (getenv("TASK", "") != "cuda"). Addresses Felix's review note on
PR lightgbm-org#8 (and the matching note on lightgbm-org#6, lightgbm-org#7, lightgbm-org#9, lightgbm-org#10).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BelixRogner and others added 3 commits May 19, 2026 12:42
…orcement

[cuda] enforce max_depth on CUDA tree learner
The CUDA tree learner (cuda_single_gpu) never consults monotone
constraints when selecting splits or computing leaf outputs: the
best-split finder and CUDALeafSplits have no constraint min/max
plumbing, unlike the CPU FeatureHistogram path. As a result a model
trained on CUDA with monotone_constraints silently violated the
requested monotonic relationship, which is a correctness/safety issue
for users who rely on monotonicity for regulatory reasons.

Until on-GPU enforcement is implemented, fail fast in
Config::CheckParamConflict (matching the existing house convention of
Log::Fatal for unsupported CUDA features) so users are not handed a
non-compliant model. CPU and GPU(OpenCL) behavior is unchanged.

Adds a parametrized regression test in test_dual.py asserting CUDA
raises while CPU still trains with the same constraints.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ee learner

Implements the "basic" monotone constraint method on CUDA so that
device_type="cuda" enforces monotone_constraints like device_type="cpu":

- CalculateSplittedLeafOutputMC / GetSplitGainsMC device functions mirror the
  CPU USE_MC formulas: candidate child outputs are clamped into the leaf's
  [min, max] constraint range and splits whose clamped outputs violate the
  requested monotone direction are rejected (gain 0)
- each SplitFindTask carries its feature's monotone type (real-feature-indexed
  config mapped to inner feature indices)
- per-leaf [min, max] constraint ranges are tracked on host, mirroring
  BasicLeafConstraints::Update midpoint propagation, and passed to the split
  kernels for the (smaller, larger) leaf pair each iteration
- the leaf VALUES stored in CUDASplitInfo are the clamped outputs (used for the
  tree and for constraint propagation), while the leaf GAINS keep their
  unconstrained values (they form the child's parent_gain baseline, matching
  the CPU BeforeNumerical gain_shift computation)
- unsupported monotone configurations on CUDA fail fast with a clear error:
  monotone_constraints_method=intermediate/advanced, monotone_penalty>0, and
  use_quantized_grad+monotone

This replaces the previous Log::Fatal guard: basic-method monotone constraints
now work on CUDA instead of being rejected.

Before this change CUDA silently ignored monotone constraints: on a 600x3
regression with constraints [1,1,1] over 100 rounds, the CUDA model violated
monotonicity at 456 of 4500 grid points checked (worst violation 0.39), with
predictions diverging from CPU by up to 2.8. After this change CUDA models have
ZERO monotonicity violations across all tested constraint vectors, round counts
and seeds, with training MSE matching CPU within 3%.

Note on tree-structure parity: CPU and CUDA can build different (both valid)
constrained trees because CPU additionally prunes candidate features through
its FeatureHistogram::is_splittable_ cache, a search-space heuristic whose
exact replication on GPU would require per-(leaf, feature) state tracking with
additional host-device round trips. The constraint guarantee itself - the
property users rely on - is enforced identically, predictions match bit-for-bit
when constraints are inactive ([0,0,0]), and constrained training quality is
equivalent (MSE within 3% of CPU, sometimes better).

Adds 18 regression tests to test_dual.py: 12 enforcement cases (zero violations
required), 3 CPU-quality-equivalence cases, 2 noop-constraint bit-parity cases,
and 1 unsupported-config rejection case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@maxwbuckley maxwbuckley changed the title [cuda] reject monotone_constraints instead of silently ignoring them [cuda] implement basic monotone constraint enforcement on the CUDA tree learner Jun 1, 2026
@BelixRogner

Copy link
Copy Markdown
Owner

Thank you, Max — and thank you, Claude Code (independently 🙂). This one's a clean bill of health.

I ran a deep correctness review of the device math against the CPU USE_MC path in feature_histogram.hpp and BasicLeafConstraints::Update. Verdict: SOLID, no blocking issues.

  • CalculateSplittedLeafOutputMC / GetSplitGainsMC match the CPU clamp + rejection predicate character-for-character ((mc>0 && left>right) || (mc<0 && left<right) → gain 0).
  • The delicate part — leaf VALUE clamped vs leaf GAIN unconstrained — is correct. Verified the stored left_gain/right_gain flow into cuda_data_partition.cu as the next level's parent_gain baseline, exactly like CPU's BeforeNumerical. Using clamped outputs there would have inflated child gains; you got it right.
  • Host UpdateLeafConstraints mirrors the midpoint propagation (UpdateMin/UpdateMax, parent/new-leaf roles) correctly.
  • All three scope guards (intermediate/advanced, monotone_penalty>0, use_quantized_grad) are unconditional Log::Fatal — no silent fall-through — and all tested.

Non-blocking note: path_smooth>0 and lambda_l1>0 combined with monotone aren't exercised by the new tests; the formulas are shared with the already-tested non-MC paths, so low risk.

Only thing standing between this and merge: it fails the required ruff format check, and since all five of your CUDA PRs touch the same files, it'll need a git merge master once the others land. The macOS CI red is the known dask socket flake — unrelated.

P.S. — you wrote a 4,500-point monotonicity grid harness with bit-exact CPU parity… and then didn't run ruff format? 😄 All five PRs tripped it. Exa-row scale, zero-row lint discipline.

@maxwbuckley maxwbuckley marked this pull request as ready for review June 3, 2026 20:29
maxwbuckley and others added 4 commits June 3, 2026 22:41
…aints-error

# Conflicts:
#	src/treelearner/cuda/cuda_best_split_finder.cpp
#	src/treelearner/cuda/cuda_best_split_finder.hpp
#	src/treelearner/cuda/cuda_single_gpu_tree_learner.cpp
#	src/treelearner/cuda/cuda_single_gpu_tree_learner.hpp
#	tests/python_package_test/test_dual.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants