Fix CUDA quantized gradient training parity with CPU (4-defect cascade) by maxwbuckley · Pull Request #23 · BelixRogner/ExaBoost

maxwbuckley · 2026-06-11T18:25:27Z

CUDA `use_quantized_grad` training parity with CPU

This PR fixes a cascade of four defects that left CUDA quantized gradient
training (device_type=cuda + use_quantized_grad=true) effectively broken,
bringing it to parity with the CPU implementation. Each defect was masking the
next, so the fixes are presented as one stack with atomic commits.

A full write-up with root causes, how each was found, and benchmarks is in
CUDA_QUANTIZED_PARITY_FIXES.md (added in this PR).

All benchmarks diff CPU-quantized vs CUDA-quantized predictions on a fixed
synthetic regression set, 20 rounds, num_leaves=31, gpu_use_dp=true,
deterministic, reporting max|CUDA − CPU|.

The four fixes (atomic commits)

#8 — discretized buffer under-allocated 2× (cuda_gradient_discretizer.hpp)
The discretized grad/hess buffer is int8_t sized num_data*2, but the kernel
writes an int16 gradient + int16 hessian per point (num_data*4 bytes). The
overflow corrupted the adjacent dequant-scale buffers, making the root
sum_hessians a denormal ~0, so the root never split and every tree was a single
leaf. Fix: size it num_data*4.
Before: 1 leaf/tree, max|Δ|=10.99. After: real trees, 6.81.

#9 + #10 — child-leaf packed sum not propagated (cuda_data_partition.cu,
cuda_split_info.hpp)
The int64 packed leaf total (sum_of_gradients_hessians, used by the discretized
finder) never reached child leaves: CUDASplitInfo::operator= dropped the packed
fields, and SplitTreeStructureKernel never refreshed them on split. Children
inherited the parent's total, scored phantom splits, and growth stalled at 2–4
leaves vs CPU's 31. Fix: copy the packed fields in operator= and propagate them
in the split kernel.
After: CUDA leaf counts match CPU (e.g. 54 vs 54, bit-identical predictions
at min_data_in_leaf=300).

#11 — histogram-slot collision (cuda_data_partition.cu)
SplitTreeStructureKernel's left-is-smaller branch assigned the discretized child
a histogram slot at 1× stride while every other path uses 2× (the buffer is
num_total_bin * 2 * num_leaves). A 1×-strided child could land on a live leaf's
slot; since ZeroHistForLeaf is a no-op, its histogram accumulated on top of the
other leaf's data (a 45-point leaf summed to 106 points' worth) → phantom splits,
0-data leaves, and outputs exploding to 1e12–1e20. Fix: use 2× in both branches.

Before → after max|CUDA − CPU|, 20 rounds:

n	min_data	Before	After	CUDA / CPU leaves
1000	20	~6e12	0.997	620 / 620
2000	20	~1e20	1.525	620 / 620
1000	200	exploded	0	76 / 76
2000	200	exploded	0	152 / 152

Tests

Regression tests added to tests/python_package_test/test_dual.py (run with
TASK=cuda), one per fix:
test_cuda_quantized_training_produces_splits (#8),
test_cuda_quantized_tree_structure_matches_cpu (#9/#10),
test_cuda_quantized_deep_trees_track_cpu (#11). All pass; ruff clean.

Remaining

The residual ~1–1.5 at small min_data_in_leaf is the separate cross-feature
gain-tie ordering difference (CPU left-to-right scan vs CUDA reduction), tracked
on a separate branch — not part of this PR.

🤖 Generated with Claude Code

… in quantized training CUDAGradientDiscretizer stores, per data point, an int16 discretized gradient and an int16 discretized hessian. DiscretizeGradientsKernel writes them through a reinterpret_cast<int16_t*> view at indices [2*i] and [2*i+1], and every consumer (the histogram constructor and CUDAInitValuesKernel3) reads the buffer back as int16_t. That layout needs 2 * sizeof(int16_t) = 4 bytes per data point. The backing buffer is int8_t and was sized num_data * 2 (the original 8-bit layout), i.e. only 2 bytes per point -- half of what the int16 path writes. DiscretizeGradientsKernel therefore overran the buffer by num_data * 2 bytes, corrupting the adjacent device allocations, which in practice are the gradient/hessian dequantization scale buffers (grad_max_block_buffer_ / hess_max_block_buffer_). The corrupted dequant scale (~1.4e-45, the int bit pattern of the bin count reinterpreted as float) made the root leaf's sum_hessians come out as a denormal ~0 instead of the true value, so the root failed the sum_hessians > min_sum_hessian_in_leaf validity check and never split. Result: under device_type=cuda + use_quantized_grad=true, every tree collapsed to a single leaf and the model did not learn. Sizing the buffer as num_data * 4 removes the overflow. The dequant scales stay correct, the root sum_hessians is accurate, and CUDA quantized training produces real splitting trees again. Before (this same minimal regression setup, 20 rounds): CUDA quantized: 1 leaf/tree, max|CUDA-CPU quantized pred| = 10.99 After: CUDA quantized: multi-leaf splitting trees, max|CUDA-CPU| = 6.81 (the residual gap is a separate, still-open quantized bug in the child-leaf split bookkeeping; tracked separately.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…uffer fix Asserts that device_type=cuda + use_quantized_grad=true produces real splitting trees (num_leaves > 1) with non-degenerate predictions, across n in {200,500,1000} and num_leaves in {7,31}. Without the buffer-size fix the discretize kernel overran the discretized gradient/hessian buffer, corrupted the dequant scale buffers, and every quantized tree collapsed to a single leaf -- which this test would catch (max leaf count == 1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…zed trees grow Two matched defects that together stop CUDA use_quantized_grad trees from growing past a couple of leaves. Both concern sum_of_gradients_hessians -- the int64 packed (gradient<<32 | hessian) leaf total that the DISCRETIZED best-split finder uses to derive cnt_factor = num_data/sum_hessians and the left = total - right split sums. (Builds on the discretized-buffer-overflow fix, without which quantized never splits at all.) 1) cuda_split_info.hpp: CUDASplitInfo::operator= copied left_/right_sum_gradients, _sum_hessians, _count, _gain, _value but NOT left_/right_sum_of_gradients_hessians. FindBestFromAllSplits assigns the winning split through this operator, so the packed sums were dropped (read back as 0). 2) cuda_data_partition.cu (SplitTreeStructureKernel): the child leaf splits had their double sum_of_gradients/sum_of_hessians refreshed on split but never their packed sum_of_gradients_hessians, so each child inherited the parent's total. Combined effect before the fix: at the first sub-split the finder used the child's real per-bin histogram but the parent's stale (or zeroed) total, scoring phantom parent-remainder splits that partitioned to a 0-data leaf -> growth halted at 2-4 leaves while CPU grew the full 31. After: CUDA quantized trees grow to match CPU's structure. Leaf counts over 20 rounds (num_leaves=31), CPU-quant vs CUDA-quant: n=1000 min_data=300: 54 vs 54 (and bit-identical predictions, max|diff|=0) n=1000 min_data=200: 76 vs 75 n=2000 min_data=300: 100 vs 106 n=2000 min_data=200: 152 vs 157 versus the broken ~2 leaves/tree before. SCOPE / honest limitations -- this is a STRUCTURAL fix, not full prediction parity: - Predictions are exact in some configs (n=1000/min_data=300 -> 0) but still diverge in others (e.g. n=2000/min_data=200 -> ~29). Two separate, still-open causes: * leaves small enough for an 8-bit histogram (num_data*num_grad_quant_bins < 256) get a wrong near-zero dequantized sum_hessian and exploded leaf outputs -- a distinct bug in the adaptive 8-bit histogram path; * cross-feature gain-tie ordering differs from CPU (addressed on a separate branch). - The accompanying test asserts the structural property (CUDA grows to within 20% of CPU's leaf count), which is what this change robustly guarantees, not tight parity. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…aves SplitTreeStructureKernel hands each newly-created (smaller) child leaf a histogram slot in cuda_hist_. Slots are laid out at a 2 * num_total_bin stride per leaf -- the buffer is sized num_total_bin * 2 * num_leaves, and the right-is-smaller branch always assigns cuda_hist + 2 * right_leaf_index * num_total_bin. But the left-is-smaller branch used a 1x stride for the discretized (use_quantized_grad) path: cuda_hist_pool[left_leaf_index] = USE_GRAD_DISCRETIZED ? cuda_hist + right_leaf_index * num_total_bin : // 1x <-- bug cuda_hist + 2 * right_leaf_index * num_total_bin; // 2x A 1x-strided child could be handed a slot already owned by a live 2x-strided leaf. Since ZeroHistForLeaf is a no-op (the whole buffer is zeroed once per tree), the child's ConstructHistogramForLeaf then atomicAdded its data on top of the other leaf's histogram. Observed directly (n=300, 8 features, so a clean leaf's histogram hessian sums to num_data*8*4): a 45-data leaf landed on a 61-data leaf's slot (both slot_off=3264) and its histogram summed to 3392 = (61+45)*32 instead of 1440. The polluted histogram gave the finder phantom splits whose threshold the data partition could not reproduce (it put all data on one side -> 0-data leaves), and leaves ended with near-zero sum_hessian whose outputs exploded to 1e12-1e20 over 20 rounds at small min_data_in_leaf. Fix: use the same 2x stride in the left-is-smaller branch. Slots are then distinct and the allocation still fits exactly (2 * num_leaves * num_total_bin). Before -> after, max|CUDA-CPU quantized prediction| (20 rounds), with lightgbm-org#8+lightgbm-org#9+lightgbm-org#10 in place: n=1000 min_data=20: ~6e12 -> 0.997 (CUDA leaves 620 == CPU 620) n=2000 min_data=20: ~1e20 -> 1.525 (CUDA 620 == CPU 620) n=1000/2000 min_data=200: -> 0 (bit-identical) The residual ~1-1.5 at small min_data is the separate cross-feature gain-tie ordering (addressed on cuda/gain-tie-break), not this bug. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Summarize the four stacked defects (lightgbm-org#8 discretized buffer overflow, lightgbm-org#9/lightgbm-org#10 child-leaf packed-sum, lightgbm-org#11 histogram-slot collision) that left CUDA use_quantized_grad training broken, with root causes, fixes, how each was found, and before/after benchmarks. Records the disproven 8-bit-bit-width hypothesis and the remaining gain-tie residual. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…histogram read (#23, #26) Lands the CUDA quantized/discretized-gradient parity cascade (PR #23) plus the 32-bit discretized-histogram read fix (PR #26, a superset of #23), direct-merged because the branches could not be rebased onto the collapsed master through their forks. Five distinct defects: - Discretized grad/hess buffer was 2x under-allocated (int16 grad + int16 hess = 4 bytes/point, sized as num_data*2). Overran into the dequant scale buffers -> denormal root sum_hessians -> single-leaf collapse. - Child leaves never had their packed int64 grad/hess sum refreshed on split, so they inherited the parent total -> phantom parent-remainder splits. (Required adding the two packed-sum fields to CUDASplitInfo's operator=, which had silently dropped them.) - Histogram-slot stride was inconsistent (1x vs 2x) between the left-smaller and right-smaller branches under the discretized path, aliasing live leaf histograms; force 2x everywhere. - The best-split finder read the 32-bit discretized histogram as int32 where the constructor writes int64 (>=4096-row promotion), reading half the bytes and indexing in the wrong units -> garbage gain at scale. Adds four CPU/CUDA parity regression tests (TASK=cuda). Authored-by: Max Buckley <maxwbuckley@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

BelixRogner · 2026-06-14T15:24:27Z

Thank you, Max — and thank you, Claude Code (independently 🙂). Landed on master as d42da7f via #26 (which is a strict superset of this PR — your four parity fixes plus the 32-bit histogram read fix).

Deep-reviewed as SOLID: the four defects are genuinely distinct (not double-counting), each fixes the root data-flow issue rather than a symptom, and three of them are latent in upstream CUDA quantized training too. All three of this PR's parity tests came along in the merge, plus #26's extra 32-bit test.

Direct-merged (preserving your authorship) because the branch couldn't be rebased onto the collapsed master through the fork. Closing as landed.

maxwbuckley and others added 5 commits June 11, 2026 02:10

maxwbuckley mentioned this pull request Jun 11, 2026

[cuda] fix 32-bit discretized histogram read in best-split finder (quantized garbage at scale) #26

Closed

BelixRogner closed this Jun 14, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix CUDA quantized gradient training parity with CPU (4-defect cascade)#23

Fix CUDA quantized gradient training parity with CPU (4-defect cascade)#23
maxwbuckley wants to merge 5 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/quantized-hist-slot-collision

maxwbuckley commented Jun 11, 2026

Uh oh!

BelixRogner commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxwbuckley commented Jun 11, 2026

CUDA use_quantized_grad training parity with CPU

The four fixes (atomic commits)

Tests

Remaining

Uh oh!

BelixRogner commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CUDA `use_quantized_grad` training parity with CPU