Fix CUDA quantized gradient training parity with CPU (4-defect cascade)#23
Closed
maxwbuckley wants to merge 5 commits into
Closed
Fix CUDA quantized gradient training parity with CPU (4-defect cascade)#23maxwbuckley wants to merge 5 commits into
maxwbuckley wants to merge 5 commits into
Conversation
… in quantized training CUDAGradientDiscretizer stores, per data point, an int16 discretized gradient and an int16 discretized hessian. DiscretizeGradientsKernel writes them through a reinterpret_cast<int16_t*> view at indices [2*i] and [2*i+1], and every consumer (the histogram constructor and CUDAInitValuesKernel3) reads the buffer back as int16_t. That layout needs 2 * sizeof(int16_t) = 4 bytes per data point. The backing buffer is int8_t and was sized num_data * 2 (the original 8-bit layout), i.e. only 2 bytes per point -- half of what the int16 path writes. DiscretizeGradientsKernel therefore overran the buffer by num_data * 2 bytes, corrupting the adjacent device allocations, which in practice are the gradient/hessian dequantization scale buffers (grad_max_block_buffer_ / hess_max_block_buffer_). The corrupted dequant scale (~1.4e-45, the int bit pattern of the bin count reinterpreted as float) made the root leaf's sum_hessians come out as a denormal ~0 instead of the true value, so the root failed the sum_hessians > min_sum_hessian_in_leaf validity check and never split. Result: under device_type=cuda + use_quantized_grad=true, every tree collapsed to a single leaf and the model did not learn. Sizing the buffer as num_data * 4 removes the overflow. The dequant scales stay correct, the root sum_hessians is accurate, and CUDA quantized training produces real splitting trees again. Before (this same minimal regression setup, 20 rounds): CUDA quantized: 1 leaf/tree, max|CUDA-CPU quantized pred| = 10.99 After: CUDA quantized: multi-leaf splitting trees, max|CUDA-CPU| = 6.81 (the residual gap is a separate, still-open quantized bug in the child-leaf split bookkeeping; tracked separately.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uffer fix
Asserts that device_type=cuda + use_quantized_grad=true produces real
splitting trees (num_leaves > 1) with non-degenerate predictions, across
n in {200,500,1000} and num_leaves in {7,31}. Without the buffer-size fix
the discretize kernel overran the discretized gradient/hessian buffer,
corrupted the dequant scale buffers, and every quantized tree collapsed
to a single leaf -- which this test would catch (max leaf count == 1).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…zed trees grow
Two matched defects that together stop CUDA use_quantized_grad trees from growing
past a couple of leaves. Both concern sum_of_gradients_hessians -- the int64 packed
(gradient<<32 | hessian) leaf total that the DISCRETIZED best-split finder uses to
derive cnt_factor = num_data/sum_hessians and the left = total - right split sums.
(Builds on the discretized-buffer-overflow fix, without which quantized never splits
at all.)
1) cuda_split_info.hpp: CUDASplitInfo::operator= copied left_/right_sum_gradients,
_sum_hessians, _count, _gain, _value but NOT left_/right_sum_of_gradients_hessians.
FindBestFromAllSplits assigns the winning split through this operator, so the
packed sums were dropped (read back as 0).
2) cuda_data_partition.cu (SplitTreeStructureKernel): the child leaf splits had their
double sum_of_gradients/sum_of_hessians refreshed on split but never their packed
sum_of_gradients_hessians, so each child inherited the parent's total.
Combined effect before the fix: at the first sub-split the finder used the child's
real per-bin histogram but the parent's stale (or zeroed) total, scoring phantom
parent-remainder splits that partitioned to a 0-data leaf -> growth halted at 2-4
leaves while CPU grew the full 31.
After: CUDA quantized trees grow to match CPU's structure. Leaf counts over 20 rounds
(num_leaves=31), CPU-quant vs CUDA-quant:
n=1000 min_data=300: 54 vs 54 (and bit-identical predictions, max|diff|=0)
n=1000 min_data=200: 76 vs 75
n=2000 min_data=300: 100 vs 106
n=2000 min_data=200: 152 vs 157
versus the broken ~2 leaves/tree before.
SCOPE / honest limitations -- this is a STRUCTURAL fix, not full prediction parity:
- Predictions are exact in some configs (n=1000/min_data=300 -> 0) but still diverge
in others (e.g. n=2000/min_data=200 -> ~29). Two separate, still-open causes:
* leaves small enough for an 8-bit histogram (num_data*num_grad_quant_bins < 256)
get a wrong near-zero dequantized sum_hessian and exploded leaf outputs -- a
distinct bug in the adaptive 8-bit histogram path;
* cross-feature gain-tie ordering differs from CPU (addressed on a separate branch).
- The accompanying test asserts the structural property (CUDA grows to within 20% of
CPU's leaf count), which is what this change robustly guarantees, not tight parity.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aves
SplitTreeStructureKernel hands each newly-created (smaller) child leaf a histogram
slot in cuda_hist_. Slots are laid out at a 2 * num_total_bin stride per leaf -- the
buffer is sized num_total_bin * 2 * num_leaves, and the right-is-smaller branch always
assigns cuda_hist + 2 * right_leaf_index * num_total_bin. But the left-is-smaller branch
used a 1x stride for the discretized (use_quantized_grad) path:
cuda_hist_pool[left_leaf_index] = USE_GRAD_DISCRETIZED ?
cuda_hist + right_leaf_index * num_total_bin : // 1x <-- bug
cuda_hist + 2 * right_leaf_index * num_total_bin; // 2x
A 1x-strided child could be handed a slot already owned by a live 2x-strided leaf.
Since ZeroHistForLeaf is a no-op (the whole buffer is zeroed once per tree), the child's
ConstructHistogramForLeaf then atomicAdded its data on top of the other leaf's histogram.
Observed directly (n=300, 8 features, so a clean leaf's histogram hessian sums to
num_data*8*4): a 45-data leaf landed on a 61-data leaf's slot (both slot_off=3264) and
its histogram summed to 3392 = (61+45)*32 instead of 1440. The polluted histogram gave
the finder phantom splits whose threshold the data partition could not reproduce (it put
all data on one side -> 0-data leaves), and leaves ended with near-zero sum_hessian whose
outputs exploded to 1e12-1e20 over 20 rounds at small min_data_in_leaf.
Fix: use the same 2x stride in the left-is-smaller branch. Slots are then distinct and
the allocation still fits exactly (2 * num_leaves * num_total_bin).
Before -> after, max|CUDA-CPU quantized prediction| (20 rounds), with lightgbm-org#8+lightgbm-org#9+lightgbm-org#10 in place:
n=1000 min_data=20: ~6e12 -> 0.997 (CUDA leaves 620 == CPU 620)
n=2000 min_data=20: ~1e20 -> 1.525 (CUDA 620 == CPU 620)
n=1000/2000 min_data=200: -> 0 (bit-identical)
The residual ~1-1.5 at small min_data is the separate cross-feature gain-tie ordering
(addressed on cuda/gain-tie-break), not this bug.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summarize the four stacked defects (lightgbm-org#8 discretized buffer overflow, lightgbm-org#9/lightgbm-org#10 child-leaf packed-sum, lightgbm-org#11 histogram-slot collision) that left CUDA use_quantized_grad training broken, with root causes, fixes, how each was found, and before/after benchmarks. Records the disproven 8-bit-bit-width hypothesis and the remaining gain-tie residual. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BelixRogner
pushed a commit
that referenced
this pull request
Jun 14, 2026
…histogram read (#23, #26) Lands the CUDA quantized/discretized-gradient parity cascade (PR #23) plus the 32-bit discretized-histogram read fix (PR #26, a superset of #23), direct-merged because the branches could not be rebased onto the collapsed master through their forks. Five distinct defects: - Discretized grad/hess buffer was 2x under-allocated (int16 grad + int16 hess = 4 bytes/point, sized as num_data*2). Overran into the dequant scale buffers -> denormal root sum_hessians -> single-leaf collapse. - Child leaves never had their packed int64 grad/hess sum refreshed on split, so they inherited the parent total -> phantom parent-remainder splits. (Required adding the two packed-sum fields to CUDASplitInfo's operator=, which had silently dropped them.) - Histogram-slot stride was inconsistent (1x vs 2x) between the left-smaller and right-smaller branches under the discretized path, aliasing live leaf histograms; force 2x everywhere. - The best-split finder read the 32-bit discretized histogram as int32 where the constructor writes int64 (>=4096-row promotion), reading half the bytes and indexing in the wrong units -> garbage gain at scale. Adds four CPU/CUDA parity regression tests (TASK=cuda). Authored-by: Max Buckley <maxwbuckley@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Owner
|
Thank you, Max — and thank you, Claude Code (independently 🙂). Landed on Deep-reviewed as SOLID: the four defects are genuinely distinct (not double-counting), each fixes the root data-flow issue rather than a symptom, and three of them are latent in upstream CUDA quantized training too. All three of this PR's parity tests came along in the merge, plus #26's extra 32-bit test. Direct-merged (preserving your authorship) because the branch couldn't be rebased onto the collapsed |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
CUDA
use_quantized_gradtraining parity with CPUThis PR fixes a cascade of four defects that left CUDA quantized gradient
training (
device_type=cuda+use_quantized_grad=true) effectively broken,bringing it to parity with the CPU implementation. Each defect was masking the
next, so the fixes are presented as one stack with atomic commits.
A full write-up with root causes, how each was found, and benchmarks is in
CUDA_QUANTIZED_PARITY_FIXES.md(added in this PR).All benchmarks diff CPU-quantized vs CUDA-quantized predictions on a fixed
synthetic regression set, 20 rounds,
num_leaves=31,gpu_use_dp=true,deterministic, reporting
max|CUDA − CPU|.The four fixes (atomic commits)
#8 — discretized buffer under-allocated 2× (
cuda_gradient_discretizer.hpp)The discretized grad/hess buffer is
int8_tsizednum_data*2, but the kernelwrites an int16 gradient + int16 hessian per point (
num_data*4bytes). Theoverflow corrupted the adjacent dequant-scale buffers, making the root
sum_hessiansa denormal ~0, so the root never split and every tree was a singleleaf. Fix: size it
num_data*4.Before: 1 leaf/tree,
max|Δ|=10.99. After: real trees,6.81.#9 + #10 — child-leaf packed sum not propagated (
cuda_data_partition.cu,cuda_split_info.hpp)The int64 packed leaf total (
sum_of_gradients_hessians, used by the discretizedfinder) never reached child leaves:
CUDASplitInfo::operator=dropped the packedfields, and
SplitTreeStructureKernelnever refreshed them on split. Childreninherited the parent's total, scored phantom splits, and growth stalled at 2–4
leaves vs CPU's 31. Fix: copy the packed fields in
operator=and propagate themin the split kernel.
After: CUDA leaf counts match CPU (e.g. 54 vs 54, bit-identical predictions
at
min_data_in_leaf=300).#11 — histogram-slot collision (
cuda_data_partition.cu)SplitTreeStructureKernel's left-is-smaller branch assigned the discretized childa histogram slot at 1× stride while every other path uses 2× (the buffer is
num_total_bin * 2 * num_leaves). A 1×-strided child could land on a live leaf'sslot; since
ZeroHistForLeafis a no-op, its histogram accumulated on top of theother leaf's data (a 45-point leaf summed to 106 points' worth) → phantom splits,
0-data leaves, and outputs exploding to 1e12–1e20. Fix: use 2× in both branches.
Before → after
max|CUDA − CPU|, 20 rounds:Tests
Regression tests added to
tests/python_package_test/test_dual.py(run withTASK=cuda), one per fix:test_cuda_quantized_training_produces_splits(#8),test_cuda_quantized_tree_structure_matches_cpu(#9/#10),test_cuda_quantized_deep_trees_track_cpu(#11). All pass; ruff clean.Remaining
The residual ~1–1.5 at small
min_data_in_leafis the separate cross-featuregain-tie ordering difference (CPU left-to-right scan vs CUDA reduction), tracked
on a separate branch — not part of this PR.
🤖 Generated with Claude Code