Skip to content

Fix CUDA quantized gradient training parity with CPU (4-defect cascade)#23

Closed
maxwbuckley wants to merge 5 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/quantized-hist-slot-collision
Closed

Fix CUDA quantized gradient training parity with CPU (4-defect cascade)#23
maxwbuckley wants to merge 5 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/quantized-hist-slot-collision

Conversation

@maxwbuckley

Copy link
Copy Markdown
Collaborator

CUDA use_quantized_grad training parity with CPU

This PR fixes a cascade of four defects that left CUDA quantized gradient
training (device_type=cuda + use_quantized_grad=true) effectively broken,
bringing it to parity with the CPU implementation. Each defect was masking the
next, so the fixes are presented as one stack with atomic commits.

A full write-up with root causes, how each was found, and benchmarks is in
CUDA_QUANTIZED_PARITY_FIXES.md (added in this PR).

All benchmarks diff CPU-quantized vs CUDA-quantized predictions on a fixed
synthetic regression set, 20 rounds, num_leaves=31, gpu_use_dp=true,
deterministic, reporting max|CUDA − CPU|.

The four fixes (atomic commits)

#8 — discretized buffer under-allocated 2× (cuda_gradient_discretizer.hpp)
The discretized grad/hess buffer is int8_t sized num_data*2, but the kernel
writes an int16 gradient + int16 hessian per point (num_data*4 bytes). The
overflow corrupted the adjacent dequant-scale buffers, making the root
sum_hessians a denormal ~0, so the root never split and every tree was a single
leaf. Fix: size it num_data*4.
Before: 1 leaf/tree, max|Δ|=10.99. After: real trees, 6.81.

#9 + #10 — child-leaf packed sum not propagated (cuda_data_partition.cu,
cuda_split_info.hpp)
The int64 packed leaf total (sum_of_gradients_hessians, used by the discretized
finder) never reached child leaves: CUDASplitInfo::operator= dropped the packed
fields, and SplitTreeStructureKernel never refreshed them on split. Children
inherited the parent's total, scored phantom splits, and growth stalled at 2–4
leaves vs CPU's 31. Fix: copy the packed fields in operator= and propagate them
in the split kernel.
After: CUDA leaf counts match CPU (e.g. 54 vs 54, bit-identical predictions
at min_data_in_leaf=300).

#11 — histogram-slot collision (cuda_data_partition.cu)
SplitTreeStructureKernel's left-is-smaller branch assigned the discretized child
a histogram slot at 1× stride while every other path uses 2× (the buffer is
num_total_bin * 2 * num_leaves). A 1×-strided child could land on a live leaf's
slot; since ZeroHistForLeaf is a no-op, its histogram accumulated on top of the
other leaf's data (a 45-point leaf summed to 106 points' worth) → phantom splits,
0-data leaves, and outputs exploding to 1e12–1e20. Fix: use 2× in both branches.

Before → after max|CUDA − CPU|, 20 rounds:

n min_data Before After CUDA / CPU leaves
1000 20 ~6e12 0.997 620 / 620
2000 20 ~1e20 1.525 620 / 620
1000 200 exploded 0 76 / 76
2000 200 exploded 0 152 / 152

Tests

Regression tests added to tests/python_package_test/test_dual.py (run with
TASK=cuda), one per fix:
test_cuda_quantized_training_produces_splits (#8),
test_cuda_quantized_tree_structure_matches_cpu (#9/#10),
test_cuda_quantized_deep_trees_track_cpu (#11). All pass; ruff clean.

Remaining

The residual ~1–1.5 at small min_data_in_leaf is the separate cross-feature
gain-tie ordering difference (CPU left-to-right scan vs CUDA reduction), tracked
on a separate branch — not part of this PR.

🤖 Generated with Claude Code

maxwbuckley and others added 5 commits June 11, 2026 02:10
… in quantized training

CUDAGradientDiscretizer stores, per data point, an int16 discretized
gradient and an int16 discretized hessian. DiscretizeGradientsKernel
writes them through a reinterpret_cast<int16_t*> view at indices
[2*i] and [2*i+1], and every consumer (the histogram constructor and
CUDAInitValuesKernel3) reads the buffer back as int16_t. That layout
needs 2 * sizeof(int16_t) = 4 bytes per data point.

The backing buffer is int8_t and was sized num_data * 2 (the original
8-bit layout), i.e. only 2 bytes per point -- half of what the int16
path writes. DiscretizeGradientsKernel therefore overran the buffer by
num_data * 2 bytes, corrupting the adjacent device allocations, which
in practice are the gradient/hessian dequantization scale buffers
(grad_max_block_buffer_ / hess_max_block_buffer_).

The corrupted dequant scale (~1.4e-45, the int bit pattern of the bin
count reinterpreted as float) made the root leaf's sum_hessians come
out as a denormal ~0 instead of the true value, so the root failed the
sum_hessians > min_sum_hessian_in_leaf validity check and never split.
Result: under device_type=cuda + use_quantized_grad=true, every tree
collapsed to a single leaf and the model did not learn.

Sizing the buffer as num_data * 4 removes the overflow. The dequant
scales stay correct, the root sum_hessians is accurate, and CUDA
quantized training produces real splitting trees again.

Before (this same minimal regression setup, 20 rounds):
  CUDA quantized: 1 leaf/tree, max|CUDA-CPU quantized pred| = 10.99
After:
  CUDA quantized: multi-leaf splitting trees, max|CUDA-CPU| = 6.81
  (the residual gap is a separate, still-open quantized bug in the
   child-leaf split bookkeeping; tracked separately.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uffer fix

Asserts that device_type=cuda + use_quantized_grad=true produces real
splitting trees (num_leaves > 1) with non-degenerate predictions, across
n in {200,500,1000} and num_leaves in {7,31}. Without the buffer-size fix
the discretize kernel overran the discretized gradient/hessian buffer,
corrupted the dequant scale buffers, and every quantized tree collapsed
to a single leaf -- which this test would catch (max leaf count == 1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…zed trees grow

Two matched defects that together stop CUDA use_quantized_grad trees from growing
past a couple of leaves. Both concern sum_of_gradients_hessians -- the int64 packed
(gradient<<32 | hessian) leaf total that the DISCRETIZED best-split finder uses to
derive cnt_factor = num_data/sum_hessians and the left = total - right split sums.
(Builds on the discretized-buffer-overflow fix, without which quantized never splits
at all.)

1) cuda_split_info.hpp: CUDASplitInfo::operator= copied left_/right_sum_gradients,
   _sum_hessians, _count, _gain, _value but NOT left_/right_sum_of_gradients_hessians.
   FindBestFromAllSplits assigns the winning split through this operator, so the
   packed sums were dropped (read back as 0).

2) cuda_data_partition.cu (SplitTreeStructureKernel): the child leaf splits had their
   double sum_of_gradients/sum_of_hessians refreshed on split but never their packed
   sum_of_gradients_hessians, so each child inherited the parent's total.

Combined effect before the fix: at the first sub-split the finder used the child's
real per-bin histogram but the parent's stale (or zeroed) total, scoring phantom
parent-remainder splits that partitioned to a 0-data leaf -> growth halted at 2-4
leaves while CPU grew the full 31.

After: CUDA quantized trees grow to match CPU's structure. Leaf counts over 20 rounds
(num_leaves=31), CPU-quant vs CUDA-quant:
  n=1000 min_data=300:  54 vs 54   (and bit-identical predictions, max|diff|=0)
  n=1000 min_data=200:  76 vs 75
  n=2000 min_data=300: 100 vs 106
  n=2000 min_data=200: 152 vs 157
versus the broken ~2 leaves/tree before.

SCOPE / honest limitations -- this is a STRUCTURAL fix, not full prediction parity:
- Predictions are exact in some configs (n=1000/min_data=300 -> 0) but still diverge
  in others (e.g. n=2000/min_data=200 -> ~29). Two separate, still-open causes:
  * leaves small enough for an 8-bit histogram (num_data*num_grad_quant_bins < 256)
    get a wrong near-zero dequantized sum_hessian and exploded leaf outputs -- a
    distinct bug in the adaptive 8-bit histogram path;
  * cross-feature gain-tie ordering differs from CPU (addressed on a separate branch).
- The accompanying test asserts the structural property (CUDA grows to within 20% of
  CPU's leaf count), which is what this change robustly guarantees, not tight parity.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aves

SplitTreeStructureKernel hands each newly-created (smaller) child leaf a histogram
slot in cuda_hist_. Slots are laid out at a 2 * num_total_bin stride per leaf -- the
buffer is sized num_total_bin * 2 * num_leaves, and the right-is-smaller branch always
assigns cuda_hist + 2 * right_leaf_index * num_total_bin. But the left-is-smaller branch
used a 1x stride for the discretized (use_quantized_grad) path:

    cuda_hist_pool[left_leaf_index] = USE_GRAD_DISCRETIZED ?
        cuda_hist + right_leaf_index * num_total_bin :       // 1x  <-- bug
        cuda_hist + 2 * right_leaf_index * num_total_bin;    // 2x

A 1x-strided child could be handed a slot already owned by a live 2x-strided leaf.
Since ZeroHistForLeaf is a no-op (the whole buffer is zeroed once per tree), the child's
ConstructHistogramForLeaf then atomicAdded its data on top of the other leaf's histogram.

Observed directly (n=300, 8 features, so a clean leaf's histogram hessian sums to
num_data*8*4): a 45-data leaf landed on a 61-data leaf's slot (both slot_off=3264) and
its histogram summed to 3392 = (61+45)*32 instead of 1440. The polluted histogram gave
the finder phantom splits whose threshold the data partition could not reproduce (it put
all data on one side -> 0-data leaves), and leaves ended with near-zero sum_hessian whose
outputs exploded to 1e12-1e20 over 20 rounds at small min_data_in_leaf.

Fix: use the same 2x stride in the left-is-smaller branch. Slots are then distinct and
the allocation still fits exactly (2 * num_leaves * num_total_bin).

Before -> after, max|CUDA-CPU quantized prediction| (20 rounds), with lightgbm-org#8+lightgbm-org#9+lightgbm-org#10 in place:
  n=1000 min_data=20:  ~6e12   -> 0.997   (CUDA leaves 620 == CPU 620)
  n=2000 min_data=20:  ~1e20   -> 1.525   (CUDA 620 == CPU 620)
  n=1000/2000 min_data=200:       -> 0    (bit-identical)
The residual ~1-1.5 at small min_data is the separate cross-feature gain-tie ordering
(addressed on cuda/gain-tie-break), not this bug.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summarize the four stacked defects (lightgbm-org#8 discretized buffer overflow,
lightgbm-org#9/lightgbm-org#10 child-leaf packed-sum, lightgbm-org#11 histogram-slot collision) that left
CUDA use_quantized_grad training broken, with root causes, fixes, how
each was found, and before/after benchmarks. Records the disproven
8-bit-bit-width hypothesis and the remaining gain-tie residual.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BelixRogner pushed a commit that referenced this pull request Jun 14, 2026
…histogram read (#23, #26)

Lands the CUDA quantized/discretized-gradient parity cascade (PR #23) plus
the 32-bit discretized-histogram read fix (PR #26, a superset of #23),
direct-merged because the branches could not be rebased onto the collapsed
master through their forks. Five distinct defects:

- Discretized grad/hess buffer was 2x under-allocated (int16 grad + int16
  hess = 4 bytes/point, sized as num_data*2). Overran into the dequant
  scale buffers -> denormal root sum_hessians -> single-leaf collapse.
- Child leaves never had their packed int64 grad/hess sum refreshed on
  split, so they inherited the parent total -> phantom parent-remainder
  splits. (Required adding the two packed-sum fields to CUDASplitInfo's
  operator=, which had silently dropped them.)
- Histogram-slot stride was inconsistent (1x vs 2x) between the
  left-smaller and right-smaller branches under the discretized path,
  aliasing live leaf histograms; force 2x everywhere.
- The best-split finder read the 32-bit discretized histogram as int32
  where the constructor writes int64 (>=4096-row promotion), reading half
  the bytes and indexing in the wrong units -> garbage gain at scale.

Adds four CPU/CUDA parity regression tests (TASK=cuda).

Authored-by: Max Buckley <maxwbuckley@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@BelixRogner

Copy link
Copy Markdown
Owner

Thank you, Max — and thank you, Claude Code (independently 🙂). Landed on master as d42da7f via #26 (which is a strict superset of this PR — your four parity fixes plus the 32-bit histogram read fix).

Deep-reviewed as SOLID: the four defects are genuinely distinct (not double-counting), each fixes the root data-flow issue rather than a symptom, and three of them are latent in upstream CUDA quantized training too. All three of this PR's parity tests came along in the merge, plus #26's extra 32-bit test.

Direct-merged (preserving your authorship) because the branch couldn't be rebased onto the collapsed master through the fork. Closing as landed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants