Skip to content

[cuda] fix 32-bit discretized histogram read in best-split finder (quantized garbage at scale)#26

Closed
maxwbuckley wants to merge 7 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/quantized-32bit-histogram-read
Closed

[cuda] fix 32-bit discretized histogram read in best-split finder (quantized garbage at scale)#26
maxwbuckley wants to merge 7 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/quantized-32bit-histogram-read

Conversation

@maxwbuckley

Copy link
Copy Markdown
Collaborator

Read the 32-bit discretized histogram as int64 in the best-split finder

Stacks on #23 (the quantized correctness cascade). Until #23 merges, this PR's
diff includes those commits; the new work here is the finder fix + its regression test.

The bug

When a leaf's max per-bin stat (num_data_in_leaf * num_grad_quant_bins) reaches
65536, the discretized histogram switches to 32-bit bins: the construct kernel
stores each bin as an int64 (grad32 << 32 | hess32) via int64 atomics, and the
leaf's packed total is likewise int64. But FindBestSplitsDiscretizedForLeafKernel
dispatched the non-16-bit case with BIN_HIST_TYPE = int32_t and read the histogram
through an int32_t* offset by task->hist_offset — reading 4-byte half-bins at
the wrong stride
(alternating grad/hess halves of consecutive int64 bins). The split
search therefore saw garbage for any leaf large enough to require 32-bit bins.

With num_grad_quant_bins = 16 that threshold is any leaf with ≥ 4096 rows, so in
practice any non-trivial dataset. The result was near-random models — root split
gain ~0.2 vs the correct ~25000, correlation ~0 with CPU.

Why it stayed hidden

The 8-bit and 16-bit histogram paths are correct, and those were the only paths the
small-data parity tests ever exercised. The bug only surfaces at scale, where leaves
are big enough to need 32-bit accumulators.

How it was found

While profiling CUDA training for performance, the quantized path looked ~3.5× faster
than double — but that "speedup" turned out to be a broken model: quantized
produced garbage (correlation ~0.05 with the double model) and terminated early on
zero-gain splits. Bisecting on dataset size pinned the break to n ≈ 4096 = 65536/16,
i.e. exactly the 16-bit → 32-bit histogram transition; sweeping num_grad_quant_bins
(which moves the threshold to 65536/bins) confirmed it.

The fix

Read the 32-bit histogram as int64 in both the forward and reverse split-search
branches: BIN_HIST_TYPE = int64_t, pointer cast to const int64_t*, offset in bin
units. (There is no global-memory discretized finder variant, and the remaining
int32_t* reads are the genuinely-16-bit cases.)

Results (CUDA quantized vs CPU quantized, num_grad_quant_bins=16)

n before after
2000 (16-bit leaves) corr 1.0 (already correct) corr 1.0, maxΔ 0
8000 (32-bit root) corr −0.004 corr 1.0, maxΔ 0
200000 corr 0.005 corr 1.0, maxΔ 0

Quantized model quality now matches double precision (train RMSE within 0.01%),
making large-scale quantized training correct — and it is the faster path on CUDA
(int histograms; ~15–20% faster than double for num_leaves ≥ 31 on large data).
The 8-bit / 16-bit paths and the double path are unchanged (verified bit-exact).

Adds test_cuda_quantized_32bit_histogram_matches_cpu (n ∈ {2000, 8000, 50000}).

🤖 Generated with Claude Code

maxwbuckley and others added 7 commits June 11, 2026 02:10
… in quantized training

CUDAGradientDiscretizer stores, per data point, an int16 discretized
gradient and an int16 discretized hessian. DiscretizeGradientsKernel
writes them through a reinterpret_cast<int16_t*> view at indices
[2*i] and [2*i+1], and every consumer (the histogram constructor and
CUDAInitValuesKernel3) reads the buffer back as int16_t. That layout
needs 2 * sizeof(int16_t) = 4 bytes per data point.

The backing buffer is int8_t and was sized num_data * 2 (the original
8-bit layout), i.e. only 2 bytes per point -- half of what the int16
path writes. DiscretizeGradientsKernel therefore overran the buffer by
num_data * 2 bytes, corrupting the adjacent device allocations, which
in practice are the gradient/hessian dequantization scale buffers
(grad_max_block_buffer_ / hess_max_block_buffer_).

The corrupted dequant scale (~1.4e-45, the int bit pattern of the bin
count reinterpreted as float) made the root leaf's sum_hessians come
out as a denormal ~0 instead of the true value, so the root failed the
sum_hessians > min_sum_hessian_in_leaf validity check and never split.
Result: under device_type=cuda + use_quantized_grad=true, every tree
collapsed to a single leaf and the model did not learn.

Sizing the buffer as num_data * 4 removes the overflow. The dequant
scales stay correct, the root sum_hessians is accurate, and CUDA
quantized training produces real splitting trees again.

Before (this same minimal regression setup, 20 rounds):
  CUDA quantized: 1 leaf/tree, max|CUDA-CPU quantized pred| = 10.99
After:
  CUDA quantized: multi-leaf splitting trees, max|CUDA-CPU| = 6.81
  (the residual gap is a separate, still-open quantized bug in the
   child-leaf split bookkeeping; tracked separately.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uffer fix

Asserts that device_type=cuda + use_quantized_grad=true produces real
splitting trees (num_leaves > 1) with non-degenerate predictions, across
n in {200,500,1000} and num_leaves in {7,31}. Without the buffer-size fix
the discretize kernel overran the discretized gradient/hessian buffer,
corrupted the dequant scale buffers, and every quantized tree collapsed
to a single leaf -- which this test would catch (max leaf count == 1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…zed trees grow

Two matched defects that together stop CUDA use_quantized_grad trees from growing
past a couple of leaves. Both concern sum_of_gradients_hessians -- the int64 packed
(gradient<<32 | hessian) leaf total that the DISCRETIZED best-split finder uses to
derive cnt_factor = num_data/sum_hessians and the left = total - right split sums.
(Builds on the discretized-buffer-overflow fix, without which quantized never splits
at all.)

1) cuda_split_info.hpp: CUDASplitInfo::operator= copied left_/right_sum_gradients,
   _sum_hessians, _count, _gain, _value but NOT left_/right_sum_of_gradients_hessians.
   FindBestFromAllSplits assigns the winning split through this operator, so the
   packed sums were dropped (read back as 0).

2) cuda_data_partition.cu (SplitTreeStructureKernel): the child leaf splits had their
   double sum_of_gradients/sum_of_hessians refreshed on split but never their packed
   sum_of_gradients_hessians, so each child inherited the parent's total.

Combined effect before the fix: at the first sub-split the finder used the child's
real per-bin histogram but the parent's stale (or zeroed) total, scoring phantom
parent-remainder splits that partitioned to a 0-data leaf -> growth halted at 2-4
leaves while CPU grew the full 31.

After: CUDA quantized trees grow to match CPU's structure. Leaf counts over 20 rounds
(num_leaves=31), CPU-quant vs CUDA-quant:
  n=1000 min_data=300:  54 vs 54   (and bit-identical predictions, max|diff|=0)
  n=1000 min_data=200:  76 vs 75
  n=2000 min_data=300: 100 vs 106
  n=2000 min_data=200: 152 vs 157
versus the broken ~2 leaves/tree before.

SCOPE / honest limitations -- this is a STRUCTURAL fix, not full prediction parity:
- Predictions are exact in some configs (n=1000/min_data=300 -> 0) but still diverge
  in others (e.g. n=2000/min_data=200 -> ~29). Two separate, still-open causes:
  * leaves small enough for an 8-bit histogram (num_data*num_grad_quant_bins < 256)
    get a wrong near-zero dequantized sum_hessian and exploded leaf outputs -- a
    distinct bug in the adaptive 8-bit histogram path;
  * cross-feature gain-tie ordering differs from CPU (addressed on a separate branch).
- The accompanying test asserts the structural property (CUDA grows to within 20% of
  CPU's leaf count), which is what this change robustly guarantees, not tight parity.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aves

SplitTreeStructureKernel hands each newly-created (smaller) child leaf a histogram
slot in cuda_hist_. Slots are laid out at a 2 * num_total_bin stride per leaf -- the
buffer is sized num_total_bin * 2 * num_leaves, and the right-is-smaller branch always
assigns cuda_hist + 2 * right_leaf_index * num_total_bin. But the left-is-smaller branch
used a 1x stride for the discretized (use_quantized_grad) path:

    cuda_hist_pool[left_leaf_index] = USE_GRAD_DISCRETIZED ?
        cuda_hist + right_leaf_index * num_total_bin :       // 1x  <-- bug
        cuda_hist + 2 * right_leaf_index * num_total_bin;    // 2x

A 1x-strided child could be handed a slot already owned by a live 2x-strided leaf.
Since ZeroHistForLeaf is a no-op (the whole buffer is zeroed once per tree), the child's
ConstructHistogramForLeaf then atomicAdded its data on top of the other leaf's histogram.

Observed directly (n=300, 8 features, so a clean leaf's histogram hessian sums to
num_data*8*4): a 45-data leaf landed on a 61-data leaf's slot (both slot_off=3264) and
its histogram summed to 3392 = (61+45)*32 instead of 1440. The polluted histogram gave
the finder phantom splits whose threshold the data partition could not reproduce (it put
all data on one side -> 0-data leaves), and leaves ended with near-zero sum_hessian whose
outputs exploded to 1e12-1e20 over 20 rounds at small min_data_in_leaf.

Fix: use the same 2x stride in the left-is-smaller branch. Slots are then distinct and
the allocation still fits exactly (2 * num_leaves * num_total_bin).

Before -> after, max|CUDA-CPU quantized prediction| (20 rounds), with lightgbm-org#8+lightgbm-org#9+lightgbm-org#10 in place:
  n=1000 min_data=20:  ~6e12   -> 0.997   (CUDA leaves 620 == CPU 620)
  n=2000 min_data=20:  ~1e20   -> 1.525   (CUDA 620 == CPU 620)
  n=1000/2000 min_data=200:       -> 0    (bit-identical)
The residual ~1-1.5 at small min_data is the separate cross-feature gain-tie ordering
(addressed on cuda/gain-tie-break), not this bug.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summarize the four stacked defects (lightgbm-org#8 discretized buffer overflow,
lightgbm-org#9/lightgbm-org#10 child-leaf packed-sum, lightgbm-org#11 histogram-slot collision) that left
CUDA use_quantized_grad training broken, with root causes, fixes, how
each was found, and before/after benchmarks. Records the disproven
8-bit-bit-width hypothesis and the remaining gain-tie residual.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…inder

When a leaf's max per-bin stat (num_data_in_leaf * num_grad_quant_bins) reaches
65536, the discretized histogram uses 32-bit bins: the construct kernel stores
each bin as an int64 (grad32 << 32 | hess32) via int64 atomics, and the leaf's
packed total is likewise int64. But the best-split finder dispatched the 32-bit
(non-16-bit) case with BIN_HIST_TYPE = int32_t and read the histogram through an
`int32_t*` offset by task->hist_offset. That reads 4-byte half-bins at the wrong
stride (alternating grad/hess halves of consecutive int64 bins), so the finder
saw garbage for every leaf that needed 32-bit bins -- i.e. any leaf with enough
data. The result was near-random splits (root split gain ~0.2 vs the correct
~25000) and useless models whenever a dataset was large enough to require 32-bit
histograms (with num_grad_quant_bins=16 this is any leaf with >= 4096 rows).

The 8-bit and 16-bit paths were correct (and were the only paths exercised by the
small-data parity tests), which is why this stayed hidden until large-scale
quantized training. Fix: read the 32-bit histogram as int64 (BIN_HIST_TYPE =
int64_t, pointer cast to const int64_t*, offset in bin units), in both the
forward and reverse split-search branches.

Before -> after (CUDA quantized vs CPU quantized, num_grad_quant_bins=16):
  n=2000  (16-bit leaves): corr 1.0 (already correct)
  n=8000  (32-bit root):   corr -0.004 -> corr 1.0, maxΔ 0
  n=200000:                corr  0.005 -> corr 1.0, maxΔ 0
Quantized model quality now matches double-precision (RMSE within 0.01%), and
large-scale quantized training is usable (and modestly faster) on CUDA.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Trains quantized regression at n in {2000, 8000, 50000} with num_grad_quant_bins=16
and asserts CUDA matches CPU (correlation > 0.99). n=2000 stays in the 16-bit
histogram regime (already correct); n>=8000 forces a 32-bit root histogram, which
the finder previously misread (int32 stride over int64 bins) and produced
near-random models (correlation ~0). Guards the int64-read fix.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
BelixRogner pushed a commit that referenced this pull request Jun 14, 2026
…histogram read (#23, #26)

Lands the CUDA quantized/discretized-gradient parity cascade (PR #23) plus
the 32-bit discretized-histogram read fix (PR #26, a superset of #23),
direct-merged because the branches could not be rebased onto the collapsed
master through their forks. Five distinct defects:

- Discretized grad/hess buffer was 2x under-allocated (int16 grad + int16
  hess = 4 bytes/point, sized as num_data*2). Overran into the dequant
  scale buffers -> denormal root sum_hessians -> single-leaf collapse.
- Child leaves never had their packed int64 grad/hess sum refreshed on
  split, so they inherited the parent total -> phantom parent-remainder
  splits. (Required adding the two packed-sum fields to CUDASplitInfo's
  operator=, which had silently dropped them.)
- Histogram-slot stride was inconsistent (1x vs 2x) between the
  left-smaller and right-smaller branches under the discretized path,
  aliasing live leaf histograms; force 2x everywhere.
- The best-split finder read the 32-bit discretized histogram as int32
  where the constructor writes int64 (>=4096-row promotion), reading half
  the bytes and indexing in the wrong units -> garbage gain at scale.

Adds four CPU/CUDA parity regression tests (TASK=cuda).

Authored-by: Max Buckley <maxwbuckley@gmail.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@BelixRogner

Copy link
Copy Markdown
Owner

Thank you, Max — and thank you, Claude Code (separately 🙂). Landed on master as d42da7f.

Both this PR and #23 reviewed out as SOLID — the cleanest, most impactful correctness work in the batch. Verified all five defects against the CPU reference path (and confirmed three are latent upstream bugs worth escalating): the 2× discretized-buffer under-allocation, the unrefreshed child packed-sum (+ the CUDASplitInfo::operator= fields it depends on), the 1×/2× histogram-slot stride aliasing, and this PR's 32-bit int64 read fix for the garbage-at-scale promotion path.

Since the branch couldn't be rebased onto the collapsed master through the fork, I direct-merged the superset (your branch = #23's four fixes + this PR's finder fix + all four parity tests), preserving your authorship. The only thing dropped was the root-level CUDA_QUANTIZED_PARITY_FIXES.md design doc — that content lives in the commit message now. CUDA-build-verified and lint-clean (cpplint/ruff/typos all pass on the pinned versions).

Closing as landed. #23 closed for the same reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants