[cuda] fix 32-bit discretized histogram read in best-split finder (quantized garbage at scale)#26
Conversation
… in quantized training CUDAGradientDiscretizer stores, per data point, an int16 discretized gradient and an int16 discretized hessian. DiscretizeGradientsKernel writes them through a reinterpret_cast<int16_t*> view at indices [2*i] and [2*i+1], and every consumer (the histogram constructor and CUDAInitValuesKernel3) reads the buffer back as int16_t. That layout needs 2 * sizeof(int16_t) = 4 bytes per data point. The backing buffer is int8_t and was sized num_data * 2 (the original 8-bit layout), i.e. only 2 bytes per point -- half of what the int16 path writes. DiscretizeGradientsKernel therefore overran the buffer by num_data * 2 bytes, corrupting the adjacent device allocations, which in practice are the gradient/hessian dequantization scale buffers (grad_max_block_buffer_ / hess_max_block_buffer_). The corrupted dequant scale (~1.4e-45, the int bit pattern of the bin count reinterpreted as float) made the root leaf's sum_hessians come out as a denormal ~0 instead of the true value, so the root failed the sum_hessians > min_sum_hessian_in_leaf validity check and never split. Result: under device_type=cuda + use_quantized_grad=true, every tree collapsed to a single leaf and the model did not learn. Sizing the buffer as num_data * 4 removes the overflow. The dequant scales stay correct, the root sum_hessians is accurate, and CUDA quantized training produces real splitting trees again. Before (this same minimal regression setup, 20 rounds): CUDA quantized: 1 leaf/tree, max|CUDA-CPU quantized pred| = 10.99 After: CUDA quantized: multi-leaf splitting trees, max|CUDA-CPU| = 6.81 (the residual gap is a separate, still-open quantized bug in the child-leaf split bookkeeping; tracked separately.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…uffer fix
Asserts that device_type=cuda + use_quantized_grad=true produces real
splitting trees (num_leaves > 1) with non-degenerate predictions, across
n in {200,500,1000} and num_leaves in {7,31}. Without the buffer-size fix
the discretize kernel overran the discretized gradient/hessian buffer,
corrupted the dequant scale buffers, and every quantized tree collapsed
to a single leaf -- which this test would catch (max leaf count == 1).
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…zed trees grow
Two matched defects that together stop CUDA use_quantized_grad trees from growing
past a couple of leaves. Both concern sum_of_gradients_hessians -- the int64 packed
(gradient<<32 | hessian) leaf total that the DISCRETIZED best-split finder uses to
derive cnt_factor = num_data/sum_hessians and the left = total - right split sums.
(Builds on the discretized-buffer-overflow fix, without which quantized never splits
at all.)
1) cuda_split_info.hpp: CUDASplitInfo::operator= copied left_/right_sum_gradients,
_sum_hessians, _count, _gain, _value but NOT left_/right_sum_of_gradients_hessians.
FindBestFromAllSplits assigns the winning split through this operator, so the
packed sums were dropped (read back as 0).
2) cuda_data_partition.cu (SplitTreeStructureKernel): the child leaf splits had their
double sum_of_gradients/sum_of_hessians refreshed on split but never their packed
sum_of_gradients_hessians, so each child inherited the parent's total.
Combined effect before the fix: at the first sub-split the finder used the child's
real per-bin histogram but the parent's stale (or zeroed) total, scoring phantom
parent-remainder splits that partitioned to a 0-data leaf -> growth halted at 2-4
leaves while CPU grew the full 31.
After: CUDA quantized trees grow to match CPU's structure. Leaf counts over 20 rounds
(num_leaves=31), CPU-quant vs CUDA-quant:
n=1000 min_data=300: 54 vs 54 (and bit-identical predictions, max|diff|=0)
n=1000 min_data=200: 76 vs 75
n=2000 min_data=300: 100 vs 106
n=2000 min_data=200: 152 vs 157
versus the broken ~2 leaves/tree before.
SCOPE / honest limitations -- this is a STRUCTURAL fix, not full prediction parity:
- Predictions are exact in some configs (n=1000/min_data=300 -> 0) but still diverge
in others (e.g. n=2000/min_data=200 -> ~29). Two separate, still-open causes:
* leaves small enough for an 8-bit histogram (num_data*num_grad_quant_bins < 256)
get a wrong near-zero dequantized sum_hessian and exploded leaf outputs -- a
distinct bug in the adaptive 8-bit histogram path;
* cross-feature gain-tie ordering differs from CPU (addressed on a separate branch).
- The accompanying test asserts the structural property (CUDA grows to within 20% of
CPU's leaf count), which is what this change robustly guarantees, not tight parity.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aves
SplitTreeStructureKernel hands each newly-created (smaller) child leaf a histogram
slot in cuda_hist_. Slots are laid out at a 2 * num_total_bin stride per leaf -- the
buffer is sized num_total_bin * 2 * num_leaves, and the right-is-smaller branch always
assigns cuda_hist + 2 * right_leaf_index * num_total_bin. But the left-is-smaller branch
used a 1x stride for the discretized (use_quantized_grad) path:
cuda_hist_pool[left_leaf_index] = USE_GRAD_DISCRETIZED ?
cuda_hist + right_leaf_index * num_total_bin : // 1x <-- bug
cuda_hist + 2 * right_leaf_index * num_total_bin; // 2x
A 1x-strided child could be handed a slot already owned by a live 2x-strided leaf.
Since ZeroHistForLeaf is a no-op (the whole buffer is zeroed once per tree), the child's
ConstructHistogramForLeaf then atomicAdded its data on top of the other leaf's histogram.
Observed directly (n=300, 8 features, so a clean leaf's histogram hessian sums to
num_data*8*4): a 45-data leaf landed on a 61-data leaf's slot (both slot_off=3264) and
its histogram summed to 3392 = (61+45)*32 instead of 1440. The polluted histogram gave
the finder phantom splits whose threshold the data partition could not reproduce (it put
all data on one side -> 0-data leaves), and leaves ended with near-zero sum_hessian whose
outputs exploded to 1e12-1e20 over 20 rounds at small min_data_in_leaf.
Fix: use the same 2x stride in the left-is-smaller branch. Slots are then distinct and
the allocation still fits exactly (2 * num_leaves * num_total_bin).
Before -> after, max|CUDA-CPU quantized prediction| (20 rounds), with lightgbm-org#8+lightgbm-org#9+lightgbm-org#10 in place:
n=1000 min_data=20: ~6e12 -> 0.997 (CUDA leaves 620 == CPU 620)
n=2000 min_data=20: ~1e20 -> 1.525 (CUDA 620 == CPU 620)
n=1000/2000 min_data=200: -> 0 (bit-identical)
The residual ~1-1.5 at small min_data is the separate cross-feature gain-tie ordering
(addressed on cuda/gain-tie-break), not this bug.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Summarize the four stacked defects (lightgbm-org#8 discretized buffer overflow, lightgbm-org#9/lightgbm-org#10 child-leaf packed-sum, lightgbm-org#11 histogram-slot collision) that left CUDA use_quantized_grad training broken, with root causes, fixes, how each was found, and before/after benchmarks. Records the disproven 8-bit-bit-width hypothesis and the remaining gain-tie residual. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…inder When a leaf's max per-bin stat (num_data_in_leaf * num_grad_quant_bins) reaches 65536, the discretized histogram uses 32-bit bins: the construct kernel stores each bin as an int64 (grad32 << 32 | hess32) via int64 atomics, and the leaf's packed total is likewise int64. But the best-split finder dispatched the 32-bit (non-16-bit) case with BIN_HIST_TYPE = int32_t and read the histogram through an `int32_t*` offset by task->hist_offset. That reads 4-byte half-bins at the wrong stride (alternating grad/hess halves of consecutive int64 bins), so the finder saw garbage for every leaf that needed 32-bit bins -- i.e. any leaf with enough data. The result was near-random splits (root split gain ~0.2 vs the correct ~25000) and useless models whenever a dataset was large enough to require 32-bit histograms (with num_grad_quant_bins=16 this is any leaf with >= 4096 rows). The 8-bit and 16-bit paths were correct (and were the only paths exercised by the small-data parity tests), which is why this stayed hidden until large-scale quantized training. Fix: read the 32-bit histogram as int64 (BIN_HIST_TYPE = int64_t, pointer cast to const int64_t*, offset in bin units), in both the forward and reverse split-search branches. Before -> after (CUDA quantized vs CPU quantized, num_grad_quant_bins=16): n=2000 (16-bit leaves): corr 1.0 (already correct) n=8000 (32-bit root): corr -0.004 -> corr 1.0, maxΔ 0 n=200000: corr 0.005 -> corr 1.0, maxΔ 0 Quantized model quality now matches double-precision (RMSE within 0.01%), and large-scale quantized training is usable (and modestly faster) on CUDA. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Trains quantized regression at n in {2000, 8000, 50000} with num_grad_quant_bins=16
and asserts CUDA matches CPU (correlation > 0.99). n=2000 stays in the 16-bit
histogram regime (already correct); n>=8000 forces a 32-bit root histogram, which
the finder previously misread (int32 stride over int64 bins) and produced
near-random models (correlation ~0). Guards the int64-read fix.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…histogram read (#23, #26) Lands the CUDA quantized/discretized-gradient parity cascade (PR #23) plus the 32-bit discretized-histogram read fix (PR #26, a superset of #23), direct-merged because the branches could not be rebased onto the collapsed master through their forks. Five distinct defects: - Discretized grad/hess buffer was 2x under-allocated (int16 grad + int16 hess = 4 bytes/point, sized as num_data*2). Overran into the dequant scale buffers -> denormal root sum_hessians -> single-leaf collapse. - Child leaves never had their packed int64 grad/hess sum refreshed on split, so they inherited the parent total -> phantom parent-remainder splits. (Required adding the two packed-sum fields to CUDASplitInfo's operator=, which had silently dropped them.) - Histogram-slot stride was inconsistent (1x vs 2x) between the left-smaller and right-smaller branches under the discretized path, aliasing live leaf histograms; force 2x everywhere. - The best-split finder read the 32-bit discretized histogram as int32 where the constructor writes int64 (>=4096-row promotion), reading half the bytes and indexing in the wrong units -> garbage gain at scale. Adds four CPU/CUDA parity regression tests (TASK=cuda). Authored-by: Max Buckley <maxwbuckley@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thank you, Max — and thank you, Claude Code (separately 🙂). Landed on Both this PR and #23 reviewed out as SOLID — the cleanest, most impactful correctness work in the batch. Verified all five defects against the CPU reference path (and confirmed three are latent upstream bugs worth escalating): the 2× discretized-buffer under-allocation, the unrefreshed child packed-sum (+ the Since the branch couldn't be rebased onto the collapsed Closing as landed. #23 closed for the same reason. |
Read the 32-bit discretized histogram as int64 in the best-split finder
The bug
When a leaf's max per-bin stat (
num_data_in_leaf * num_grad_quant_bins) reaches65536, the discretized histogram switches to 32-bit bins: the construct kernel
stores each bin as an
int64(grad32 << 32 | hess32) via int64 atomics, and theleaf's packed total is likewise
int64. ButFindBestSplitsDiscretizedForLeafKerneldispatched the non-16-bit case with
BIN_HIST_TYPE = int32_tand read the histogramthrough an
int32_t*offset bytask->hist_offset— reading 4-byte half-bins atthe wrong stride (alternating grad/hess halves of consecutive int64 bins). The split
search therefore saw garbage for any leaf large enough to require 32-bit bins.
With
num_grad_quant_bins = 16that threshold is any leaf with ≥ 4096 rows, so inpractice any non-trivial dataset. The result was near-random models — root split
gain ~0.2 vs the correct ~25000, correlation ~0 with CPU.
Why it stayed hidden
The 8-bit and 16-bit histogram paths are correct, and those were the only paths the
small-data parity tests ever exercised. The bug only surfaces at scale, where leaves
are big enough to need 32-bit accumulators.
How it was found
While profiling CUDA training for performance, the quantized path looked ~3.5× faster
than double — but that "speedup" turned out to be a broken model: quantized
produced garbage (correlation ~0.05 with the double model) and terminated early on
zero-gain splits. Bisecting on dataset size pinned the break to n ≈ 4096 = 65536/16,
i.e. exactly the 16-bit → 32-bit histogram transition; sweeping
num_grad_quant_bins(which moves the threshold to 65536/bins) confirmed it.
The fix
Read the 32-bit histogram as
int64in both the forward and reverse split-searchbranches:
BIN_HIST_TYPE = int64_t, pointer cast toconst int64_t*, offset in binunits. (There is no global-memory discretized finder variant, and the remaining
int32_t*reads are the genuinely-16-bit cases.)Results (CUDA quantized vs CPU quantized,
num_grad_quant_bins=16)Quantized model quality now matches double precision (train RMSE within 0.01%),
making large-scale quantized training correct — and it is the faster path on CUDA
(int histograms; ~15–20% faster than double for
num_leaves ≥ 31on large data).The 8-bit / 16-bit paths and the double path are unchanged (verified bit-exact).
Adds
test_cuda_quantized_32bit_histogram_matches_cpu(n ∈ {2000, 8000, 50000}).🤖 Generated with Claude Code