[cuda] fix 32-bit discretized histogram read in best-split finder (quantized garbage at scale) by maxwbuckley · Pull Request #26 · BelixRogner/ExaBoost

maxwbuckley · 2026-06-11T21:31:12Z

Read the 32-bit discretized histogram as int64 in the best-split finder

Stacks on #23 (the quantized correctness cascade). Until #23 merges, this PR's
diff includes those commits; the new work here is the finder fix + its regression test.

The bug

When a leaf's max per-bin stat (num_data_in_leaf * num_grad_quant_bins) reaches
65536, the discretized histogram switches to 32-bit bins: the construct kernel
stores each bin as an int64 (grad32 << 32 | hess32) via int64 atomics, and the
leaf's packed total is likewise int64. But FindBestSplitsDiscretizedForLeafKernel
dispatched the non-16-bit case with BIN_HIST_TYPE = int32_t and read the histogram
through an int32_t* offset by task->hist_offset — reading 4-byte half-bins at
the wrong stride (alternating grad/hess halves of consecutive int64 bins). The split
search therefore saw garbage for any leaf large enough to require 32-bit bins.

With num_grad_quant_bins = 16 that threshold is any leaf with ≥ 4096 rows, so in
practice any non-trivial dataset. The result was near-random models — root split
gain ~0.2 vs the correct ~25000, correlation ~0 with CPU.

Why it stayed hidden

The 8-bit and 16-bit histogram paths are correct, and those were the only paths the
small-data parity tests ever exercised. The bug only surfaces at scale, where leaves
are big enough to need 32-bit accumulators.

How it was found

While profiling CUDA training for performance, the quantized path looked ~3.5× faster
than double — but that "speedup" turned out to be a broken model: quantized
produced garbage (correlation ~0.05 with the double model) and terminated early on
zero-gain splits. Bisecting on dataset size pinned the break to n ≈ 4096 = 65536/16,
i.e. exactly the 16-bit → 32-bit histogram transition; sweeping num_grad_quant_bins
(which moves the threshold to 65536/bins) confirmed it.

The fix

Read the 32-bit histogram as int64 in both the forward and reverse split-search
branches: BIN_HIST_TYPE = int64_t, pointer cast to const int64_t*, offset in bin
units. (There is no global-memory discretized finder variant, and the remaining
int32_t* reads are the genuinely-16-bit cases.)

Results (CUDA quantized vs CPU quantized, `num_grad_quant_bins=16`)

n	before	after
2000 (16-bit leaves)	corr 1.0 (already correct)	corr 1.0, maxΔ 0
8000 (32-bit root)	corr −0.004	corr 1.0, maxΔ 0
200000	corr 0.005	corr 1.0, maxΔ 0

Quantized model quality now matches double precision (train RMSE within 0.01%),
making large-scale quantized training correct — and it is the faster path on CUDA
(int histograms; ~15–20% faster than double for num_leaves ≥ 31 on large data).
The 8-bit / 16-bit paths and the double path are unchanged (verified bit-exact).

Adds test_cuda_quantized_32bit_histogram_matches_cpu (n ∈ {2000, 8000, 50000}).

🤖 Generated with Claude Code

… in quantized training CUDAGradientDiscretizer stores, per data point, an int16 discretized gradient and an int16 discretized hessian. DiscretizeGradientsKernel writes them through a reinterpret_cast<int16_t*> view at indices [2*i] and [2*i+1], and every consumer (the histogram constructor and CUDAInitValuesKernel3) reads the buffer back as int16_t. That layout needs 2 * sizeof(int16_t) = 4 bytes per data point. The backing buffer is int8_t and was sized num_data * 2 (the original 8-bit layout), i.e. only 2 bytes per point -- half of what the int16 path writes. DiscretizeGradientsKernel therefore overran the buffer by num_data * 2 bytes, corrupting the adjacent device allocations, which in practice are the gradient/hessian dequantization scale buffers (grad_max_block_buffer_ / hess_max_block_buffer_). The corrupted dequant scale (~1.4e-45, the int bit pattern of the bin count reinterpreted as float) made the root leaf's sum_hessians come out as a denormal ~0 instead of the true value, so the root failed the sum_hessians > min_sum_hessian_in_leaf validity check and never split. Result: under device_type=cuda + use_quantized_grad=true, every tree collapsed to a single leaf and the model did not learn. Sizing the buffer as num_data * 4 removes the overflow. The dequant scales stay correct, the root sum_hessians is accurate, and CUDA quantized training produces real splitting trees again. Before (this same minimal regression setup, 20 rounds): CUDA quantized: 1 leaf/tree, max|CUDA-CPU quantized pred| = 10.99 After: CUDA quantized: multi-leaf splitting trees, max|CUDA-CPU| = 6.81 (the residual gap is a separate, still-open quantized bug in the child-leaf split bookkeeping; tracked separately.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…uffer fix Asserts that device_type=cuda + use_quantized_grad=true produces real splitting trees (num_leaves > 1) with non-degenerate predictions, across n in {200,500,1000} and num_leaves in {7,31}. Without the buffer-size fix the discretize kernel overran the discretized gradient/hessian buffer, corrupted the dequant scale buffers, and every quantized tree collapsed to a single leaf -- which this test would catch (max leaf count == 1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…zed trees grow Two matched defects that together stop CUDA use_quantized_grad trees from growing past a couple of leaves. Both concern sum_of_gradients_hessians -- the int64 packed (gradient<<32 | hessian) leaf total that the DISCRETIZED best-split finder uses to derive cnt_factor = num_data/sum_hessians and the left = total - right split sums. (Builds on the discretized-buffer-overflow fix, without which quantized never splits at all.) 1) cuda_split_info.hpp: CUDASplitInfo::operator= copied left_/right_sum_gradients, _sum_hessians, _count, _gain, _value but NOT left_/right_sum_of_gradients_hessians. FindBestFromAllSplits assigns the winning split through this operator, so the packed sums were dropped (read back as 0). 2) cuda_data_partition.cu (SplitTreeStructureKernel): the child leaf splits had their double sum_of_gradients/sum_of_hessians refreshed on split but never their packed sum_of_gradients_hessians, so each child inherited the parent's total. Combined effect before the fix: at the first sub-split the finder used the child's real per-bin histogram but the parent's stale (or zeroed) total, scoring phantom parent-remainder splits that partitioned to a 0-data leaf -> growth halted at 2-4 leaves while CPU grew the full 31. After: CUDA quantized trees grow to match CPU's structure. Leaf counts over 20 rounds (num_leaves=31), CPU-quant vs CUDA-quant: n=1000 min_data=300: 54 vs 54 (and bit-identical predictions, max|diff|=0) n=1000 min_data=200: 76 vs 75 n=2000 min_data=300: 100 vs 106 n=2000 min_data=200: 152 vs 157 versus the broken ~2 leaves/tree before. SCOPE / honest limitations -- this is a STRUCTURAL fix, not full prediction parity: - Predictions are exact in some configs (n=1000/min_data=300 -> 0) but still diverge in others (e.g. n=2000/min_data=200 -> ~29). Two separate, still-open causes: * leaves small enough for an 8-bit histogram (num_data*num_grad_quant_bins < 256) get a wrong near-zero dequantized sum_hessian and exploded leaf outputs -- a distinct bug in the adaptive 8-bit histogram path; * cross-feature gain-tie ordering differs from CPU (addressed on a separate branch). - The accompanying test asserts the structural property (CUDA grows to within 20% of CPU's leaf count), which is what this change robustly guarantees, not tight parity. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…aves SplitTreeStructureKernel hands each newly-created (smaller) child leaf a histogram slot in cuda_hist_. Slots are laid out at a 2 * num_total_bin stride per leaf -- the buffer is sized num_total_bin * 2 * num_leaves, and the right-is-smaller branch always assigns cuda_hist + 2 * right_leaf_index * num_total_bin. But the left-is-smaller branch used a 1x stride for the discretized (use_quantized_grad) path: cuda_hist_pool[left_leaf_index] = USE_GRAD_DISCRETIZED ? cuda_hist + right_leaf_index * num_total_bin : // 1x <-- bug cuda_hist + 2 * right_leaf_index * num_total_bin; // 2x A 1x-strided child could be handed a slot already owned by a live 2x-strided leaf. Since ZeroHistForLeaf is a no-op (the whole buffer is zeroed once per tree), the child's ConstructHistogramForLeaf then atomicAdded its data on top of the other leaf's histogram. Observed directly (n=300, 8 features, so a clean leaf's histogram hessian sums to num_data*8*4): a 45-data leaf landed on a 61-data leaf's slot (both slot_off=3264) and its histogram summed to 3392 = (61+45)*32 instead of 1440. The polluted histogram gave the finder phantom splits whose threshold the data partition could not reproduce (it put all data on one side -> 0-data leaves), and leaves ended with near-zero sum_hessian whose outputs exploded to 1e12-1e20 over 20 rounds at small min_data_in_leaf. Fix: use the same 2x stride in the left-is-smaller branch. Slots are then distinct and the allocation still fits exactly (2 * num_leaves * num_total_bin). Before -> after, max|CUDA-CPU quantized prediction| (20 rounds), with lightgbm-org#8+lightgbm-org#9+lightgbm-org#10 in place: n=1000 min_data=20: ~6e12 -> 0.997 (CUDA leaves 620 == CPU 620) n=2000 min_data=20: ~1e20 -> 1.525 (CUDA 620 == CPU 620) n=1000/2000 min_data=200: -> 0 (bit-identical) The residual ~1-1.5 at small min_data is the separate cross-feature gain-tie ordering (addressed on cuda/gain-tie-break), not this bug. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Summarize the four stacked defects (lightgbm-org#8 discretized buffer overflow, lightgbm-org#9/lightgbm-org#10 child-leaf packed-sum, lightgbm-org#11 histogram-slot collision) that left CUDA use_quantized_grad training broken, with root causes, fixes, how each was found, and before/after benchmarks. Records the disproven 8-bit-bit-width hypothesis and the remaining gain-tie residual. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…inder When a leaf's max per-bin stat (num_data_in_leaf * num_grad_quant_bins) reaches 65536, the discretized histogram uses 32-bit bins: the construct kernel stores each bin as an int64 (grad32 << 32 | hess32) via int64 atomics, and the leaf's packed total is likewise int64. But the best-split finder dispatched the 32-bit (non-16-bit) case with BIN_HIST_TYPE = int32_t and read the histogram through an `int32_t*` offset by task->hist_offset. That reads 4-byte half-bins at the wrong stride (alternating grad/hess halves of consecutive int64 bins), so the finder saw garbage for every leaf that needed 32-bit bins -- i.e. any leaf with enough data. The result was near-random splits (root split gain ~0.2 vs the correct ~25000) and useless models whenever a dataset was large enough to require 32-bit histograms (with num_grad_quant_bins=16 this is any leaf with >= 4096 rows). The 8-bit and 16-bit paths were correct (and were the only paths exercised by the small-data parity tests), which is why this stayed hidden until large-scale quantized training. Fix: read the 32-bit histogram as int64 (BIN_HIST_TYPE = int64_t, pointer cast to const int64_t*, offset in bin units), in both the forward and reverse split-search branches. Before -> after (CUDA quantized vs CPU quantized, num_grad_quant_bins=16): n=2000 (16-bit leaves): corr 1.0 (already correct) n=8000 (32-bit root): corr -0.004 -> corr 1.0, maxΔ 0 n=200000: corr 0.005 -> corr 1.0, maxΔ 0 Quantized model quality now matches double-precision (RMSE within 0.01%), and large-scale quantized training is usable (and modestly faster) on CUDA. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Trains quantized regression at n in {2000, 8000, 50000} with num_grad_quant_bins=16 and asserts CUDA matches CPU (correlation > 0.99). n=2000 stays in the 16-bit histogram regime (already correct); n>=8000 forces a 32-bit root histogram, which the finder previously misread (int32 stride over int64 bins) and produced near-random models (correlation ~0). Guards the int64-read fix. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…histogram read (#23, #26) Lands the CUDA quantized/discretized-gradient parity cascade (PR #23) plus the 32-bit discretized-histogram read fix (PR #26, a superset of #23), direct-merged because the branches could not be rebased onto the collapsed master through their forks. Five distinct defects: - Discretized grad/hess buffer was 2x under-allocated (int16 grad + int16 hess = 4 bytes/point, sized as num_data*2). Overran into the dequant scale buffers -> denormal root sum_hessians -> single-leaf collapse. - Child leaves never had their packed int64 grad/hess sum refreshed on split, so they inherited the parent total -> phantom parent-remainder splits. (Required adding the two packed-sum fields to CUDASplitInfo's operator=, which had silently dropped them.) - Histogram-slot stride was inconsistent (1x vs 2x) between the left-smaller and right-smaller branches under the discretized path, aliasing live leaf histograms; force 2x everywhere. - The best-split finder read the 32-bit discretized histogram as int32 where the constructor writes int64 (>=4096-row promotion), reading half the bytes and indexing in the wrong units -> garbage gain at scale. Adds four CPU/CUDA parity regression tests (TASK=cuda). Authored-by: Max Buckley <maxwbuckley@gmail.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

BelixRogner · 2026-06-14T15:24:25Z

Thank you, Max — and thank you, Claude Code (separately 🙂). Landed on master as d42da7f.

Both this PR and #23 reviewed out as SOLID — the cleanest, most impactful correctness work in the batch. Verified all five defects against the CPU reference path (and confirmed three are latent upstream bugs worth escalating): the 2× discretized-buffer under-allocation, the unrefreshed child packed-sum (+ the CUDASplitInfo::operator= fields it depends on), the 1×/2× histogram-slot stride aliasing, and this PR's 32-bit int64 read fix for the garbage-at-scale promotion path.

Since the branch couldn't be rebased onto the collapsed master through the fork, I direct-merged the superset (your branch = #23's four fixes + this PR's finder fix + all four parity tests), preserving your authorship. The only thing dropped was the root-level CUDA_QUANTIZED_PARITY_FIXES.md design doc — that content lives in the commit message now. CUDA-build-verified and lint-clean (cpplint/ruff/typos all pass on the pinned versions).

Closing as landed. #23 closed for the same reason.

maxwbuckley and others added 7 commits June 11, 2026 02:10

BelixRogner closed this Jun 14, 2026

BelixRogner mentioned this pull request Jun 14, 2026

Fix CUDA quantized gradient training parity with CPU (4-defect cascade) #23

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] fix 32-bit discretized histogram read in best-split finder (quantized garbage at scale)#26

[cuda] fix 32-bit discretized histogram read in best-split finder (quantized garbage at scale)#26
maxwbuckley wants to merge 7 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/quantized-32bit-histogram-read

maxwbuckley commented Jun 11, 2026

Uh oh!

BelixRogner commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxwbuckley commented Jun 11, 2026

Read the 32-bit discretized histogram as int64 in the best-split finder

The bug

Why it stayed hidden

How it was found

The fix

Results (CUDA quantized vs CPU quantized, num_grad_quant_bins=16)

Uh oh!

BelixRogner commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Results (CUDA quantized vs CPU quantized, `num_grad_quant_bins=16`)