[cuda] order FindBestSplits after histograms with events, not device syncs (~8.6% faster) by maxwbuckley · Pull Request #27 · BelixRogner/ExaBoost

maxwbuckley · 2026-06-13T12:29:19Z

Replace per-split histogram→FindBestSplits device syncs with event-based stream ordering

TL;DR

Per-split CUDA training is host-synchronization bound. nsys on an RTX 5090
(sm_120, CUDA 13.2), 500k×100 / 255-leaf regression fit: cudaDeviceSynchronize ≈
45% of wall time (~9 syncs per leaf split), while all GPU kernels together are ≈
21%. Two of those device syncs sit between histogram construction and the
FindBestSplits launches:

cuda_histogram_constructor: a SynchronizeCUDADevice after constructing the
smaller-leaf histogram, so the best split finder (which runs on its own streams)
sees it.
cuda_best_split_finder: a SynchronizeCUDADevice between the smaller-leaf and
larger-leaf FindBestSplits launches, so the larger leaf sees the subtracted
histogram.

Both are full device syncs (host↔GPU round trips) on every split.

Change

The histogram constructor records two timing-disabled CUDA events on its stream:
construct_done_event_ after the smaller-leaf histogram and subtract_done_event_
after the subtract (which also covers the in-place FixHistogram that precedes it).
The best split finder waits on these via cudaStreamWaitEvent — construct on
stream 0 before the smaller-leaf search, subtract on stream 1 before the larger-leaf
search — instead of the device syncs. The two leaves then run concurrently with no
host stall. Events are wired once in the tree learner after both objects init.

The global-memory FindBestSplits path keeps its device sync: it shares
cuda_feature_hist_grad/hess_buffer_ between the two leaves, so they must not run
concurrently. It gains only the construct/subtract visibility waits.

Result

Interleaved A/B, n=30, Welch's t, full-tree double-precision (500k×100, 255 leaves):
+8.6% mean end-to-end training (median +8.8%, t=4.64, significant), lower variance.

Correctness

CPU/CUDA parity test added in test_dual.py (multi-leaf regression, several
num_leaves): CUDA matches CPU to ≤4e-16 on the deterministic single-thread,
double-precision config.
On the non-deterministic large double path, predictions stay within the baseline's
own run-to-run noise floor (max|Δ| ≈ 2.5e-7, same order as base-vs-base).

Independent of (and composable with) #25 (SyncBestSplit overlap); they touch different
functions. Measured together they compound to ≈ +21% on the integration branch.

…syncs Per-split CUDA training is host-synchronization bound (nsys on a 500k x 100, 255-leaf regression fit: cudaDeviceSynchronize ~45% of wall time, ~9 syncs per split; all GPU kernels ~21%). Two of those device syncs sit between histogram construction and the FindBestSplits launches: - cuda_histogram_constructor: a SynchronizeCUDADevice after constructing the smaller-leaf histogram, so the best split finder (its own streams) sees it. - cuda_best_split_finder: a SynchronizeCUDADevice between the smaller-leaf and larger-leaf FindBestSplits launches, so the larger leaf sees the subtracted histogram. Replace them with GPU-side ordering: the histogram constructor records two timing-disabled events on its stream -- construct_done_event_ after the smaller-leaf histogram and subtract_done_event_ after the subtract (which also covers the in-place FixHistogram that precedes it). The best split finder waits on construct_done_event_ on stream 0 before the smaller-leaf FindBestSplits and on subtract_done_event_ on stream 1 before the larger-leaf FindBestSplits. The two leaves then run concurrently with no host stall and no device sync. Events are wired once in the tree learner. The global-memory FindBestSplits path keeps its device sync: it shares cuda_feature_hist_grad/hess_buffer_ between the two leaves, so they must not run concurrently. It gains only the construct/subtract visibility waits. Correctness: on the deterministic small-data config CUDA stays bit-for-bit aligned with CPU (see test_dual.py). On the non-deterministic large double path, predictions stay within the baseline's own run-to-run noise floor. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…ring Guards the event-based ordering that replaced the per-split histogram->FindBestSplits device syncs: trains multi-leaf regression trees on CPU and CUDA (deterministic, single-thread, double-precision) and asserts predictions match to 1e-9. A missing or incorrect cudaStreamWaitEvent would let FindBestSplits read a histogram before it is written and diverge far beyond that. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

BelixRogner · 2026-06-14T19:52:40Z

Thank you, Max — and thank you, Claude Code (independently 🙂). Verdict: NEEDS-FIX, but the core is solid.

The sync→event substitution is correct. Both removed device syncs are covered by targeted events: the smaller-leaf histogram via construct_done_event_ → cudaStreamWaitEvent(stream 0), and the subtracted larger-leaf histogram via subtract_done_event_ → cudaStreamWaitEvent(stream 1). Event lifecycle is clean (created cudaEventDisableTiming in Init, destroyed null-guarded in the constructor dtor, the finder holds non-owning copies), the global-memory path correctly keeps its device sync, and the terminal sync still fences the host consumer. It also composes cleanly with #25. Nice ~8.6% win.

To get it green/mergeable:

The typos hook is the blocker — cuda_best_split_finder.cu:1800 has "unstalled", which isn't a word. Reword (e.g. "keeps the host from stalling"). That's the whole lint failure. (Running joke continues: the kernels are bit-deterministic but the comments aren't spell-checked 😄 — pre-commit run -a!)
Rebase needed — [cuda] overlap both child leaves' SyncBestSplit on separate streams (deterministic) #25 (the SyncBestSplit stream overlap) just landed on master, and it edits the same best_split_finder.cu region and appends to test_dual.py:442 like this one does, so you'll get a conflict. They're logically complementary, just textually collide.
Minor, non-blocking: the "covers FixHistogram" claim holds for the larger leaf, but the smaller leaf reads its own in-place-fixed most-freq bin without ordering on subtract_done_event_. It's pre-existing in master (you're strictly more synchronized than before, not a regression), but worth a note — or have the smaller leaf wait on subtract_done_event_ too if you want it airtight.

You've got write access now, so once the typo's fixed and it's rebased you can merge it yourself. 🚀

maxwbuckley and others added 2 commits June 13, 2026 14:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cuda] order FindBestSplits after histograms with events, not device syncs (~8.6% faster)#27

[cuda] order FindBestSplits after histograms with events, not device syncs (~8.6% faster)#27
maxwbuckley wants to merge 2 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/perf-histogram-event-ordering

maxwbuckley commented Jun 13, 2026

Uh oh!

BelixRogner commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maxwbuckley commented Jun 13, 2026

Replace per-split histogram→FindBestSplits device syncs with event-based stream ordering

TL;DR

Change

Result

Correctness

Uh oh!

BelixRogner commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants