Skip to content

[cuda] order FindBestSplits after histograms with events, not device syncs (~8.6% faster)#27

Open
maxwbuckley wants to merge 2 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/perf-histogram-event-ordering
Open

[cuda] order FindBestSplits after histograms with events, not device syncs (~8.6% faster)#27
maxwbuckley wants to merge 2 commits into
BelixRogner:masterfrom
maxwbuckley:cuda/perf-histogram-event-ordering

Conversation

@maxwbuckley

Copy link
Copy Markdown
Collaborator

Replace per-split histogram→FindBestSplits device syncs with event-based stream ordering

TL;DR

Per-split CUDA training is host-synchronization bound. nsys on an RTX 5090
(sm_120, CUDA 13.2), 500k×100 / 255-leaf regression fit: cudaDeviceSynchronize
45% of wall time (~9 syncs per leaf split), while all GPU kernels together are ≈
21%. Two of those device syncs sit between histogram construction and the
FindBestSplits launches:

  • cuda_histogram_constructor: a SynchronizeCUDADevice after constructing the
    smaller-leaf histogram, so the best split finder (which runs on its own streams)
    sees it.
  • cuda_best_split_finder: a SynchronizeCUDADevice between the smaller-leaf and
    larger-leaf FindBestSplits launches, so the larger leaf sees the subtracted
    histogram.

Both are full device syncs (host↔GPU round trips) on every split.

Change

The histogram constructor records two timing-disabled CUDA events on its stream:
construct_done_event_ after the smaller-leaf histogram and subtract_done_event_
after the subtract (which also covers the in-place FixHistogram that precedes it).
The best split finder waits on these via cudaStreamWaitEventconstruct on
stream 0 before the smaller-leaf search, subtract on stream 1 before the larger-leaf
search — instead of the device syncs. The two leaves then run concurrently with no
host stall. Events are wired once in the tree learner after both objects init.

The global-memory FindBestSplits path keeps its device sync: it shares
cuda_feature_hist_grad/hess_buffer_ between the two leaves, so they must not run
concurrently. It gains only the construct/subtract visibility waits.

Result

Interleaved A/B, n=30, Welch's t, full-tree double-precision (500k×100, 255 leaves):
+8.6% mean end-to-end training (median +8.8%, t=4.64, significant), lower variance.

Correctness

  • CPU/CUDA parity test added in test_dual.py (multi-leaf regression, several
    num_leaves): CUDA matches CPU to ≤4e-16 on the deterministic single-thread,
    double-precision config.
  • On the non-deterministic large double path, predictions stay within the baseline's
    own run-to-run noise floor (max|Δ| ≈ 2.5e-7, same order as base-vs-base).

Independent of (and composable with) #25 (SyncBestSplit overlap); they touch different
functions. Measured together they compound to ≈ +21% on the integration branch.

maxwbuckley and others added 2 commits June 13, 2026 14:11
…syncs

Per-split CUDA training is host-synchronization bound (nsys on a 500k x 100,
255-leaf regression fit: cudaDeviceSynchronize ~45% of wall time, ~9 syncs per
split; all GPU kernels ~21%). Two of those device syncs sit between histogram
construction and the FindBestSplits launches:

  - cuda_histogram_constructor: a SynchronizeCUDADevice after constructing the
    smaller-leaf histogram, so the best split finder (its own streams) sees it.
  - cuda_best_split_finder: a SynchronizeCUDADevice between the smaller-leaf and
    larger-leaf FindBestSplits launches, so the larger leaf sees the subtracted
    histogram.

Replace them with GPU-side ordering: the histogram constructor records two
timing-disabled events on its stream -- construct_done_event_ after the smaller-leaf
histogram and subtract_done_event_ after the subtract (which also covers the in-place
FixHistogram that precedes it). The best split finder waits on construct_done_event_
on stream 0 before the smaller-leaf FindBestSplits and on subtract_done_event_ on
stream 1 before the larger-leaf FindBestSplits. The two leaves then run concurrently
with no host stall and no device sync. Events are wired once in the tree learner.

The global-memory FindBestSplits path keeps its device sync: it shares
cuda_feature_hist_grad/hess_buffer_ between the two leaves, so they must not run
concurrently. It gains only the construct/subtract visibility waits.

Correctness: on the deterministic small-data config CUDA stays bit-for-bit aligned
with CPU (see test_dual.py). On the non-deterministic large double path, predictions
stay within the baseline's own run-to-run noise floor.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ring

Guards the event-based ordering that replaced the per-split histogram->FindBestSplits
device syncs: trains multi-leaf regression trees on CPU and CUDA (deterministic,
single-thread, double-precision) and asserts predictions match to 1e-9. A missing or
incorrect cudaStreamWaitEvent would let FindBestSplits read a histogram before it is
written and diverge far beyond that.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@BelixRogner

Copy link
Copy Markdown
Owner

Thank you, Max — and thank you, Claude Code (independently 🙂). Verdict: NEEDS-FIX, but the core is solid.

The sync→event substitution is correct. Both removed device syncs are covered by targeted events: the smaller-leaf histogram via construct_done_event_cudaStreamWaitEvent(stream 0), and the subtracted larger-leaf histogram via subtract_done_event_cudaStreamWaitEvent(stream 1). Event lifecycle is clean (created cudaEventDisableTiming in Init, destroyed null-guarded in the constructor dtor, the finder holds non-owning copies), the global-memory path correctly keeps its device sync, and the terminal sync still fences the host consumer. It also composes cleanly with #25. Nice ~8.6% win.

To get it green/mergeable:

  1. The typos hook is the blockercuda_best_split_finder.cu:1800 has "unstalled", which isn't a word. Reword (e.g. "keeps the host from stalling"). That's the whole lint failure. (Running joke continues: the kernels are bit-deterministic but the comments aren't spell-checked 😄 — pre-commit run -a!)
  2. Rebase needed[cuda] overlap both child leaves' SyncBestSplit on separate streams (deterministic) #25 (the SyncBestSplit stream overlap) just landed on master, and it edits the same best_split_finder.cu region and appends to test_dual.py:442 like this one does, so you'll get a conflict. They're logically complementary, just textually collide.
  3. Minor, non-blocking: the "covers FixHistogram" claim holds for the larger leaf, but the smaller leaf reads its own in-place-fixed most-freq bin without ordering on subtract_done_event_. It's pre-existing in master (you're strictly more synchronized than before, not a regression), but worth a note — or have the smaller leaf wait on subtract_done_event_ too if you want it airtight.

You've got write access now, so once the typo's fixed and it's rebased you can merge it yourself. 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants