[cuda] order FindBestSplits after histograms with events, not device syncs (~8.6% faster)#27
Conversation
…syncs
Per-split CUDA training is host-synchronization bound (nsys on a 500k x 100,
255-leaf regression fit: cudaDeviceSynchronize ~45% of wall time, ~9 syncs per
split; all GPU kernels ~21%). Two of those device syncs sit between histogram
construction and the FindBestSplits launches:
- cuda_histogram_constructor: a SynchronizeCUDADevice after constructing the
smaller-leaf histogram, so the best split finder (its own streams) sees it.
- cuda_best_split_finder: a SynchronizeCUDADevice between the smaller-leaf and
larger-leaf FindBestSplits launches, so the larger leaf sees the subtracted
histogram.
Replace them with GPU-side ordering: the histogram constructor records two
timing-disabled events on its stream -- construct_done_event_ after the smaller-leaf
histogram and subtract_done_event_ after the subtract (which also covers the in-place
FixHistogram that precedes it). The best split finder waits on construct_done_event_
on stream 0 before the smaller-leaf FindBestSplits and on subtract_done_event_ on
stream 1 before the larger-leaf FindBestSplits. The two leaves then run concurrently
with no host stall and no device sync. Events are wired once in the tree learner.
The global-memory FindBestSplits path keeps its device sync: it shares
cuda_feature_hist_grad/hess_buffer_ between the two leaves, so they must not run
concurrently. It gains only the construct/subtract visibility waits.
Correctness: on the deterministic small-data config CUDA stays bit-for-bit aligned
with CPU (see test_dual.py). On the non-deterministic large double path, predictions
stay within the baseline's own run-to-run noise floor.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ring Guards the event-based ordering that replaced the per-split histogram->FindBestSplits device syncs: trains multi-leaf regression trees on CPU and CUDA (deterministic, single-thread, double-precision) and asserts predictions match to 1e-9. A missing or incorrect cudaStreamWaitEvent would let FindBestSplits read a histogram before it is written and diverge far beyond that. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Thank you, Max — and thank you, Claude Code (independently 🙂). Verdict: NEEDS-FIX, but the core is solid. The sync→event substitution is correct. Both removed device syncs are covered by targeted events: the smaller-leaf histogram via To get it green/mergeable:
You've got write access now, so once the typo's fixed and it's rebased you can merge it yourself. 🚀 |
Replace per-split histogram→FindBestSplits device syncs with event-based stream ordering
TL;DR
Per-split CUDA training is host-synchronization bound. nsys on an RTX 5090
(sm_120, CUDA 13.2), 500k×100 / 255-leaf regression fit:
cudaDeviceSynchronize≈45% of wall time (~9 syncs per leaf split), while all GPU kernels together are ≈
21%. Two of those device syncs sit between histogram construction and the
FindBestSplitslaunches:cuda_histogram_constructor: aSynchronizeCUDADeviceafter constructing thesmaller-leaf histogram, so the best split finder (which runs on its own streams)
sees it.
cuda_best_split_finder: aSynchronizeCUDADevicebetween the smaller-leaf andlarger-leaf
FindBestSplitslaunches, so the larger leaf sees the subtractedhistogram.
Both are full device syncs (host↔GPU round trips) on every split.
Change
The histogram constructor records two timing-disabled CUDA events on its stream:
construct_done_event_after the smaller-leaf histogram andsubtract_done_event_after the subtract (which also covers the in-place
FixHistogramthat precedes it).The best split finder waits on these via
cudaStreamWaitEvent—constructonstream 0 before the smaller-leaf search,
subtracton stream 1 before the larger-leafsearch — instead of the device syncs. The two leaves then run concurrently with no
host stall. Events are wired once in the tree learner after both objects init.
The global-memory
FindBestSplitspath keeps its device sync: it sharescuda_feature_hist_grad/hess_buffer_between the two leaves, so they must not runconcurrently. It gains only the construct/subtract visibility waits.
Result
Interleaved A/B, n=30, Welch's t, full-tree double-precision (500k×100, 255 leaves):
+8.6% mean end-to-end training (median +8.8%, t=4.64, significant), lower variance.
Correctness
test_dual.py(multi-leaf regression, severalnum_leaves): CUDA matches CPU to ≤4e-16 on the deterministic single-thread,double-precision config.
own run-to-run noise floor (max|Δ| ≈ 2.5e-7, same order as base-vs-base).
Independent of (and composable with) #25 (SyncBestSplit overlap); they touch different
functions. Measured together they compound to ≈ +21% on the integration branch.