Improve mGPU Gaussian tile intersection by matthewdcong · Pull Request #664 · openvdb/fvdb-core

matthewdcong · 2026-05-30T15:20:30Z

Previous iterations of mGPU Gaussian tile intersection include:

Distribute every step of the single mGPU tile sort. Compute the tile intersections for all Gaussians across all tiles, followed by a parallel mGPU radix sort. Requires significantly more communication and synchronization during the parallel radix sort, as well as a temp output key and value array for cross-device merging. (previously in main)
Observe that the radix sort is independent for each camera and compute the radix sort for each camera entirely on a single device. This works well when batch size = num GPUs, but performs poorly when batch size = 1 and num GPUs > 1 because only one GPU is used and all data must be gathered to this GPU. (currently in main)

This PR introduces a more performant strategy that pre-partitions the Gaussians by computing the intersections of all Gaussians with only the subset of the tiles/the tile range rendered by that GPU. Since the tile keys are monotonically increasing, this means that the subsequent sorting process is decoupled, i.e. we can sort the per-GPU Gaussian tile intersection lists independently and the resulting flattened array is guaranteed to be sorted. This significantly reduces the amount of communication and data transfer required during the sorting process. Moreover, a switch from radix sort to merge sort enables us to remove the temp output buffers, further reducing stalls due to prefetching as well as decreasing peak memory utilization.

This is a performance improvement across the board, but becomes more significant as the number of GPUs increases. On 8x A100s, this improves end-to-end reconstruction performance about 15% with a batch size of 1 (on a relatively small problem).

Signed-off-by: Matthew Cong <mcong@nvidia.com>

matthewdcong added 2 commits May 30, 2026 08:07

Pre-partition intersections in mGPU Gaussian-tile sort

04415be

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Store device totals in pinned memory

890a32d

Signed-off-by: Matthew Cong <mcong@nvidia.com>

matthewdcong requested a review from a team as a code owner May 30, 2026 15:20

matthewdcong requested review from blackencino and sifakis May 30, 2026 15:20

matthewdcong added 2 commits May 30, 2026 16:57

Fix template deduction

dd2d6ff

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Fix sync call

94c5143

Signed-off-by: Matthew Cong <mcong@nvidia.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve mGPU Gaussian tile intersection#664

Improve mGPU Gaussian tile intersection#664
matthewdcong wants to merge 4 commits into
openvdb:mainfrom
matthewdcong:decoupled_mgpu_sort

matthewdcong commented May 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

matthewdcong commented May 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

matthewdcong commented May 30, 2026 •

edited

Loading