Skip to content

Improve mGPU Gaussian tile intersection#664

Open
matthewdcong wants to merge 4 commits into
openvdb:mainfrom
matthewdcong:decoupled_mgpu_sort
Open

Improve mGPU Gaussian tile intersection#664
matthewdcong wants to merge 4 commits into
openvdb:mainfrom
matthewdcong:decoupled_mgpu_sort

Conversation

@matthewdcong
Copy link
Copy Markdown
Contributor

@matthewdcong matthewdcong commented May 30, 2026

Previous iterations of mGPU Gaussian tile intersection include:

  1. Distribute every step of the single mGPU tile sort. Compute the tile intersections for all Gaussians across all tiles, followed by a parallel mGPU radix sort. Requires significantly more communication and synchronization during the parallel radix sort, as well as a temp output key and value array for cross-device merging. (previously in main)
  2. Observe that the radix sort is independent for each camera and compute the radix sort for each camera entirely on a single device. This works well when batch size = num GPUs, but performs poorly when batch size = 1 and num GPUs > 1 because only one GPU is used and all data must be gathered to this GPU. (currently in main)

This PR introduces a more performant strategy that pre-partitions the Gaussians by computing the intersections of all Gaussians with only the subset of the tiles/the tile range rendered by that GPU. Since the tile keys are monotonically increasing, this means that the subsequent sorting process is decoupled, i.e. we can sort the per-GPU Gaussian tile intersection lists independently and the resulting flattened array is guaranteed to be sorted. This significantly reduces the amount of communication and data transfer required during the sorting process. Moreover, a switch from radix sort to merge sort enables us to remove the temp output buffers, further reducing stalls due to prefetching as well as decreasing peak memory utilization.

This is a performance improvement across the board, but becomes more significant as the number of GPUs increases. On 8x A100s, this improves end-to-end reconstruction performance about 15% with a batch size of 1 (on a relatively small problem).

Signed-off-by: Matthew Cong <mcong@nvidia.com>
Signed-off-by: Matthew Cong <mcong@nvidia.com>
@matthewdcong matthewdcong requested a review from a team as a code owner May 30, 2026 15:20
Signed-off-by: Matthew Cong <mcong@nvidia.com>
Signed-off-by: Matthew Cong <mcong@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant