Skip to content

cuD-PDLP#1391

Open
Bubullzz wants to merge 148 commits into
NVIDIA:mainfrom
Bubullzz:cuD-PDLP
Open

cuD-PDLP#1391
Bubullzz wants to merge 148 commits into
NVIDIA:mainfrom
Bubullzz:cuD-PDLP

Conversation

@Bubullzz

@Bubullzz Bubullzz commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

Implemented metis-partitionned multi-GPU PDLP.

To run PDLP using multi-GPU run :
./cpp/build/cuopt_cli ../path/to/file.mps --method 1 --use-distributed-pdlp true --presolve 0, the exact number of GPUs used can be set with --distributed-pdlp-num-gpus n

All benchmarking results against D-PDLP and single GPU CuOpt can be found in this spreadsheet

Here is the bottom line of the results
On 8 NVLINKed B200 :

against CuOpt :

  • speedup : at least 2.5x and up to 7.08x (tsp-gaia-10m.mps)
  • memory footprint : ~8x on most instances

against D-PDLP

  • speedup: slower on most instances but faster on the bigger ones (psr_100, tsp-gaia-10m, ELMOD_876_10_noVEname). getting up to a 2x speedup on ELMOD_876_10_noVEname.
  • memory footprint : they consistently have a better memory footprint than we do but on the bigger instances it does not go over 20% extra footprint

to note: the speedups against D-PDLP are computed with NVLS_SHARP=0 disabling a feature that could give them a speedup from 1.1x to 1.75x I am looking with the compute-lab team to make it work

closes #891

Bubullzz added 30 commits May 7, 2026 15:07
{
for (auto& s : shards) {
raft::device_setter guard(s->device_id);
fn(*s);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! What is the behavior when fn is asynchronous from a GPU perspective? Will the asynchronous kernels launched within fn be correctly run on the device set above even if the calling thread has left the scope?

auto& sub = *shard.sub_pdlp;
// turns the Tuple of lambdas into a tuple of rmm::device_uvector
auto cub_inputs = std::apply(
[&sub](auto&... acc) { return cuda::std::make_tuple(acc(sub)...); }, in_accessors);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I'm following, where are we turning things into rmm::device_uvector? Also can't we directly wrap things in a cuda::std::make_tuple?

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test d8a1fa8

@bbozkaya

Copy link
Copy Markdown
Member

--use-distributed-pdlp argument not recognized when I build cuOpt and run cuopt_cli

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 1a5b941

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test e267972

@bbozkaya

Copy link
Copy Markdown
Member

all CLI arguments work now, as validated on B200 with up to 8 GPUs.

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 368b3b3

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 6948bc5

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 1563cdc

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test d0de284

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 7bb6945

@Bubullzz

Copy link
Copy Markdown
Contributor Author

/ok to test 21cdccc

@github-actions

Copy link
Copy Markdown

CI Test Summary

1 failed · 30 passed · 0 skipped

conda-cpp-tests / 13.3.0, 3.13, arm64, ubuntu26.04, l4, latest-driver, latest-deps — 1 failed test
  • mip_solve.incumbent_get_set_callback_test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Introduces a non-breaking change pdlp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] Multi GPU PDLP

3 participants