cuD-PDLP#1391
Conversation
…he cycle seems to be fixed, cuopt compiles
+ style too
compiles and runs
| { | ||
| for (auto& s : shards) { | ||
| raft::device_setter guard(s->device_id); | ||
| fn(*s); |
There was a problem hiding this comment.
Very cool! What is the behavior when fn is asynchronous from a GPU perspective? Will the asynchronous kernels launched within fn be correctly run on the device set above even if the calling thread has left the scope?
| auto& sub = *shard.sub_pdlp; | ||
| // turns the Tuple of lambdas into a tuple of rmm::device_uvector | ||
| auto cub_inputs = std::apply( | ||
| [&sub](auto&... acc) { return cuda::std::make_tuple(acc(sub)...); }, in_accessors); |
There was a problem hiding this comment.
Not sure I'm following, where are we turning things into rmm::device_uvector? Also can't we directly wrap things in a cuda::std::make_tuple?
|
/ok to test d8a1fa8 |
…tings rather than hyper_params
|
--use-distributed-pdlp argument not recognized when I build cuOpt and run cuopt_cli |
|
/ok to test 1a5b941 |
|
/ok to test e267972 |
|
all CLI arguments work now, as validated on B200 with up to 8 GPUs. |
|
/ok to test 368b3b3 |
|
/ok to test 6948bc5 |
|
/ok to test 1563cdc |
|
/ok to test d0de284 |
|
/ok to test 7bb6945 |
|
/ok to test 21cdccc |
CI Test Summary1 failed · 30 passed · 0 skipped
|
Implemented metis-partitionned multi-GPU PDLP.
To run PDLP using multi-GPU run :
./cpp/build/cuopt_cli ../path/to/file.mps --method 1 --use-distributed-pdlp true --presolve 0, the exact number of GPUs used can be set with--distributed-pdlp-num-gpus nAll benchmarking results against D-PDLP and single GPU CuOpt can be found in this spreadsheet
Here is the bottom line of the results
On 8 NVLINKed B200 :
against CuOpt :
against D-PDLP
to note: the speedups against D-PDLP are computed with NVLS_SHARP=0 disabling a feature that could give them a speedup from 1.1x to 1.75x I am looking with the compute-lab team to make it work
closes #891