Skip to content

failed to run some allreduce algorithm in msccl executor #42

@banjiaojuhao

Description

@banjiaojuhao

I generated two allreduce algorithms in DGX1. One works(C48-S14-R14) but one not(C8-S4-R4).
It failed to generate allreduce directly. User should generate reducescatter and allgather seperately and compose them.

environment

Server: DGX1
OS: Ubuntu 22.04.3 LTS
Driver Version: 535.161.07
Docker Engine - Community, Version: 27.5.1
Docker image: nvcr.io/nvidia/cuda:12.2.2-devel-ubuntu22.04 (b56b435576e8)

steps to reproduce

compile runtime

# start container
docker run -dt --gpus all --hostname msccl-azure --name msccl-azure nvcr.io/nvidia/cuda:12.2.2-devel-ubuntu22.04
docker exec -it msccl-azure bash
apt update && apt install git libopenmpi-dev python3 python3-pip -y
adduser azure
su - azure

cd
git clone https://github.com/Azure/msccl.git --recurse-submodules

cd ~/msccl/executor/msccl-executor-nccl/
make -j src.build NVCC_GENCODE="-gencode=arch=compute_70,code=sm_70"

cd ~/msccl/tests/msccl-tests-nccl/
make MPI=1 MPI_HOME=/usr/include/x86_64-linux-gnu/mpi,/ NCCL_HOME=~/msccl/executor/msccl-executor-nccl/build/ -j

synthesis algorithm

generated latency optimal C8-S4-R4 algo and bandwidth optimal C48-S14-R14 algorithm.
see attachment Allreduce.n8-DGX1-steps4.msccl.xml.txt & Allreduce.n8-DGX1-steps14.chunks6.msccl.xml.txt

cd
pip install git+https://github.com/azure/msccl-tools.git

# failed to generate allreduce directly!
azure@msccl-azure:~$ msccl solve instance DGX1 Allreduce --chunks 8 --steps 4 --rounds 4
Solving instance steps=4,chunks=8... unsatisfiable. (94.0s)

# generate C8-S4-R4 Allreduce by compose ReduceScatter & Allgather
azure@msccl-azure:~$ msccl solve instance DGX1 ReduceScatter --chunks 1 --steps 2 --rounds 2
Solving instance steps=2... synthesized! (0.2s)
Wrote to ReduceScatter.n8-DGX1-steps2.msccl.json
azure@msccl-azure:~$ msccl solve instance DGX1 Allgather --chunks 1 --steps 2 --rounds 2
Solving instance steps=2... synthesized! (0.2s)
Wrote to Allgather.n8-DGX1-steps2.msccl.json
azure@msccl-azure:~$ msccl compose allreduce ReduceScatter.n8-DGX1-steps2.msccl.json Allgather.n8-DGX1-steps2.msccl.json
Wrote to Allreduce.n8-DGX1-steps4.msccl.json
azure@msccl-azure:~$ msccl ncclize Allreduce.n8-DGX1-steps4.msccl.json
Wrote to Allreduce.n8-DGX1-steps4.msccl.xml

# C48-S14-R14
azure@msccl-azure:~$ msccl solve instance DGX1 ReduceScatter --chunks 6 --steps 7 --rounds 7
Solving instance steps=7,chunks=6... synthesized! (16.6s)
Wrote to ReduceScatter.n8-DGX1-steps7.chunks6.msccl.json
azure@msccl-azure:~$ msccl solve instance DGX1 Allgather --chunks 6 --steps 7 --rounds 7
Solving instance steps=7,chunks=6... synthesized! (16.6s)
Wrote to Allgather.n8-DGX1-steps7.chunks6.msccl.json
azure@msccl-azure:~$ msccl compose allreduce ReduceScatter.n8-DGX1-steps7.chunks6.msccl.json Allgather.n8-DGX1-steps7.chunks6.msccl.json
Wrote to Allreduce.n8-DGX1-steps14.chunks6.msccl.json
azure@msccl-azure:~$ msccl ncclize Allreduce.n8-DGX1-steps14.chunks6.msccl.json
Wrote to Allreduce.n8-DGX1-steps14.chunks6.msccl.xml

bench algorithm

bench C8-S4-R4

according to first line of the xml file, inplace="0" outofplace="1", outofplace is msccl, inplace is nccl.

azure@msccl-azure:~$ head Allreduce.n8-DGX1-steps4.msccl.xml
<algo name="Allreduce(n=8)-DGX1-steps=4" proto="Simple" nchannels="2" ngpus="8" inplace="0" outofplace="1" minBytes="0" maxBytes="0" coll="allreduce" nchunksperloop="1">
  <gpu id="0" i_chunks="1" o_chunks="1" s_chunks="7">
    <tb id="0" send="-1" recv="1" chan="0">
      <step s="0" type="rrc" srcbuf="s" srcoff="3" dstbuf="s" dstoff="3" cnt="1" depid="2" deps="0" hasdep="1"/>
      <step s="1" type="rrc" srcbuf="i" srcoff="0" dstbuf="i" dstoff="0" cnt="1" depid="-1" deps="-1" hasdep="1"/>
      <step s="2" type="r" srcbuf="s" srcoff="4" dstbuf="s" dstoff="4" cnt="1" depid="4" deps="0" hasdep="0"/>
    </tb>
    <tb id="1" send="-1" recv="2" chan="0">
      <step s="0" type="r" srcbuf="s" srcoff="3" dstbuf="s" dstoff="3" cnt="1" depid="-1" deps="-1" hasdep="1"/>
      <step s="1" type="r" srcbuf="s" srcoff="1" dstbuf="s" dstoff="1" cnt="1" depid="5" deps="1" hasdep="0"/>

output of all_reduce_perf shows that result of msccl is wrong(#wrong inside out-of-place column).

mpi_out_azure_4-rank.0-stdout.txt

cp Allreduce.n8-DGX1-steps4.msccl.xml msccl/executor/msccl-executor-nccl/build/lib/msccl-algorithms

azure@msccl-azure:~$ mpirun -np 8 --output-filename mpi_out_azure_4 --merge-stderr-to-stdout -x LD_LIBRARY_PATH=/home/azure/msccl/executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_ALGO
=MSCCL,RING,TREE  /home/azure/msccl/tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 1GB -f 2 -g 1 -c 1 -n 5 -w 3
# nThread 1 nGpus 1 minBytes 128 maxBytes 1073741824 step: 2(factor) warmup iters: 3 iters: 5 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid  14179 on msccl-azure device  0 [0x1a] Tesla V100-SXM2-32GB
#  Rank  1 Group  0 Pid  14180 on msccl-azure device  1 [0x1b] Tesla V100-SXM2-32GB
#  Rank  2 Group  0 Pid  14181 on msccl-azure device  2 [0x3d] Tesla V100-SXM2-32GB
#  Rank  3 Group  0 Pid  14182 on msccl-azure device  3 [0x3e] Tesla V100-SXM2-32GB
#  Rank  4 Group  0 Pid  14183 on msccl-azure device  4 [0x88] Tesla V100-SXM2-32GB
#  Rank  5 Group  0 Pid  14184 on msccl-azure device  5 [0x89] Tesla V100-SXM2-32GB
#  Rank  6 Group  0 Pid  14185 on msccl-azure device  6 [0xb1] Tesla V100-SXM2-32GB
#  Rank  7 Group  0 Pid  14187 on msccl-azure device  7 [0xb2] Tesla V100-SXM2-32GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
         128            32     float     sum      -1    51.31    0.00    0.00    220    16.50    0.01    0.01      0
         256            64     float     sum      -1    54.38    0.00    0.01    426    14.83    0.02    0.03      0
         512           128     float     sum      -1    51.79    0.01    0.02    857    14.84    0.03    0.06      0
        1024           256     float     sum      -1    52.50    0.02    0.03   1699    15.37    0.07    0.12      0
        2048           512     float     sum      -1    52.21    0.04    0.07   3411    17.07    0.12    0.21      0
        4096          1024     float     sum      -1    52.19    0.08    0.14   6844    17.67    0.23    0.41      0
        8192          2048     float     sum      -1    53.48    0.15    0.27  13689    18.92    0.43    0.76      0
       16384          4096     float     sum      -1    58.86    0.28    0.49  27356    19.31    0.85    1.48      0
       32768          8192     float     sum      -1    74.37    0.44    0.77  54774    23.33    1.40    2.46      0
       65536         16384     float     sum      -1    100.8    0.65    1.14  109458    24.07    2.72    4.76      0
      131072         32768     float     sum      -1    137.4    0.95    1.67  219235    24.76    5.29    9.26      0
      262144         65536     float     sum      -1    218.9    1.20    2.10  438106    26.49    9.90   17.32      0
      524288        131072     float     sum      -1    371.7    1.41    2.47  876530    33.24   15.77   27.61      0
     1048576        262144     float     sum      -1    669.3    1.57    2.74  1.75311e+06    67.05   15.64   27.37      0
     2097152        524288     float     sum      -1   1269.0    1.65    2.89  3.50607e+06    90.05   23.29   40.75      0
     4194304       1048576     float     sum      -1   2509.7    1.67    2.92  7.01167e+06    133.4   31.44   55.02      0
     8388608       2097152     float     sum      -1   4960.7    1.69    2.96  1.40248e+07    223.4   37.54   65.70      0
    16777216       4194304     float     sum      -1   9824.3    1.71    2.99  2.8049e+07    314.5   53.34   93.35      0
    33554432       8388608     float     sum      -1    19505    1.72    3.01  5.60985e+07    560.9   59.83  104.69      0
    67108864      16777216     float     sum      -1    38952    1.72    3.01  1.12195e+08   1007.5   66.61  116.56      0
   134217728      33554432     float     sum      -1    77785    1.73    3.02  2.24396e+08   1910.7   70.24  122.93      0
   268435456      67108864     float     sum      -1   155343    1.73    3.02  4.48791e+08   3773.8   71.13  124.48      0
   536870912     134217728     float     sum      -1   310660    1.73    3.02  8.97575e+08   7430.9   72.25  126.43      0
  1073741824     268435456     float     sum      -1   620955    1.73    3.03  1.79515e+09    14741   72.84  127.47      0
# Out of bounds values : 168 FAILED
# Avg bus bandwidth    : 23.1468
#

--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[61481,1],4]
  Exit code:    1
--------------------------------------------------------------------------

debug output (added -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV options) (see attachment mpi_out_azure_4_debug-rank.0-stdout.txt )

mpirun -np 8 --output-filename mpi_out_azure_4_debug --merge-stderr-to-stdout -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV -x LD_LIBRARY_PATH=/home/azure/msccl/executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_ALGO=MSCCL,RING,TREE  /home/azure/msccl/tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 1GB -f 2 -g 1 -c 1 -n 5 -w 3

bench C48-S14-R14, allreduce result of msccl is right (#wrong of out-of-place are all zero)

mpi_out_azure_14-rank.0-stdout.txt

rm msccl/executor/msccl-executor-nccl/build/lib/msccl-algorithms/*
cp Allreduce.n8-DGX1-steps14.chunks6.msccl.xml msccl/executor/msccl-executor-nccl/build/lib/msccl-algorithms

azure@msccl-azure:~$ mpirun -np 8 --output-filename mpi_out_azure_14 --merge-stderr-to-stdout -x LD_LIBRARY_PATH=/home/azure/msccl/executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_ALG
O=MSCCL,RING,TREE  /home/azure/msccl/tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 1GB -f 2 -g 1 -c 1 -n 5 -w 3
# nThread 1 nGpus 1 minBytes 128 maxBytes 1073741824 step: 2(factor) warmup iters: 3 iters: 5 agg iters: 1 validation: 1 graph: 0
#
# Using devices
#  Rank  0 Group  0 Pid    123 on msccl-azure device  0 [0x1a] Tesla V100-SXM2-32GB
#  Rank  1 Group  0 Pid    124 on msccl-azure device  1 [0x1b] Tesla V100-SXM2-32GB
#  Rank  2 Group  0 Pid    125 on msccl-azure device  2 [0x3d] Tesla V100-SXM2-32GB
#  Rank  3 Group  0 Pid    126 on msccl-azure device  3 [0x3e] Tesla V100-SXM2-32GB
#  Rank  4 Group  0 Pid    127 on msccl-azure device  4 [0x88] Tesla V100-SXM2-32GB
#  Rank  5 Group  0 Pid    128 on msccl-azure device  5 [0x89] Tesla V100-SXM2-32GB
#  Rank  6 Group  0 Pid    129 on msccl-azure device  6 [0xb1] Tesla V100-SXM2-32GB
#  Rank  7 Group  0 Pid    130 on msccl-azure device  7 [0xb2] Tesla V100-SXM2-32GB
#
#                                                              out-of-place                       in-place
#       size         count      type   redop    root     time   algbw   busbw #wrong     time   algbw   busbw #wrong
#        (B)    (elements)                               (us)  (GB/s)  (GB/s)            (us)  (GB/s)  (GB/s)
         128            32     float     sum      -1    16.74    0.01    0.01      0    16.27    0.01    0.01      0
         256            64     float     sum      -1    17.04    0.02    0.03      0    16.34    0.02    0.03      0
         512           128     float     sum      -1    16.37    0.03    0.05      0    16.26    0.03    0.06      0
        1024           256     float     sum      -1    17.08    0.06    0.10      0    16.72    0.06    0.11      0
        2048           512     float     sum      -1    18.17    0.11    0.20      0    17.57    0.12    0.20      0
        4096          1024     float     sum      -1    19.77    0.21    0.36      0    19.11    0.21    0.38      0
        8192          2048     float     sum      -1    21.27    0.39    0.67      0    20.42    0.40    0.70      0
       16384          4096     float     sum      -1    23.28    0.70    1.23      0    22.16    0.74    1.29      0
       32768          8192     float     sum      -1    27.61    1.19    2.08      0    27.01    1.21    2.12      0
       65536         16384     float     sum      -1    28.70    2.28    4.00      0    27.54    2.38    4.16      0
      131072         32768     float     sum      -1    29.07    4.51    7.89      0    27.77    4.72    8.26      0
      262144         65536     float     sum      -1    29.84    8.79   15.37      0    28.77    9.11   15.95      0
      524288        131072     float     sum      -1    36.89   14.21   24.87      0    36.17   14.49   25.36      0
     1048576        262144     float     sum      -1    78.43   13.37   23.40      0    77.08   13.60   23.81      0
     2097152        524288     float     sum      -1    102.4   20.47   35.83      0    101.8   20.60   36.06      0
     4194304       1048576     float     sum      -1    153.4   27.35   47.86      0    137.9   30.41   53.22      0
     8388608       2097152     float     sum      -1    232.8   36.03   63.05      0    229.6   36.53   63.93      0
    16777216       4194304     float     sum      -1    318.5   52.67   92.18      0    318.8   52.63   92.10      0
    33554432       8388608     float     sum      -1    563.6   59.54  104.19      0    567.8   59.09  103.42      0
    67108864      16777216     float     sum      -1   1020.4   65.77  115.09      0   1008.9   66.51  116.40      0
   134217728      33554432     float     sum      -1   1904.4   70.48  123.34      0   1909.5   70.29  123.01      0
   268435456      67108864     float     sum      -1   3770.3   71.20  124.60      0   3763.7   71.32  124.81      0
   536870912     134217728     float     sum      -1   7428.9   72.27  126.47      0   7408.3   72.47  126.82      0
  1073741824     268435456     float     sum      -1    14755   72.77  127.35      0    14750   72.80  127.40      0
# Out of bounds values : 0 OK
# Avg bus bandwidth    : 43.5381
#

debug output (see attachment mpi_out_azure_14_debug-rank.0-stdout.txt )

mpirun -np 8 --output-filename mpi_out_azure_14_debug --merge-stderr-to-stdout -x NCCL_DEBUG=INFO -x NCCL_DEBUG_SUBSYS=INIT,ENV -x LD_LIBRARY_PATH=/home/azure/msccl/executor/msccl-executor-nccl/build/lib/:$LD_LIBRARY_PATH -x NCCL_ALGO=MSCCL,RING,TREE  /home/azure/msccl/tests/msccl-tests-nccl/build/all_reduce_perf -b 128 -e 1GB -f 2 -g 1 -c 1 -n 5 -w 3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions