Skip to content

GB200 support: SendRecv DSL collective and per-channel executor connections#810

Open
Binyang2014 wants to merge 150 commits into
mainfrom
binyli/GB200
Open

GB200 support: SendRecv DSL collective and per-channel executor connections#810
Binyang2014 wants to merge 150 commits into
mainfrom
binyli/GB200

Conversation

@Binyang2014
Copy link
Copy Markdown
Contributor

@Binyang2014 Binyang2014 commented May 20, 2026

Summary

GB200 support work: introduces point-to-point send/receive in the MSCCL++ DSL
and extends the executor for split-NVL-domain topologies where some ranks are
NVL-connected within a node and other ranks must communicate across the network.

DSL

  • New SendRecv collective with separate input/output buffers
    (python/mscclpp/language/collectives.py).
  • New multi-node sendrecv DSL example
    (python/mscclpp/language/tests/multi_node/send_recv.py) with --split_mask
    (group size − 1) and --instances CLI options. Documents the channel-ordering
    trick that keeps signal tags cross-matched between paired peers when
    prev == next.
  • BaseBuffer.__getitem__ now accepts slices with None start/stop
    (e.g., buf[:]).

Executor

  • One connection (unique QP) per channel entry instead of one per peer.
    Required for HostNoAtomic IB mode where each QP can forward signals to a
    single semaphore. Uses per-peer tag counters so paired ranks agree on tag
    ordering regardless of the order peers appear in each rank's connected_to
    list.
  • MEMORY channels now unconditionally use Transport::CudaIpc; only PORT
    channels can use IB. Matches the invariant already enforced by
    getTransportFlags.
  • ExecutionContext::connections is now a vector<Connection> indexed by
    channel order (was unordered_map<int, Connection> keyed by peer). Removes
    redundant semaphore fields from ExecutionContext.
  • TODO: explicit NVL-domain check in useIB

Copilot AI and others added 30 commits February 11, 2026 00:12
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Modified test/mp_unit/mp_unit_tests.hpp to use ../framework.hpp instead of gtest/gtest.h
- Enhanced test/framework.hpp with GTest-compatible APIs:
  - Added Environment base class for global test setup/teardown
  - Added TestInfo and UnitTest classes for test metadata access
  - Added GTEST_SKIP macro support via SkipHelper class
  - Added namespace alias 'testing' for compatibility
  - Added InitGoogleTest and AddGlobalTestEnvironment helper functions
- Updated test/framework.cc with implementations for new classes
- All mp_unit test files now use framework.hpp through mp_unit_tests.hpp
- Formatting applied via lint.sh

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Remove duplicate static getMPIRank() and getMPISize() functions
- Add full namespace qualification to GTEST_SKIP macro

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Move test framework from test/perf/ to test/ for shared use
- Add GTest-compatible macros (TEST, TEST_F, EXPECT_*, ASSERT_*, etc.)
- Remove GTest dependency from CMakeLists.txt
- Add test_framework library for unit and mp_unit tests
- Add code coverage support with lcov (MSCCLPP_ENABLE_COVERAGE option)
- Update perf tests to use shared framework

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Support --gtest_filter command line argument for test filtering,
compatible with Azure pipeline configurations.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Move PerfTestResult struct definition outside vector declaration
- Move getCurrentTimestamp to anonymous namespace
- Add documentation for GTEST_SKIP macro explaining RAII pattern

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
The recent removal of GTest and introduction of custom test framework
requires MPI dependency which is not needed for CodeQL analysis.
Disable test building in CodeQL workflows to fix the build failures.

CodeQL only needs to analyze the core library code, not the tests.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Make MPI REQUIRED for test builds (clearer error messages)
- Add project include directories to test_framework library
- Fix core_tests.cc to use custom framework correctly
- Fix mp_unit_tests.hpp to use mscclpp::test namespace
- Add FAIL() macro with streaming support for test messages
- Building tests now works in Docker environment with GPU bypass

Tests can now be built using:
docker run --rm -v $(pwd):/workspace -w /workspace \
  ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
  "mkdir build && cd build && cmake -DMSCCLPP_BYPASS_GPU_CHECK=ON \
   -DMSCCLPP_USE_CUDA=ON .. && make -j"

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Move helper classes inside namespace for proper access
- Remove duplicate class definitions outside namespace
- Tests can now build in Docker with CUDA toolkit installed
- Remaining issues: ErrorCode and TransportFlags need operator<< for EXPECT_EQ

Successfully building with Docker:
docker run --rm -v $(pwd):/workspace -w /workspace \
  ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
  "mkdir build && cd build && cmake -DMSCCLPP_BYPASS_GPU_CHECK=ON \
   -DMSCCLPP_USE_CUDA=ON .. && make -j4"

Note: Some unit tests (errors_tests.cc, core_tests.cc) need operator<<
defined for ErrorCode and TransportFlags to compile with custom framework.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Remove build_test/ directory containing CMake cache and build files
- Update .gitignore to exclude build_*/ pattern to prevent future accidents

These CMake artifacts (CMakeCache.txt, CMakeFiles/, generated headers)
were accidentally committed and should never be in version control.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Add nlohmann::ordered_json metrics field to TestResult struct
- Add nlohmann/json.hpp include to test/framework.hpp
- Link test_framework with nlohmann_json::nlohmann_json
- Replace PerfTestResult with TestResult in test/perf/framework.cc
- Move perf utility functions to utils namespace for consistency
- Remove duplicate PerfTestResult struct definition

This consolidates the two similar structs into one, reducing code
duplication while maintaining all necessary fields for both unit
tests (passed/failure_message) and performance tests (metrics).

Verified build succeeds with Docker:
docker run --rm -v $(pwd):/workspace -w /workspace \
  ghcr.io/microsoft/mscclpp/mscclpp:base-dev-cuda12.4 bash -c \
  "cd /workspace/build && make -j4 fifo_test"

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Add isPerfTest field to TestInfoInternal struct
- Add --exclude-perf-tests command line argument
- Add PERF_TEST and PERF_TEST_F macros for marking performance tests
- Update runAllTests to filter performance tests when requested
- Remove genhtml dependency and HTML report generation
- Keep only coverage.info file generation with lcov

Performance tests can now be excluded with:
  ./build/bin/unit_tests --exclude-perf-tests
  ./build/bin/mp_unit_tests --exclude-perf-tests

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Add unit_tests_main.cc with main() function for unit_tests executable
- Create fifo_perf_tests.cu as PERF_TEST for unit_tests
- Add fifo_perf_tests.cu to unit_tests sources
- Fix errors_tests.cc to use ASSERT_TRUE for ErrorCode comparisons
- Fix core_tests.cc to use ASSERT_TRUE for TransportFlags comparisons
- Add Azure pipeline step for Debug build with coverage
- Add step to run mp_unit_tests --exclude-perf-tests with coverage

The perf tests are now part of unit_tests and can be filtered out
for coverage reporting. CI now includes Debug build with coverage
collection for non-performance tests.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
Another set of accidentally committed build artifacts in build2/ directory.
The .gitignore pattern build_*/ should prevent these in the future.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Remove test/perf/ directory (fifo_test.cu, framework.{cc,hpp}, CMakeLists.txt)
- Remove add_subdirectory(perf) from test/CMakeLists.txt
- Performance tests now integrated into unit_tests as fifo_perf_tests.cu
- Fix mp_unit_tests.cc to use framework functions without ::testing:: namespace
- Fix bootstrap_tests.cc ErrorCode comparison to use ASSERT_TRUE
- Fix switch_channel_tests.cu to not use streaming with ASSERT_EQ
- Add missing #include <unistd.h> to executor_tests.cc

All perf test functionality is now in unit_tests and can be filtered
with --exclude-perf-tests flag. The standalone test/perf/ directory
is no longer needed.

Verified builds:
- unit_tests: ✅
- mp_unit_tests: ✅

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
1. Add missing includes to fifo_perf_tests.cu
   - Add #include <cassert>
   - Add #include <unordered_map>

2. Fix license header capitalization (4 files)
   - "license." → "License." in test/framework.{hpp,cc}
   - "license." → "License." in test/unit/{unit_tests_main.cc,fifo_perf_tests.cu}

3. Fix double MPI_Init issue
   - Check MPI_Initialized() before calling MPI_Init
   - Prevents double initialization when mp_unit_tests already inits MPI

4. Fix coverage flags for CUDA compilation
   - Use generator expressions to apply --coverage only to C++ language
   - Prevents breaking CUDA compilation with host-only flags

5. Fix environment memory leak
   - Delete environment objects after TearDown()
   - Clear environments_ vector

6. Implement proper GTEST_SKIP handling
   - Create SkipException class
   - Handle skipped tests separately from failures
   - Report skipped test count

7. Implement GTest-style filter pattern matching
   - Support wildcards (* and ?)
   - Support negative patterns (-Pattern)
   - Support colon-separated patterns (Foo:Bar)
   - Compatible with existing CI usage like --gtest_filter=-*Ib*

Verified builds successfully with Docker.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
- Remove complex wildcard pattern matching (*, ?, negative patterns)
- Use simple substring matching with find()
- Simpler implementation, easier to understand and maintain
- Still supports --gtest_filter for basic test name filtering

Note: For advanced filtering like wildcards, users can use multiple
test runs with different substring filters.

Co-authored-by: chhwang <8018170+chhwang@users.noreply.github.com>
mahdiehghazim and others added 18 commits April 7, 2026 01:40
- Add TEST_DATA_SEND_RECV verifier kernel that replays fill_data PRNG
  with peer_rank seed to validate received data
- Add double-buffer support for sendrecv in executor_test.py:
  allocate 2 input/result/test buffers, alternate per iteration
- Create two executor funcs for sendrecv, one per buffer pair
- Update bench_correctness and bench_time to handle double-buffer
- Add bandwidth reporting to output

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@Binyang2014 Binyang2014 changed the title Binyli/gb200 GB200 support: SendRecv DSL collective and per-channel executor connections May 22, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds/extends support for a multi-node “sendrecv” communication pattern and adjusts executor connection construction to create distinct connections per channel entry, plus updates the Python executor test harness and verifier kernels to support split-group verification and double-buffering.

Changes:

  • Build one Connection (QP) per channel entry with deterministic per-peer tag assignment in the executor.
  • Extend Python executor tests/verifier kernels to support sendrecv, optional split_mask grouping, and double-buffer execution.
  • Add a new DSL SendRecv collective and a corresponding multi-node test program.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/core/include/execution_kernel.hpp Fixes putWithSignalAndFlush argument ordering/selection in device-side put handling.
src/core/executor/executor.cc Switches to per-channel connections (vector-based) and uses per-peer tag counters for stable matching.
python/test/executor_test.py Adds sendrecv/double-buffer and split-mask support to benchmarking and correctness harness.
python/test/executor_test_verifier.cu Updates fill/test kernels to accept split_mask and adds sendrecv verification kernels.
python/mscclpp/language/tests/multi_node/send_recv.py New multi-node DSL test program for sendrecv with split grouping.
python/mscclpp/language/rank.py Makes buffer slicing more robust by handling slice(None, None) and non-slice indexing.
python/mscclpp/language/collectives.py Introduces SendRecv collective and buffer initialization for it.

Comment thread python/test/executor_test.py Outdated
Comment thread python/test/executor_test.py
Comment thread python/test/executor_test.py Outdated
Comment thread python/mscclpp/language/rank.py Outdated
Comment thread python/test/executor_test_verifier.cu
Comment thread python/test/executor_test_verifier.cu
Comment thread python/mscclpp/language/tests/multi_node/send_recv.py Outdated
@Binyang2014 Binyang2014 marked this pull request as ready for review May 22, 2026 18:12
Binyang2014 and others added 2 commits May 22, 2026 18:27
build_bufs now always returns parallel lists of buffers (length 1 for
normal collectives, length 2 for sendrecv double-buffering), so
bench_time, bench_correctness, and main() no longer branch on
sendrecv_mode or double_buf. Iteration i uses funcs[i % len(funcs)]
uniformly.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants