Skip to content

Add collective benchmark and correctness check#814

Draft
Binyang2014 wants to merge 16 commits into
mainfrom
binyli/benchmark
Draft

Add collective benchmark and correctness check#814
Binyang2014 wants to merge 16 commits into
mainfrom
binyli/benchmark

Conversation

@Binyang2014
Copy link
Copy Markdown
Contributor

@Binyang2014 Binyang2014 commented May 28, 2026

  • Add unit-test for float8_e4m3b15 data type.
  • And tuner and benchmark for allreduce/allgather algo, make sure the correctness and performance.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a Python-based collective benchmark + offline tuning workflow (with correctness checks), adds a GPU FP8 conversion unit test, and tightens/adjusts a couple of collective/runtime integration points (CI/docs/deps and kernel argument validation).

Changes:

  • Added python/mscclpp_benchmark utilities for benchmarking, offline tuning, GPU runtime abstraction (CUDA/HIP), and correctness checking (including FP8 handling).
  • Added a new CUDA unit test covering fp8_e4m3b15 encode/decode conversions and wired it into unit_tests.
  • Updated dependency extras/docs/CI to support the benchmark flow (CUDA Python bindings / hip-python) and added an allreduce packet launch-parameter check.

Reviewed changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
test/unit/gpu_data_types_tests.cu New unit test validating fp8_e4m3b15 conversion behavior.
test/unit/CMakeLists.txt Adds the new GPU data type test to the unit test target.
src/ext/collectives/allreduce/allreduce_packet.cu Adds validation that nBlocks is at least the local peer count.
src/ext/collectives/allgather/allgather_fullmesh_2.cu Refactors worker participation/sync in the kernel (noted a correctness bug in review).
python/requirements_rocm6.txt Fixes formatting and adds hip-python>=6,<7.
python/requirements_cuda11.txt Adds CUDA Python bindings requirement for CUDA 11.
python/requirements_cuda12.txt Adds CUDA Python bindings requirement for CUDA 12.
python/requirements_cuda13.txt Adds CUDA Python bindings requirement for CUDA 13.
python/mscclpp_benchmark/tuning_config.py New tuned-config parsing/selection/serialization utilities (noted a serialization bug in review).
python/mscclpp_benchmark/tuner.py New offline tuner CLI driver to generate tuned configs.
python/mscclpp_benchmark/gpu.py New CUDA/HIP runtime shim for graph capture/launch (cuda-bindings / hip-python).
python/mscclpp_benchmark/correctness.py New correctness harness (including FP8 encoding/decoding and tolerances).
python/mscclpp_benchmark/comm.py New raw-buffer Comm wrapper and default config resolution.
python/mscclpp_benchmark/bench_collective.py New collective benchmark runner with optional autotuning + correctness gating.
python/mscclpp_benchmark/__init__.py Switches to lazy attribute exports via __getattr__.
pyproject.toml Updates platform extras to include cuda-bindings / hip-python and includes mscclpp_benchmark in wheel packages.
include/mscclpp/gpu_data_types.hpp Adjusts HIP gfx942 f32→f16 packing path used by fp8_e4m3b15 conversion routing.
docs/quickstart.md Documents new benchmark/tuning usage and updated extras behavior.
.azure-pipelines/templates/rccl-test.yml Adds a Python benchmark step (noted robustness issue in review).
.azure-pipelines/templates/nccl-test.yml Adds a Python benchmark step (noted robustness issue in review).

Comment thread src/ext/collectives/allgather/allgather_fullmesh_2.cu
Comment on lines +89 to +108
def write_path(self, path: str | Path) -> None:
profiles_payload: list[dict[str, Any]] = []
for profile, configs_by_collective in sorted(
((profile, configs) for profile, configs in self._profiles.items() if profile is not None),
key=lambda item: (item[0].sku is None, item[0].sku or "", item[0].scale is None, item[0].scale or 0),
):
collectives: dict[str, list[dict[str, Any]]] = {}
for collective, configs in sorted(configs_by_collective.items()):
collectives[collective] = [_config_entry_payload(item) for item in sorted(configs)]
profile_payload: dict[str, Any] = {}
if profile.sku is not None:
profile_payload["sku"] = profile.sku
if profile.scale is not None:
profile_payload["scale"] = profile.scale
profile_payload["collectives"] = collectives
profiles_payload.append(profile_payload)

with Path(path).open("w", encoding="utf-8") as handle:
handle.write(_format_tuned_config_json({"version": 1, "profiles": profiles_payload}))

Comment thread python/mscclpp_benchmark/correctness.py Outdated
Comment on lines +265 to +276
_FP8_TABLES: dict[str, list[tuple[int, float]]] = {}
_FP8_SPACING_CACHE: dict[tuple[str, float], float] = {}


def _encode_fp8_values(fp8_format: str, values):
values = values.astype(cp.float32)
if fp8_format == "e4m3b15":
return _encode_e4m3b15_values(values)

table = _FP8_TABLES.setdefault(fp8_format, _build_fp8_table(fp8_format))
table_bytes = cp.asarray([byte for byte, _ in table], dtype=cp.uint8)
table_values = cp.asarray([value for _, value in table], dtype=cp.float32)
Comment thread .azure-pipelines/templates/nccl-test.yml Outdated
Comment thread .azure-pipelines/templates/rccl-test.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants