Add collective benchmark and correctness check#814
Draft
Binyang2014 wants to merge 16 commits into
Draft
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR introduces a Python-based collective benchmark + offline tuning workflow (with correctness checks), adds a GPU FP8 conversion unit test, and tightens/adjusts a couple of collective/runtime integration points (CI/docs/deps and kernel argument validation).
Changes:
- Added
python/mscclpp_benchmarkutilities for benchmarking, offline tuning, GPU runtime abstraction (CUDA/HIP), and correctness checking (including FP8 handling). - Added a new CUDA unit test covering
fp8_e4m3b15encode/decode conversions and wired it intounit_tests. - Updated dependency extras/docs/CI to support the benchmark flow (CUDA Python bindings / hip-python) and added an allreduce packet launch-parameter check.
Reviewed changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
test/unit/gpu_data_types_tests.cu |
New unit test validating fp8_e4m3b15 conversion behavior. |
test/unit/CMakeLists.txt |
Adds the new GPU data type test to the unit test target. |
src/ext/collectives/allreduce/allreduce_packet.cu |
Adds validation that nBlocks is at least the local peer count. |
src/ext/collectives/allgather/allgather_fullmesh_2.cu |
Refactors worker participation/sync in the kernel (noted a correctness bug in review). |
python/requirements_rocm6.txt |
Fixes formatting and adds hip-python>=6,<7. |
python/requirements_cuda11.txt |
Adds CUDA Python bindings requirement for CUDA 11. |
python/requirements_cuda12.txt |
Adds CUDA Python bindings requirement for CUDA 12. |
python/requirements_cuda13.txt |
Adds CUDA Python bindings requirement for CUDA 13. |
python/mscclpp_benchmark/tuning_config.py |
New tuned-config parsing/selection/serialization utilities (noted a serialization bug in review). |
python/mscclpp_benchmark/tuner.py |
New offline tuner CLI driver to generate tuned configs. |
python/mscclpp_benchmark/gpu.py |
New CUDA/HIP runtime shim for graph capture/launch (cuda-bindings / hip-python). |
python/mscclpp_benchmark/correctness.py |
New correctness harness (including FP8 encoding/decoding and tolerances). |
python/mscclpp_benchmark/comm.py |
New raw-buffer Comm wrapper and default config resolution. |
python/mscclpp_benchmark/bench_collective.py |
New collective benchmark runner with optional autotuning + correctness gating. |
python/mscclpp_benchmark/__init__.py |
Switches to lazy attribute exports via __getattr__. |
pyproject.toml |
Updates platform extras to include cuda-bindings / hip-python and includes mscclpp_benchmark in wheel packages. |
include/mscclpp/gpu_data_types.hpp |
Adjusts HIP gfx942 f32→f16 packing path used by fp8_e4m3b15 conversion routing. |
docs/quickstart.md |
Documents new benchmark/tuning usage and updated extras behavior. |
.azure-pipelines/templates/rccl-test.yml |
Adds a Python benchmark step (noted robustness issue in review). |
.azure-pipelines/templates/nccl-test.yml |
Adds a Python benchmark step (noted robustness issue in review). |
Comment on lines
+89
to
+108
| def write_path(self, path: str | Path) -> None: | ||
| profiles_payload: list[dict[str, Any]] = [] | ||
| for profile, configs_by_collective in sorted( | ||
| ((profile, configs) for profile, configs in self._profiles.items() if profile is not None), | ||
| key=lambda item: (item[0].sku is None, item[0].sku or "", item[0].scale is None, item[0].scale or 0), | ||
| ): | ||
| collectives: dict[str, list[dict[str, Any]]] = {} | ||
| for collective, configs in sorted(configs_by_collective.items()): | ||
| collectives[collective] = [_config_entry_payload(item) for item in sorted(configs)] | ||
| profile_payload: dict[str, Any] = {} | ||
| if profile.sku is not None: | ||
| profile_payload["sku"] = profile.sku | ||
| if profile.scale is not None: | ||
| profile_payload["scale"] = profile.scale | ||
| profile_payload["collectives"] = collectives | ||
| profiles_payload.append(profile_payload) | ||
|
|
||
| with Path(path).open("w", encoding="utf-8") as handle: | ||
| handle.write(_format_tuned_config_json({"version": 1, "profiles": profiles_payload})) | ||
|
|
Comment on lines
+265
to
+276
| _FP8_TABLES: dict[str, list[tuple[int, float]]] = {} | ||
| _FP8_SPACING_CACHE: dict[tuple[str, float], float] = {} | ||
|
|
||
|
|
||
| def _encode_fp8_values(fp8_format: str, values): | ||
| values = values.astype(cp.float32) | ||
| if fp8_format == "e4m3b15": | ||
| return _encode_e4m3b15_values(values) | ||
|
|
||
| table = _FP8_TABLES.setdefault(fp8_format, _build_fp8_table(fp8_format)) | ||
| table_bytes = cp.asarray([byte for byte, _ in table], dtype=cp.uint8) | ||
| table_values = cp.asarray([value for _, value in table], dtype=cp.float32) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.