Primus v26.3#159
Conversation
Primus v26.3 release introduces the following new models and upgrades previously supported models
Qwen 3 30B BF16/FP8
Qwen 3 235B BF16/FP8
GPT OSS 20B BF16/FP8
GPT OSS 120B BF16/FP8
There was a problem hiding this comment.
Pull request overview
Updates MAD’s Primus integration to align with the Primus v26.3 release, adding new Megatron-LM training targets (Qwen3 30B/235B, GPT-OSS 20B/120B) and bumping Docker base images/documentation to the newer container stack.
Changes:
- Add new Primus Megatron-LM model repos + datatype support logic (BF16/FP8 where applicable), and extend benchmark parsing to recognize these models.
- Introduce a setup-time patch step to add training-log metrics summarization to Primus’
primus-cli-direct.sh. - Bump training Docker base images to
rocm/primus:v26.3and refresh benchmark README component versions.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/pytorch_train/run.sh | Ensure certain post-train model repos explicitly run with BF16. |
| scripts/primus/pytorch_train/primus_pytorch_benchmark_report.sh | Adjust device→config mapping logic for pretrain benchmarks. |
| scripts/primus/megatron-lm/run.sh | Add new model repo selectors + datatype support matrix updates. |
| scripts/primus/megatron-lm/primus_megatron-lm_benchmark_setup.sh | Apply an inline patch to Primus to parse/append training metrics summaries. |
| scripts/primus/megatron-lm/primus_megatron-lm_benchmark_report.sh | Add new models and refine benchmark execution behaviors. |
| scripts/primus/megatron-lm/primus_megatron-lm_benchmark_report.py | Extend log parsing eligibility list for new models. |
| models.json | Register new model repos for MAD runs (including skip-arch for GPT-OSS-120B). |
| docker/pytorch_train.ubuntu.amd.Dockerfile | Bump base image to rocm/primus:v26.3. |
| docker/primus_pytorch_train.ubuntu.amd.Dockerfile | Bump base image to rocm/primus:v26.3. |
| docker/primus_megatron_train.ubuntu.amd.Dockerfile | Bump base image to rocm/primus:v26.3. |
| benchmark/pytorch_train/README.md | Update component versions listed for the training container. |
| benchmark/megatron_lm/README.md | Update component versions and document some new supported models + examples. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
added examples for multi-node training of mixtral 8x22B and llama3.1-405B. Also made some other changes.
added multi-node training examples
accept Copilot suggestions
amd-fuyuajin
left a comment
There was a problem hiding this comment.
Made Copilot suggested changes. Added multi-node training examples. Ready to merge.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| ``` | ||
|
|
||
| The docker container hosts verified coomit `e16b27b` from [Primus repository](https://github.com/AMD-AGI/Primus/tree/e16b27bf6c1b2798f38848fc574fee60d9a9b902). | ||
| The docker container hosts verified commit `e16b27b` from [Primus repository](https://github.com/AMD-AGI/Primus/tree/e16b27bf6c1b2798f38848fc574fee60d9a9b902). |
There was a problem hiding this comment.
| The docker container hosts verified commit `e16b27b` from [Primus repository](https://github.com/AMD-AGI/Primus/tree/e16b27bf6c1b2798f38848fc574fee60d9a9b902). | |
| The docker container hosts verified commit `43a6e0` from [Primus repository](https://github.com/AMD-AGI/Primus/tree/release/v26.3). |
Should this be 43a6e006c419697208295c5523b99070e8198ad9? That's the head of the release branch https://github.com/AMD-AGI/Primus/commits/release/v26.3/
| ## 2. Configurations in Yaml Script (`examples/megatron/configs/`) | ||
| ## 2. Configurations in yaml files (`examples/megatron/configs/`) | ||
|
|
||
| Primus defines training yaml for each model inside [examples/megatron/configs/](https://github.com/AMD-AGI/Primus/tree/e16b27bf6c1b2798f38848fc574fee60d9a9b902/examples/megatron/configs) repository. For example, use `examples/megatron/configs/llama3.1_8B-pretrain.yaml` for updating llama3.1_8B training parameters. Other yaml for the supported model can be found with `examples/megatron/configs/${MODEL_NAME}-pretrain.yaml` naming convention in this repository. |
There was a problem hiding this comment.
| Primus defines training yaml for each model inside [examples/megatron/configs/](https://github.com/AMD-AGI/Primus/tree/e16b27bf6c1b2798f38848fc574fee60d9a9b902/examples/megatron/configs) repository. For example, use `examples/megatron/configs/llama3.1_8B-pretrain.yaml` for updating llama3.1_8B training parameters. Other yaml for the supported model can be found with `examples/megatron/configs/${MODEL_NAME}-pretrain.yaml` naming convention in this repository. | |
| Primus defines training yaml for each model inside [examples/megatron/configs/](https://github.com/AMD-AGI/Primus/tree/release/v26.3) repository. For example, use `examples/megatron/configs/llama3.1_8B-pretrain.yaml` for updating llama3.1_8B training parameters. Other yaml for the supported model can be found with `examples/megatron/configs/${MODEL_NAME}-pretrain.yaml` naming convention in this repository. |
Primus v26.3 release introduces the following new models and upgrades previously supported models