[bugfix] avoid swanlab.get_run() on non-main ranks in profiling_context by HaozheZhang6 · Pull Request #9546 · modelscope/ms-swift

HaozheZhang6 · 2026-06-12T08:14:48Z

PR type

Bug Fix

PR information

In distributed GRPO with --report_to swanlab, swanlab is only initialized on the main process. On non-main ranks swanlab.get_run() raises RuntimeError: No active Run. Call swanlab.init() first. instead of returning None, so profiling_context in swift/rlhf_trainers/utils.py crashes during rollout profiling:

File ".../swift/rlhf_trainers/rollout_mixin.py", line 943, in _fast_infer
    self._move_model_to_vllm()
File ".../swift/rlhf_trainers/utils.py", in profiling_context
    if 'swanlab' in trainer.args.report_to and swanlab.get_run() is not None and is_main_process:
RuntimeError: No active Run. Call swanlab.init() first.

is_main_process was evaluated last, after swanlab.get_run(). Moving it first makes the check short-circuit on non-main ranks before get_run() is ever called:

if is_main_process and 'swanlab' in trainer.args.report_to and swanlab.get_run() is not None:

The wandb branch just above is already safe — wandb.run is an attribute that returns None, not a call that raises. The other swanlab.get_run() use (reward_trainer.py:112) already sits inside an if self.accelerator.process_index == 0: guard, so it only runs on the main rank and isn't affected.

Experiment results

Reproducing the original crash needs a multi-rank distributed run. Minimal mechanism demo of the short-circuit order on a non-main rank (where swanlab.get_run() raises):

OLD (current main): CRASH -> No active Run. Call swanlab.init() first.
NEW (fix):          result=False

In distributed GRPO with `--report_to swanlab`, swanlab is only initialized on the main process. On non-main ranks `swanlab.get_run()` raises `RuntimeError: No active Run. Call swanlab.init() first.` instead of returning None, so the condition crashes during rollout profiling (modelscope#9545). `is_main_process` was evaluated last, after `swanlab.get_run()`. Move it first so the check short-circuits on non-main ranks before calling `get_run()`. The wandb branch above is already safe because `wandb.run` is an attribute, not a call that raises.

gemini-code-assist

Code Review

This pull request reorders the conditional checks in profiling_context to evaluate is_main_process first, preventing unnecessary calls to swanlab.get_run() on non-main processes. The review feedback recommends further improving robustness by checking if swanlab is available and wrapping the initialization check in a try-except block to handle potential exceptions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-12T08:15:31Z

+    if is_main_process and 'swanlab' in trainer.args.report_to and swanlab.get_run() is not None:
        swanlab.log(profiling_metrics)


While moving is_main_process to the front of the condition prevents calling swanlab.get_run() on non-main ranks, there are two additional potential issues:

If swanlab is not installed (i.e., is_swanlab_available() is False), referencing swanlab will raise a NameError on the main process if 'swanlab' is in report_to.

If swanlab.get_run() is called before the run is initialized (even on the main process), it raises a RuntimeError instead of returning None.

To make this check fully robust, we should check is_swanlab_available() and wrap the get_run() call in a try-except block.

Suggested change

if is_main_process and 'swanlab' in trainer.args.report_to and swanlab.get_run() is not None:

swanlab.log(profiling_metrics)

if is_main_process and 'swanlab' in trainer.args.report_to and is_swanlab_available():

try:

if swanlab.get_run() is not None:

swanlab.log(profiling_metrics)

except Exception:

pass

hjh0119 · 2026-06-12T10:31:52Z

thanks for your contribution

the relevant issue will be fixed in #9547

HaozheZhang6 · 2026-06-12T10:52:21Z

Makes sense — the centralized swanlab_get_run() helper is cleaner and covers the other call sites (gkd/grpo/reward) that my one-line reorder in profiling_context didn't. Closing in favor of #9547.

gemini-code-assist Bot reviewed Jun 12, 2026

View reviewed changes

HaozheZhang6 closed this Jun 12, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] avoid swanlab.get_run() on non-main ranks in profiling_context#9546

[bugfix] avoid swanlab.get_run() on non-main ranks in profiling_context#9546
HaozheZhang6 wants to merge 1 commit into
modelscope:mainfrom
HaozheZhang6:fix/swanlab-nonmain-rank-profiling

HaozheZhang6 commented Jun 12, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Uh oh!

hjh0119 commented Jun 12, 2026

Uh oh!

HaozheZhang6 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if is_main_process and 'swanlab' in trainer.args.report_to and swanlab.get_run() is not None:
		swanlab.log(profiling_metrics)

Conversation

HaozheZhang6 commented Jun 12, 2026

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 12, 2026

Choose a reason for hiding this comment

Uh oh!

hjh0119 commented Jun 12, 2026

Uh oh!

HaozheZhang6 commented Jun 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants