Skip to content

[bugfix] avoid swanlab.get_run() on non-main ranks in profiling_context#9546

Closed
HaozheZhang6 wants to merge 1 commit into
modelscope:mainfrom
HaozheZhang6:fix/swanlab-nonmain-rank-profiling
Closed

[bugfix] avoid swanlab.get_run() on non-main ranks in profiling_context#9546
HaozheZhang6 wants to merge 1 commit into
modelscope:mainfrom
HaozheZhang6:fix/swanlab-nonmain-rank-profiling

Conversation

@HaozheZhang6

Copy link
Copy Markdown

PR type

  • Bug Fix

PR information

Fixes #9545.

In distributed GRPO with --report_to swanlab, swanlab is only initialized on the main process. On non-main ranks swanlab.get_run() raises RuntimeError: No active Run. Call swanlab.init() first. instead of returning None, so profiling_context in swift/rlhf_trainers/utils.py crashes during rollout profiling:

File ".../swift/rlhf_trainers/rollout_mixin.py", line 943, in _fast_infer
    self._move_model_to_vllm()
File ".../swift/rlhf_trainers/utils.py", in profiling_context
    if 'swanlab' in trainer.args.report_to and swanlab.get_run() is not None and is_main_process:
RuntimeError: No active Run. Call swanlab.init() first.

is_main_process was evaluated last, after swanlab.get_run(). Moving it first makes the check short-circuit on non-main ranks before get_run() is ever called:

if is_main_process and 'swanlab' in trainer.args.report_to and swanlab.get_run() is not None:

The wandb branch just above is already safe — wandb.run is an attribute that returns None, not a call that raises. The other swanlab.get_run() use (reward_trainer.py:112) already sits inside an if self.accelerator.process_index == 0: guard, so it only runs on the main rank and isn't affected.

Experiment results

Reproducing the original crash needs a multi-rank distributed run. Minimal mechanism demo of the short-circuit order on a non-main rank (where swanlab.get_run() raises):

OLD (current main): CRASH -> No active Run. Call swanlab.init() first.
NEW (fix):          result=False

In distributed GRPO with `--report_to swanlab`, swanlab is only initialized on
the main process. On non-main ranks `swanlab.get_run()` raises
`RuntimeError: No active Run. Call swanlab.init() first.` instead of returning
None, so the condition crashes during rollout profiling (modelscope#9545).

`is_main_process` was evaluated last, after `swanlab.get_run()`. Move it first so
the check short-circuits on non-main ranks before calling `get_run()`. The wandb
branch above is already safe because `wandb.run` is an attribute, not a call that
raises.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reorders the conditional checks in profiling_context to evaluate is_main_process first, preventing unnecessary calls to swanlab.get_run() on non-main processes. The review feedback recommends further improving robustness by checking if swanlab is available and wrapping the initialization check in a try-except block to handle potential exceptions.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment on lines +610 to 611
if is_main_process and 'swanlab' in trainer.args.report_to and swanlab.get_run() is not None:
swanlab.log(profiling_metrics)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While moving is_main_process to the front of the condition prevents calling swanlab.get_run() on non-main ranks, there are two additional potential issues:

  1. If swanlab is not installed (i.e., is_swanlab_available() is False), referencing swanlab will raise a NameError on the main process if 'swanlab' is in report_to.
  2. If swanlab.get_run() is called before the run is initialized (even on the main process), it raises a RuntimeError instead of returning None.

To make this check fully robust, we should check is_swanlab_available() and wrap the get_run() call in a try-except block.

Suggested change
if is_main_process and 'swanlab' in trainer.args.report_to and swanlab.get_run() is not None:
swanlab.log(profiling_metrics)
if is_main_process and 'swanlab' in trainer.args.report_to and is_swanlab_available():
try:
if swanlab.get_run() is not None:
swanlab.log(profiling_metrics)
except Exception:
pass

@hjh0119

hjh0119 commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

thanks for your contribution

the relevant issue will be fixed in #9547

@HaozheZhang6

Copy link
Copy Markdown
Author

Makes sense — the centralized swanlab_get_run() helper is cleaner and covers the other call sites (gkd/grpo/reward) that my one-line reorder in profiling_context didn't. Closing in favor of #9547.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

swanlab crashes on non-main GRPO ranks

2 participants