[bugfix] avoid swanlab.get_run() on non-main ranks in profiling_context#9546
[bugfix] avoid swanlab.get_run() on non-main ranks in profiling_context#9546HaozheZhang6 wants to merge 1 commit into
Conversation
In distributed GRPO with `--report_to swanlab`, swanlab is only initialized on the main process. On non-main ranks `swanlab.get_run()` raises `RuntimeError: No active Run. Call swanlab.init() first.` instead of returning None, so the condition crashes during rollout profiling (modelscope#9545). `is_main_process` was evaluated last, after `swanlab.get_run()`. Move it first so the check short-circuits on non-main ranks before calling `get_run()`. The wandb branch above is already safe because `wandb.run` is an attribute, not a call that raises.
There was a problem hiding this comment.
Code Review
This pull request reorders the conditional checks in profiling_context to evaluate is_main_process first, preventing unnecessary calls to swanlab.get_run() on non-main processes. The review feedback recommends further improving robustness by checking if swanlab is available and wrapping the initialization check in a try-except block to handle potential exceptions.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| if is_main_process and 'swanlab' in trainer.args.report_to and swanlab.get_run() is not None: | ||
| swanlab.log(profiling_metrics) |
There was a problem hiding this comment.
While moving is_main_process to the front of the condition prevents calling swanlab.get_run() on non-main ranks, there are two additional potential issues:
- If
swanlabis not installed (i.e.,is_swanlab_available()isFalse), referencingswanlabwill raise aNameErroron the main process if'swanlab'is inreport_to. - If
swanlab.get_run()is called before the run is initialized (even on the main process), it raises aRuntimeErrorinstead of returningNone.
To make this check fully robust, we should check is_swanlab_available() and wrap the get_run() call in a try-except block.
| if is_main_process and 'swanlab' in trainer.args.report_to and swanlab.get_run() is not None: | |
| swanlab.log(profiling_metrics) | |
| if is_main_process and 'swanlab' in trainer.args.report_to and is_swanlab_available(): | |
| try: | |
| if swanlab.get_run() is not None: | |
| swanlab.log(profiling_metrics) | |
| except Exception: | |
| pass |
|
thanks for your contribution the relevant issue will be fixed in #9547 |
|
Makes sense — the centralized |
PR type
PR information
Fixes #9545.
In distributed GRPO with
--report_to swanlab, swanlab is only initialized on the main process. On non-main ranksswanlab.get_run()raisesRuntimeError: No active Run. Call swanlab.init() first.instead of returningNone, soprofiling_contextinswift/rlhf_trainers/utils.pycrashes during rollout profiling:is_main_processwas evaluated last, afterswanlab.get_run(). Moving it first makes the check short-circuit on non-main ranks beforeget_run()is ever called:The
wandbbranch just above is already safe —wandb.runis an attribute that returnsNone, not a call that raises. The otherswanlab.get_run()use (reward_trainer.py:112) already sits inside anif self.accelerator.process_index == 0:guard, so it only runs on the main rank and isn't affected.Experiment results
Reproducing the original crash needs a multi-rank distributed run. Minimal mechanism demo of the short-circuit order on a non-main rank (where
swanlab.get_run()raises):