Skip to content

swanlab crashes on non-main GRPO ranks #9545

@rxqy

Description

@rxqy

Checklist / 检查清单

  • I have searched existing issues, and this is a new bug report. / 我已经搜索过现有的 issues,确认这是一个新的 bug report。

Bug Description / Bug 描述

When running GRPO with distributed training and --report_to swanlab, non-main ranks can crash during rollout/vLLM profiling with:

RuntimeError: No active Run. Call swanlab.init() first.

The crash occurs during rollout, around _move_model_to_vllm():

File ".../swift/rlhf_trainers/rollout_mixin.py", line 943, in _fast_infer
    self._move_model_to_vllm()
File ".../swift/rlhf_trainers/utils.py", line 626, in profiling_context
    if 'swanlab' in trainer.args.report_to and swanlab.get_run() is not None and is_main_process:
RuntimeError: No active Run. Call swanlab.init() first.

How to Reproduce / 如何复现

My training script:

swift rlhf \
    --rlhf_type grpo \
    --model $MODEL \
    --use_vllm true \
    --vllm_mode colocate \
    --vllm_gpu_memory_utilization 0.4 \
    --vllm_tensor_parallel_size 8 \
    --vllm_max_model_len 20480 \
    --beta 0. \
    --num_generations $NUM_GENERATIONS \
    --external_plugins reward_plugin.py \
    --reward_funcs some_random_reward \
    --output_dir $OUTPUT_DIR \
    --tuner_type full \
    --dataset some_random_jsonl.files \
    --dataloader_drop_last true \
    --dataloader_persistent_workers true \
    --dataloader_num_workers 8 \
    --load_from_cache_file false \
    --warmup_ratio 0.05 \
    --num_train_epochs 4 \
    --per_device_train_batch_size 1 \
    --gradient_checkpointing true \
    --gradient_accumulation_steps 8 \
    --learning_rate 5e-7 \
    --max_length 18000 \
    --max_completion_length 2048 \
    --save_steps 100 \
    --save_total_limit 999 \
    --save_only_model true \
    --logging_steps 5 \
    --report_to swanlab tensorboard \
    --eval_strategy no \
    --eval_steps 100 \
    --deepspeed zero3

Additional Information / 补充信息

swift version:
[INFO:swift] Start time of running main: 2026-06-12 15:54:39.049291
[INFO:swift] swift.version: 4.4.0.dev0

should be the latest code from main

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions