Skip to content

packed_data问题 #208

@WhyDwelledOnAi

Description

@WhyDwelledOnAi

使用generate_packed_dataset.py后的packed_data训练时,训练会卡在accessory/engine_pretrain.py 的metric_logger.synchronize_between_processes()不动,然后ddp超时结束。
在使用*.parquet文件时则没有问题。

[E ProcessGroupNCCL.cpp:828] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=495104, OpType=_ALLGATHER_BASE, Timeout(ms)=1800000) ran for 1805715 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:455] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:460] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'

环境完全遵循文档中的requirement.txt.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions