Skip to content

Agent: add hard timeout for snapshot_download to prevent indefinite hangs#718

Merged
luotao1 merged 1 commit into
PaddlePaddle:developfrom
luotao1:timeout
May 21, 2026
Merged

Agent: add hard timeout for snapshot_download to prevent indefinite hangs#718
luotao1 merged 1 commit into
PaddlePaddle:developfrom
luotao1:timeout

Conversation

@luotao1
Copy link
Copy Markdown
Collaborator

@luotao1 luotao1 commented May 20, 2026

PR Category

Feature Enhancement

Description

问题

HFFetcher.download() 调用 snapshot_download 下载模型文件时,可能因网络问题无限阻塞。

典型案例Frank290350/ku-typhoon-v1-mergedmerges.txt 下载卡住 13 小时,导致整个 worker 进程阻塞,pipeline 无法继续。

根因分析:

  • HF_HUB_DOWNLOAD_TIMEOUT 默认仅 10s,且只控制单次 HTTP 请求超时
  • huggingface_hub 内部的重试/恢复逻辑在 TCP 连接存活但无数据传输时,可能无限循环而不抛出异常
  • 没有任何进程级超时保护,单个文件下载可永久挂起

修改内容

  1. 提升单次 HTTP 超时到 30s

    • HF_HUB_DOWNLOAD_TIMEOUT=30(默认 10s)
    • 兼顾慢网络,同时防止无限挂起
  2. 添加 120s 硬总超时(signal.alarm

    • 为整个 snapshot_download 调用套上进程级 wall-clock 超时
    • 120 秒足够下载 config.json/tokenizer.json 等小文件
    • 超时后抛出 _DownloadTimeoutError,转为 ModelFetchError
    • 上层记录为"下载失败"并继续下一个模型,不再阻塞整个 pipeline
  3. 自定义超时异常

    • _DownloadTimeoutError 专门用于区分超时失败与其他网络错误

效果

  • 单个文件下载最多等待 120 秒,超时后优雅失败
  • Worker 不再因单个模型下载卡住而阻塞整个 pipeline
  • 模型下载异常从"无限等待"变为"可控失败"

…angs

- Set HF_HUB_DOWNLOAD_TIMEOUT=30 (default was 10s) to prevent slow HTTP hangs
- Add signal.alarm(120) wall-clock timeout around snapshot_download call
  to catch TCP stalls that don't trigger HTTP-level timeouts
- Add _DownloadTimeoutError for clean timeout handling
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 20, 2026

Thanks for your contribution!

)

old_handler = signal.signal(signal.SIGALRM, _alarm_handler)
signal.alarm(120)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subprocess_graph_extractor.py 为了加timeout控制,用了subprocess.Popen,subprocess可以控制真正杀死子进程。不过这个任务轻量一些,应该也还好

@luotao1 luotao1 merged commit 0e58051 into PaddlePaddle:develop May 21, 2026
3 checks passed
@luotao1 luotao1 deleted the timeout branch May 21, 2026 01:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants