Agent: add hard timeout for snapshot_download to prevent indefinite hangs#718
Merged
Conversation
…angs - Set HF_HUB_DOWNLOAD_TIMEOUT=30 (default was 10s) to prevent slow HTTP hangs - Add signal.alarm(120) wall-clock timeout around snapshot_download call to catch TCP stalls that don't trigger HTTP-level timeouts - Add _DownloadTimeoutError for clean timeout handling
|
Thanks for your contribution! |
Xreki
approved these changes
May 21, 2026
| ) | ||
|
|
||
| old_handler = signal.signal(signal.SIGALRM, _alarm_handler) | ||
| signal.alarm(120) |
Collaborator
There was a problem hiding this comment.
subprocess_graph_extractor.py 为了加timeout控制,用了subprocess.Popen,subprocess可以控制真正杀死子进程。不过这个任务轻量一些,应该也还好
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PR Category
Feature Enhancement
Description
问题
HFFetcher.download()调用snapshot_download下载模型文件时,可能因网络问题无限阻塞。典型案例:
Frank290350/ku-typhoon-v1-merged的merges.txt下载卡住 13 小时,导致整个 worker 进程阻塞,pipeline 无法继续。根因分析:
HF_HUB_DOWNLOAD_TIMEOUT默认仅 10s,且只控制单次 HTTP 请求超时huggingface_hub内部的重试/恢复逻辑在 TCP 连接存活但无数据传输时,可能无限循环而不抛出异常修改内容
提升单次 HTTP 超时到 30s
HF_HUB_DOWNLOAD_TIMEOUT=30(默认 10s)添加 120s 硬总超时(
signal.alarm)snapshot_download调用套上进程级 wall-clock 超时config.json/tokenizer.json等小文件_DownloadTimeoutError,转为ModelFetchError自定义超时异常
_DownloadTimeoutError专门用于区分超时失败与其他网络错误效果