Skip to content

Agent: add extraction helper scripts and dedup tool#713

Merged
luotao1 merged 1 commit into
PaddlePaddle:developfrom
luotao1:agent-scripts
May 19, 2026
Merged

Agent: add extraction helper scripts and dedup tool#713
luotao1 merged 1 commit into
PaddlePaddle:developfrom
luotao1:agent-scripts

Conversation

@luotao1
Copy link
Copy Markdown
Collaborator

@luotao1 luotao1 commented May 19, 2026

PR Category

Feature Enhancement

Description

Agent: 新增抽取辅助脚本和去重工具

新增 graph_net/agent/scripts/ 目录,提供三个辅助脚本:

  1. analyze_extraction_log.sh —— 日志分析脚本

    • 分析已完成批次的抽取日志,统计成功/失败/未处理数量
    • 按失败原因分类汇总(Script execution failed / Failed to analyze model /
      Failed to download / timeout / 401 / ducc 等)
    • 支持 CPU 和 GPU 两种模式,兼容不同批次日志格式
    • 支持二进制日志文件(grep -a)
    • 修复计数逻辑:准确统计总尝试数、成功数、失败数,避免重复或遗漏
    • 生成已处理和成功模型列表到 /tmp/
  2. check_extraction_progress.sh —— 进度检查脚本

    • 一键查看当前抽取任务的实时运行状态
    • 输出进程状态(PID、CPU/内存、Worker 数)、日志进度、统计汇总
      (已处理/总数、成功/失败、成功率、速度估算、预计剩余时间)、
      磁盘空间、样本目录文件数
    • 支持自动查找最新日志,或指定日志文件路径
  3. gen_hash_and_dedup.py —— 子图去重脚本

    • 遍历抽取结果目录下的 model.py,计算 SHA256 哈希生成 graph_hash.txt
    • 按哈希值分组找出内容完全相同的子图,生成去重报告 dedup_report.txt
    • 支持 --remove 参数一键删除重复目录(保留每组第一个)
    • 默认工作目录为当前目录,避免硬编码个人路径

更新 graph_net/agent/README.md:

  • 新增"辅助脚本"章节,说明三个脚本的用法和输出内容
  • 包含环境变量覆盖说明(GRAPHNET_LOG_DIR / GRAPHNET_SUCCESS_DIR /
    GRAPHNET_SAMPLES_DIR),便于不同环境复用
  • 添加去重率高的原因解释:同一基础模型的微调变体共享完全相同的
    计算图,实测可缩减 90%+(85K 子图 -> 1.5K 唯一,2.3 GB -> 172 MB)

@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 19, 2026

Thanks for your contribution!

Add graph_net/agent/scripts/ with three utilities:

1. analyze_extraction_log.sh
   - Analyze batch extraction logs for success/failure stats
   - Categorize failures (script error, model too large, download failure, etc.)
   - Support CPU/GPU modes and binary logs (grep -a)
   - Generate processed/success model lists to /tmp/

2. check_extraction_progress.sh
   - Check real-time status of running extraction tasks
   - Show PID, CPU/memory, worker count, progress, speed estimate,
     disk space, and sample directory counts
   - Auto-detect latest log or accept custom log path

3. gen_hash_and_dedup.py
   - Walk extracted subgraphs, compute SHA256 of model.py files
   - Generate graph_hash.txt per subgraph and dedup_report.txt
   - Support --remove to delete duplicate subgraphs (keep first per group)
   - Default workspace is current directory (.), no hardcoded paths

Update graph_net/agent/README.md with usage docs and environment variable
overrides (GRAPHNET_LOG_DIR, GRAPHNET_SUCCESS_DIR, GRAPHNET_SAMPLES_DIR).
@luotao1 luotao1 merged commit b0a7d12 into PaddlePaddle:develop May 19, 2026
3 checks passed
@luotao1 luotao1 deleted the agent-scripts branch May 19, 2026 12:44
@luotao1
Copy link
Copy Markdown
Collaborator Author

luotao1 commented May 20, 2026

TODO:

  1. 子图去重脚本 gen_hash_and_dedup.py 当前内联实现了 SHA256 哈希计算。agent-scripts 分支后续应将其改造为复用库内统一的 graph_net.hash_util.get_sha256_hash,与 Agent 抽取流程及其他模块保持一致,避免哈希逻辑分散维护。
  2. 同时需确保每个 subgraph 目录均生成 graph_hash.txt,缺失时自动补充,再基于完整 hash 进行去重分析。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants