Skip to content

Agent: fix graph hash generation for multi-subgraph models and unify hash utilities#716

Merged
luotao1 merged 3 commits into
PaddlePaddle:developfrom
luotao1:hash
May 20, 2026
Merged

Agent: fix graph hash generation for multi-subgraph models and unify hash utilities#716
luotao1 merged 3 commits into
PaddlePaddle:developfrom
luotao1:hash

Conversation

@luotao1
Copy link
Copy Markdown
Collaborator

@luotao1 luotao1 commented May 20, 2026

PR Category

Feature Enhancement

Description

背景

_generate_graph_hash 只处理单图模型(根目录下有 model.py),对多子图模型(目录结构为 subgraph_0/, subgraph_1/, ...)直接跳过,导致 graph_hash.txt 始终无法生成。具体影响:

  • 多子图模型缺少 graph_hash.txt
  • is_duplicate_sample() 依赖 graph_hash.txt 做去重,缺失时直接返回 False,相同计算图结构会被重复抽取
  • 对 fine-tune 变体(共用同一 base 架构)浪费大量计算资源

同时,子图去重脚本 gen_hash_and_dedup.py 内联实现了 SHA256 哈希计算,与库内统一的 graph_net.hash_util.get_sha256_hash 不一致,不利于维护。

修改内容

1. 修复多子图模型的 hash 生成与去重 (graph_net_agent.py)

  • 单/多子图统一处理:新增 _get_subgraph_dirs() 辅助方法

    • 单图模型:返回 [sample_dir]
    • 多子图模型:返回 [subgraph_0, subgraph_1, ...] 排序后的列表
    • 消除所有 subgraph_xxx hardcoded glob
  • _generate_graph_hash:基于 _get_subgraph_dirs 统一循环处理

    • 每个子图目录独立计算 model.py 的 sha256,生成对应目录下的 graph_hash.txt
    • 单图模型即循环一次,逻辑自然兼容
  • is_duplicate_sample:基于 _get_subgraph_dirs 统一收集 hash

    • 定义内部 _collect_hashes() 函数,遍历所有子图目录收集 graph_hash.txt
    • 单图/多子图统一使用 frozenset 比对,消除重复分支
  • _fix_model_name:同样复用 _get_subgraph_dirs,消除硬编码 glob

2. 统一 hash 工具 (gen_hash_and_dedup.py)

  • 删除内联的 get_sha256_hash 实现
  • 复用 graph_net.hash_util.get_sha256_hash,与 Agent 抽取流程及其他模块保持一致

验证结果

对历史成功样本 /home/luotao02/workspace/success_backup_20260519 运行去重分析:

python graph_net/agent/scripts/gen_hash_and_dedup.py /path/to/workspace

输出:

Found 24430 model.py files under /path/to/workspace
Step 1 - Generate graph_hash.txt:
  Total model.py: 24430
  Generated/Updated: 24430
  Failed: 0
Step 2 - Deduplication analysis:
  Total subgraphs: 24430
  Unique graphs: 503
  Duplicate groups: 344
  Subgraphs involved in duplication: 24271
  Can be removed (keeping one per group): 23927

luotao1 added 2 commits May 20, 2026 14:43
- _generate_graph_hash: generate per-subgraph hashes (subgraph_N/graph_hash.txt)
  instead of a single top-level hash, avoiding false collisions
- is_duplicate_sample: use frozenset of subgraph hashes for multi-subgraph models,
  preventing rglob false matches on per-subgraph hash files
- Single-graph model logic unchanged (root graph_hash.txt)
Replace inline SHA256 implementation with the canonical get_sha256_hash
from graph_net.hash_util, consistent with the agent extraction pipeline
and other modules.
@paddle-bot
Copy link
Copy Markdown

paddle-bot Bot commented May 20, 2026

Thanks for your contribution!

- Add _get_subgraph_dirs() helper: returns [sample_dir] for single-graph
  or [subgraph_0, subgraph_1, ...] for multi-subgraph models
- Refactor _fix_model_name, _generate_graph_hash, is_duplicate_sample
  to use the helper, eliminating hardcoded subgraph_xxx globs
- is_duplicate_sample now collects hashes from all subgraphs uniformly
  via frozenset comparison, regardless of model layout
@luotao1 luotao1 merged commit ce9453d into PaddlePaddle:develop May 20, 2026
3 checks passed
@luotao1 luotao1 deleted the hash branch May 20, 2026 08:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants