Issue #6: 【Eval】定义 skill 评估指标(metrics)#19
Open
alijiujiu123 wants to merge 1 commit intomainfrom
Open
Conversation
- 创建 agent/skills/base.ts,定义核心类型(Skill, SkillMetadata, SkillMetrics 等) - 创建 agent/evaluation/metrics.ts,实现指标收集和评分逻辑 - 创建 metrics.yaml 配置文件,包含默认阈值和权重 - 定义 5 个核心指标:success_rate, avg_cost, avg_latency, rollback_rate, stability_score - 实现指标收集逻辑(MetricsCollector) - 实现指标评分逻辑(MetricsScorer) - 支持数据驱动的晋升/淘汰决策 相关 Issue: #6
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #6
概述
本 PR 完成了 Issue #6 的所有任务,定义了 Skill 评估指标系统,支持数据驱动的晋升/淘汰决策。
完成的任务
1. 创建核心类型定义 (agent/skills/base.ts)
定义了以下核心类型:
SkillMetadata: Skill 元数据接口SkillContext: 执行上下文SkillResult: 执行结果ExecutionMetrics: 执行指标SkillMetrics: Skill 统计指标Skill: Skill 接口ScoreWeights: 评分权重Thresholds: 配置阈值MetricsConfig: Metrics 配置2. 实现指标收集和评分逻辑 (agent/evaluation/metrics.ts)
MetricsCollector(指标收集器)
recordExecution(): 记录执行结果calculateMetrics(): 计算 Skill 指标clearRecords(): 清空记录MetricsScorer(指标评分器)
calculateScore(): 计算综合评分canPromote(): 判断是否可以晋升shouldRetire(): 判断是否应该淘汰3. 创建配置文件 (metrics.yaml)
包含以下配置:
4. 核心指标定义
实现了 5 个核心指标,每个指标都有明确的计算公式:
success_rate (成功率)
success_rate = 成功次数 / 总执行次数avg_cost (平均成本)
avg_cost = 所有执行成本之和 / 执行次数avg_latency (平均延迟)
avg_latency = 所有执行延迟之和 / 执行次数rollback_rate (回滚率)
rollback_rate = 回滚次数 / 总执行次数stability_score (稳定性分数)
stability_score = 1 - 标准差(归一化)验收标准检查
测试
创建了完整的单元测试(agent/evaluation/metrics.test.ts),覆盖:
并通过验证脚本验证了所有功能正常工作。
文件变更
agent/skills/base.ts- 核心类型定义agent/evaluation/metrics.ts- 指标收集和评分逻辑agent/evaluation/metrics.test.ts- 单元测试metrics.yaml- 配置文件相关文档
docs/v2/architecture.md