Skip to content

feat: implement skill-evaluator loop and benchmark aggregation framework#5

Open
trszhang wants to merge 1 commit into
ustc-table-mining:mainfrom
trszhang:main
Open

feat: implement skill-evaluator loop and benchmark aggregation framework#5
trszhang wants to merge 1 commit into
ustc-table-mining:mainfrom
trszhang:main

Conversation

@trszhang

Copy link
Copy Markdown

Dear TabClaw authors

I wanted to sync on a recent contribution I completed for TabClaw, which primarily focuses on establishing a closed evaluation loop after a user uploads a custom skill.

PR Summary
This change introduces the skill-evaluator. Once a user uploads a ZIP skill, they can now use this tool directly to run structured evaluations and generate iterative feedback, improving the verifiability and maintainability of custom skills.

Motivation
In the current workflow, users can import skills but lack a clear, reusable "evaluation entry point." This leads to several issues:

  • Quality verification post-upload relies on manual, ad-hoc checks.
  • There is no unified review path for testing skill trigger-ability, stability, and output quality.
  • The "import -> try out -> improve" loop for skills is fragmented.

The goal of skill-evaluator is to bridge this gap, allowing users to directly evaluate their uploaded skills instead of stopping at just "can import, can invoke."

What Changed

  • Package loading path support improved: Adjusted the skill registry loading strategy so that skills/*/SKILL.md within the repository is recognized as a usable package skill (no longer solely relying on data/skills).
  • Coexistence & Protection: Maintained support for user-imported directories, ensuring compatibility between uploaded ZIP skills and built-in skills. Handled naming conflict priority and read-only protection to prevent built-in packages from being accidentally deleted or overwritten during upgrades.

Benchmark Design & Methodology

  1. Single Run Scoring (The Baseline)
    Every evaluation run generates a grading.json file. The core logic evaluates each expectation item by item to determine a passed: true/false status, which is then aggregated:
  • Metrics: passed, failed, total
  • Core Accuracy Metric: pass_rate = passed / total
  • Source: This accuracy framework is defined in skills/skill-evaluator/agents/grader.md and skills/skill-evaluator/references/schemas.md.
  1. Single Run Efficiency Metrics
    In addition to accuracy, each individual run logs quantitative efficiency data to track cost and stability:
  • time_seconds: Extracted from timing.json.total_duration_seconds.
  • tokens: Primarily uses timing.json.total_tokens (falls back to execution_metrics.output_chars as a proxy if unavailable).
  • tool_calls: Total count of tools invoked.
  • errors: Total error count.
  • Purpose: These act as cost/stability guardrails. They help determine trade-offs—for example, deciding whether a higher pass rate justifies an increase in latency and token consumption.
  1. Multi-Run Aggregation into Benchmarks (The Core Logic)
    The aggregation script located at skills/skill-evaluator/scripts/aggregate_benchmark.py compiles statistics across all runs for each specific configuration (e.g., with_skill vs. without_skill):
  • Statistical Metrics: Mean, Standard Deviation (StdDev), Min, Max.
  • Note: Standard deviation uses sample standard deviation (denominator is n-1), not population standard deviation.
  • Formulas used for aggregation:
    • pass_rate_mean = sum(pass_rate_i) / n
    • time_mean = sum(time_i) / n
    • tokens_mean = sum(tokens_i) / n
    • stddev = sqrt(sum((x_i - mean)^2) / (n - 1))
  1. A/B Testing Comparison (With vs. Without Skill)
    The final run_summary automatically calculates the performance delta between configurations:
  • delta.pass_rate = mean(with_skill.pass_rate) - mean(without_skill.pass_rate)
  • delta.time_seconds = mean(with_skill.time) - mean(without_skill.time)
  • delta.tokens = mean(with_skill.tokens) - mean(without_skill.tokens)
  • Impact: This directly answers three critical questions for any skill:
    1. Does it improve quality? (Look at delta.pass_rate)
    2. What is the added cost? (Look at delta.time / delta.tokens)
    3. Is it reliable? (Look at the size of the stddev)
  1. Why This Quantitatively Works
    Instead of relying on a single, isolated run, this system comprehensively evaluates Quality (pass_rate) + Cost (time/tokens) + Stability (stddev) against a strict control group (with vs. without). This isolates whether a custom skill introduces a true net benefit or if a good run was simply a fluke.

Result / User-Facing Impact
After uploading a ZIP skill, users can now immediately trigger an evaluation process via skill-evaluator to receive:

  • Guided evaluation steps (testing, feedback, iteration).
  • Actionable quality check recommendations (including benchmark/eval directions).
  • A path for continuous improvement (rather than a one-off trial run).

Please let me know your thoughts or if you'd like to do a quick walkthrough of the aggregation script!

Best regards
Fengyi Zhang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant