feat: implement skill-evaluator loop and benchmark aggregation framework by trszhang · Pull Request #5 · ustc-table-mining/TabClaw

trszhang · 2026-05-22T09:07:30Z

Dear TabClaw authors

I wanted to sync on a recent contribution I completed for TabClaw, which primarily focuses on establishing a closed evaluation loop after a user uploads a custom skill.

PR Summary
This change introduces the skill-evaluator. Once a user uploads a ZIP skill, they can now use this tool directly to run structured evaluations and generate iterative feedback, improving the verifiability and maintainability of custom skills.

Motivation
In the current workflow, users can import skills but lack a clear, reusable "evaluation entry point." This leads to several issues:

Quality verification post-upload relies on manual, ad-hoc checks.
There is no unified review path for testing skill trigger-ability, stability, and output quality.
The "import -> try out -> improve" loop for skills is fragmented.

The goal of skill-evaluator is to bridge this gap, allowing users to directly evaluate their uploaded skills instead of stopping at just "can import, can invoke."

What Changed

Package loading path support improved: Adjusted the skill registry loading strategy so that skills/*/SKILL.md within the repository is recognized as a usable package skill (no longer solely relying on data/skills).
Coexistence & Protection: Maintained support for user-imported directories, ensuring compatibility between uploaded ZIP skills and built-in skills. Handled naming conflict priority and read-only protection to prevent built-in packages from being accidentally deleted or overwritten during upgrades.

Benchmark Design & Methodology

Single Run Scoring (The Baseline)
Every evaluation run generates a grading.json file. The core logic evaluates each expectation item by item to determine a passed: true/false status, which is then aggregated:

Metrics: passed, failed, total
Core Accuracy Metric: pass_rate = passed / total

Source: This accuracy framework is defined in skills/skill-evaluator/agents/grader.md and skills/skill-evaluator/references/schemas.md.

Single Run Efficiency Metrics
In addition to accuracy, each individual run logs quantitative efficiency data to track cost and stability:

time_seconds: Extracted from timing.json.total_duration_seconds.
tokens: Primarily uses timing.json.total_tokens (falls back to execution_metrics.output_chars as a proxy if unavailable).
tool_calls: Total count of tools invoked.
errors: Total error count.

Purpose: These act as cost/stability guardrails. They help determine trade-offs—for example, deciding whether a higher pass rate justifies an increase in latency and token consumption.

Multi-Run Aggregation into Benchmarks (The Core Logic)
The aggregation script located at skills/skill-evaluator/scripts/aggregate_benchmark.py compiles statistics across all runs for each specific configuration (e.g., with_skill vs. without_skill):

Statistical Metrics: Mean, Standard Deviation (StdDev), Min, Max.

Note: Standard deviation uses sample standard deviation (denominator is n-1), not population standard deviation.

Formulas used for aggregation:
- pass_rate_mean = sum(pass_rate_i) / n
- time_mean = sum(time_i) / n
- tokens_mean = sum(tokens_i) / n
- stddev = sqrt(sum((x_i - mean)^2) / (n - 1))

A/B Testing Comparison (With vs. Without Skill)
The final run_summary automatically calculates the performance delta between configurations:

delta.pass_rate = mean(with_skill.pass_rate) - mean(without_skill.pass_rate)
delta.time_seconds = mean(with_skill.time) - mean(without_skill.time)
delta.tokens = mean(with_skill.tokens) - mean(without_skill.tokens)

Impact: This directly answers three critical questions for any skill:
1. Does it improve quality? (Look at delta.pass_rate)
2. What is the added cost? (Look at delta.time / delta.tokens)
3. Is it reliable? (Look at the size of the stddev)

Why This Quantitatively Works
Instead of relying on a single, isolated run, this system comprehensively evaluates Quality (pass_rate) + Cost (time/tokens) + Stability (stddev) against a strict control group (with vs. without). This isolates whether a custom skill introduces a true net benefit or if a good run was simply a fluke.

Result / User-Facing Impact
After uploading a ZIP skill, users can now immediately trigger an evaluation process via skill-evaluator to receive:

Guided evaluation steps (testing, feedback, iteration).
Actionable quality check recommendations (including benchmark/eval directions).
A path for continuous improvement (rather than a one-off trial run).

Please let me know your thoughts or if you'd like to do a quick walkthrough of the aggregation script!

Best regards
Fengyi Zhang

feat: implement skill-evaluator loop and benchmark aggregation framework

7666da2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: implement skill-evaluator loop and benchmark aggregation framework#5

feat: implement skill-evaluator loop and benchmark aggregation framework#5
trszhang wants to merge 1 commit into
ustc-table-mining:mainfrom
trszhang:main

trszhang commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

trszhang commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant