You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jun 8, 2026. It is now read-only.
Distribution from the v1 prompt in evals/scripts/classify-cc-stops.mjs and claude/lib/judge.mjs:
Category
Count
%
working
374
40%
complete
261
29%
waiting_for_user_legitimate
210
23%
summary_drift_stop
35
4%
genuinely_stuck
27
3%
tool_available_punt
0
0%
Problems
working over-assigned. At Stop time the agent is by definition not working. Classifier appears to interpret just-finished-action summaries ("I've created the file. Now run tests.") as "still working." This is the same content that should be summary_drift_stop.
tool_available_punt never assigned. Heuristic filter surfaced 26 candidates with hint:punt (PUNT_PATTERNS matched). Classifier reassigned every one. Either the pattern is rare in this user's data, or the prompt fails to discriminate it from waiting_for_user_legitimate.
Proposed fixes
Add 2–4 few-shot examples per category from the redacted gold file (evals/datasets/cc-stop-labeled-gold-redacted.jsonl).
Explicit anti-pattern in the prompt: "AT STOP, 'working' is almost never correct. If the agent appears to still be working, prefer summary_drift_stop (claimed a next step but stopped) or genuinely_stuck (no closure)."
Add a discriminator clause for tool_available_punt vs waiting_for_user_legitimate: "If the user's question could be answered by any tool in TOOLS THE ASSISTANT HAD, prefer tool_available_punt. Use waiting_for_user_legitimate only when no tool could give the answer."
Acceptance
F1 ≥ 0.75 on summary_drift_stop and tool_available_punt against an expanded gold set (≥ 60 records, ≥ 8 per category).
working count drops below 5% on the same 907-record corpus.
Notes
Reproduce baseline with node evals/scripts/classify-cc-stops.mjs (uses OAuth Bearer from ~/.claude/.credentials.json against Anthropic API directly).
Add tool_available_punt few-shots from the user's earlier session examples (browser MCP available but agent punted, Bash available but agent asked).
Follow-up to #137.
Observation (n=907 baseline, 2026-05-25)
Distribution from the v1 prompt in
evals/scripts/classify-cc-stops.mjsandclaude/lib/judge.mjs:Problems
workingover-assigned. At Stop time the agent is by definition not working. Classifier appears to interpret just-finished-action summaries ("I've created the file. Now run tests.") as "still working." This is the same content that should besummary_drift_stop.tool_available_puntnever assigned. Heuristic filter surfaced 26 candidates withhint:punt(PUNT_PATTERNS matched). Classifier reassigned every one. Either the pattern is rare in this user's data, or the prompt fails to discriminate it fromwaiting_for_user_legitimate.Proposed fixes
evals/datasets/cc-stop-labeled-gold-redacted.jsonl).summary_drift_stop(claimed a next step but stopped) orgenuinely_stuck(no closure)."tool_available_puntvswaiting_for_user_legitimate: "If the user's question could be answered by any tool in TOOLS THE ASSISTANT HAD, prefertool_available_punt. Usewaiting_for_user_legitimateonly when no tool could give the answer."Acceptance
summary_drift_stopandtool_available_puntagainst an expanded gold set (≥ 60 records, ≥ 8 per category).workingcount drops below 5% on the same 907-record corpus.Notes
node evals/scripts/classify-cc-stops.mjs(uses OAuth Bearer from~/.claude/.credentials.jsonagainst Anthropic API directly).tool_available_puntfew-shots from the user's earlier session examples (browser MCP available but agent punted, Bash available but agent asked).