You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Jun 8, 2026. It is now read-only.
Improve the reflection plugin's task-completion judge prompt (evals/prompts/task-verification.txt, the eval mirror of the production reflection-3.ts self-assessment prompt) by adding patterns and antipatterns mined from real local agent sessions, validated with an automated eval loop.
Why
The reflection judge decides whether an agent's turn is actually complete or whether it stopped/idled prematurely. To ground that decision in reality (not guesswork), we mined 155 real OpenCode + Claude Code sessions from local stores to learn why agents actually stop or go idle.
Findings (155 sessions, classified by cheap haiku agents):
43/155 (28%) ended on a stop/idle antipattern.
Top antipatterns: stopped-with-todos-listed (14), false-complete (12), permission-seeking (9), analysis-no-implementation (4), plus silent tool-only loops.
9 sessions asked the user a question that was obvious enough the agent should have just proceeded; only 7 genuinely needed a human (reCAPTCHA, OAuth consent, login token).
These map directly onto judge guidance: recognize premature stops, false completes, permission-seeking, and analysis-without-action — while NOT penalizing legitimate human-blocked stops.
How
Extract (scripts/extract_sessions.py) — read OpenCode SQLite (~/.local/share/opencode/*.db) + Claude jsonl (~/.claude/projects) directly off disk into compact per-session digests. No LLM.
Classify (session-stop-analysis workflow) — one cheap haiku agent per session → structured labels (ended_by, antipattern_tags, question_was_answerable, user_followup, stop_reasoning, evidence_quote).
Dataset (scripts/build_dataset.py) — join classifications with raw messages → .dataset/{session.id}.xml (user_input, ai_output, user_followup, classification). Git-ignored (personal data). 155 well-formed entries. No LLM.
Eval loop:
scripts/eval_render.py renders the 34 promptfoo judge cases (evals/promptfooconfig.yaml) into per-case prompts.
scripts/judge-eval.wf.js runs each case through cheap judge agents 3× with majority-vote to denoise (single-run haiku verdicts proved too noisy).
scripts/eval_score.mjs evaluates the original promptfoo JS assertions against the consensus verdicts → pass rate + failures + vote splits.
BLOCKER severity clarified: reflects the nature of the defect, not keywords in file/task names (a stuck test on auth.integration.test.ts is HIGH, not BLOCKER).
Human-action clarified: blocked-now (OAuth consent, paste token, 2FA) → requires_human_action: true; future/optional approval while real work remains → false.
New "Observed Stop/Idle Antipatterns" section (grounded in real sessions, with actual trigger phrases): stopped-with-todos, permission-seeking/verification-deferral, false-complete, analysis-no-implementation, stuck/retry/silent-tool-loop.
New "Legitimate stops" section so genuine human-blocked or evidence-backed completions aren't penalized.
Evidence requirements refined: a write/edit tool call is evidence of file creation; a reported test/build result summary counts as runtime evidence (no demand to re-dump raw stdout).
Results (34-case judge eval, 3× majority-vote)
Baseline: 32/34 (94%, single-run, noisy).
After patterns + denoising: 34/34 (100%), 24/34 unanimous.
Local disk was 100% full; freed ~750MB of regenerable cache to run anything.
All external eval providers were unavailable (.env Azure key expired, azure-dev resource has no deployments, xAI out of credits, Gemini free-tier quota 0), so the judge runs through harness agents instead of promptfoo's external provider call.
Token discipline
Dataset build is ~0 tokens (pure local Python, re-buildable for free). Cost is in the haiku classification (~11M, one-time, cached) and eval re-runs (~7M per full 3× round). Going forward: only re-judge cases touched by a change, not all 34×3.
What
Improve the reflection plugin's task-completion judge prompt (
evals/prompts/task-verification.txt, the eval mirror of the productionreflection-3.tsself-assessment prompt) by adding patterns and antipatterns mined from real local agent sessions, validated with an automated eval loop.Why
The reflection judge decides whether an agent's turn is actually complete or whether it stopped/idled prematurely. To ground that decision in reality (not guesswork), we mined 155 real OpenCode + Claude Code sessions from local stores to learn why agents actually stop or go idle.
Findings (155 sessions, classified by cheap haiku agents):
stopped-with-todos-listed(14),false-complete(12),permission-seeking(9),analysis-no-implementation(4), plus silent tool-only loops.These map directly onto judge guidance: recognize premature stops, false completes, permission-seeking, and analysis-without-action — while NOT penalizing legitimate human-blocked stops.
How
scripts/extract_sessions.py) — read OpenCode SQLite (~/.local/share/opencode/*.db) + Claude jsonl (~/.claude/projects) directly off disk into compact per-session digests. No LLM.session-stop-analysisworkflow) — one cheap haiku agent per session → structured labels (ended_by, antipattern_tags, question_was_answerable, user_followup, stop_reasoning, evidence_quote).scripts/build_dataset.py) — join classifications with raw messages →.dataset/{session.id}.xml(user_input, ai_output, user_followup, classification). Git-ignored (personal data). 155 well-formed entries. No LLM.scripts/eval_render.pyrenders the 34 promptfoo judge cases (evals/promptfooconfig.yaml) into per-case prompts.scripts/judge-eval.wf.jsruns each case through cheap judge agents 3× with majority-vote to denoise (single-run haiku verdicts proved too noisy).scripts/eval_score.mjsevaluates the original promptfoo JS assertions against the consensus verdicts → pass rate + failures + vote splits.Prompt changes (
evals/prompts/task-verification.txt)auth.integration.test.tsis HIGH, not BLOCKER).requires_human_action: true; future/optional approval while real work remains → false.Results (34-case judge eval, 3× majority-vote)
Infra notes
.envAzure key expired, azure-dev resource has no deployments, xAI out of credits, Gemini free-tier quota 0), so the judge runs through harness agents instead of promptfoo's external provider call.Token discipline
Dataset build is ~0 tokens (pure local Python, re-buildable for free). Cost is in the haiku classification (~11M, one-time, cached) and eval re-runs (~7M per full 3× round). Going forward: only re-judge cases touched by a change, not all 34×3.