Skip to content
This repository was archived by the owner on Jun 8, 2026. It is now read-only.
This repository was archived by the owner on Jun 8, 2026. It is now read-only.

Improve reflection judge prompt with antipatterns mined from real sessions + eval loop #140

@OpenCodeEngineer

Description

@OpenCodeEngineer

What

Improve the reflection plugin's task-completion judge prompt (evals/prompts/task-verification.txt, the eval mirror of the production reflection-3.ts self-assessment prompt) by adding patterns and antipatterns mined from real local agent sessions, validated with an automated eval loop.

Why

The reflection judge decides whether an agent's turn is actually complete or whether it stopped/idled prematurely. To ground that decision in reality (not guesswork), we mined 155 real OpenCode + Claude Code sessions from local stores to learn why agents actually stop or go idle.

Findings (155 sessions, classified by cheap haiku agents):

  • 43/155 (28%) ended on a stop/idle antipattern.
  • Top antipatterns: stopped-with-todos-listed (14), false-complete (12), permission-seeking (9), analysis-no-implementation (4), plus silent tool-only loops.
  • 9 sessions asked the user a question that was obvious enough the agent should have just proceeded; only 7 genuinely needed a human (reCAPTCHA, OAuth consent, login token).

These map directly onto judge guidance: recognize premature stops, false completes, permission-seeking, and analysis-without-action — while NOT penalizing legitimate human-blocked stops.

How

  1. Extract (scripts/extract_sessions.py) — read OpenCode SQLite (~/.local/share/opencode/*.db) + Claude jsonl (~/.claude/projects) directly off disk into compact per-session digests. No LLM.
  2. Classify (session-stop-analysis workflow) — one cheap haiku agent per session → structured labels (ended_by, antipattern_tags, question_was_answerable, user_followup, stop_reasoning, evidence_quote).
  3. Dataset (scripts/build_dataset.py) — join classifications with raw messages → .dataset/{session.id}.xml (user_input, ai_output, user_followup, classification). Git-ignored (personal data). 155 well-formed entries. No LLM.
  4. Eval loop:
    • scripts/eval_render.py renders the 34 promptfoo judge cases (evals/promptfooconfig.yaml) into per-case prompts.
    • scripts/judge-eval.wf.js runs each case through cheap judge agents 3× with majority-vote to denoise (single-run haiku verdicts proved too noisy).
    • scripts/eval_score.mjs evaluates the original promptfoo JS assertions against the consensus verdicts → pass rate + failures + vote splits.
    • Iterate: add/refine pattern → re-judge → score → fix regressions. Repeat until stable.

Prompt changes (evals/prompts/task-verification.txt)

  • BLOCKER severity clarified: reflects the nature of the defect, not keywords in file/task names (a stuck test on auth.integration.test.ts is HIGH, not BLOCKER).
  • Human-action clarified: blocked-now (OAuth consent, paste token, 2FA) → requires_human_action: true; future/optional approval while real work remains → false.
  • New "Observed Stop/Idle Antipatterns" section (grounded in real sessions, with actual trigger phrases): stopped-with-todos, permission-seeking/verification-deferral, false-complete, analysis-no-implementation, stuck/retry/silent-tool-loop.
  • New "Legitimate stops" section so genuine human-blocked or evidence-backed completions aren't penalized.
  • Evidence requirements refined: a write/edit tool call is evidence of file creation; a reported test/build result summary counts as runtime evidence (no demand to re-dump raw stdout).

Results (34-case judge eval, 3× majority-vote)

  • Baseline: 32/34 (94%, single-run, noisy).
  • After patterns + denoising: 34/34 (100%), 24/34 unanimous.
  • Tuning evidence rule surfaced a fixable over/under-strictness tradeoff between trivial file-creation (#00) and test-summary (Telegram Voice Message Support - Unified Architecture #1); converging.

Infra notes

  • Local disk was 100% full; freed ~750MB of regenerable cache to run anything.
  • All external eval providers were unavailable (.env Azure key expired, azure-dev resource has no deployments, xAI out of credits, Gemini free-tier quota 0), so the judge runs through harness agents instead of promptfoo's external provider call.

Token discipline

Dataset build is ~0 tokens (pure local Python, re-buildable for free). Cost is in the haiku classification (~11M, one-time, cached) and eval re-runs (~7M per full 3× round). Going forward: only re-judge cases touched by a change, not all 34×3.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions