Improve reflection judge prompt with antipatterns mined from real sessions + eval loop

## What

Improve the reflection plugin's task-completion **judge prompt** (`evals/prompts/task-verification.txt`, the eval mirror of the production `reflection-3.ts` self-assessment prompt) by adding **patterns and antipatterns mined from real local agent sessions**, validated with an automated eval loop.

## Why

The reflection judge decides whether an agent's turn is *actually* complete or whether it stopped/idled prematurely. To ground that decision in reality (not guesswork), we mined **155 real OpenCode + Claude Code sessions** from local stores to learn *why agents actually stop or go idle*.

Findings (155 sessions, classified by cheap haiku agents):
- **43/155 (28%)** ended on a stop/idle **antipattern**.
- Top antipatterns: `stopped-with-todos-listed` (14), `false-complete` (12), `permission-seeking` (9), `analysis-no-implementation` (4), plus silent tool-only loops.
- **9** sessions asked the user a question that was *obvious enough the agent should have just proceeded*; only **7** genuinely needed a human (reCAPTCHA, OAuth consent, login token).

These map directly onto judge guidance: recognize premature stops, false completes, permission-seeking, and analysis-without-action — while NOT penalizing legitimate human-blocked stops.

## How

1. **Extract** (`scripts/extract_sessions.py`) — read OpenCode SQLite (`~/.local/share/opencode/*.db`) + Claude jsonl (`~/.claude/projects`) directly off disk into compact per-session digests. No LLM.
2. **Classify** (`session-stop-analysis` workflow) — one cheap haiku agent per session → structured labels (ended_by, antipattern_tags, question_was_answerable, user_followup, stop_reasoning, evidence_quote).
3. **Dataset** (`scripts/build_dataset.py`) — join classifications with raw messages → `.dataset/{session.id}.xml` (user_input, ai_output, user_followup, classification). Git-ignored (personal data). 155 well-formed entries. No LLM.
4. **Eval loop**:
   - `scripts/eval_render.py` renders the 34 promptfoo judge cases (`evals/promptfooconfig.yaml`) into per-case prompts.
   - `scripts/judge-eval.wf.js` runs each case through cheap judge agents **3× with majority-vote** to denoise (single-run haiku verdicts proved too noisy).
   - `scripts/eval_score.mjs` evaluates the **original promptfoo JS assertions** against the consensus verdicts → pass rate + failures + vote splits.
   - Iterate: add/refine pattern → re-judge → score → fix regressions. Repeat until stable.

### Prompt changes (`evals/prompts/task-verification.txt`)
- **BLOCKER severity** clarified: reflects the *nature* of the defect, not keywords in file/task names (a stuck test on `auth.integration.test.ts` is HIGH, not BLOCKER).
- **Human-action** clarified: *blocked-now* (OAuth consent, paste token, 2FA) → `requires_human_action: true`; *future/optional approval* while real work remains → false.
- New **"Observed Stop/Idle Antipatterns"** section (grounded in real sessions, with actual trigger phrases): stopped-with-todos, permission-seeking/verification-deferral, false-complete, analysis-no-implementation, stuck/retry/silent-tool-loop.
- New **"Legitimate stops"** section so genuine human-blocked or evidence-backed completions aren't penalized.
- **Evidence requirements** refined: a write/edit tool call is evidence of file creation; a reported test/build *result summary* counts as runtime evidence (no demand to re-dump raw stdout).

### Results (34-case judge eval, 3× majority-vote)
- Baseline: 32/34 (94%, single-run, noisy).
- After patterns + denoising: **34/34 (100%)**, 24/34 unanimous.
- Tuning evidence rule surfaced a fixable over/under-strictness tradeoff between trivial file-creation (#00) and test-summary (#01); converging.

## Infra notes
- Local disk was 100% full; freed ~750MB of regenerable cache to run anything.
- All external eval providers were unavailable (`.env` Azure key expired, azure-dev resource has no deployments, xAI out of credits, Gemini free-tier quota 0), so the judge runs through harness agents instead of promptfoo's external provider call.

## Token discipline
Dataset build is ~0 tokens (pure local Python, re-buildable for free). Cost is in the haiku classification (~11M, one-time, cached) and eval re-runs (~7M per full 3× round). Going forward: only re-judge cases touched by a change, not all 34×3.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reflection judge prompt with antipatterns mined from real sessions + eval loop #140

What

Why

How

Prompt changes (`evals/prompts/task-verification.txt`)

Results (34-case judge eval, 3× majority-vote)

Infra notes

Token discipline

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Improve reflection judge prompt with antipatterns mined from real sessions + eval loop #140

Description

What

Why

How

Prompt changes (evals/prompts/task-verification.txt)

Results (34-case judge eval, 3× majority-vote)

Infra notes

Token discipline

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Prompt changes (`evals/prompts/task-verification.txt`)