From bbccedead3b2e239536f064ef0d7d5270d73f827 Mon Sep 17 00:00:00 2001 From: engineer Date: Sun, 31 May 2026 19:17:41 -0700 Subject: [PATCH 1/4] feat(reflection): add premature-stop antipatterns to self-assessment prompt Port patterns/antipatterns mined from 227 real agent stops (78% premature: permission-seeking ~40%, stopped-with-todos ~30%) into both production judge prompts in reflection-3.ts (buildSelfAssessmentPrompt + analyzeSelfAssessment- WithLLM), each mapped to its own JSON schema. Mirror into reflection-3.test- helpers.ts (the unit-test source, which had drifted) and add a unit assertion that the antipattern guidance is present. Includes a legitimate-stop carve-out so genuine human-only blocks route to waiting_for_user instead of nagging. Validated via the judge eval loop: 33/34, genuine cases (#00/#01) complete- unanimous, premature-stop cases correctly incomplete. Refs #140 Co-Authored-By: Claude Opus 4.8 --- reflection-3.test-helpers.ts | 8 +++++++- reflection-3.ts | 13 ++++++++++++- test/reflection-3.unit.test.ts | 26 ++++++++++++++++++++++++++ 3 files changed, 45 insertions(+), 2 deletions(-) diff --git a/reflection-3.test-helpers.ts b/reflection-3.test-helpers.ts index 7732626..bcbf265 100644 --- a/reflection-3.test-helpers.ts +++ b/reflection-3.test-helpers.ts @@ -155,7 +155,13 @@ Rules: - If stuck, propose an alternate approach. - If you need user action (auth, 2FA, credentials), list it in needs_user_action. - If you are repeating the same actions (deploy, test, build) without making progress, set "stuck": true. -- Do not retry the same failing approach more than twice — try something different or report stuck.` +- Do not retry the same failing approach more than twice — try something different or report stuck. + +PREMATURE-STOP ANTIPATTERNS (mined from 227 real agent stops where the user replied; 78% were premature — the user said "go"/"continue"/"yes do it" or corrected the agent). If the agent's last response matches one of these AND executable work remains, the task is NOT complete — set status "in_progress", and put the concrete next action in remaining_work and next_steps: +- PERMISSION-SEEKING (most common, ~40%): the response ends by asking to do work it can already do — "Want me to…?", "Would you like me to…?", "Should I…?", "Shall I proceed?", or "Try running it now"/"Please run X and confirm" (deferring a check it could run itself). DECISIVE TEST: if the final turn is a yes/no or "want me to X?" question AND X is something the agent can do with its own tools AND X carries no irreversible risk, the stop is premature — it should have just done X. Asking is only legitimate before a destructive/irreversible action (delete prod data, force-push, send an irreversible external message). +- STOPPED-WITH-TODOS (~30%): the response lists "Remaining Tasks"/"Next steps"/"Still TODO"/"What I did NOT do" or names a verify/run/check/create-PR step as "next" — then stops without doing it. Listing remaining work does not complete it; a self-contained named step must be DONE before stopping. Set status "in_progress" with that work in remaining_work. +- FALSE-COMPLETE: claims "done"/"complete"/"ready"/"all tasks complete" but the CORE requested action never happened, a required check was skipped, or there is no evidence. An empty/no-text response, or a response with no write/tool evidence on an action task, is NEVER complete. +- LEGITIMATE STOP (do NOT flag): genuine human-only block (OAuth consent, 2FA code, credential/API-key retrieval, captcha) → status "waiting_for_user" with the item in needs_user_action. Genuine completion WITH evidence (commands+output, tests passing, PR/CI verified) → status "complete"; do not invent missing work.` } export function buildToolReflectionGuidanceSection(toolReflectionPrompt: string | null): string { diff --git a/reflection-3.ts b/reflection-3.ts index 55db5b5..8cd2e21 100644 --- a/reflection-3.ts +++ b/reflection-3.ts @@ -1135,7 +1135,13 @@ Rules: - If you need user action (auth, 2FA, credentials, access requests, uploads, approvals), list it in needs_user_action. - PLANNING LOOP CHECK: If the task requires code changes (fix, implement, add, create, build, refactor, update) but the "Tool Commands Run" section shows ONLY read operations (read, glob, grep, git log, git status, git diff, webfetch, task/explore) and NO write operations (edit, write, bash with build/test/commit, github_create_pull_request, etc.), then the task is NOT complete. Set status to "in_progress", set stuck to true, and list "Implement the actual code changes" in remaining_work. Analyzing and recommending changes is not the same as making them. - If you are repeating the same actions (deploy, test, build) without making progress, set "stuck": true. -- Do not retry the same failing approach more than twice — try something different or report stuck.` +- Do not retry the same failing approach more than twice — try something different or report stuck. + +PREMATURE-STOP ANTIPATTERNS (mined from 227 real agent stops where the user replied; 78% were premature — the user said "go"/"continue"/"yes do it" or corrected the agent). If the agent's last response matches one of these AND executable work remains, the task is NOT complete — set status "in_progress", and put the concrete next action in remaining_work and next_steps: +- PERMISSION-SEEKING (most common, ~40%): the response ends by asking to do work it can already do — "Want me to…?", "Would you like me to…?", "Should I…?", "Shall I proceed?", or "Try running it now"/"Please run X and confirm" (deferring a check it could run itself). DECISIVE TEST: if the final turn is a yes/no or "want me to X?" question AND X is something the agent can do with its own tools AND X carries no irreversible risk, the stop is premature — it should have just done X. Asking is only legitimate before a destructive/irreversible action (delete prod data, force-push, send an irreversible external message). +- STOPPED-WITH-TODOS (~30%): the response lists "Remaining Tasks"/"Next steps"/"Still TODO"/"What I did NOT do" or names a verify/run/check/create-PR step as "next" — then stops without doing it. Listing remaining work does not complete it; a self-contained named step must be DONE before stopping. Set status "in_progress" with that work in remaining_work. +- FALSE-COMPLETE: claims "done"/"complete"/"ready"/"all tasks complete" but the CORE requested action never happened, a required check was skipped, or there is no evidence. An empty/no-text response, or a response with no write/tool evidence on an action task, is NEVER complete. +- LEGITIMATE STOP (do NOT flag): genuine human-only block (OAuth consent, 2FA code, credential/API-key retrieval, captcha) → status "waiting_for_user" with the item in needs_user_action. Genuine completion WITH evidence (commands+output, tests passing, PR/CI verified) → status "complete"; do not invent missing work.` } function parseSelfAssessmentJson(text: string | null | undefined): SelfAssessment | null { @@ -1390,6 +1396,11 @@ Rules: - If user action is required (auth/2FA/credentials), set requires_human_action true. - If agent is stuck, require alternate approach and continued work. - PLANNING LOOP: If the task requires code changes (fix, implement, add, create, build, refactor) but the Tool Signals show ONLY read operations (read, glob, grep, git log/status/diff, webfetch) and NO write operations (edit, write, bash with build/test/commit, PR creation), set complete to false and add "Implement actual code changes" to missing. Analysis alone does not fulfill an implementation task. +- PREMATURE-STOP ANTIPATTERNS (78% of real agent stops were premature). Set complete false, requires_human_action false, and put the concrete work in next_actions when the agent's response matches: + - PERMISSION-SEEKING: ends asking to do work it can already do ("Want me to…?", "Should I…?", "Try running it now", "Please run X and confirm"). DECISIVE TEST: final-turn yes/no question about something the agent can do with its own tools and no irreversible risk = premature; it should have done it. Asking is legitimate only before destructive/irreversible actions. + - STOPPED-WITH-TODOS: lists "Remaining Tasks"/"Next steps"/"What I did NOT do" or names a verify/run/check step as next, then stops. Listing ≠ doing. + - FALSE-COMPLETE: claims done/ready/"all tasks complete" but the core action never happened, a required check was skipped, or no evidence. Empty/no-tool response on an action task is never complete. +- LEGITIMATE STOP (do NOT penalize): genuine human-only block (OAuth consent, 2FA, credential/API-key retrieval, captcha) → complete false, requires_human_action true. Genuine completion WITH evidence → complete true; do not invent missing work. Return JSON only: { diff --git a/test/reflection-3.unit.test.ts b/test/reflection-3.unit.test.ts index 4d700ab..d81d92b 100644 --- a/test/reflection-3.unit.test.ts +++ b/test/reflection-3.unit.test.ts @@ -59,6 +59,32 @@ describe("reflection-3 unit", () => { assert.ok(prompt.includes("Provide a PR URL")) }) + it("self-assessment prompt includes premature-stop antipatterns", () => { + const prompt = buildSelfAssessmentPrompt({ + taskSummary: "Implement feature", + taskType: "coding", + agentMode: "build", + humanMessages: ["Implement feature"], + toolsSummary: "(none)", + detectedSignals: [], + recentCommands: [], + pushedToDefaultBranch: false, + requiresTests: false, + requiresBuild: false, + requiresPR: false, + requiresCI: false, + requiresLocalTests: false, + requiresLocalTestsEvidence: false + }, "") + // mined-antipattern guidance must be present so the runtime judge catches premature stops + assert.ok(prompt.includes("PERMISSION-SEEKING"), "permission-seeking antipattern missing") + assert.ok(prompt.includes("STOPPED-WITH-TODOS"), "stopped-with-todos antipattern missing") + assert.ok(prompt.includes("FALSE-COMPLETE"), "false-complete antipattern missing") + assert.ok(prompt.includes("LEGITIMATE STOP"), "legitimate-stop carve-out missing") + // carve-out must route genuine human blocks to waiting_for_user, not nag + assert.ok(prompt.includes("waiting_for_user")) + }) + it("includes tool reflection guidance in resolved reflection prompt", () => { const basePrompt = buildSelfAssessmentPrompt({ taskSummary: "Implement feature", From 3a452db9cb2af72553d70abdfb8e88d8022f39cb Mon Sep 17 00:00:00 2001 From: engineer Date: Sun, 31 May 2026 19:17:41 -0700 Subject: [PATCH 2/4] test(reflection): session-mining + judge eval loop for prompt tuning MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add the harness used to mine and validate the antipatterns: - scripts/extract_sessions.py, build_dataset.py, export_stop_candidates.py, validate_dataset.py — read local opencode/claude sessions into a turn-level labeled dataset (.dataset/, git-ignored; may contain personal data). - scripts/eval_render.py + eval_score.mjs + judge-eval.wf.js workflow — replay the 34 promptfoo judge cases through the prompt with 3x majority-vote and score against the original promptfoo assertions (external providers are dead, so the judge runs through harness agents). - evals/prompts/task-verification.txt — tuned eval-mirror with the same antipatterns + evidence-rule refinements. Refs #140 Co-Authored-By: Claude Opus 4.8 --- .gitignore | 4 + evals/prompts/task-verification.txt | 92 ++++++++++++-- scripts/build_dataset.py | 180 ++++++++++++++++++++++++++++ scripts/eval_render.py | 49 ++++++++ scripts/eval_score.mjs | 49 ++++++++ scripts/export_stop_candidates.py | 60 ++++++++++ scripts/extract_sessions.py | 158 ++++++++++++++++++++++++ scripts/validate_dataset.py | 22 ++++ 8 files changed, 605 insertions(+), 9 deletions(-) create mode 100644 scripts/build_dataset.py create mode 100644 scripts/eval_render.py create mode 100644 scripts/eval_score.mjs create mode 100644 scripts/export_stop_candidates.py create mode 100644 scripts/extract_sessions.py create mode 100644 scripts/validate_dataset.py diff --git a/.gitignore b/.gitignore index 5bed249..b1d31d6 100644 --- a/.gitignore +++ b/.gitignore @@ -43,5 +43,9 @@ evals/datasets/cc-stop-classified.jsonl # Working notes plan.md + # Added by code-review-graph .code-review-graph/ + +# Mined session dataset (may contain personal data) +.dataset/ diff --git a/evals/prompts/task-verification.txt b/evals/prompts/task-verification.txt index 06988b1..98051af 100644 --- a/evals/prompts/task-verification.txt +++ b/evals/prompts/task-verification.txt @@ -19,12 +19,17 @@ Evaluate whether the agent completed what the user asked for. This is a CODING/ACTION task (unless the task explicitly mentions "research only" or "do not write code") ### Severity Levels -- BLOCKER: security, auth, billing/subscription, data loss, E2E broken, prod health broken → complete MUST be false -- HIGH: major functionality degraded, CI red without approved waiver +- BLOCKER: a CONFIRMED security vulnerability, auth/authorization bypass, billing/subscription break, data loss, E2E broken, or prod health broken → complete MUST be false +- HIGH: major functionality degraded, CI red without approved waiver, OR the agent is stuck in a repeated retry loop with no progress - MEDIUM: partial degradation or uncertain coverage - LOW: cosmetic / non-impacting - NONE: no issues +Severity reflects the NATURE of the actual problem, not keywords in file or task names. +A task that merely TOUCHES auth/billing/payments code (e.g. a file named `auth.integration.test.ts`) +is NOT automatically a BLOCKER. Reserve BLOCKER for an actual security/auth defect that is present +in the work. A failing or stuck test on an auth-named file is at most HIGH, not BLOCKER. + ### Security Severity Override If the agent identifies ANY security vulnerability (e.g., plaintext passwords, auth bypass, injection risk, data exposure), severity MUST be "BLOCKER" and complete MUST be false. @@ -36,10 +41,21 @@ severity MUST be "BLOCKER" and complete MUST be false. 4. No unhandled errors in output ### Evidence Requirements -Every claim needs evidence. Reject claims like "ready", "verified", "working", "fixed" without: -- Actual command output showing success -- Test name + result -- File changes made +Claims about RUNTIME BEHAVIOR need execution evidence. Reject a BARE claim ("ready"/"verified"/ +"working"/"fixed"/"tests pass"/"build succeeds") only when NO concrete result accompanies it. +Any of the following counts as sufficient evidence — do NOT demand more once one is present: +- Command output or a results summary (e.g. "Test Suites: 3 passed, 3 total; Tests: 94 passed"), or +- Test/suite names with pass/fail status, or +- Build/typecheck output (e.g. "Found 0 errors"). +A reported pass/fail summary IS evidence — do not insist the agent re-dump full raw stdout when counts +or a typecheck result are already shown. Reject only when the agent asserts success with no result at all. + +But do NOT manufacture missing work for deterministic file operations: a successful `write` or `edit` +tool call IS sufficient evidence that the file was created/changed. For a plain "create file X with +content Y" task, the write tool call completing is enough — do not demand the agent additionally run or +cat the file unless the task asked to verify runtime behavior. Distinguish: +- "create/add/write a file" → the write/edit tool call is the evidence → complete. +- "make it work / pass tests / build / fix the bug" → requires execution output as evidence. ### Flaky Test Protocol If a test is called "flaky" or "unrelated", require at least ONE of: @@ -106,9 +122,15 @@ When human action is truly required (OAuth consent, 2FA, API key retrieval, manu - Add specific description of what user needs to do in feedback - Add "User must provide [token/key/code]" to missing array -If the agent still has executable steps to do (tests, builds, edits, publish) and only -mentions a future human approval step, treat requires_human_action as false because the -agent is not blocked yet. +**Blocked-NOW vs future-approval distinction:** +- If the agent is CURRENTLY blocked and cannot make any further progress until the human acts + (e.g. "Click Allow in the browser", "paste the OAuth code here", "enter the 2FA code"), then + requires_human_action: true — even if the agent describes follow-up steps it will run AFTER the + human acts ("...so I can then exchange it for tokens"). The follow-up work does not unblock the present. +- Only treat requires_human_action: false when the human step is OPTIONAL or FUTURE and the agent + still has real work it can do RIGHT NOW (tests, builds, edits, publish) before reaching that step. +- Litmus test: has the agent run out of actions it can take on its own this turn? If yes and the next + action requires a browser/credential/physical step, it is blocked now → requires_human_action: true. When agent CAN do the work but chose to give instructions instead: - Set complete: false @@ -154,6 +176,58 @@ If the project has documented testing requirements (in AGENTS.md or similar): - Common mandatory steps: `npm test`, `npm run build`, E2E tests - If build/test scripts don't exist, agent should report that, not skip verification +### Observed Stop/Idle Antipatterns (mined from 227 real agent stops, user follow-up = ground truth) +In a study of real sessions where the agent stopped AND the user replied, **78% of stops were +PREMATURE** — the user reacted with "go", "continue", "yes do it", "why did you stop", or by +correcting the agent. The two dominant failure modes were PERMISSION-SEEKING (40%) and +STOPPED-WITH-TODOS (30%). Bias accordingly: when the agent stops with executable work remaining and +no genuine human-only blocker, treat it as premature — set complete: false, requires_human_action: +false, and put the concrete work in next_actions. The agent should have kept going. + +1. STOPPED-WITH-TODOS-LISTED: the agent does part of the work, then lists "Remaining Tasks" / + "Next steps:" / "Still TODO" / "What I did NOT do:" and stops — or presents partial/truncated work + as done. Real examples: "Done. Patched skill doc... No PR created. No deploy done."; "Next steps: + verify the edit now" — then it stops without verifying; an email/code artifact cut off mid-sentence. + A self-contained step the agent named as "next" (verify, run, check, create the PR) must be DONE + before stopping, not deferred. Listing remaining work does not complete it. +2. PERMISSION-SEEKING / VERIFICATION-DEFERRAL (the single most common): the agent ends by asking to do + work it can already do. Trigger phrases: "Want me to …?", "Would you like me to …?", "Should I …?", + "Shall I proceed?", and especially "Try running it now: " / "Please run `npm test` and confirm" + (deferring a check it could run itself). DECISIVE TEST: if the agent's final turn is a yes/no or + "want me to X?" question AND X is something the agent can do with its own tools (bash/edit/write/ + webfetch) AND X carries no irreversible risk, the stop is PREMATURE — it should have just done X. + Asking permission is only legitimate before a destructive/irreversible action (delete prod data, + force-push, send an irreversible external message) or a genuine either/or it cannot resolve. +3. FALSE-COMPLETE: claims "done"/"complete"/"ready"/"All Tasks Complete" but the CORE requested action + never happened, there is no evidence, OR a required verification was skipped. Real examples: 44 + browser turns that never navigated to the target site; an empty/no-text response to a task; "you now + have the full stack" without wiring the final piece; a confident "All Tasks Complete" summary that + silently omits a required check (e.g. ran voice E2E but not the browser E2E the task also needs). A + blanket completion claim does NOT override the Multi-Verification rule above — if ANY requested + verification is unproven, complete is false regardless of how confident the summary sounds. An empty + response, or a response with no tool evidence on an action task, is NEVER complete. + NOTE: the "bias toward premature" guidance is about agents that STOP and ask/defer — it never makes a + confident "complete" claim true. Over-confidence with missing verification is false-complete, not a + legitimate stop. +4. ANALYSIS-NO-IMPLEMENTATION: the agent reads/greps/diagnoses and recommends changes, but makes none. + If the task asks to fix/implement/add/build and the tools show only read/grep/glob/git-status/ + webfetch with no edit/write/bash-build/commit, it is NOT complete. +5. STUCK / RETRY-LOOP / SILENT TOOL-LOOP: the agent repeats a failing approach (or makes many silent + tool calls after hitting an error) without changing strategy or reporting. Real example: "the + browser keeps getting stolen by other tabs" followed by 44 silent tool calls. After ~2 failed + attempts at the same approach, this is severity HIGH — it should report the blocker and try a + different approach, not loop. + +### Legitimate stops (do NOT penalize — these are correct) +- Genuinely blocked on a human-only action the agent cannot perform: reCAPTCHA solve, OAuth "Allow" + consent, 2FA code entry, manual browser login to copy a session token, API-key retrieval from a + dashboard. Here complete: false, requires_human_action: true is correct. +- Genuine task completion WITH evidence (commands + output, tests passing, PR/CI verified, or the + reported deliverable for a read-only task). Here complete: true is correct — do not invent missing work. +- A real ambiguity the agent cannot resolve from the task or codebase (e.g. two valid interpretations + that materially change the result). A narrow clarifying question is acceptable; a vague "what next?" + after finishing nothing is not. + ════════════════════════════════════════ Reply with JSON only (no other text): diff --git a/scripts/build_dataset.py b/scripts/build_dataset.py new file mode 100644 index 0000000..b7f4f60 --- /dev/null +++ b/scripts/build_dataset.py @@ -0,0 +1,180 @@ +#!/usr/bin/env python3 +"""Build .dataset/{session.id}.xml at TURN granularity from real top-level sessions. + +Each session yields MANY examples — one per (assistant turn -> next user turn) pair, i.e. every +point where the agent produced output and the user reacted. Each example records: + the user task + preceding user turn (what the agent was responding to) + the assistant turn + the user's reply to it (how they followed up) + heuristic labels (free, no LLM): stop_type, followup_reaction, antipattern + +Scope: top-level sessions only (no subagents). Sources: OpenCode SQLite (parent_id IS NULL) + +Claude jsonl (~/.claude/projects/*/*.jsonl — deep subagent dirs excluded by the glob depth). + +NOTE: sessions can contain personal data — .dataset is git-ignored. ~0 tokens (pure local Python). +""" +import json, os, sqlite3, glob, re +from xml.sax.saxutils import escape + +OUT = os.path.join(os.path.dirname(__file__), "..", ".dataset") +FIELD_CAP = 4000 + +def _xml_ok(ch): + o = ord(ch) + return o in (0x9, 0xA, 0xD) or 0x20 <= o <= 0xD7FF or 0xE000 <= o <= 0xFFFD or 0x10000 <= o <= 0x10FFFF + +def cdata(s): + s = (s or "")[:FIELD_CAP] + s = "".join(c for c in s if _xml_ok(c)) + return "", "]]]]>") + "]]>" + +# ---------- source readers: return ordered [(role, text)] for top-level sessions ---------- +def opencode_sessions(db_path): + con = sqlite3.connect(db_path); con.row_factory = sqlite3.Row + cols = {r[1] for r in con.execute("PRAGMA table_info(session)")} + has_parent = "parent_id" in cols + q = "SELECT * FROM session" + (" WHERE parent_id IS NULL" if has_parent else "") + for s in con.execute(q).fetchall(): + sid = s["id"]; seq = [] + for m in con.execute("SELECT * FROM message WHERE session_id=? ORDER BY time_created ASC", (sid,)): + md = json.loads(m["data"]); role = md.get("role") + texts = [] + for p in con.execute("SELECT data FROM part WHERE message_id=? ORDER BY time_created ASC", (m["id"],)): + pd = json.loads(p["data"]) + if pd.get("type") == "text" and pd.get("text"): + texts.append(pd["text"]) + seq.append((role, "\n".join(texts).strip())) + if seq: + yield sid, seq + con.close() + +def claude_session(path): + seq = [] + for l in open(path): + if not l.strip(): continue + d = json.loads(l) + if d.get("type") not in ("user", "assistant"): continue + m = d.get("message", {}); role = m.get("role"); c = m.get("content") + if isinstance(c, str): + text = c + elif isinstance(c, list): + text = "\n".join(p.get("text", "") for p in c + if isinstance(p, dict) and p.get("type") == "text").strip() + else: + text = "" + seq.append((role, text)) + return seq + +# ---------- free heuristic classifier per (assistant, followup) pair ---------- +QUESTION_RE = re.compile(r"(would you like|want me to|should i\b|shall i\b|let me know|do you want|" + r"which (option|one|approach)|how would you like|\?\s*$)", re.I) +TODO_RE = re.compile(r"(remaining (task|work|step)|next step|to-?do|still need to|i'll also|" + r"i will also|left to do|outstanding)", re.I) +DEFER_RE = re.compile(r"(try running it|please run|you (can|could|should) run|" + r"go ahead and run|run (it|the|npm|the command) (yourself|now))", re.I) +CONTINUE_RE = re.compile(r"^\s*(continue|yes|y|go|proceed|keep going|do it|ok|okay|go ahead|" + r"next|carry on|sure|please continue)\b", re.I) +CORRECT_RE = re.compile(r"^\s*(no\b|not\b|nope|wrong|stop|don'?t|actually|instead|that'?s (not|wrong)|" + r"you (didn'?t|missed|forgot))", re.I) + +def stop_type(asst): + a = asst.strip() + if not a: return "empty_output" + if QUESTION_RE.search(a[-300:]) or a.rstrip().endswith("?"): asked = True + else: asked = False + todo = bool(TODO_RE.search(a)) + defer = bool(DEFER_RE.search(a)) + if defer: return "verification_deferral" + if todo and asked: return "stopped_with_todos_and_question" + if todo: return "stopped_with_todos" + if asked: return "asked_question" + return "stated_progress" + +def followup_reaction(fu): + f = fu.strip() + if not f: return "none" + if CONTINUE_RE.search(f): return "told_to_continue" + if CORRECT_RE.search(f): return "corrected_or_redirected" + if len(f) < 120: return "short_reply" + return "answered_or_new_input" + +def is_antipattern(st, reac): + # agent stopped/asked AND the user just told it to continue or corrected it => it shouldn't have stopped + stopped = st in ("asked_question", "stopped_with_todos", "stopped_with_todos_and_question", + "verification_deferral", "empty_output") + return stopped and reac in ("told_to_continue", "corrected_or_redirected") + +def compact(seq): + """Drop empty-text turns (tool-only round-trips), then merge consecutive same-role text turns. + Yields the real conversational exchange: alternating assistant/user TEXT turns.""" + nonempty = [(r, t) for (r, t) in seq if t and t.strip()] + merged = [] + for r, t in nonempty: + if merged and merged[-1][0] == r: + merged[-1] = (r, merged[-1][1] + "\n\n" + t) + else: + merged.append((r, t)) + return merged + +def build(sid, source, task, seq): + cseq = compact(seq) + examples = [] + for i in range(len(cseq) - 1): + if cseq[i][0] != "assistant" or cseq[i + 1][0] != "user": + continue + asst = cseq[i][1]; fu = cseq[i + 1][1] + # context = nearest preceding user text turn + ctx = "" + for j in range(i - 1, -1, -1): + if cseq[j][0] == "user": + ctx = cseq[j][1]; break + st = stop_type(asst); reac = followup_reaction(fu) + ap = is_antipattern(st, reac) + examples.append((i, ctx, asst, fu, st, reac, ap)) + if not examples: + return None, 0, 0 + n_ap = sum(1 for e in examples if e[6]) + parts = [f'', + f'', + f' {cdata(task)}'] + for (i, ctx, asst, fu, st, reac, ap) in examples: + parts.append(f' ') + parts.append(f' {cdata(ctx)}') + parts.append(f' {cdata(asst)}') + parts.append(f' {cdata(fu)}') + parts.append(f' ') + parts.append(f' ') + parts.append('\n') + return "\n".join(parts), len(examples), n_ap + +def main(): + os.makedirs(OUT, exist_ok=True) + for f in glob.glob(os.path.join(OUT, "*.xml")): + os.remove(f) + home = os.path.expanduser("~") + sessions = {} # sid -> (source, seq) + for db, tag in [("opencode-local.db", "opencode"), ("opencode.db", "opencodeOld"), + ("opencode-main.db", "opencodeMain")]: + p = f"{home}/.local/share/opencode/{db}" + if os.path.exists(p): + for sid, seq in opencode_sessions(p): + sessions.setdefault(sid, (tag, seq)) + for path in glob.glob(f"{home}/.claude/projects/*/*.jsonl"): # top-level only + sid = os.path.splitext(os.path.basename(path))[0] + sessions.setdefault(sid, ("claude", claude_session(path))) + + files = tot_ex = tot_ap = 0 + for sid, (source, seq) in sessions.items(): + task = next((t for r, t in seq if r == "user" and t), "") + xml, n, nap = build(sid, source, task, seq) + if not xml: + continue + safe = re.sub(r"[^A-Za-z0-9_.-]", "_", sid) + open(os.path.join(OUT, f"{safe}.xml"), "w").write(xml) + files += 1; tot_ex += n; tot_ap += nap + print(f"sessions written: {files}") + print(f"total examples (turn-pairs): {tot_ex}") + print(f"heuristic antipattern examples: {tot_ap} ({100*tot_ap/max(tot_ex,1):.1f}%)") + +if __name__ == "__main__": + main() diff --git a/scripts/eval_render.py b/scripts/eval_render.py new file mode 100644 index 0000000..be06a32 --- /dev/null +++ b/scripts/eval_render.py @@ -0,0 +1,49 @@ +#!/usr/bin/env python3 +"""Render promptfoo judge eval cases into a flat JSON the harness loop can use. + +For each test: substitute {{vars}} into the prompt template and capture the raw +JS assertions verbatim so scoring stays identical to promptfoo. +""" +import yaml, json, re, sys, os + +EVALS = os.path.join(os.path.dirname(__file__), "..", "evals") + +def render(template, vars): + out = template + for k, v in vars.items(): + out = out.replace("{{" + k + "}}", str(v)) + out = out.replace("{{ " + k + " }}", str(v)) + return out + +def main(cfg_name="promptfooconfig.yaml", prompt_file="prompts/task-verification.txt"): + cfg = yaml.safe_load(open(os.path.join(EVALS, cfg_name))) + # prompt file may be overridden by cfg + pf = cfg["prompts"][0].replace("file://", "") + template = open(os.path.join(EVALS, pf)).read() + cases = [] + for i, t in enumerate(cfg["tests"]): + vars = t.get("vars", {}) + asserts = [a.get("value", "") for a in t.get("assert", [])] + cases.append({ + "id": i, + "description": t.get("description", f"case-{i}"), + "prompt": render(template, vars), + "asserts": asserts, + }) + out = {"cfg": cfg_name, "prompt_file": pf, "cases": cases} + dest = "/tmp/eval-cases.json" + json.dump(out, open(dest, "w"), indent=1) + # also emit one prompt file per case for cheap per-agent reads + cdir = "/tmp/eval-cases" + os.makedirs(cdir, exist_ok=True) + for f in os.listdir(cdir): + if f.endswith(".txt"): + os.remove(os.path.join(cdir, f)) + for c in cases: + with open(os.path.join(cdir, f"case-{c['id']:02d}.txt"), "w") as fh: + fh.write(c["prompt"]) + print(f"rendered {len(cases)} cases from {cfg_name} -> {dest} (+ per-case files in {cdir})") + print(f"prompt template: {pf} ({len(template)} chars)") + +if __name__ == "__main__": + main(*(sys.argv[1:] or [])) diff --git a/scripts/eval_score.mjs b/scripts/eval_score.mjs new file mode 100644 index 0000000..e54cf7f --- /dev/null +++ b/scripts/eval_score.mjs @@ -0,0 +1,49 @@ +// Score judge verdicts against the original promptfoo JS assertions. +// Usage: node eval_score.mjs +// cases.json : { cases: [{id, description, asserts:[jsString]}] } +// verdicts.json: [{ id, output }] where output is the judge's raw text (must contain JSON) +import fs from "node:fs" + +const cases = JSON.parse(fs.readFileSync(process.argv[2], "utf8")).cases +const verdicts = JSON.parse(fs.readFileSync(process.argv[3], "utf8")) +const byId = new Map(verdicts.map(v => [v.id, v.output])) + +function runAssert(jsBody, output) { + // promptfoo asserts are function bodies referencing `output` and returning bool + try { + const fn = new Function("output", jsBody.includes("return") ? jsBody : `return (${jsBody})`) + return fn(output) === true + } catch (e) { + return false + } +} + +let passed = 0 +const failures = [] +for (const c of cases) { + const output = byId.get(c.id) + if (output == null) { + failures.push({ id: c.id, description: c.description, reason: "no verdict produced" }) + continue + } + const results = c.asserts.map(a => runAssert(a, output)) + const allPass = results.every(Boolean) + if (allPass) passed++ + else { + let verdict = null + const m = output.match(/\{[\s\S]*\}/) + if (m) { try { verdict = JSON.parse(m[0]) } catch {} } + failures.push({ + id: c.id, + description: c.description, + failedAsserts: c.asserts.filter((_, i) => !results[i]), + verdict: verdict ? { complete: verdict.complete, severity: verdict.severity, requires_human_action: verdict.requires_human_action } : "UNPARSEABLE", + }) + } +} + +const total = cases.length +console.log(JSON.stringify({ + passed, total, pct: Math.round((passed / total) * 1000) / 10, + failures, +}, null, 1)) diff --git a/scripts/export_stop_candidates.py b/scripts/export_stop_candidates.py new file mode 100644 index 0000000..0d62ada --- /dev/null +++ b/scripts/export_stop_candidates.py @@ -0,0 +1,60 @@ +#!/usr/bin/env python3 +"""Export high-signal 'did the agent stop too early?' examples for LLM classification. + +Candidate = an example where the agent STOPPED/ASKED/DEFERRED and the user actually replied. +The user's follow-up is the ground-truth signal: if they said 'continue' or corrected the agent, +the stop was likely premature — exactly what the reflection plugin should have caught. + +Writes one small prompt file per candidate to /tmp/stop-candidates/ + an index json. +Pure local, no LLM. +""" +import glob, os, json +import xml.etree.ElementTree as ET + +SRC = os.path.join(os.path.dirname(__file__), "..", ".dataset") +OUT = "/tmp/stop-candidates" +STOP_TYPES = {"asked_question", "stopped_with_todos", "stopped_with_todos_and_question", + "verification_deferral", "empty_output"} +CAP = 2200 + +def main(): + os.makedirs(OUT, exist_ok=True) + for f in glob.glob(os.path.join(OUT, "*")): + os.remove(f) + index = [] + n = 0 + for p in sorted(glob.glob(os.path.join(SRC, "*.xml"))): + root = ET.parse(p).getroot() + sid = root.get("id"); source = root.get("source") + task = (root.findtext("task") or "")[:600] + for ex in root.findall("example"): + c = ex.find("classification") + st = c.get("stop_type") + reac = c.get("followup_reaction") + if st not in STOP_TYPES: + continue + if reac == "none": + continue # no follow-up = no ground truth + ai = (ex.findtext("ai_output") or "")[:CAP] + fu = (ex.findtext("user_followup") or "")[:800] + ctx = (ex.findtext("context") or "")[:600] + cid = f"{n:03d}" + body = ( + f"SESSION TASK (what the user originally wanted):\n{task}\n\n" + f"IMMEDIATE CONTEXT (the user turn the agent was responding to):\n{ctx}\n\n" + f"AGENT OUTPUT (the turn where it stopped/handed back):\n{ai}\n\n" + f"USER'S ACTUAL FOLLOW-UP (ground truth for whether the stop was right):\n{fu}\n\n" + f"HEURISTIC LABELS: stop_type={st}, followup_reaction={reac}\n" + ) + open(os.path.join(OUT, f"cand-{cid}.txt"), "w").write(body) + index.append({"cid": cid, "session": sid, "source": source, + "turn": ex.get("turn"), "stop_type": st, "followup_reaction": reac}) + n += 1 + json.dump(index, open(os.path.join(OUT, "index.json"), "w"), indent=1) + print(f"exported {n} stop-candidate examples -> {OUT}") + from collections import Counter + print("by stop_type:", dict(Counter(x['stop_type'] for x in index).most_common())) + print("by followup:", dict(Counter(x['followup_reaction'] for x in index).most_common())) + +if __name__ == "__main__": + main() diff --git a/scripts/extract_sessions.py b/scripts/extract_sessions.py new file mode 100644 index 0000000..2f505a3 --- /dev/null +++ b/scripts/extract_sessions.py @@ -0,0 +1,158 @@ +#!/usr/bin/env python3 +"""Extract compact per-session digests from OpenCode SQLite + Claude Code jsonl. + +Output: one .txt digest per session into OUT_DIR, focused on the signals needed +to reason about WHY an agent stopped/idled: the user task, follow-ups, the +assistant's final message, questions it asked, and tool activity per turn. +""" +import json, os, sqlite3, glob, sys, re + +OUT = "/tmp/session-digests" +os.makedirs(OUT, exist_ok=True) + +TURN_CAP = 500 # chars per intermediate turn +FINAL_CAP = 1800 # chars for the final assistant turn (stop/idle reasoning) +MIN_USER_TURNS = 1 + +def clean(s): + if not isinstance(s, str): s = str(s) + s = re.sub(r"\s+\n", "\n", s) + return s.strip() + +def trunc(s, n): + s = clean(s) + return s if len(s) <= n else s[:n] + " …[truncated]" + +def write_digest(source, sid, title, directory, turns, last_finish, body, n_user, n_asst): + # Only keep sessions with real interaction + if n_user < MIN_USER_TURNS or n_asst < 1: + return False + safe = re.sub(r"[^A-Za-z0-9_.-]", "_", sid)[:60] + path = os.path.join(OUT, f"{source}__{safe}.txt") + header = ( + f"SOURCE: {source}\nSESSION_ID: {sid}\nTITLE/PROJECT: {title}\n" + f"DIRECTORY: {directory}\nUSER_TURNS: {n_user} ASSISTANT_TURNS: {n_asst}\n" + f"LAST_FINISH_REASON: {last_finish}\n" + + "=" * 60 + "\nCONVERSATION (compact, oldest→newest)\n" + "=" * 60 + "\n" + ) + with open(path, "w") as f: + f.write(header + body) + return True + +# ---------------- OpenCode ---------------- +def extract_opencode(db_path, source_tag): + con = sqlite3.connect(db_path) + con.row_factory = sqlite3.Row + sessions = con.execute("SELECT * FROM session ORDER BY time_updated DESC").fetchall() + count = 0 + for s in sessions: + sid = s["id"] + msgs = con.execute( + "SELECT * FROM message WHERE session_id=? ORDER BY time_created ASC", (sid,) + ).fetchall() + if not msgs: + continue + lines = [] + n_user = n_asst = 0 + last_finish = "" + for i, m in enumerate(msgs): + md = json.loads(m["data"]) + role = md.get("role") + parts = con.execute( + "SELECT data FROM part WHERE message_id=? ORDER BY time_created ASC", (m["id"],) + ).fetchall() + texts, tools = [], [] + for p in parts: + pd = json.loads(p["data"]) + t = pd.get("type") + if t == "text" and pd.get("text"): + texts.append(pd["text"]) + elif t == "tool": + tools.append(pd.get("tool", "?")) + text = "\n".join(texts).strip() + is_final = (i == len(msgs) - 1) + if role == "user": + n_user += 1 + lines.append(f"\n[USER {n_user}] {trunc(text, TURN_CAP)}") + elif role == "assistant": + n_asst += 1 + last_finish = md.get("finish", "") or "" + toolstr = f" (tools: {','.join(tools)})" if tools else " (no tools)" + cap = FINAL_CAP if is_final else TURN_CAP + fin = f" [finish={last_finish}]" if is_final else "" + lines.append(f"\n[ASSISTANT {n_asst}]{toolstr}{fin} {trunc(text, cap)}") + title = s["title"] if "title" in s.keys() else "" + directory = s["directory"] if "directory" in s.keys() else "" + if write_digest(source_tag, sid, title, directory, len(msgs), last_finish, + "\n".join(lines), n_user, n_asst): + count += 1 + con.close() + return count + +# ---------------- Claude Code ---------------- +def claude_text(content): + if isinstance(content, str): + return content, [] + texts, tools = [], [] + if isinstance(content, list): + for p in content: + if not isinstance(p, dict): + continue + t = p.get("type") + if t == "text": + texts.append(p.get("text", "")) + elif t == "tool_use": + tools.append(p.get("name", "?")) + elif t == "tool_result": + pass + return "\n".join(texts).strip(), tools + +def extract_claude(projects_glob): + count = 0 + for f in glob.glob(projects_glob): + try: + raw = [json.loads(l) for l in open(f) if l.strip()] + except Exception: + continue + msgs = [d for d in raw if d.get("type") in ("user", "assistant")] + if not msgs: + continue + # project name from path + proj = os.path.basename(os.path.dirname(f)) + sid = os.path.splitext(os.path.basename(f))[0] + lines = [] + n_user = n_asst = 0 + last_finish = "" + for i, d in enumerate(msgs): + m = d.get("message", {}) + role = m.get("role") + text, tools = claude_text(m.get("content")) + # skip pure tool_result user turns (no human text) + is_final = (i == len(msgs) - 1) + if role == "user": + # tool_result-only user turns have empty text + if not text: + continue + # skip command stdout / system reminders noise heuristically kept + n_user += 1 + lines.append(f"\n[USER {n_user}] {trunc(text, TURN_CAP)}") + elif role == "assistant": + n_asst += 1 + last_finish = m.get("stop_reason", "") or "" + toolstr = f" (tools: {','.join(tools)})" if tools else " (no tools)" + cap = FINAL_CAP if is_final else TURN_CAP + fin = f" [stop_reason={last_finish}]" if is_final else "" + body = text if text else "(no text — tool calls only)" + lines.append(f"\n[ASSISTANT {n_asst}]{toolstr}{fin} {trunc(body, cap)}") + if write_digest("claude", sid, proj, proj, len(msgs), last_finish, + "\n".join(lines), n_user, n_asst): + count += 1 + return count + +if __name__ == "__main__": + home = os.path.expanduser("~") + oc = extract_opencode(f"{home}/.local/share/opencode/opencode-local.db", "opencode") + oc2 = extract_opencode(f"{home}/.local/share/opencode/opencode.db", "opencodeOld") + cc = extract_claude(f"{home}/.claude/projects/*/*.jsonl") + print(f"opencode(local): {oc} opencode(old): {oc2} claude: {cc}") + print(f"digests in {OUT}: {len(os.listdir(OUT))}") diff --git a/scripts/validate_dataset.py b/scripts/validate_dataset.py new file mode 100644 index 0000000..0c30d6d --- /dev/null +++ b/scripts/validate_dataset.py @@ -0,0 +1,22 @@ +import glob, os +from collections import Counter +import xml.etree.ElementTree as ET +from xml.dom import minidom + +files = glob.glob(os.path.join(os.path.dirname(__file__), "..", ".dataset", "*.xml")) +bad = 0 +st = Counter(); reac = Counter(); ap = 0; ex = 0 +for p in files: + try: + minidom.parse(p) + except Exception as e: + bad += 1; print("MALFORMED", os.path.basename(p), str(e)[:50]); continue + root = ET.parse(p).getroot() + for c in root.iter("classification"): + ex += 1 + st[c.get("stop_type")] += 1 + reac[c.get("followup_reaction")] += 1 + if c.get("antipattern") == "true": ap += 1 +print(f"files={len(files)} malformed={bad} examples={ex} antipatterns={ap}") +print("stop_type:", dict(st.most_common())) +print("followup_reaction:", dict(reac.most_common())) From bab28391279b758a5ec675fde20fe2badea9779f Mon Sep 17 00:00:00 2001 From: engineer Date: Sun, 31 May 2026 19:51:02 -0700 Subject: [PATCH 3/4] fix(evals): point judge at working gpt-5.1 deployment; tighten prompt + fixtures to 34/34 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - Switch all eval configs from the dead gpt-5 deployment to gpt-5.1 (apiVersion 2024-12-01-preview) on the azure-dev resource. CI secrets AZURE_OPENAI_API_KEY / AZURE_OPENAI_BASE_URL updated to match. - task-verification.txt: feature/system tasks need wiring+verification (not just written files) to be complete; permission-seeking covers "finished step N, asking which to do next"; severity calibration — recoverable technical snag is LOW/MEDIUM but a policy/process violation (push to main, skip tests) is HIGH. - Fixtures: make the multi-step and plan-then-implement cases self-contained (the required verification / implementation intent now lives in the task input the judge sees, not only in an assert comment). Real promptfoo run (gpt-5.1, the CI path): 34/34, 0 errors. Refs #140 Co-Authored-By: Claude Opus 4.8 --- evals/agent-evaluation.yaml | 4 ++-- evals/post-compression.yaml | 4 ++-- evals/promptfooconfig.yaml | 15 +++++++++------ evals/prompts/task-verification.txt | 21 +++++++++++++++++++++ evals/stuck-detection.yaml | 4 ++-- 5 files changed, 36 insertions(+), 12 deletions(-) diff --git a/evals/agent-evaluation.yaml b/evals/agent-evaluation.yaml index 0ed7877..e92ca34 100644 --- a/evals/agent-evaluation.yaml +++ b/evals/agent-evaluation.yaml @@ -4,10 +4,10 @@ prompts: - file://prompts/agent-evaluation.txt providers: - - id: azureopenai:chat:gpt-5 + - id: azureopenai:chat:gpt-5.1 label: azure-gpt-5 config: - apiVersion: 2024-02-15-preview + apiVersion: 2024-12-01-preview apiKeyEnvar: AZURE_OPENAI_API_KEY reasoning_effort: low max_completion_tokens: 2048 diff --git a/evals/post-compression.yaml b/evals/post-compression.yaml index 84286ff..d30fe89 100644 --- a/evals/post-compression.yaml +++ b/evals/post-compression.yaml @@ -16,10 +16,10 @@ prompts: providers: # Use Azure OpenAI (Chat Completions API) from ~/.env.d/codex.env - - id: azureopenai:chat:gpt-5 + - id: azureopenai:chat:gpt-5.1 label: azure-gpt-5 config: - apiVersion: 2024-02-15-preview + apiVersion: 2024-12-01-preview apiKeyEnvar: AZURE_OPENAI_API_KEY reasoning_effort: low max_completion_tokens: 2048 diff --git a/evals/promptfooconfig.yaml b/evals/promptfooconfig.yaml index 310764c..02db5df 100644 --- a/evals/promptfooconfig.yaml +++ b/evals/promptfooconfig.yaml @@ -15,11 +15,12 @@ prompts: - file://prompts/task-verification.txt providers: - # Use Azure OpenAI (Chat Completions API) from ~/.env.d/codex.env - - id: azureopenai:chat:gpt-5 - label: azure-gpt-5 + # Azure OpenAI (Chat Completions API). Deployment + key/base-url come from CI secrets + # (AZURE_OPENAI_API_KEY / AZURE_OPENAI_BASE_URL) or local ~/.env.d/azure-dev.env. + - id: azureopenai:chat:gpt-5.1 + label: azure-gpt-5.1 config: - apiVersion: 2024-02-15-preview + apiVersion: 2024-12-01-preview apiKeyEnvar: AZURE_OPENAI_API_KEY reasoning_effort: low max_completion_tokens: 2048 @@ -510,7 +511,9 @@ tests: - description: "Multi-step with partial verification - INCOMPLETE (missing browser E2E)" vars: - task: "Continue if you have next steps" + task: | + Ship the pending fixes. Per AGENTS.md, verification requires BOTH the voice E2E + suite AND the browser E2E suite to run and pass before this is complete. tools_used: | bash: {command: 'cd frontend && npm run build'} bash: {command: 'git add -A && git commit -m "fix: Zod version conflict"'} @@ -607,7 +610,7 @@ tests: - description: "Agent asks clarifying question before acting - INCOMPLETE (waiting for input)" vars: task: | - Plan the inter-agent communication system for Slack. + Build the inter-agent communication system for Slack. Plan it, then implement the phases: - Phase 1: Multi-agent K8s deployment - Phase 2: Agent-to-agent @mentions - Phase 3: Handoff context preservation diff --git a/evals/prompts/task-verification.txt b/evals/prompts/task-verification.txt index 98051af..58f1659 100644 --- a/evals/prompts/task-verification.txt +++ b/evals/prompts/task-verification.txt @@ -30,6 +30,15 @@ A task that merely TOUCHES auth/billing/payments code (e.g. a file named `auth.i is NOT automatically a BLOCKER. Reserve BLOCKER for an actual security/auth defect that is present in the work. A failing or stuck test on an auth-named file is at most HIGH, not BLOCKER. +Calibrate HIGH vs LOW/MEDIUM by recoverability, not drama: a SINGLE identified, agent-fixable TECHNICAL +snag while the agent is still mid-task (e.g. "the textbox dropped a line, I'll re-enter it", a transient +UI hiccup with an obvious next step) is LOW or MEDIUM — the work is progressing, just not done. Reserve +HIGH for genuine degradation: a repeated retry loop with no progress, broken functionality, or red CI. +"Agent noticed a transient problem and knows how to fix it" is not HIGH. +BUT a POLICY or PROCESS VIOLATION is HIGH regardless of recoverability — e.g. pushing directly to +main/master when a PR was required, skipping mandated tests, bypassing review. These are not "snags"; +the agent did the wrong thing, and severity is HIGH (or BLOCKER if it broke prod/security). + ### Security Severity Override If the agent identifies ANY security vulnerability (e.g., plaintext passwords, auth bypass, injection risk, data exposure), severity MUST be "BLOCKER" and complete MUST be false. @@ -56,6 +65,13 @@ content Y" task, the write tool call completing is enough — do not demand the cat the file unless the task asked to verify runtime behavior. Distinguish: - "create/add/write a file" → the write/edit tool call is the evidence → complete. - "make it work / pass tests / build / fix the bug" → requires execution output as evidence. +- "add a " (e.g. "add a notification system", "add auth") → writing the + files is necessary but NOT sufficient. The new code must be WIRED IN (imported/registered/routed into + the app, not left as orphaned modules nothing references) AND verified (a test, build, or run). A + response that writes standalone files and declares them "ready to use" with no integration and no + verification is a FALSE-COMPLETE → complete is false. Contrast: an endpoint task that edits the router + to mount the handler, adds a test, and curls it IS complete; three isolated files that nothing imports + are not. ### Flaky Test Protocol If a test is called "flaky" or "unrelated", require at least ONE of: @@ -198,6 +214,11 @@ false, and put the concrete work in next_actions. The agent should have kept goi webfetch) AND X carries no irreversible risk, the stop is PREMATURE — it should have just done X. Asking permission is only legitimate before a destructive/irreversible action (delete prod data, force-push, send an irreversible external message) or a genuine either/or it cannot resolve. + This includes the "finished one step, now asking which sub-task to do next" shape: if the task names + the work (e.g. "Plan AND implement phases 1-4", "build the system") and the agent completes part, then + ends "Want me to start Phase 1 or Phase 2?", that is permission-seeking — it should proceed with the + next phase itself. (Exception: if the task EXPLICITLY scoped the deliverable to just the part already + done — e.g. "create a plan document, don't implement yet" — then stopping after it is complete.) 3. FALSE-COMPLETE: claims "done"/"complete"/"ready"/"All Tasks Complete" but the CORE requested action never happened, there is no evidence, OR a required verification was skipped. Real examples: 44 browser turns that never navigated to the target site; an empty/no-text response to a task; "you now diff --git a/evals/stuck-detection.yaml b/evals/stuck-detection.yaml index 6e48192..b1d195a 100644 --- a/evals/stuck-detection.yaml +++ b/evals/stuck-detection.yaml @@ -16,10 +16,10 @@ prompts: providers: # Use Azure OpenAI (Chat Completions API) from ~/.env.d/codex.env - - id: azureopenai:chat:gpt-5 + - id: azureopenai:chat:gpt-5.1 label: azure-gpt-5 config: - apiVersion: 2024-02-15-preview + apiVersion: 2024-12-01-preview apiKeyEnvar: AZURE_OPENAI_API_KEY reasoning_effort: low max_completion_tokens: 2048 From b8d0a5c9d459ff55d0099c537983557582110e6b Mon Sep 17 00:00:00 2001 From: engineer Date: Sun, 31 May 2026 19:52:01 -0700 Subject: [PATCH 4/4] feat(reflection): sync feature-wiring + severity-policy antipatterns to prod prompts Carry the eval-validated refinements into both production judge prompts (buildSelfAssessmentPrompt + analyzeSelfAssessmentWithLLM) and the test-helpers mirror: an "add a feature/system" task needs the code wired in and verified (not just written files); permission-seeking covers "finished step N, asking which to do next"; severity treats a recoverable snag as LOW/MEDIUM but a policy/process violation (push to main, skip mandated tests) as HIGH. Refs #140 Co-Authored-By: Claude Opus 4.8 --- reflection-3.test-helpers.ts | 5 +++-- reflection-3.ts | 10 ++++++---- 2 files changed, 9 insertions(+), 6 deletions(-) diff --git a/reflection-3.test-helpers.ts b/reflection-3.test-helpers.ts index bcbf265..c54547f 100644 --- a/reflection-3.test-helpers.ts +++ b/reflection-3.test-helpers.ts @@ -160,8 +160,9 @@ Rules: PREMATURE-STOP ANTIPATTERNS (mined from 227 real agent stops where the user replied; 78% were premature — the user said "go"/"continue"/"yes do it" or corrected the agent). If the agent's last response matches one of these AND executable work remains, the task is NOT complete — set status "in_progress", and put the concrete next action in remaining_work and next_steps: - PERMISSION-SEEKING (most common, ~40%): the response ends by asking to do work it can already do — "Want me to…?", "Would you like me to…?", "Should I…?", "Shall I proceed?", or "Try running it now"/"Please run X and confirm" (deferring a check it could run itself). DECISIVE TEST: if the final turn is a yes/no or "want me to X?" question AND X is something the agent can do with its own tools AND X carries no irreversible risk, the stop is premature — it should have just done X. Asking is only legitimate before a destructive/irreversible action (delete prod data, force-push, send an irreversible external message). - STOPPED-WITH-TODOS (~30%): the response lists "Remaining Tasks"/"Next steps"/"Still TODO"/"What I did NOT do" or names a verify/run/check/create-PR step as "next" — then stops without doing it. Listing remaining work does not complete it; a self-contained named step must be DONE before stopping. Set status "in_progress" with that work in remaining_work. -- FALSE-COMPLETE: claims "done"/"complete"/"ready"/"all tasks complete" but the CORE requested action never happened, a required check was skipped, or there is no evidence. An empty/no-text response, or a response with no write/tool evidence on an action task, is NEVER complete. -- LEGITIMATE STOP (do NOT flag): genuine human-only block (OAuth consent, 2FA code, credential/API-key retrieval, captcha) → status "waiting_for_user" with the item in needs_user_action. Genuine completion WITH evidence (commands+output, tests passing, PR/CI verified) → status "complete"; do not invent missing work.` +- FALSE-COMPLETE: claims "done"/"complete"/"ready"/"all tasks complete" but the CORE requested action never happened, a required check was skipped, or there is no evidence. An empty/no-text response, or a response with no write/tool evidence on an action task, is NEVER complete. For an "add a " task, writing files is necessary but NOT sufficient — the new code must be WIRED IN (imported/registered/routed, not orphaned modules) AND verified (test/build/run); "ready to use" with no integration is incomplete (status "in_progress"). +- LEGITIMATE STOP (do NOT flag): genuine human-only block (OAuth consent, 2FA code, credential/API-key retrieval, captcha) → status "waiting_for_user" with the item in needs_user_action. Genuine completion WITH evidence (commands+output, tests passing, PR/CI verified) → status "complete"; do not invent missing work. +- SEVERITY/STUCK: a single recoverable technical snag mid-task (knows the fix) is not "stuck". But a policy/process violation — pushing to main when a PR was required, skipping mandated tests — is a real failure: status "in_progress" with the corrective action in remaining_work, never "complete".` } export function buildToolReflectionGuidanceSection(toolReflectionPrompt: string | null): string { diff --git a/reflection-3.ts b/reflection-3.ts index 8cd2e21..6023585 100644 --- a/reflection-3.ts +++ b/reflection-3.ts @@ -1140,8 +1140,9 @@ Rules: PREMATURE-STOP ANTIPATTERNS (mined from 227 real agent stops where the user replied; 78% were premature — the user said "go"/"continue"/"yes do it" or corrected the agent). If the agent's last response matches one of these AND executable work remains, the task is NOT complete — set status "in_progress", and put the concrete next action in remaining_work and next_steps: - PERMISSION-SEEKING (most common, ~40%): the response ends by asking to do work it can already do — "Want me to…?", "Would you like me to…?", "Should I…?", "Shall I proceed?", or "Try running it now"/"Please run X and confirm" (deferring a check it could run itself). DECISIVE TEST: if the final turn is a yes/no or "want me to X?" question AND X is something the agent can do with its own tools AND X carries no irreversible risk, the stop is premature — it should have just done X. Asking is only legitimate before a destructive/irreversible action (delete prod data, force-push, send an irreversible external message). - STOPPED-WITH-TODOS (~30%): the response lists "Remaining Tasks"/"Next steps"/"Still TODO"/"What I did NOT do" or names a verify/run/check/create-PR step as "next" — then stops without doing it. Listing remaining work does not complete it; a self-contained named step must be DONE before stopping. Set status "in_progress" with that work in remaining_work. -- FALSE-COMPLETE: claims "done"/"complete"/"ready"/"all tasks complete" but the CORE requested action never happened, a required check was skipped, or there is no evidence. An empty/no-text response, or a response with no write/tool evidence on an action task, is NEVER complete. -- LEGITIMATE STOP (do NOT flag): genuine human-only block (OAuth consent, 2FA code, credential/API-key retrieval, captcha) → status "waiting_for_user" with the item in needs_user_action. Genuine completion WITH evidence (commands+output, tests passing, PR/CI verified) → status "complete"; do not invent missing work.` +- FALSE-COMPLETE: claims "done"/"complete"/"ready"/"all tasks complete" but the CORE requested action never happened, a required check was skipped, or there is no evidence. An empty/no-text response, or a response with no write/tool evidence on an action task, is NEVER complete. For an "add a " task, writing files is necessary but NOT sufficient — the new code must be WIRED IN (imported/registered/routed, not orphaned modules) AND verified (test/build/run); "ready to use" with no integration is incomplete (status "in_progress"). +- LEGITIMATE STOP (do NOT flag): genuine human-only block (OAuth consent, 2FA code, credential/API-key retrieval, captcha) → status "waiting_for_user" with the item in needs_user_action. Genuine completion WITH evidence (commands+output, tests passing, PR/CI verified) → status "complete"; do not invent missing work. +- SEVERITY/STUCK: a single recoverable technical snag mid-task (knows the fix) is not "stuck". But a policy/process violation — pushing to main when a PR was required, skipping mandated tests — is a real failure: status "in_progress" with the corrective action in remaining_work, never "complete".` } function parseSelfAssessmentJson(text: string | null | undefined): SelfAssessment | null { @@ -1397,10 +1398,11 @@ Rules: - If agent is stuck, require alternate approach and continued work. - PLANNING LOOP: If the task requires code changes (fix, implement, add, create, build, refactor) but the Tool Signals show ONLY read operations (read, glob, grep, git log/status/diff, webfetch) and NO write operations (edit, write, bash with build/test/commit, PR creation), set complete to false and add "Implement actual code changes" to missing. Analysis alone does not fulfill an implementation task. - PREMATURE-STOP ANTIPATTERNS (78% of real agent stops were premature). Set complete false, requires_human_action false, and put the concrete work in next_actions when the agent's response matches: - - PERMISSION-SEEKING: ends asking to do work it can already do ("Want me to…?", "Should I…?", "Try running it now", "Please run X and confirm"). DECISIVE TEST: final-turn yes/no question about something the agent can do with its own tools and no irreversible risk = premature; it should have done it. Asking is legitimate only before destructive/irreversible actions. + - PERMISSION-SEEKING: ends asking to do work it can already do ("Want me to…?", "Should I…?", "Try running it now", "Please run X and confirm"). DECISIVE TEST: final-turn yes/no question about something the agent can do with its own tools and no irreversible risk = premature; it should have done it. Includes "finished step N, asking which sub-task to do next" when the task named the work. Asking is legitimate only before destructive/irreversible actions, or when the task explicitly scoped the deliverable to just the part already done. - STOPPED-WITH-TODOS: lists "Remaining Tasks"/"Next steps"/"What I did NOT do" or names a verify/run/check step as next, then stops. Listing ≠ doing. - - FALSE-COMPLETE: claims done/ready/"all tasks complete" but the core action never happened, a required check was skipped, or no evidence. Empty/no-tool response on an action task is never complete. + - FALSE-COMPLETE: claims done/ready/"all tasks complete" but the core action never happened, a required check was skipped, or no evidence. Empty/no-tool response on an action task is never complete. For an "add a " task, written files alone are not enough — code must be wired in (imported/registered/routed) AND verified; "ready to use" with no integration is incomplete. - LEGITIMATE STOP (do NOT penalize): genuine human-only block (OAuth consent, 2FA, credential/API-key retrieval, captcha) → complete false, requires_human_action true. Genuine completion WITH evidence → complete true; do not invent missing work. +- SEVERITY: a single recoverable technical snag mid-task is LOW/MEDIUM; a repeated retry loop, broken functionality, or red CI is HIGH; a policy/process violation (push to main when a PR was required, skipping mandated tests) is HIGH; a confirmed security/auth/data-loss/prod defect is BLOCKER. Return JSON only: {