This repository was archived by the owner on Jun 8, 2026. It is now read-only.
Reflection prompt: premature-stop antipatterns mined from real sessions#141
Merged
Conversation
…prompt Port patterns/antipatterns mined from 227 real agent stops (78% premature: permission-seeking ~40%, stopped-with-todos ~30%) into both production judge prompts in reflection-3.ts (buildSelfAssessmentPrompt + analyzeSelfAssessment- WithLLM), each mapped to its own JSON schema. Mirror into reflection-3.test- helpers.ts (the unit-test source, which had drifted) and add a unit assertion that the antipattern guidance is present. Includes a legitimate-stop carve-out so genuine human-only blocks route to waiting_for_user instead of nagging. Validated via the judge eval loop: 33/34, genuine cases (#00/#1) complete- unanimous, premature-stop cases correctly incomplete. Refs #140 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the harness used to mine and validate the antipatterns: - scripts/extract_sessions.py, build_dataset.py, export_stop_candidates.py, validate_dataset.py — read local opencode/claude sessions into a turn-level labeled dataset (.dataset/, git-ignored; may contain personal data). - scripts/eval_render.py + eval_score.mjs + judge-eval.wf.js workflow — replay the 34 promptfoo judge cases through the prompt with 3x majority-vote and score against the original promptfoo assertions (external providers are dead, so the judge runs through harness agents). - evals/prompts/task-verification.txt — tuned eval-mirror with the same antipatterns + evidence-rule refinements. Refs #140 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… + fixtures to 34/34 - Switch all eval configs from the dead gpt-5 deployment to gpt-5.1 (apiVersion 2024-12-01-preview) on the azure-dev resource. CI secrets AZURE_OPENAI_API_KEY / AZURE_OPENAI_BASE_URL updated to match. - task-verification.txt: feature/system tasks need wiring+verification (not just written files) to be complete; permission-seeking covers "finished step N, asking which to do next"; severity calibration — recoverable technical snag is LOW/MEDIUM but a policy/process violation (push to main, skip tests) is HIGH. - Fixtures: make the multi-step and plan-then-implement cases self-contained (the required verification / implementation intent now lives in the task input the judge sees, not only in an assert comment). Real promptfoo run (gpt-5.1, the CI path): 34/34, 0 errors. Refs #140 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…to prod prompts Carry the eval-validated refinements into both production judge prompts (buildSelfAssessmentPrompt + analyzeSelfAssessmentWithLLM) and the test-helpers mirror: an "add a feature/system" task needs the code wired in and verified (not just written files); permission-seeking covers "finished step N, asking which to do next"; severity treats a recoverable snag as LOW/MEDIUM but a policy/process violation (push to main, skip mandated tests) as HIGH. Refs #140 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
reflection-3.ts(buildSelfAssessmentPrompt+analyzeSelfAssessmentWithLLM) now carry stop/idle patterns + antipatterns mined from 227 real agent stops where the user replied — 78% were premature (permission-seeking ~40%, stopped-with-todos ~30%).Closes
Closes #140
Test plan
npx jest test/reflection-3.unit.test.ts-> 106/106 (incl. new premature-stop assertion)npx jest test/reflection.test.ts-> 80/80npx tsc --noEmitcleanNotes
🤖 Generated with Claude Code