Reflection prompt: premature-stop antipatterns mined from real sessions by dzianisv · Pull Request #141 · dzianisv/opencode-plugins

dzianisv · 2026-06-01T02:19:07Z

Summary

Production reflection judge prompts in reflection-3.ts (buildSelfAssessmentPrompt + analyzeSelfAssessmentWithLLM) now carry stop/idle patterns + antipatterns mined from 227 real agent stops where the user replied — 78% were premature (permission-seeking ~40%, stopped-with-todos ~30%).
Each antipattern is mapped to that prompt's own JSON schema (status/stuck vs complete/severity), with a legitimate-stop carve-out (OAuth/2FA/credential/captcha → waiting_for_user) so genuine human blocks aren't nagged.
Adds the session-mining + judge eval harness used to derive and validate the change.

Closes

Closes #140

Test plan

npx jest test/reflection-3.unit.test.ts -> 106/106 (incl. new premature-stop assertion)
npx jest test/reflection.test.ts -> 80/80
npx tsc --noEmit clean
Judge eval loop (3x majority, 34 cases) -> 33/34; genuine cases #00/Telegram Voice Message Support - Unified Architecture #1 complete-unanimous, premature-stop cases correctly incomplete. The one miss (fix(reflection): use original task for evaluation, not last message #21) is a documented underspecified-test defect (required browser-E2E not present in the judge-visible input), not a prompt issue.

Notes

reflection-3.test-helpers.ts (the unit-test source) had drifted from reflection-3.ts -- mirrored the antipatterns there too.
External eval providers are dead, so the judge eval runs through harness agents rather than promptfoo's external call.
.dataset/ (mined sessions, may contain personal data) is git-ignored.

🤖 Generated with Claude Code

…prompt Port patterns/antipatterns mined from 227 real agent stops (78% premature: permission-seeking ~40%, stopped-with-todos ~30%) into both production judge prompts in reflection-3.ts (buildSelfAssessmentPrompt + analyzeSelfAssessment- WithLLM), each mapped to its own JSON schema. Mirror into reflection-3.test- helpers.ts (the unit-test source, which had drifted) and add a unit assertion that the antipattern guidance is present. Includes a legitimate-stop carve-out so genuine human-only blocks route to waiting_for_user instead of nagging. Validated via the judge eval loop: 33/34, genuine cases (#00/#1) complete- unanimous, premature-stop cases correctly incomplete. Refs #140 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Add the harness used to mine and validate the antipatterns: - scripts/extract_sessions.py, build_dataset.py, export_stop_candidates.py, validate_dataset.py — read local opencode/claude sessions into a turn-level labeled dataset (.dataset/, git-ignored; may contain personal data). - scripts/eval_render.py + eval_score.mjs + judge-eval.wf.js workflow — replay the 34 promptfoo judge cases through the prompt with 3x majority-vote and score against the original promptfoo assertions (external providers are dead, so the judge runs through harness agents). - evals/prompts/task-verification.txt — tuned eval-mirror with the same antipatterns + evidence-rule refinements. Refs #140 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… + fixtures to 34/34 - Switch all eval configs from the dead gpt-5 deployment to gpt-5.1 (apiVersion 2024-12-01-preview) on the azure-dev resource. CI secrets AZURE_OPENAI_API_KEY / AZURE_OPENAI_BASE_URL updated to match. - task-verification.txt: feature/system tasks need wiring+verification (not just written files) to be complete; permission-seeking covers "finished step N, asking which to do next"; severity calibration — recoverable technical snag is LOW/MEDIUM but a policy/process violation (push to main, skip tests) is HIGH. - Fixtures: make the multi-step and plan-then-implement cases self-contained (the required verification / implementation intent now lives in the task input the judge sees, not only in an assert comment). Real promptfoo run (gpt-5.1, the CI path): 34/34, 0 errors. Refs #140 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…to prod prompts Carry the eval-validated refinements into both production judge prompts (buildSelfAssessmentPrompt + analyzeSelfAssessmentWithLLM) and the test-helpers mirror: an "add a feature/system" task needs the code wired in and verified (not just written files); permission-seeking covers "finished step N, asking which to do next"; severity treats a recoverable snag as LOW/MEDIUM but a policy/process violation (push to main, skip mandated tests) as HIGH. Refs #140 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

engineer and others added 4 commits May 31, 2026 19:17

dzianisv merged commit c4dd7bc into main Jun 1, 2026
2 checks passed

dzianisv deleted the own/140-reflection-antipatterns branch June 1, 2026 02:53

dzianisv mentioned this pull request Jun 1, 2026

Improve reflection judge prompt with antipatterns mined from real sessions + eval loop #140

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reflection prompt: premature-stop antipatterns mined from real sessions#141

Reflection prompt: premature-stop antipatterns mined from real sessions#141
dzianisv merged 4 commits into
mainfrom
own/140-reflection-antipatterns

dzianisv commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dzianisv commented Jun 1, 2026

Summary

Closes

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant