Skip to content
This repository was archived by the owner on Jun 8, 2026. It is now read-only.

Reflection prompt: premature-stop antipatterns mined from real sessions#141

Merged
dzianisv merged 4 commits into
mainfrom
own/140-reflection-antipatterns
Jun 1, 2026
Merged

Reflection prompt: premature-stop antipatterns mined from real sessions#141
dzianisv merged 4 commits into
mainfrom
own/140-reflection-antipatterns

Conversation

@dzianisv

@dzianisv dzianisv commented Jun 1, 2026

Copy link
Copy Markdown
Owner

Summary

  • Production reflection judge prompts in reflection-3.ts (buildSelfAssessmentPrompt + analyzeSelfAssessmentWithLLM) now carry stop/idle patterns + antipatterns mined from 227 real agent stops where the user replied — 78% were premature (permission-seeking ~40%, stopped-with-todos ~30%).
  • Each antipattern is mapped to that prompt's own JSON schema (status/stuck vs complete/severity), with a legitimate-stop carve-out (OAuth/2FA/credential/captcha → waiting_for_user) so genuine human blocks aren't nagged.
  • Adds the session-mining + judge eval harness used to derive and validate the change.

Closes

Closes #140

Test plan

Notes

  • reflection-3.test-helpers.ts (the unit-test source) had drifted from reflection-3.ts -- mirrored the antipatterns there too.
  • External eval providers are dead, so the judge eval runs through harness agents rather than promptfoo's external call.
  • .dataset/ (mined sessions, may contain personal data) is git-ignored.

🤖 Generated with Claude Code

engineer and others added 4 commits May 31, 2026 19:17
…prompt

Port patterns/antipatterns mined from 227 real agent stops (78% premature:
permission-seeking ~40%, stopped-with-todos ~30%) into both production judge
prompts in reflection-3.ts (buildSelfAssessmentPrompt + analyzeSelfAssessment-
WithLLM), each mapped to its own JSON schema. Mirror into reflection-3.test-
helpers.ts (the unit-test source, which had drifted) and add a unit assertion
that the antipattern guidance is present. Includes a legitimate-stop carve-out
so genuine human-only blocks route to waiting_for_user instead of nagging.

Validated via the judge eval loop: 33/34, genuine cases (#00/#1) complete-
unanimous, premature-stop cases correctly incomplete.

Refs #140

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Add the harness used to mine and validate the antipatterns:
- scripts/extract_sessions.py, build_dataset.py, export_stop_candidates.py,
  validate_dataset.py — read local opencode/claude sessions into a turn-level
  labeled dataset (.dataset/, git-ignored; may contain personal data).
- scripts/eval_render.py + eval_score.mjs + judge-eval.wf.js workflow — replay
  the 34 promptfoo judge cases through the prompt with 3x majority-vote and
  score against the original promptfoo assertions (external providers are dead,
  so the judge runs through harness agents).
- evals/prompts/task-verification.txt — tuned eval-mirror with the same
  antipatterns + evidence-rule refinements.

Refs #140

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
… + fixtures to 34/34

- Switch all eval configs from the dead gpt-5 deployment to gpt-5.1
  (apiVersion 2024-12-01-preview) on the azure-dev resource. CI secrets
  AZURE_OPENAI_API_KEY / AZURE_OPENAI_BASE_URL updated to match.
- task-verification.txt: feature/system tasks need wiring+verification (not just
  written files) to be complete; permission-seeking covers "finished step N, asking
  which to do next"; severity calibration — recoverable technical snag is LOW/MEDIUM
  but a policy/process violation (push to main, skip tests) is HIGH.
- Fixtures: make the multi-step and plan-then-implement cases self-contained (the
  required verification / implementation intent now lives in the task input the judge
  sees, not only in an assert comment).

Real promptfoo run (gpt-5.1, the CI path): 34/34, 0 errors.

Refs #140

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…to prod prompts

Carry the eval-validated refinements into both production judge prompts
(buildSelfAssessmentPrompt + analyzeSelfAssessmentWithLLM) and the test-helpers
mirror: an "add a feature/system" task needs the code wired in and verified (not
just written files); permission-seeking covers "finished step N, asking which to
do next"; severity treats a recoverable snag as LOW/MEDIUM but a policy/process
violation (push to main, skip mandated tests) as HIGH.

Refs #140

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@dzianisv dzianisv merged commit c4dd7bc into main Jun 1, 2026
2 checks passed
@dzianisv dzianisv deleted the own/140-reflection-antipatterns branch June 1, 2026 02:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve reflection judge prompt with antipatterns mined from real sessions + eval loop

1 participant