Skip to content

reduce ceremony and make verification/review gates stricter #7

@regenrek

Description

@regenrek

Context

A real Codex /goal dogfood run used Planr to build and deploy a polished Phaser browser game with generated assets, live browser verification, follow-up
fixes, and Cloudflare deploy.
Retrospective verdict was mixed:
Planr helped the original broad run by providing durable task state, dependency ordering, evidence logs, reviews, and an auditable goal contract. However, it
also added ceremony during smaller polish/debug follow-ups and missed important product-level bugs.
This issue tracks the concrete product gaps found during that run.

What Worked

  • Planr made the original broad goal more reliable than plain Codex /goal.
  • The map turned the broad task into ordered slices.
  • planr pick, planr done, logs, and reviews preserved handoff evidence.
  • planr plan audit <plan-id> --json prevented premature completion for the original goal because the stored goal contract required live browser
    verification.
  • If the session had crashed, Planr would have preserved item state, dependency links, logs, reviews, approvals, context, and screenshot/artifact paths.

Problems Found

1. Verification was not consistently binding across follow-up plans

The original goal audit required live browser evidence because a goal contract existed.
A later follow-up plan could report holds: true even when:

"verification_logged": {
  "pass": false,
  "required": false
}

This creates an unsafe distinction:

  • broad /goal plans can require verification,
  • follow-up plans can appear done without verification unless a new contract is created.
    For autonomous work, this is too easy to miss.

2. Browser capability context was not used

The run used Codex Browser, but not because Planr selected it from capability state.
Evidence from the retrospective:

planr context list --tag capability

returned:

{"contexts":[]}

Browser use happened because the user prompt and plan/contract said to use Browser, not because a pinned Planr capability guided the worker or reviewer.
Planr should make capability capture more automatic when the user says things like:

  • use Codex browser plugin
  • use browser-harness
  • use Playwright
  • use iOS simulator
  • use hatch-pet / image generation

3. Reviews could become ceremonial

Some review artifacts had:

Source content included: false

and no findings, yet review gates completed.
That weakens the value of review gates. If a review does not inspect source, changed files, logs, and verification evidence, it should either:

  • close as unclear, or
  • be skipped as low-signal ceremony.

4. Small follow-up polish created too much overhead

For small visual polish/debug loops, Planr sometimes produced too many items and repeated done --review / review close cycles.
This made Planr feel slower than a plain Codex /goal loop for minor UI polish.
Planr needs a lighter follow-up mode for small bug/polish/deploy tasks.

5. Semantic product bugs were not caught

A real product bug escaped Planr review and verification:
The leaderboard submitted bestScore instead of the current failed-run score. The user caught it manually during live use.
Planr’s smoke tests, browser verification, reviews, and audits did not catch this semantic acceptance bug.
The issue is not that Planr should magically know every product bug, but that verification synthesis was too generic. The browser verifier should derive
specific assertions from acceptance criteria.

Proposed Fixes

A. Make autonomous follow-up plans require a goal contract or verification gate

When $planr-loop runs against a plan without a stored goal-contract, it should not silently treat verification as optional.
Possible behavior:

  • If no contract exists, create/store one in iteration 1 using the plan’s verification section.
  • Or fail with a clear instruction to run $planr-goal / create a contract first.
  • Or mark verification_logged.required = true for all plans run by $planr-loop.
    Acceptance criteria:
  • A plan executed by $planr-loop cannot audit as complete without required verification evidence.
  • plan audit clearly explains when verification is skipped because no contract exists.
  • Follow-up bug/deploy plans have binding verification by default when run autonomously.

B. Capture requested capabilities during goal prep

$planr-goal should detect user-requested tooling and record it as capability context.
Examples:

planr context add "web verification, Codex browser plugin, invoke via Codex browser tools" --tag capability
planr context add "asset generation, use /hatch-pet for player character graphics" --tag capability

Acceptance criteria:

  • If the user says “use Codex browser plugin”, $planr-goal records a capability context.
  • $planr-verify-web reads planr context list --tag capability before discovery.
  • Reviewers can see which capability was intended and whether it was actually used.
  • Capability contexts are included in summary/audit output where relevant.

C. Synthesize acceptance-specific verification steps

For web goals, $planr-goal or $planr-verify-web should derive concrete browser assertions from acceptance criteria.
Example for a game:

  • start the game,
  • jump,
  • observe score increasing,
  • collide,
  • verify game over,
  • submit or inspect leaderboard score,
  • restart,
  • verify generated assets are visible.
    Acceptance criteria:
  • Browser verification logs include actions and assertions, not only “opened app and checked screenshot.”
  • Verification explicitly covers user-visible acceptance criteria.
  • The verification summary should be strong enough for a reviewer to replay or challenge it.

D. Make review gates source-aware or skip them

Improve review discipline so complete means the reviewer actually inspected evidence.
Possible behavior:

  • planr review close --verdict complete warns or fails when the review has no source/evidence inspection.
  • Review artifacts should record:
    • changed files inspected,
    • commands rerun,
    • verification evidence checked,
    • whether source content was included.
  • Low-signal review gates should be avoided by default for trivial polish.
    Acceptance criteria:
  • A complete review cannot be indistinguishable from a ceremonial approval.
  • Reviews without source/evidence inspection close as unclear or produce a warning.
  • $planr-loop requests fewer review gates for small low-risk follow-ups.

E. Add lightweight follow-up mode

Planr should support small post-goal work without forcing a full heavy planning cycle.
Example prompt:

Use $planr. Add follow-up tasks for two bugs and a deploy, then continue.

Expected behavior:

  • Reuse the existing project.
  • Create a small follow-up plan or parent gate.
  • Add 2-4 map items.
  • Link deploy after fixes and verification.
  • Request approval for deploy.
  • Avoid excessive review gates unless risk warrants them.
    Acceptance criteria:
  • Follow-up bug/polish/deploy work is easy to add from a single prompt.
  • The old completed goal remains intact.
  • The live map shows old and new scope together.
  • Follow-up work still has evidence and verification, but less ceremony.

Non-Goals

  • Do not replace Codex /goal.
  • Do not make Planr another loop engine.
  • Do not require Planr to ship browser tooling.
  • Do not require all work to have heavy independent reviews.
  • Do not make every tiny polish change a full product plan.

Core Positioning

Codex /goal is the loop engine.
Planr should be the durable state, evidence, verification, review, approval, and recovery layer around that loop.
This dogfood run supports that positioning, but also shows Planr must reduce ceremony and make its gates more meaningful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions