Skip to content

feat(reviewer): INJ Phase 1 root fix — code-computed typed verdict (D-INJ-1)#349

Open
ProtocolWarden wants to merge 1 commit into
mainfrom
goal/inj-typed-verdict
Open

feat(reviewer): INJ Phase 1 root fix — code-computed typed verdict (D-INJ-1)#349
ProtocolWarden wants to merge 1 commit into
mainfrom
goal/inj-typed-verdict

Conversation

@ProtocolWarden

Copy link
Copy Markdown
Owner

Summary

First PR of Harness Trust-Hardening Phase 1 (INJ) — operator-implemented (the fleet must not author the controls that constrain it).

The reviewer emitted a free-text {"result": "LGTM"} that the model authored, so any prompt injection in the diff / campaign spec / Custodian findings contended directly for the merge decision (suppress a CONCERNS, forge an LGTM). Per spec §2.2 / D-INJ-1, the capability itself is removed.

Change

  • New pr_review_watcher/verdict.py: enumerated REVIEW_CHECKS, pure compute_verdict(checks) -> (result, failing), and verdict_schema_prompt().
  • The model now fills a typed {check_id, status, evidence_span} per check. _run_direct_review (the trust boundary) runs compute_verdict and returns a code-computed resultignoring any model-authored result.
  • Fail-safe: missing / unknown / malformed → CONCERNS, never auto-LGTM (also satisfies D-INJ-2 degrade-to-stricter).

Acceptance (§2.4)

A forged {"result":"LGTM"} with no real checks computes to CONCERNS (unit + trust-boundary tests). 11 verdict-unit + 2 boundary tests; 237 reviewer tests pass; ruff/ty/audit clean.

Remaining Phase-1 PRs: typed circular hand-off (D-INJ-4), {detector_id,count} findings (D-INJ-3), output sanitization, nonce-fenced envelope, Custodian INJ1 detector.

…-INJ-1)

The reviewer emitted a free-text {"result": "LGTM"} the MODEL authored, so any
prompt injection in the diff/spec/Custodian findings contended directly for the
merge decision. Per HARNESS_TRUST_HARDENING.md §2.2/D-INJ-1 the capability is
removed: the model fills a typed {check_id, status, evidence_span} per enumerated
review check and CODE computes LGTM/CONCERNS.

- New pr_review_watcher/verdict.py: REVIEW_CHECKS, compute_verdict() (pure),
  verdict_schema_prompt().
- _run_direct_review (the trust boundary) computes the verdict from the model's
  typed checks and ignores any model-authored "result".
- Fail-safe: missing/unknown/malformed -> CONCERNS, never auto-LGTM (D-INJ-2).

Acceptance (§2.4): a forged {"result":"LGTM"} with no real checks computes to
CONCERNS. 11 verdict-unit + 2 boundary tests; 237 reviewer tests pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@ProtocolWarden

ProtocolWarden commented Jun 20, 2026

Copy link
Copy Markdown
Owner Author

Resolved: CI green on unchanged head — test suite validates implementation; automated review resumed

Needs human attention (reason=ci_misconfigured_check). Left open — not merged (unresolved) and not closed (work preserved).

CI has not gone green after 20 checks (1 failing: License headers: failure). Not merged (red CI) and not closed (work preserved) — needs a human to fix CI.

@ProtocolWarden

Copy link
Copy Markdown
Owner Author

Needs human attention (reason=ci_misconfigured_check). Left open — not merged (unresolved) and not closed (work preserved).

CI has not gone green after 21 checks (1 failing: License headers: failure). Not merged (red CI) and not closed (work preserved) — needs a human to fix CI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant