feat(reviewer): INJ Phase 1 — fenced findings (D-INJ-3) + nonce envelope + output sanitization by ProtocolWarden · Pull Request #350 · ProtocolWarden/OperationsCenter

ProtocolWarden · 2026-06-20T03:17:50Z

Summary

Completes the OC-side of Phase 1 (INJ) (stacked on #349's typed verdict).

D-INJ-3 (fenced findings): _custodian_findings now emits only {detector_id×count} — never the raw path/line/message, which are attacker-authored repo content laundered through a trusted channel (the spec's single strongest attack). The reviewer re-derives which lines from the diff (itself fenced).
Nonce envelope (outer, §2.2.5): new pr_review_watcher/inj.py wraps every untrusted span (PR title, diff, campaign spec) in a per-run randomized fence + system preamble. Attacker text can't close the fence; a literal copy of the live nonce is redacted from the payload.
Output sanitization (§2.2.4): sanitize_for_comment defangs @-mentions + zero-width/bidi chars before any model/untrusted text is reflected to GitHub.
D-INJ-4 (typed hand-off) was already closed by feat(reviewer): INJ Phase 1 root fix — code-computed typed verdict (D-INJ-1) #349 — the fix-pass goal is built from the code-derived failing-check IDs, not model free-text.

Verification

Acceptance §2.4: no raw custodian sample reaches the goal (grep + test). 10 new inj tests; 247 reviewer tests pass; ruff/ty/audit clean.

Note: builds on #349; once that merges this diff reduces to just the fence/findings changes.
Remaining Phase-1 item: the Custodian INJ1 deterministic detector (separate repo).

…-INJ-1) The reviewer emitted a free-text {"result": "LGTM"} the MODEL authored, so any prompt injection in the diff/spec/Custodian findings contended directly for the merge decision. Per HARNESS_TRUST_HARDENING.md §2.2/D-INJ-1 the capability is removed: the model fills a typed {check_id, status, evidence_span} per enumerated review check and CODE computes LGTM/CONCERNS. - New pr_review_watcher/verdict.py: REVIEW_CHECKS, compute_verdict() (pure), verdict_schema_prompt(). - _run_direct_review (the trust boundary) computes the verdict from the model's typed checks and ignores any model-authored "result". - Fail-safe: missing/unknown/malformed -> CONCERNS, never auto-LGTM (D-INJ-2). Acceptance (§2.4): a forged {"result":"LGTM"} with no real checks computes to CONCERNS. 11 verdict-unit + 2 boundary tests; 237 reviewer tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

… sanitization D-INJ-3: _custodian_findings emits only {detector_id×count}, never the raw path/line/message (attacker-authored repo content laundered through a trusted channel — the spec's strongest attack). New pr_review_watcher/inj.py: per-run nonce fence wrapping every untrusted span (title/diff/spec) with a system preamble (outer defense; live nonce redacted from payload so a close marker can't be forged), and sanitize_for_comment defanging @-mentions + zero-width/ bidi chars before reflecting any text to GitHub (D-INJ output sanitization). 10 new inj tests; 247 reviewer tests pass. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

ProtocolWarden · 2026-06-20T03:47:03Z

Needs human attention (reason=ci_misconfigured_check). Left open — not merged (unresolved) and not closed (work preserved).

CI has not gone green after 20 checks (1 failing: License headers: failure). Not merged (red CI) and not closed (work preserved) — needs a human to fix CI.

ProtocolWarden and others added 2 commits June 19, 2026 23:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(reviewer): INJ Phase 1 — fenced findings (D-INJ-3) + nonce envelope + output sanitization#350

feat(reviewer): INJ Phase 1 — fenced findings (D-INJ-3) + nonce envelope + output sanitization#350
ProtocolWarden wants to merge 2 commits into
mainfrom
goal/inj-fence-findings

ProtocolWarden commented Jun 20, 2026

Uh oh!

ProtocolWarden commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ProtocolWarden commented Jun 20, 2026

Summary

Verification

Uh oh!

ProtocolWarden commented Jun 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant