Research-backed AI self-audit protocol for Claude Code, Codex, and OpenClaw. Verifies high-stakes answers via independent re-solve + cross-method probe — not by critique.
中文简介:跨 Claude Code / Codex / OpenClaw 的 AI 自审 skill。主 agent 在给出高风险答案(算法 / 机制 / 数字 / 正确性声明)前,spawn 一个 subagent 在不看草稿的前提下独立重解同一个问题,再 spawn 一个跨方法 probe,三路机械对照。每条设计决定都有论文背书,对协议局限性诚实承认。
When an AI agent produces a high-stakes answer — algorithm correctness, mechanism design, a numeric estimate, a "this is safe / optimal" claim — and you can't simply run a test, the conventional wisdom is "just ask the AI to check itself."
Empirical research consistently shows this fails structurally, not just occasionally:
- Refinement-aware bias — same content scored higher when labeled "revised"
- CoT trust — judges believe shown reasoning traces as ground truth (false-positive rate up to 90%)
- Sycophancy — multi-turn pushback flips answers ~3× more than direct questioning
- Self-preference / perplexity bias — models systematically under-flag errors typical of their own training distribution
- Answer wavering — multi-round critique echo-chambers instead of converging
- Intrinsic self-correction degrades reasoning accuracy on average (Huang et al., ICLR 2024)
Critique-of-draft is structurally broken. The structural fix that survives the literature: independent re-solve, then mechanically compare.
This mirrors what works in mature human audit domains: reperformance > inquiry in financial audit; replication > peer review; kernel-check > read-the-proof. Step-checking inherits the auditee's blind spots.
audit-loop (default, budget-balanced — handles ~95% of cases):
- Triage — empirically testable? Run the test instead. Trivial? Skip. Otherwise continue.
- Characterize — internally name the CLAIM and its FALSIFICATION SHAPE (what would prove it wrong).
- Spawn 1 — Independent re-solve. A subagent solves the original problem from scratch with no view of the draft, no reasoning, no audit framing. Just "solve."
- Spawn 2 — Cross-method probe. A different subagent attacks the falsification shape directly: trace on edge inputs, search for counterexample, recompute via alternative method.
- Mechanical comparison. Compare draft, re-solve, and probe via documented equivalence rules. Default to "disagreement" when unsure.
- Report honestly. Disagreement surfaces in the audit line, never silently picked.
Hard cap: 2 spawns (a budget ceiling, not an accuracy optimum — the literature would support 6-14 verifiers for accuracy-optimal).
audit-loop-max (accuracy-optimal — for security-critical / irreversible / material-consequence decisions):
- 3-5 parallel independent re-solves (varied prompt approaches, cross-family if available)
- 2-3 parallel cross-method probes (different falsification angles)
- Du-et-al multi-agent debate on persistent disagreement
- No spawn cap (typical pool 5-8, up to 14)
- Cross-family verification required where available
| Platform | Default skill | Max-accuracy skill |
|---|---|---|
| Claude Code | ~/.claude/skills/audit-loop/SKILL.md |
~/.claude/skills/audit-loop-max/SKILL.md |
| Codex CLI | ~/.agents/skills/audit-loop/SKILL.md |
~/.agents/skills/audit-loop-max/SKILL.md |
| OpenClaw | ~/.openclaw/skills/audit-loop/SKILL.md |
~/.openclaw/skills/audit-loop-max/SKILL.md |
All three platforms implement the open agent skills standard (frontmatter + markdown body), with platform-specific subagent invocation:
- Claude Code:
Agenttool withsubagent_type=general-purpose - Codex: explicit subagent spawn (optionally via custom
auditor.tomlagent) - OpenClaw:
sessions_spawn+sessions_yield,context: "isolated"
git clone https://github.com/guoyurui138-hue/audit-loop.git
cd audit-loop
# Claude Code
mkdir -p ~/.claude/skills/audit-loop ~/.claude/skills/audit-loop-max
cp platforms/claude-code/audit-loop/SKILL.md ~/.claude/skills/audit-loop/
cp platforms/claude-code/audit-loop-max/SKILL.md ~/.claude/skills/audit-loop-max/
# Codex CLI
mkdir -p ~/.agents/skills/audit-loop ~/.agents/skills/audit-loop-max
cp platforms/codex/audit-loop/SKILL.md ~/.agents/skills/audit-loop/
cp platforms/codex/audit-loop-max/SKILL.md ~/.agents/skills/audit-loop-max/
# OpenClaw
mkdir -p ~/.openclaw/skills/audit-loop ~/.openclaw/skills/audit-loop-max
cp platforms/openclaw/audit-loop/SKILL.md ~/.openclaw/skills/audit-loop/
cp platforms/openclaw/audit-loop-max/SKILL.md ~/.openclaw/skills/audit-loop-max/Skills auto-trigger when the agent is about to make a claim matching the description (algorithm correctness, mechanism design, non-empirical numeric estimate, safety/correctness assertion). Or invoke manually with /audit-loop or /audit-loop-max.
This protocol is deliberately honest about its limits. Most "I built an AI agent that improves X by 80%" claims are uncited folklore. This one specifies what it cannot do:
- Reduces error rate; does not eliminate it. Same-family verifiers share weights, share training data, share blind spots no protocol can fully escape.
- Mathematical floor on correlated-verifier accuracy. For pairwise correlation ρ > 0, ensemble error converges to a positive constant
Φ(Φ⁻¹(1−α)/√ρ)— adding verifiers cannot drive error to zero (Don't Always Pick, arXiv:2602.08003). - Cross-family is bounded. Eliminates judge bias (preference leakage drops from 28-37% to ~±1.5%) but only halves error correlation (same-family ρ ~0.7-0.8 → cross-family ~0.4-0.5). Capability is a bigger driver of correlation than vendor — two strong models from different vendors can agree on errors at 0.99+ (Correlated Errors, ICML 2025).
- For empirically testable claims, this is inferior to running the test. The triage gate exists so you don't substitute theory for measurement.
- Design-type problems are the most degraded mode. Failure-mode enumeration shares the same-family blind spots in the worst way — missing modes are the actually-dangerous ones, and same-family agents miss the same ones the main agent missed.
- Frontier-novel claims, self-consistent fabrications, and aesthetic judgments are explicit bypass cases — the protocol reports degraded value in those.
Full limits are documented in each SKILL.md.
Every design choice has a paper citation in the SKILL.md. Headlines:
Why re-solve, not critique:
- McAleese et al., 2024 — LLM Critics Help Catch LLM Bugs (CriticGPT) — https://arxiv.org/abs/2407.00215
- Huang et al., ICLR 2024 — Large Language Models Cannot Self-Correct Reasoning Yet — https://arxiv.org/abs/2310.01798
- Ye et al., 2024 — Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge — https://arxiv.org/html/2410.02736v1
- SycEval — Evaluating LLM Sycophancy — https://arxiv.org/html/2502.08177v4
Cross-model error correlation & cross-family limits:
- Kim et al., ICML 2025 — Correlated Errors in Large Language Models — https://arxiv.org/abs/2506.07962
- Li et al., ICLR 2026 — Preference Leakage in LLM-as-a-judge — https://arxiv.org/abs/2502.01534
- Don't Always Pick the Highest-Performing Model (ensemble error floor) — https://arxiv.org/abs/2602.08003
Method diversity > sample diversity:
- Lifshitz et al., 2025 — BoN-MAV: Multi-Agent Verification — https://arxiv.org/abs/2502.20379
- Naik et al., 2023 — Diversity of Thought — https://arxiv.org/abs/2310.07088
- Wang et al., 2022 — Self-Consistency — https://arxiv.org/abs/2203.11171
- Du et al., 2023 — Multi-Agent Debate — https://arxiv.org/abs/2305.14325
Negative-prompting / priming failure:
- Rana, 2026 — Semantic Gravity Wells — https://arxiv.org/pdf/2601.08070
Cross-domain audit principles (reperformance > inquiry, pre-registration, de Bruijn criterion): PCAOB AS 2315; Cochrane Handbook; NTSB Annex 13; Bazerman et al. 2002 on auditor capture; replication-crisis literature on Registered Reports.
In the interest of not overclaiming, these choices are marked in the SKILL.md as defensible-but-not-proven, awaiting head-to-head studies:
- Probe-on-agreement vs probe-on-disagreement allocation (we do probe regardless, but the comparative value is speculative).
- The 2-spawn hard cap as an accuracy claim (it's defensible as a budget claim; literature would support 6+ for accuracy-optimal).
If you have empirical data that addresses these, please open an issue.
MIT
PRs and issues welcome. Especially valuable: empirical comparisons that would move the "hypothesis" flags, additional bypass-case documentation, and platform adapters beyond the three currently supported.