audit-loop

Research-backed AI self-audit protocol for Claude Code, Codex, and OpenClaw. Verifies high-stakes answers via independent re-solve + cross-method probe — not by critique.

中文简介：跨 Claude Code / Codex / OpenClaw 的 AI 自审 skill。主 agent 在给出高风险答案（算法 / 机制 / 数字 / 正确性声明）前，spawn 一个 subagent 在不看草稿的前提下独立重解同一个问题，再 spawn 一个跨方法 probe，三路机械对照。每条设计决定都有论文背书，对协议局限性诚实承认。

The problem

When an AI agent produces a high-stakes answer — algorithm correctness, mechanism design, a numeric estimate, a "this is safe / optimal" claim — and you can't simply run a test, the conventional wisdom is "just ask the AI to check itself."

Empirical research consistently shows this fails structurally, not just occasionally:

Refinement-aware bias — same content scored higher when labeled "revised"
CoT trust — judges believe shown reasoning traces as ground truth (false-positive rate up to 90%)
Sycophancy — multi-turn pushback flips answers ~3× more than direct questioning
Self-preference / perplexity bias — models systematically under-flag errors typical of their own training distribution
Answer wavering — multi-round critique echo-chambers instead of converging
Intrinsic self-correction degrades reasoning accuracy on average (Huang et al., ICLR 2024)

Critique-of-draft is structurally broken. The structural fix that survives the literature: independent re-solve, then mechanically compare.

This mirrors what works in mature human audit domains: reperformance > inquiry in financial audit; replication > peer review; kernel-check > read-the-proof. Step-checking inherits the auditee's blind spots.

The protocol

audit-loop (default, budget-balanced — handles ~95% of cases):

Triage — empirically testable? Run the test instead. Trivial? Skip. Otherwise continue.
Characterize — internally name the CLAIM and its FALSIFICATION SHAPE (what would prove it wrong).
Spawn 1 — Independent re-solve. A subagent solves the original problem from scratch with no view of the draft, no reasoning, no audit framing. Just "solve."
Spawn 2 — Cross-method probe. A different subagent attacks the falsification shape directly: trace on edge inputs, search for counterexample, recompute via alternative method.
Mechanical comparison. Compare draft, re-solve, and probe via documented equivalence rules. Default to "disagreement" when unsure.
Report honestly. Disagreement surfaces in the audit line, never silently picked.

Hard cap: 2 spawns (a budget ceiling, not an accuracy optimum — the literature would support 6-14 verifiers for accuracy-optimal).

audit-loop-max (accuracy-optimal — for security-critical / irreversible / material-consequence decisions):

3-5 parallel independent re-solves (varied prompt approaches, cross-family if available)
2-3 parallel cross-method probes (different falsification angles)
Du-et-al multi-agent debate on persistent disagreement
No spawn cap (typical pool 5-8, up to 14)
Cross-family verification required where available

Platforms

Platform	Default skill	Max-accuracy skill
Claude Code	`~/.claude/skills/audit-loop/SKILL.md`	`~/.claude/skills/audit-loop-max/SKILL.md`
Codex CLI	`~/.agents/skills/audit-loop/SKILL.md`	`~/.agents/skills/audit-loop-max/SKILL.md`
OpenClaw	`~/.openclaw/skills/audit-loop/SKILL.md`	`~/.openclaw/skills/audit-loop-max/SKILL.md`

All three platforms implement the open agent skills standard (frontmatter + markdown body), with platform-specific subagent invocation:

Claude Code: Agent tool with subagent_type=general-purpose
Codex: explicit subagent spawn (optionally via custom auditor.toml agent)
OpenClaw: sessions_spawn + sessions_yield, context: "isolated"

Installation

git clone https://github.com/guoyurui138-hue/audit-loop.git
cd audit-loop

# Claude Code
mkdir -p ~/.claude/skills/audit-loop ~/.claude/skills/audit-loop-max
cp platforms/claude-code/audit-loop/SKILL.md ~/.claude/skills/audit-loop/
cp platforms/claude-code/audit-loop-max/SKILL.md ~/.claude/skills/audit-loop-max/

# Codex CLI
mkdir -p ~/.agents/skills/audit-loop ~/.agents/skills/audit-loop-max
cp platforms/codex/audit-loop/SKILL.md ~/.agents/skills/audit-loop/
cp platforms/codex/audit-loop-max/SKILL.md ~/.agents/skills/audit-loop-max/

# OpenClaw
mkdir -p ~/.openclaw/skills/audit-loop ~/.openclaw/skills/audit-loop-max
cp platforms/openclaw/audit-loop/SKILL.md ~/.openclaw/skills/audit-loop/
cp platforms/openclaw/audit-loop-max/SKILL.md ~/.openclaw/skills/audit-loop-max/

Skills auto-trigger when the agent is about to make a claim matching the description (algorithm correctness, mechanism design, non-empirical numeric estimate, safety/correctness assertion). Or invoke manually with /audit-loop or /audit-loop-max.

What this does NOT promise

This protocol is deliberately honest about its limits. Most "I built an AI agent that improves X by 80%" claims are uncited folklore. This one specifies what it cannot do:

Reduces error rate; does not eliminate it. Same-family verifiers share weights, share training data, share blind spots no protocol can fully escape.
Mathematical floor on correlated-verifier accuracy. For pairwise correlation ρ > 0, ensemble error converges to a positive constant Φ(Φ⁻¹(1−α)/√ρ) — adding verifiers cannot drive error to zero (Don't Always Pick, arXiv:2602.08003).
Cross-family is bounded. Eliminates judge bias (preference leakage drops from 28-37% to ~±1.5%) but only halves error correlation (same-family ρ ~0.7-0.8 → cross-family ~0.4-0.5). Capability is a bigger driver of correlation than vendor — two strong models from different vendors can agree on errors at 0.99+ (Correlated Errors, ICML 2025).
For empirically testable claims, this is inferior to running the test. The triage gate exists so you don't substitute theory for measurement.
Design-type problems are the most degraded mode. Failure-mode enumeration shares the same-family blind spots in the worst way — missing modes are the actually-dangerous ones, and same-family agents miss the same ones the main agent missed.
Frontier-novel claims, self-consistent fabrications, and aesthetic judgments are explicit bypass cases — the protocol reports degraded value in those.

Full limits are documented in each SKILL.md.

Empirical grounding

Every design choice has a paper citation in the SKILL.md. Headlines:

Why re-solve, not critique:

McAleese et al., 2024 — LLM Critics Help Catch LLM Bugs (CriticGPT) — https://arxiv.org/abs/2407.00215
Huang et al., ICLR 2024 — Large Language Models Cannot Self-Correct Reasoning Yet — https://arxiv.org/abs/2310.01798
Ye et al., 2024 — Justice or Prejudice? Quantifying Biases in LLM-as-a-Judge — https://arxiv.org/html/2410.02736v1
SycEval — Evaluating LLM Sycophancy — https://arxiv.org/html/2502.08177v4

Cross-model error correlation & cross-family limits:

Kim et al., ICML 2025 — Correlated Errors in Large Language Models — https://arxiv.org/abs/2506.07962
Li et al., ICLR 2026 — Preference Leakage in LLM-as-a-judge — https://arxiv.org/abs/2502.01534
Don't Always Pick the Highest-Performing Model (ensemble error floor) — https://arxiv.org/abs/2602.08003

Method diversity > sample diversity:

Lifshitz et al., 2025 — BoN-MAV: Multi-Agent Verification — https://arxiv.org/abs/2502.20379
Naik et al., 2023 — Diversity of Thought — https://arxiv.org/abs/2310.07088
Wang et al., 2022 — Self-Consistency — https://arxiv.org/abs/2203.11171
Du et al., 2023 — Multi-Agent Debate — https://arxiv.org/abs/2305.14325

Negative-prompting / priming failure:

Rana, 2026 — Semantic Gravity Wells — https://arxiv.org/pdf/2601.08070

Cross-domain audit principles (reperformance > inquiry, pre-registration, de Bruijn criterion): PCAOB AS 2315; Cochrane Handbook; NTSB Annex 13; Bazerman et al. 2002 on auditor capture; replication-crisis literature on Registered Reports.

Design choices flagged as "hypothesis, not yet empirically tested"

In the interest of not overclaiming, these choices are marked in the SKILL.md as defensible-but-not-proven, awaiting head-to-head studies:

Probe-on-agreement vs probe-on-disagreement allocation (we do probe regardless, but the comparative value is speculative).
The 2-spawn hard cap as an accuracy claim (it's defensible as a budget claim; literature would support 6+ for accuracy-optimal).

If you have empirical data that addresses these, please open an issue.

License

MIT

Contributing

PRs and issues welcome. Especially valuable: empirical comparisons that would move the "hypothesis" flags, additional bypass-case documentation, and platform adapters beyond the three currently supported.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
platforms		platforms
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install.sh		install.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

audit-loop

The problem

The protocol

Platforms

Installation

What this does NOT promise

Empirical grounding

Design choices flagged as "hypothesis, not yet empirically tested"

License

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

audit-loop

The problem

The protocol

Platforms

Installation

What this does NOT promise

Empirical grounding

Design choices flagged as "hypothesis, not yet empirically tested"

License

Contributing

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages