Skip to content

Mercer8964/audit-loop

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

audit-loop

Research-backed AI self-audit protocol for Claude Code, Codex, and OpenClaw. Verifies high-stakes answers via independent re-solve + cross-method probe — not by critique.

中文简介:跨 Claude Code / Codex / OpenClaw 的 AI 自审 skill。主 agent 在给出高风险答案(算法 / 机制 / 数字 / 正确性声明)前,spawn 一个 subagent 在不看草稿的前提下独立重解同一个问题,再 spawn 一个跨方法 probe,三路机械对照。每条设计决定都有论文背书,对协议局限性诚实承认。


The problem

When an AI agent produces a high-stakes answer — algorithm correctness, mechanism design, a numeric estimate, a "this is safe / optimal" claim — and you can't simply run a test, the conventional wisdom is "just ask the AI to check itself."

Empirical research consistently shows this fails structurally, not just occasionally:

  • Refinement-aware bias — same content scored higher when labeled "revised"
  • CoT trust — judges believe shown reasoning traces as ground truth (false-positive rate up to 90%)
  • Sycophancy — multi-turn pushback flips answers ~3× more than direct questioning
  • Self-preference / perplexity bias — models systematically under-flag errors typical of their own training distribution
  • Answer wavering — multi-round critique echo-chambers instead of converging
  • Intrinsic self-correction degrades reasoning accuracy on average (Huang et al., ICLR 2024)

Critique-of-draft is structurally broken. The structural fix that survives the literature: independent re-solve, then mechanically compare.

This mirrors what works in mature human audit domains: reperformance > inquiry in financial audit; replication > peer review; kernel-check > read-the-proof. Step-checking inherits the auditee's blind spots.

The protocol

audit-loop (default, budget-balanced — handles ~95% of cases):

  1. Triage — empirically testable? Run the test instead. Trivial? Skip. Otherwise continue.
  2. Characterize — internally name the CLAIM and its FALSIFICATION SHAPE (what would prove it wrong).
  3. Spawn 1 — Independent re-solve. A subagent solves the original problem from scratch with no view of the draft, no reasoning, no audit framing. Just "solve."
  4. Spawn 2 — Cross-method probe. A different subagent attacks the falsification shape directly: trace on edge inputs, search for counterexample, recompute via alternative method.
  5. Mechanical comparison. Compare draft, re-solve, and probe via documented equivalence rules. Default to "disagreement" when unsure.
  6. Report honestly. Disagreement surfaces in the audit line, never silently picked.

Hard cap: 2 spawns (a budget ceiling, not an accuracy optimum — the literature would support 6-14 verifiers for accuracy-optimal).

audit-loop-max (accuracy-optimal — for security-critical / irreversible / material-consequence decisions):

  • 3-5 parallel independent re-solves (varied prompt approaches, cross-family if available)
  • 2-3 parallel cross-method probes (different falsification angles)
  • Du-et-al multi-agent debate on persistent disagreement
  • No spawn cap (typical pool 5-8, up to 14)
  • Cross-family verification required where available

Platforms

Platform Default skill Max-accuracy skill
Claude Code ~/.claude/skills/audit-loop/SKILL.md ~/.claude/skills/audit-loop-max/SKILL.md
Codex CLI ~/.agents/skills/audit-loop/SKILL.md ~/.agents/skills/audit-loop-max/SKILL.md
OpenClaw ~/.openclaw/skills/audit-loop/SKILL.md ~/.openclaw/skills/audit-loop-max/SKILL.md

All three platforms implement the open agent skills standard (frontmatter + markdown body), with platform-specific subagent invocation:

  • Claude Code: Agent tool with subagent_type=general-purpose
  • Codex: explicit subagent spawn (optionally via custom auditor.toml agent)
  • OpenClaw: sessions_spawn + sessions_yield, context: "isolated"

Installation

git clone https://github.com/guoyurui138-hue/audit-loop.git
cd audit-loop

# Claude Code
mkdir -p ~/.claude/skills/audit-loop ~/.claude/skills/audit-loop-max
cp platforms/claude-code/audit-loop/SKILL.md ~/.claude/skills/audit-loop/
cp platforms/claude-code/audit-loop-max/SKILL.md ~/.claude/skills/audit-loop-max/

# Codex CLI
mkdir -p ~/.agents/skills/audit-loop ~/.agents/skills/audit-loop-max
cp platforms/codex/audit-loop/SKILL.md ~/.agents/skills/audit-loop/
cp platforms/codex/audit-loop-max/SKILL.md ~/.agents/skills/audit-loop-max/

# OpenClaw
mkdir -p ~/.openclaw/skills/audit-loop ~/.openclaw/skills/audit-loop-max
cp platforms/openclaw/audit-loop/SKILL.md ~/.openclaw/skills/audit-loop/
cp platforms/openclaw/audit-loop-max/SKILL.md ~/.openclaw/skills/audit-loop-max/

Skills auto-trigger when the agent is about to make a claim matching the description (algorithm correctness, mechanism design, non-empirical numeric estimate, safety/correctness assertion). Or invoke manually with /audit-loop or /audit-loop-max.

What this does NOT promise

This protocol is deliberately honest about its limits. Most "I built an AI agent that improves X by 80%" claims are uncited folklore. This one specifies what it cannot do:

  • Reduces error rate; does not eliminate it. Same-family verifiers share weights, share training data, share blind spots no protocol can fully escape.
  • Mathematical floor on correlated-verifier accuracy. For pairwise correlation ρ > 0, ensemble error converges to a positive constant Φ(Φ⁻¹(1−α)/√ρ) — adding verifiers cannot drive error to zero (Don't Always Pick, arXiv:2602.08003).
  • Cross-family is bounded. Eliminates judge bias (preference leakage drops from 28-37% to ~±1.5%) but only halves error correlation (same-family ρ ~0.7-0.8 → cross-family ~0.4-0.5). Capability is a bigger driver of correlation than vendor — two strong models from different vendors can agree on errors at 0.99+ (Correlated Errors, ICML 2025).
  • For empirically testable claims, this is inferior to running the test. The triage gate exists so you don't substitute theory for measurement.
  • Design-type problems are the most degraded mode. Failure-mode enumeration shares the same-family blind spots in the worst way — missing modes are the actually-dangerous ones, and same-family agents miss the same ones the main agent missed.
  • Frontier-novel claims, self-consistent fabrications, and aesthetic judgments are explicit bypass cases — the protocol reports degraded value in those.

Full limits are documented in each SKILL.md.

Empirical grounding

Every design choice has a paper citation in the SKILL.md. Headlines:

Why re-solve, not critique:

Cross-model error correlation & cross-family limits:

Method diversity > sample diversity:

Negative-prompting / priming failure:

Cross-domain audit principles (reperformance > inquiry, pre-registration, de Bruijn criterion): PCAOB AS 2315; Cochrane Handbook; NTSB Annex 13; Bazerman et al. 2002 on auditor capture; replication-crisis literature on Registered Reports.

Design choices flagged as "hypothesis, not yet empirically tested"

In the interest of not overclaiming, these choices are marked in the SKILL.md as defensible-but-not-proven, awaiting head-to-head studies:

  • Probe-on-agreement vs probe-on-disagreement allocation (we do probe regardless, but the comparative value is speculative).
  • The 2-spawn hard cap as an accuracy claim (it's defensible as a budget claim; literature would support 6+ for accuracy-optimal).

If you have empirical data that addresses these, please open an issue.

License

MIT

Contributing

PRs and issues welcome. Especially valuable: empirical comparisons that would move the "hypothesis" flags, additional bypass-case documentation, and platform adapters beyond the three currently supported.

About

Research-backed AI self-audit skill for Claude Code / Codex / OpenClaw. Verifies high-stakes answers via independent re-solve + cross-method probe — not by critique. Honest about limits; every design choice has a paper citation.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages