diff --git a/docs/results/autodata-live.md b/docs/results/autodata-live.md
index 0847d25..80df2e8 100644
--- a/docs/results/autodata-live.md
+++ b/docs/results/autodata-live.md
@@ -1,17 +1,16 @@
-# Autodata live result: the causal challenger widens the gap (reproduced) — but clearing the accept bar is noisy at this n/tier (NOT robust)
+# Autodata live result: the causal-challenger loop reliably discriminates at power — 38% accept-rate, CI [23%, 55%] (NOT a coin-flip)
 
-Running the agentic data-creation loop (`src/autodata/`) on a real arXiv doc with real two-tier
+Running the agentic data-creation loop (`src/autodata/`) on real arXiv docs with real two-tier
 solvers, to manufacture training examples that separate a strong solver from a weak one (the
 discriminative reward of the Autodata / Agentic-Self-Instruct method).
 
-**Honest headline (two independent runs):** the non-extractive causal challenger + the refine fold
-**reliably widen the strong/weak gap by ~+0.20 vs plain generation** (reproduced in both runs — the
-method's Table-1 *direction* holds). BUT **clearing the hard accept bar** (weak < 0.5 ∧ strong ≥ 0.65
-∧ gap ≥ 0.2) is **noisy and marginal**: one run accepted 1–2 of 3, an **independent re-run accepted
-0 of 3**. The reason is in the answers — `llama-3.1-8b` on these MoE questions sometimes flails
-(0.24) and sometimes answers *competently* (0.75), straddling the 0.5 "weak must struggle" line. So:
-**directionally confirmed, not a robust positive at n=3 / this tier.** This is the same small-n
-mirage that bit the earlier two-agent A/B (positive at n=1, washes at power) — flagged, not buried.
+**Powered headline (32 independent slots, 2 docs, samples=4):** the loop **reliably manufactures
+discriminating examples — accept-rate 38%, Wilson 95% CI [23%, 55%]** (12 of 32 slots cleared the
+hard accept bar: weak < 0.5 ∧ strong ≥ 0.65 ∧ gap ≥ 0.2). The CI lower bound (23%) excludes ~0, so
+this is a **real, repeatable rate, not the n=1–2 luck** that made it look like a coin-flip at n=3.
+Acceptance is **doc-dependent** (mixtral 19%, deepseek-v3 56%) and gated by **whether the weak model
+struggles** (it does on only 39% of attempts), but it is decisively above zero on both docs. This
+**replaces** the earlier n=3 result, which was too noisy to tell "real rate" from "coin-flip ~0".
 
 ## The two levers that turned the null into a positive
 
@@ -29,22 +28,30 @@ both fixed here:
 
 2. **The grounding doc was memorized.** The default was "Attention Is All You Need" — the most
    canonical paper in ML, which an 8B has memorized, so even reasoning questions are answerable from
-   pretraining and capability cannot separate. Fix — **ground on a doc the weak solver has not
-   memorized**: the new default is the Mixtral-of-Experts paper (arXiv 2401.04088, Jan 2024), which
-   post-dates `llama-3.1-8b`'s knowledge cutoff, forcing it to reason from the context.
+   pretraining and capability cannot separate. Fix — **ground on docs the weak solver has not
+   memorized**: the Mixtral-of-Experts paper (arXiv 2401.04088, Jan 2024) and the DeepSeek-V3 paper
+   (arXiv 2412.19437, Dec 2024), both post-dating `llama-3.1-8b`'s knowledge cutoff, forcing it to
+   reason from the context.
 
 ## Setup (all env-overridable)
 
 | role | model | why |
 |---|---|---|
-| weak solver | `groq/llama-3.1-8b-instant` | small; cutoff predates the 2024 doc → must reason, can't recall |
+| weak solver | `groq/llama-3.1-8b-instant` | small; cutoff predates the 2024 docs → must reason, can't recall |
 | strong solver | `gemini-2.5-pro` | frontier reasoner; a real wide capability gap |
 | challenger + judge | `deepseek-v4-flash` | capable, fast, reliable, a DIFFERENT family from both solvers (no judge-bias) |
-| grounding doc | Mixtral-of-Experts (2401.04088) | non-memorized, reasoning-rich (MoE routing / gating) |
+| grounding doc A | Mixtral-of-Experts (2401.04088) | non-memorized; MoE expert routing / gating (`focus=expert`) |
+| grounding doc B | DeepSeek-V3 (2412.19437) | non-memorized; auxiliary-loss-free load balancing / expert specialization (`focus=auxiliary`) |
 
-Accept thresholds (the paper's): strong >= 0.65, weak < 0.50, gap >= 0.20. (`glm-5.2`, the brief's
-challenger/judge, was returning upstream-capacity 503s during this run; `deepseek-v4-flash` is the
-live, neutral substitute. `routerChat` now retries transient 503/429/timeout with bounded backoff.)
+Accept thresholds (the paper's): strong ≥ 0.65, weak < 0.50, gap ≥ 0.20. (`glm-5.2`, the brief's
+challenger/judge, was returning upstream-capacity 503s; `deepseek-v4-flash` is the live, neutral
+substitute. `routerChat` retries transient 503/429/timeout with bounded backoff.)
+
+The grounding chunk must be **prose, not equations**: an equation-dense chunk (e.g. DeepSeek-V3's MLA
+section) breaks the challenger's strict-JSON output (LaTeX backslashes), so both `focus` terms select
+the prose description of an MoE-expert mechanism. Even so, 5 of 32 slots (~16%) still hit a
+LaTeX-in-JSON failure and produced no example — those count as rejects in the headline (the
+conservative floor); see below.
 
 ## The judge is reliable (checked before trusting any gap)
 
@@ -54,70 +61,94 @@ each: `deepseek-v4-flash` returned strong `[1.00, 1.00, 1.00]` (mean 1.00) vs we
 measured gap reflects answer quality, not judge noise. (`gemini-2.5-flash` as judge threw parse
 errors — `deepseek` is the better grader here.)
 
-## The result — the gap opens, examples are accepted
-
-**Memorized doc (Transformer paper), recall challenger — reproduces the null:** mean gap **0.117**,
-**0 accepted**; the weak solver scored 0.68–0.78 (it has the content memorized — reading beats
-reasoning).
-
-**Non-memorized doc (Mixtral), non-extractive causal challenger — three runs, NOT consistent:**
-
-| run | accepted | gap widening (plain → refined) | note |
-|---|---|---|---|
-| target=3, samples=2, maxRetries=3 | **1 / 3** | 0.306 → 0.508 (Δ +0.202) | fold steered a too-easy draft (weak 0.78) to an accepted one (weak 0.24) |
-| target=1, samples=3, maxRetries=4 | **1 / 1** | — | first causal draft already separated |
-| **target=3 — independent re-run** | **0 / 3** | 0.052 → 0.246 (Δ +0.194) | gap widened the same, but **no slot cleared the bar**; weak scored **0.75** on a near-miss — a competent, correct answer, not a struggle |
-
-**What reproduces:** the +0.19–0.20 gap-widening from the fold (both runs). **What does not:** the
-accepted count (0 to 2 of 3). The accept bar requires the weak model to *struggle* (< 0.5), and on
-these MoE-reasoning questions `llama-3.1-8b` is too often competent (0.75) to fall below it — so
-acceptance is close to a coin-flip at n=3. Total live spend ≈ **$0.25** across all runs.
+## The powered result — a real ~38% accept-rate
 
-## An autopsied accepted example (real discrimination, both answers read)
+**Design (fixed-slots, not until-N-accepted):** run a fixed K = 32 independent slots (each slot = one
+full challenger → refine → accept cycle), split 16 / 16 across the two docs, samples = 4 per solver
+(stabilise the weak mean), maxRetries = 2 (3 challenger attempts per slot). Record each slot's
+outcome (accept / reject) + best gap, so the rate is bounded-cost and unbiased. Runnable:
+`src/autodata/powered.ts`; per-attempt autopsy JSONL per doc; the CIs are agent-eval's published
+estimators (`wilson` for the binomial accept-rate, `pairedBootstrap` for the paired widening).
 
-> **Q:** Walk through how the MoE layer processes a single token. If the router's gating network were
-> broken and always output uniform weights (G(x)_i = 1/8 for all 8 experts), how would the layer's
-> output differ from the intended behavior, and why is this failure mode problematic?
-
-- **strong (`gemini-2.5-pro`): [1.00, 1.00, 1.00]** — walks through top-2 routing, then derives that
-  uniform weights make the layer average ALL 8 experts (dense, no specialization/sparsity), losing
-  the point of the MoE. Correct.
-- **weak (`llama-3.1-8b`): [0.21, 0.27], mean 0.24** — restates the routing steps but does NOT derive
-  the failure consequence; it never reaches "all experts averaged → specialization lost."
-
-When the gap *does* open, it is real discrimination — not a judge artifact (judge verified above) or
-leakage (the answer is not in the context). **But it does not open reliably.** In the independent
-re-run, the analogous near-miss question drew a *competent* weak answer (0.75): `llama-3.1-8b`
-correctly explained that high positional locality routes consecutive tokens to the same expert →
-over-subscription, and that uniform routing would balance the load. On that draw the 8B reasoned
-fine, so weak ≮ 0.5 and nothing was accepted. The weak model's competence on these questions is the
-variance that makes acceptance a coin-flip.
+| metric | value | read |
+|---|---|---|
+| **accept-rate (headline)** | **38%  CI [23%, 55%]** (12 / 32) | excludes ~0 → **reliable, not a coin-flip** |
+| accept-rate (producing slots) | 44%  CI [28%, 63%] (12 / 27) | excludes the 5 challenger-stage (LaTeX) failures |
+| — mixtral | 19%  CI [7%, 43%]  (3 / 16) | the harder doc; still excludes 0 |
+| — deepseek-v3 | 56%  CI [33%, 77%] (9 / 16) | the easier-to-discriminate doc |
+| best gap / slot (n=27) | min −0.23 · median **0.42** · p90 0.80 · max 0.95 | how far each slot separated the tiers |
+| plain (first-draft) gap / slot | min −0.23 · median 0.19 · p90 0.61 · max 0.95 | the un-refined baseline |
+| **gap-widening Δ (plain → best-refined)** | mean **+0.103**  CI [+0.029, +0.193] (paired bootstrap, n=27) | the fold's lift; **excludes 0** (median Δ 0 — it helps a minority) |
+| weak score / attempt (n=33) | min 0.05 · median **0.55** · max 1.00 | the variance source — competent ~half the time |
+| strong score / attempt (n=33) | min 0.21 · median **0.99** · max 1.00 | the strong solver almost always derives |
+
+**Accept-rule decomposition (33 quality-clean attempts):** strong ≥ 0.65 = **88%**, weak < 0.50 =
+**39%** ← the binding gate, gap ≥ 0.20 = 52%, all-three (= accept) = 36%. The strong solver derives
+almost everything; the bottleneck is the weak model failing — which happens on only ~39% of
+attempts, so the per-slot accept-rate is set by **how often `llama-3.1-8b` actually struggles**, not
+by the challenger or judge. **Total live spend: $0.57** for the 32-slot run (~$1.0 including pilots).
+
+## Two autopsied accepted examples (real discrimination, both answers read)
+
+**deepseek-v3 — gap 0.93 (weak 0.07, strong 1.00):**
+> **Q:** Why does using a *sequence-wise* auxiliary loss lead to a higher validation loss than a
+> *batch-wise* auxiliary loss or the auxiliary-loss-free method in MoE models?
+
+- **strong (`gemini-2.5-pro`): 1.00** — derives that the sequence-wise loss imposes a *stricter,
+  less flexible* per-sequence balance constraint that *hinders the emergence of expert
+  specialisation*. Correct, matches the reference.
+- **weak (`llama-3.1-8b`): [0.10, 0.03, 0.10, 0.03]** — *restates the question* and never derives the
+  reason. A recall-shaped non-answer; the judge's `reasoning` criterion floors it.
+
+**mixtral — gap 0.95 (weak 0.05, strong 1.00):**
+> **Q:** The text says each input is routed to 2 of 8 experts, yet the output sums `G(x)_i · E_i(x)`
+> over all `n` experts. Are these consistent? If not, which should be revised?
+
+- **strong: 1.00** — derives YES, consistent: the gating vector `G(x)` is *sparse* (nonzero only for
+  the 2 selected experts), so the full-`n` sum effectively includes only those 2. Correct.
+- **weak: [0.03, 0.07, 0.03, 0.07]** — concludes the statements are *inconsistent*; it never grasps
+  the sparse-gating equivalence. A genuine reasoning error, not a judge artifact or leakage (the
+  answer is derived, not in the context).
+
+These are real weak-fails-strong-derives examples on both docs — the loop is manufacturing genuine
+discrimination, not gaming the gap.
 
 ## The finding
 
-The two levers are **directionally confirmed and necessary**: a non-extractive causal challenger
-(no leakage) AND a grounding doc the weak solver hasn't memorized — drop either and it nulls hard
-(recall challenger leaks; the memorized Transformer paper lets the 8B recall). With both, the fold
-**reliably widens the strong/weak gap by ~+0.20** (reproduced in both runs).
-
-But "the discriminative reward works" is **NOT** established. Clearing the accept bar (weak must
-*struggle*, < 0.5) is noisy: 0–2 accepted of 3 across runs, because `llama-3.1-8b` answers these
-MoE-reasoning questions competently (0.75) about as often as it flails (0.24). At n=3 that is a
-coin-flip, not a result. Honest verdict: **promising, directionally right, under-powered** — the
-exact small-n shape that has repeatedly looked positive here and washed out at power.
+The question "does the causal-challenger loop reliably manufacture discriminating examples, or is
+acceptance a coin-flip ~0?" is now **settled at power: it reliably works.** Accept-rate **38%, CI
+[23%, 55%]** over 32 slots — the lower bound excludes ~0, and even the harder of the two docs
+(mixtral, 19% [7%, 43%]) excludes 0. The fold also **reliably widens the gap** (mean +0.103, CI
+[+0.029, +0.193]), reproducing the n=3 direction at power, though most of the discrimination comes
+from the first causal draft already separating (median widening 0 — the refine helps a minority of
+slots).
+
+Two honest caveats, both quantified, neither overturns the verdict:
+
+1. **Doc-dependence.** The rate ranges 19% (mixtral) → 56% (deepseek-v3). The pooled 38% is a real
+   average across two non-memorized MoE papers, not a single lucky doc — but expect the rate to move
+   with the source material's difficulty for the 8B.
+2. **The binding constraint is the weak model's competence, not the method.** `llama-3.1-8b` answers
+   these MoE-reasoning questions competently (weak median 0.55) about as often as it flails, so
+   ~39% of attempts clear the "weak must struggle" gate. A weaker weak model (or harder docs) would
+   raise the rate; a stronger one would lower it. The loop's discriminative reward works as designed —
+   the rate is a property of the **tier gap**, which is exactly what it should measure.
 
 ## Status
 
-Mechanism + observability: solid (gap-widening reproduced, judge reliability checked, every attempt
-dumped to a JSONL autopsy trail via `AUTODATA_ATTEMPTS` — which is how the over-claim was caught).
-Empirical positive: **not yet** — acceptance is too noisy at n=3. To actually settle it: raise
-`samples` (stabilize the weak mean per question), raise the slot count to n≥24, and report the
-*accepted-rate* with a confidence interval — not a single lucky run. Until then this is a confirmed
-direction, not a confirmed win.
+Mechanism + observability + **power**: solid. The accept-rate is measured at n=32 with a Wilson CI
+that excludes ~0, the gap-widening with a paired-bootstrap CI that excludes 0, every attempt dumped
+to a JSONL autopsy trail, and the two headline accepted examples read end-to-end (real
+discrimination). The n=3 "coin-flip ~0?" worry is **resolved: ~38% accept-rate, not zero.**
 
 ## Reproduce
 
 ```
-dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/run.ts        # causal, default Mixtral doc
-dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/calibrate.ts  # recall-vs-causal A/B, same doc
+# Powered accept-rate + CIs (32 slots, 2 docs, samples=4) — the headline result:
+dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/powered.ts
+#   knobs: AUTODATA_SLOTS_PER_DOC=16  AUTODATA_SAMPLES=4  AUTODATA_MAXRETRIES=2
+
+# Single-doc builder + recall-vs-causal calibration (the lever's A/B):
+dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/run.ts
+dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/calibrate.ts
 ```
diff --git a/src/autodata/index.ts b/src/autodata/index.ts
index c1372a7..deb45ca 100644
--- a/src/autodata/index.ts
+++ b/src/autodata/index.ts
@@ -36,6 +36,7 @@ export {
   type GroundedDoc,
   groundDoc,
 } from './grounding'
+export { analyzeTrails, type DocTrail, type PoweredStats } from './powered'
 export {
   type AutodataRoles,
   buildAutodataRoles,
diff --git a/src/autodata/powered.test.ts b/src/autodata/powered.test.ts
new file mode 100644
index 0000000..861e127
--- /dev/null
+++ b/src/autodata/powered.test.ts
@@ -0,0 +1,130 @@
+import { describe, expect, it } from 'vitest'
+import type { AttemptRecord, SolverEval } from './data-creation-loop'
+import { analyzeTrails, type DocTrail } from './powered'
+
+// A minimal attempt-row factory: only the fields `analyzeTrails` reads matter.
+function attempt(args: {
+  slot: number
+  iteration: number
+  weak: number
+  strong: number
+  accept: boolean
+  qualityOk?: boolean
+}): AttemptRecord {
+  const weak: SolverEval = { mean: args.weak, samples: [] }
+  const strong: SolverEval = { mean: args.strong, samples: [] }
+  const gap = args.strong - args.weak
+  return {
+    slotIndex: args.slot,
+    iteration: args.iteration,
+    example: { context: 'c', question: 'q', reference: 'r', rubric: ['a', 'b'] },
+    weak,
+    strong,
+    gap,
+    decision: { accept: args.accept, reason: args.accept ? 'discriminates' : 'rejected' },
+    qualityOk: args.qualityOk ?? true,
+  }
+}
+
+describe('analyzeTrails (powered aggregation)', () => {
+  it('counts accept-rate per slot over the requested target, not the trailed slots', () => {
+    // 3 slots requested, only 2 left a trail; slot 0 accepted, slot 1 rejected, slot 2 errored (no rows).
+    const trail: DocTrail = {
+      tag: 'd',
+      url: 'u',
+      target: 3,
+      rows: [
+        attempt({ slot: 0, iteration: 0, weak: 0.7, strong: 0.8, accept: false }),
+        attempt({ slot: 0, iteration: 1, weak: 0.3, strong: 0.9, accept: true }),
+        attempt({ slot: 1, iteration: 0, weak: 0.6, strong: 0.7, accept: false }),
+      ],
+    }
+    const s = analyzeTrails([trail], { bootstrapSeed: 1 })
+    expect(s.totalSlots).toBe(3) // denominator = requested target, not 2 trailed slots
+    expect(s.acceptedSlots).toBe(1)
+    expect(s.acceptRate.estimate).toBeCloseTo(1 / 3, 6)
+    // Wilson lower bound is strictly above 0 and the point estimate sits inside the interval.
+    expect(s.acceptRate.lower).toBeGreaterThan(0)
+    expect(s.acceptRate.lower).toBeLessThan(s.acceptRate.estimate)
+    expect(s.acceptRate.upper).toBeGreaterThan(s.acceptRate.estimate)
+    // Slot 2 left no trail → counted as a challenger-stage failure, still in the denominator.
+    expect(s.slotsWithAttempts).toBe(2)
+    expect(s.challengerFailedSlots).toBe(1)
+    // Among only producing slots the denominator is 2, so the rate is higher.
+    expect(s.acceptRateAmongProducing.estimate).toBeCloseTo(0.5, 6)
+  })
+
+  it('takes the slot best-gap from quality-clean attempts and pairs plain→refined', () => {
+    const trail: DocTrail = {
+      tag: 'd',
+      url: 'u',
+      target: 1,
+      rows: [
+        attempt({ slot: 0, iteration: 0, weak: 0.7, strong: 0.8, accept: false }), // plain gap 0.10
+        attempt({ slot: 0, iteration: 1, weak: 0.3, strong: 0.9, accept: true }), // refined gap 0.60
+      ],
+    }
+    const s = analyzeTrails([trail], { bootstrapSeed: 1 })
+    expect(s.bestGapPerSlot.max).toBeCloseTo(0.6, 6)
+    expect(s.plainGap.median).toBeCloseTo(0.1, 6)
+    // The fold widened the gap from 0.10 to 0.60 → a positive paired delta.
+    expect(s.widening.meanDelta).toBeCloseTo(0.5, 6)
+  })
+
+  it('a leaky (quality-failed) draft contributes a 0 best-gap but is excluded from the condition decomposition', () => {
+    const trail: DocTrail = {
+      tag: 'd',
+      url: 'u',
+      target: 1,
+      rows: [
+        attempt({ slot: 0, iteration: 0, weak: 0, strong: 0, accept: false, qualityOk: false }),
+      ],
+    }
+    const s = analyzeTrails([trail], { bootstrapSeed: 1 })
+    expect(s.bestGapPerSlot.max).toBe(0)
+    expect(s.conditions.nAttempts).toBe(0) // quality-failed attempt excluded
+    expect(s.acceptedSlots).toBe(0)
+  })
+
+  it('decomposes the accept rule: the binding gate is weak < 0.5', () => {
+    // All attempts have strong high + gap wide, but weak only struggles half the time.
+    const trail: DocTrail = {
+      tag: 'd',
+      url: 'u',
+      target: 4,
+      rows: [
+        attempt({ slot: 0, iteration: 0, weak: 0.2, strong: 0.9, accept: true }),
+        attempt({ slot: 1, iteration: 0, weak: 0.7, strong: 0.95, accept: false }), // weak too competent
+        attempt({ slot: 2, iteration: 0, weak: 0.1, strong: 0.85, accept: true }),
+        attempt({ slot: 3, iteration: 0, weak: 0.75, strong: 0.98, accept: false }), // weak too competent
+      ],
+    }
+    const s = analyzeTrails([trail], { bootstrapSeed: 1 })
+    expect(s.conditions.strongHi).toBeCloseTo(1, 6) // strong always high
+    expect(s.conditions.gapWide).toBeCloseTo(1, 6) // gap always wide
+    expect(s.conditions.weakLo).toBeCloseTo(0.5, 6) // weak struggles only half → the binding gate
+    expect(s.conditions.all).toBeCloseTo(0.5, 6)
+    expect(s.acceptRate.estimate).toBeCloseTo(0.5, 6)
+  })
+
+  it('aggregates across multiple docs (denominators add)', () => {
+    const a: DocTrail = {
+      tag: 'a',
+      url: 'u1',
+      target: 2,
+      rows: [attempt({ slot: 0, iteration: 0, weak: 0.2, strong: 0.9, accept: true })],
+    }
+    const b: DocTrail = {
+      tag: 'b',
+      url: 'u2',
+      target: 2,
+      rows: [attempt({ slot: 0, iteration: 0, weak: 0.6, strong: 0.7, accept: false })],
+    }
+    const s = analyzeTrails([a, b], { bootstrapSeed: 1 })
+    expect(s.totalSlots).toBe(4)
+    expect(s.acceptedSlots).toBe(1)
+    expect(s.perDoc).toHaveLength(2)
+    expect(s.perDoc[0]?.accepted).toBe(1)
+    expect(s.perDoc[1]?.accepted).toBe(0)
+  })
+})
diff --git a/src/autodata/powered.ts b/src/autodata/powered.ts
new file mode 100644
index 0000000..3140e8e
--- /dev/null
+++ b/src/autodata/powered.ts
@@ -0,0 +1,450 @@
+/**
+ * Autodata — POWERED accept-rate measurement.
+ *
+ * Settles the question n=3 was too noisy to answer: does the causal-challenger loop RELIABLY
+ * manufacture discriminating examples, or is acceptance a coin-flip? It runs a FIXED number of
+ * independent slots (each slot = one full challenger→refine→accept cycle) split across >= 2
+ * non-memorized grounding docs, and reports the ACCEPTED-RATE with a Wilson 95% CI, the per-slot
+ * best-gap distribution, and the plain-vs-refined gap-widening with a paired-bootstrap CI.
+ *
+ * This is a FIXED-SLOTS harness, NOT until-N-accepted: it runs K slots and records each slot's
+ * outcome (accept/reject) + best gap, so the rate is bounded-cost and unbiased. It reuses
+ * `buildAutodataDataset` — which already runs exactly `target` independent slots — so nothing in the
+ * loop is rebuilt; only the cross-slot aggregation + the two confidence intervals are added here.
+ * The CIs are agent-eval's published estimators (`wilson` for the binomial accept-rate, `pairedBootstrap`
+ * for the paired plain-vs-refined widening), never hand-rolled.
+ *
+ * The source of truth is the per-attempt autopsy JSONL each doc writes incrementally: the final
+ * statistics are recomputed by re-reading those trails from disk (`analyzeTrails`), so an interrupted
+ * run loses no data — re-run the analysis over the JSONL.
+ *
+ * Run (key never printed):
+ *   dotenvx run -f /home/drew/company/devops/secrets/agent-state.env -- \
+ *     pnpm tsx src/autodata/powered.ts
+ *
+ * Env knobs: AUTODATA_SLOTS_PER_DOC (default 14), AUTODATA_SAMPLES (default 4),
+ *            AUTODATA_MAXRETRIES (default 3), AUTODATA_DOCS ("url|focus|tag,url|focus|tag" override),
+ *            AUTODATA_{WEAK,STRONG,CHALLENGER,JUDGE}_MODEL, TANGLE_API_KEY (or TANGLE_ROUTER_KEY).
+ */
+
+import { readFile } from 'node:fs/promises'
+import { pairedBootstrap, wilson } from '@tangle-network/agent-eval'
+import { buildAutodataDataset } from './build-dataset'
+import type { AttemptRecord } from './data-creation-loop'
+import { groundDoc } from './grounding'
+import {
+  CHALLENGER_MODEL,
+  JUDGE_MODEL,
+  STRONG_SOLVER_MODEL,
+  smokeTestModels,
+  WEAK_SOLVER_MODEL,
+} from './router-roles'
+
+/** One grounding document to split slots across. */
+interface DocSpec {
+  url: string
+  focus: string
+  tag: string
+}
+
+/**
+ * Two non-memorized, reasoning-rich MoE papers (both post-date `llama-3.1-8b`'s knowledge cutoff, so
+ * the weak solver must REASON from the context, not recall) — the precondition for any gap to open.
+ * Each `focus` selects a PROSE mechanism chunk in the same "MoE-expert reasoning" band: a chunk dense
+ * with LaTeX equations (e.g. DeepSeek-V3's MLA section) breaks the challenger's strict-JSON output, so
+ * both focuses target the prose description of an expert-routing mechanism, not the equations.
+ *   • Mixtral-of-Experts (2401.04088, Jan 2024) — sparse MoE expert routing / gating.
+ *   • DeepSeek-V3 (2412.19437, Dec 2024) — auxiliary-loss-free load balancing / expert specialization.
+ */
+const defaultDocs: DocSpec[] = [
+  { url: 'https://ar5iv.labs.arxiv.org/html/2401.04088', focus: 'expert', tag: 'mixtral' },
+  { url: 'https://ar5iv.labs.arxiv.org/html/2412.19437', focus: 'auxiliary', tag: 'deepseek-v3' },
+]
+
+function envInt(name: string, fallback: number): number {
+  const raw = process.env[name]
+  if (!raw) return fallback
+  const n = Number.parseInt(raw, 10)
+  if (!Number.isFinite(n) || n <= 0) throw new Error(`${name}='${raw}' is not a positive integer`)
+  return n
+}
+
+function parseDocsEnv(): DocSpec[] {
+  const raw = process.env.AUTODATA_DOCS
+  if (!raw) return defaultDocs
+  return raw.split(',').map((entry) => {
+    const [url, focus, tag] = entry.split('|').map((s) => s.trim())
+    if (!url || !focus || !tag)
+      throw new Error(`AUTODATA_DOCS entry '${entry}' is not url|focus|tag`)
+    return { url, focus, tag }
+  })
+}
+
+// ── Descriptive distribution helpers (NOT inferential — those reuse agent-eval) ────────────────
+
+/** Linear-interpolated quantile of a sorted-or-unsorted numeric sample. */
+function quantile(xs: number[], q: number): number {
+  if (xs.length === 0) return Number.NaN
+  const s = [...xs].sort((a, b) => a - b)
+  if (s.length === 1) return s[0] as number
+  const pos = (s.length - 1) * q
+  const lo = Math.floor(pos)
+  const hi = Math.ceil(pos)
+  const frac = pos - lo
+  return (s[lo] as number) * (1 - frac) + (s[hi] as number) * frac
+}
+
+function mean(xs: number[]): number {
+  return xs.length === 0 ? Number.NaN : xs.reduce((a, b) => a + b, 0) / xs.length
+}
+
+interface Distribution {
+  n: number
+  min: number
+  median: number
+  p90: number
+  max: number
+  mean: number
+}
+
+function describe(xs: number[]): Distribution {
+  return {
+    n: xs.length,
+    min: xs.length ? Math.min(...xs) : Number.NaN,
+    median: quantile(xs, 0.5),
+    p90: quantile(xs, 0.9),
+    max: xs.length ? Math.max(...xs) : Number.NaN,
+    mean: mean(xs),
+  }
+}
+
+// ── The aggregation: per-attempt JSONL trail → per-slot outcomes → CIs ─────────────────────────
+
+/** One JSONL row from a per-attempt trail: an `AttemptRecord` plus the challenger style tag. */
+type TrailRow = AttemptRecord & { style?: string }
+
+/** One doc's trail + the number of slots it was asked to run (the accept-rate denominator). */
+export interface DocTrail {
+  tag: string
+  url: string
+  /** Slots requested for this doc — the denominator (a slot whose drafts all errored counts as a
+   *  reject, so the denominator is the requested count, not the number of slots that left a trail). */
+  target: number
+  rows: TrailRow[]
+}
+
+export interface PoweredStats {
+  totalSlots: number
+  acceptedSlots: number
+  /** Slots that produced >= 1 attempt (the challenger authored a parseable example at least once). */
+  slotsWithAttempts: number
+  /** Slots where the challenger threw on every refine (0 attempts) — an infra/parse failure, NOT a
+   *  discrimination reject. Surfaced so it can be flagged as a threat to validity. */
+  challengerFailedSlots: number
+  /** Wilson 95% CI on the accept-rate (binomial: a slot accepts or it does not). */
+  acceptRate: { estimate: number; lower: number; upper: number }
+  /** Wilson 95% CI on the accept-rate among only the slots that produced >= 1 attempt (excludes the
+   *  challenger-stage failures) — the rate if every slot had at least authored an example. */
+  acceptRateAmongProducing: { estimate: number; lower: number; upper: number }
+  /** Per-slot BEST gap (max gap over the slot's quality-clean attempts) — the discriminating power
+   *  each slot reached, accepted or not. */
+  bestGapPerSlot: Distribution
+  /** Plain (first-draft) gap per slot. */
+  plainGap: Distribution
+  /** Paired plain→best-refined gap-widening with a bootstrap CI on the mean delta. */
+  widening: { n: number; meanDelta: number; medianDelta: number; lower: number; upper: number }
+  /** Weak solver mean-score distribution over quality-clean attempts (the coin-flip's source). */
+  weakScore: Distribution
+  /** Strong solver mean-score distribution over quality-clean attempts. */
+  strongScore: Distribution
+  /** Per-attempt pass fractions for each sub-condition of the accept rule (decomposes the rate). */
+  conditions: {
+    nAttempts: number
+    strongHi: number // fraction with strong >= 0.65
+    weakLo: number // fraction with weak < 0.50  (the "weak must struggle" gate)
+    gapWide: number // fraction with gap >= 0.20
+    all: number // fraction passing all three (== accept)
+  }
+  perDoc: { tag: string; target: number; accepted: number; meanBestGap: number }[]
+}
+
+const minStrong = 0.65
+const maxWeak = 0.5
+const minGap = 0.2
+
+/**
+ * Compute the powered statistics from the per-doc attempt trails. Pure + deterministic (seeded
+ * bootstrap), so it can be re-run standalone over the on-disk JSONL after an interrupted run.
+ */
+export function analyzeTrails(trails: DocTrail[], opts?: { bootstrapSeed?: number }): PoweredStats {
+  const seed = opts?.bootstrapSeed ?? 0xc0ffee
+
+  let totalSlots = 0
+  let acceptedSlots = 0
+  let slotsWithAttempts = 0
+  const bestGapPerSlotAll: number[] = []
+  const plainGapsPaired: number[] = []
+  const refinedGapsPaired: number[] = []
+  const weakScores: number[] = []
+  const strongScores: number[] = []
+  let nAttempts = 0
+  let strongHi = 0
+  let weakLo = 0
+  let gapWide = 0
+  let acceptAll = 0
+  const perDoc: PoweredStats['perDoc'] = []
+
+  for (const trail of trails) {
+    totalSlots += trail.target
+
+    // Group this doc's attempts by slot index.
+    const bySlot = new Map<number, TrailRow[]>()
+    for (const row of trail.rows) {
+      const arr = bySlot.get(row.slotIndex) ?? []
+      arr.push(row)
+      bySlot.set(row.slotIndex, arr)
+    }
+
+    slotsWithAttempts += bySlot.size
+    let docAccepted = 0
+    const docBestGaps: number[] = []
+    for (const [, rows] of bySlot) {
+      // A slot is ACCEPTED iff any of its attempts cleared the accept rule.
+      const accepted = rows.some((r) => r.decision.accept)
+      if (accepted) {
+        acceptedSlots += 1
+        docAccepted += 1
+      }
+      // Best gap the slot reached over its quality-clean attempts (a leaky/thin draft has gap 0).
+      const cleanGaps = rows.filter((r) => r.qualityOk).map((r) => r.gap)
+      const bestGap = cleanGaps.length ? Math.max(...cleanGaps) : 0
+      bestGapPerSlotAll.push(bestGap)
+      docBestGaps.push(bestGap)
+
+      // Paired plain→refined: plain = the earliest iteration's gap, refined = the slot's best gap.
+      const earliest = [...rows].sort((a, b) => a.iteration - b.iteration)[0]
+      if (earliest) {
+        plainGapsPaired.push(earliest.gap)
+        refinedGapsPaired.push(bestGap)
+      }
+
+      // Per-attempt condition decomposition (quality-clean attempts only — a leaky draft never
+      // reached the solvers, so its 0/0 scores would falsely depress every sub-condition).
+      for (const r of rows) {
+        if (!r.qualityOk) continue
+        nAttempts += 1
+        weakScores.push(r.weak.mean)
+        strongScores.push(r.strong.mean)
+        if (r.strong.mean >= minStrong) strongHi += 1
+        if (r.weak.mean < maxWeak) weakLo += 1
+        if (r.gap >= minGap) gapWide += 1
+        if (r.strong.mean >= minStrong && r.weak.mean < maxWeak && r.gap >= minGap) acceptAll += 1
+      }
+    }
+    perDoc.push({
+      tag: trail.tag,
+      target: trail.target,
+      accepted: docAccepted,
+      meanBestGap: mean(docBestGaps),
+    })
+  }
+
+  const accept = wilson(acceptedSlots, totalSlots, 0.95)
+  const acceptProducing = wilson(acceptedSlots, slotsWithAttempts, 0.95)
+  const boot = pairedBootstrap(plainGapsPaired, refinedGapsPaired, {
+    confidence: 0.95,
+    statistic: 'mean',
+    seed,
+    resamples: 5000,
+  })
+
+  return {
+    totalSlots,
+    acceptedSlots,
+    slotsWithAttempts,
+    challengerFailedSlots: totalSlots - slotsWithAttempts,
+    acceptRate: { estimate: accept.estimate, lower: accept.lower, upper: accept.upper },
+    acceptRateAmongProducing: {
+      estimate: acceptProducing.estimate,
+      lower: acceptProducing.lower,
+      upper: acceptProducing.upper,
+    },
+    bestGapPerSlot: describe(bestGapPerSlotAll),
+    plainGap: describe(plainGapsPaired),
+    widening: {
+      n: boot.n,
+      meanDelta: boot.mean,
+      medianDelta: boot.median,
+      lower: boot.low,
+      upper: boot.high,
+    },
+    weakScore: describe(weakScores),
+    strongScore: describe(strongScores),
+    conditions: {
+      nAttempts,
+      strongHi: nAttempts ? strongHi / nAttempts : Number.NaN,
+      weakLo: nAttempts ? weakLo / nAttempts : Number.NaN,
+      gapWide: nAttempts ? gapWide / nAttempts : Number.NaN,
+      all: nAttempts ? acceptAll / nAttempts : Number.NaN,
+    },
+    perDoc,
+  }
+}
+
+async function readTrail(path: string): Promise<TrailRow[]> {
+  // A slot whose every challenger draft errored produces no attempt → no trail file is written.
+  // That is a legitimate all-reject outcome (the loop failed to manufacture an example), not a
+  // crash: return an empty trail so those slots still count against the accept-rate denominator.
+  let text: string
+  try {
+    text = await readFile(path, 'utf8')
+  } catch (err) {
+    if ((err as NodeJS.ErrnoException).code === 'ENOENT') return []
+    throw err
+  }
+  return text
+    .split('\n')
+    .filter((l) => l.trim().length > 0)
+    .map((l) => JSON.parse(l) as TrailRow)
+}
+
+function pctCI(x: { estimate: number; lower: number; upper: number }): string {
+  return `${(x.estimate * 100).toFixed(0)}%  CI[${(x.lower * 100).toFixed(0)}%, ${(x.upper * 100).toFixed(0)}%]`
+}
+
+function dist(d: Distribution): string {
+  return `n=${d.n}  min=${d.min.toFixed(2)} median=${d.median.toFixed(2)} p90=${d.p90.toFixed(2)} max=${d.max.toFixed(2)} (mean=${d.mean.toFixed(2)})`
+}
+
+function printReport(stats: PoweredStats, spendUsd: number): void {
+  console.log('\n══════════════════════════════════════════════════════════════════════════')
+  console.log(' POWERED ACCEPT-RATE — does the causal-challenger loop reliably discriminate?')
+  console.log('══════════════════════════════════════════════════════════════════════════\n')
+  console.log(`  slots run            : ${stats.totalSlots}  (${stats.acceptedSlots} accepted)`)
+  console.log(`  ACCEPTED-RATE        : ${pctCI(stats.acceptRate)}   ← the headline (Wilson 95%)`)
+  if (stats.challengerFailedSlots > 0) {
+    console.log(
+      `  challenger-failed    : ${stats.challengerFailedSlots} slot(s) produced 0 attempts (infra/parse, not a discrimination reject)`,
+    )
+    console.log(
+      `  accept-rate (producing): ${pctCI(stats.acceptRateAmongProducing)}  (excl. challenger-failed slots)`,
+    )
+  }
+  console.log('')
+  console.log(`  best gap / slot      : ${dist(stats.bestGapPerSlot)}`)
+  console.log(`  plain gap / slot     : ${dist(stats.plainGap)}`)
+  console.log(
+    `  gap-widening Δ       : mean ${stats.widening.meanDelta >= 0 ? '+' : ''}${stats.widening.meanDelta.toFixed(
+      3,
+    )}  median ${stats.widening.medianDelta >= 0 ? '+' : ''}${stats.widening.medianDelta.toFixed(3)}  ` +
+      `CI[${stats.widening.lower.toFixed(3)}, ${stats.widening.upper.toFixed(3)}]  (paired bootstrap, n=${stats.widening.n})`,
+  )
+  console.log('')
+  console.log(`  weak score / attempt : ${dist(stats.weakScore)}`)
+  console.log(`  strong score / attempt: ${dist(stats.strongScore)}`)
+  console.log('')
+  const c = stats.conditions
+  console.log(`  accept-rule decomposition over ${c.nAttempts} quality-clean attempts:`)
+  console.log(`    strong >= 0.65  : ${(c.strongHi * 100).toFixed(0)}%`)
+  console.log(
+    `    weak   <  0.50  : ${(c.weakLo * 100).toFixed(0)}%   ← the binding gate (weak must struggle)`,
+  )
+  console.log(`    gap    >= 0.20  : ${(c.gapWide * 100).toFixed(0)}%`)
+  console.log(`    all three (accept): ${(c.all * 100).toFixed(0)}%`)
+  console.log('')
+  console.log('  per-doc:')
+  for (const d of stats.perDoc) {
+    console.log(
+      `    ${d.tag.padEnd(14)} accepted ${d.accepted}/${d.target}  mean best-gap ${d.meanBestGap.toFixed(3)}`,
+    )
+  }
+  console.log(`\n  total live spend     : $${spendUsd.toFixed(4)}`)
+  const lo = stats.acceptRate.lower
+  const verdict =
+    lo > 0.15
+      ? `WORKS AT POWER — accept-rate CI lower bound ${(lo * 100).toFixed(0)}% excludes ~0`
+      : stats.acceptRate.upper < 0.2
+        ? 'COIN-FLIP / DOES NOT RELIABLY WORK — accept-rate CI sits near 0'
+        : 'UNDER-POWERED / MARGINAL — accept-rate CI still includes near-0; not settled'
+  console.log(`\n  VERDICT: ${verdict}\n`)
+}
+
+async function main(): Promise<void> {
+  const apiKey = process.env.TANGLE_API_KEY ?? process.env.TANGLE_ROUTER_KEY
+  if (!apiKey) throw new Error('no TANGLE_API_KEY in env — run under dotenvx so the key is set')
+
+  const docs = parseDocsEnv()
+  const slotsPerDoc = envInt('AUTODATA_SLOTS_PER_DOC', 14)
+  const samples = envInt('AUTODATA_SAMPLES', 4)
+  const maxRetries = envInt('AUTODATA_MAXRETRIES', 3)
+
+  console.log('Autodata · POWERED accept-rate measurement')
+  console.log(
+    `  challenger/judge=${CHALLENGER_MODEL}/${JUDGE_MODEL}  weak=${WEAK_SOLVER_MODEL}  strong=${STRONG_SOLVER_MODEL}`,
+  )
+  console.log(
+    `  ${docs.length} docs × ${slotsPerDoc} slots = ${docs.length * slotsPerDoc} slots · samples=${samples} maxRetries=${maxRetries}\n`,
+  )
+
+  // ── COST GATE: one cheap call per model, all must return non-empty content before the burn ──
+  const smoke = await smokeTestModels({
+    apiKey,
+    models: [CHALLENGER_MODEL, WEAK_SOLVER_MODEL, STRONG_SOLVER_MODEL],
+  })
+  for (const s of smoke) {
+    console.log(
+      `  ${s.ok ? 'ok ' : 'DEAD'} ${s.model.padEnd(28)} chars=${String(s.contentChars).padStart(4)}  ` +
+        `finish=${s.finishReason ?? '?'}  cost=$${s.costUsd.toFixed(5)} (${s.costSource})`,
+    )
+  }
+  const dead = smoke.filter((s) => !s.ok)
+  if (dead.length > 0) {
+    throw new Error(`cost gate failed — empty content from: ${dead.map((d) => d.model).join(', ')}`)
+  }
+
+  // ── Run K slots per doc (each doc writes its own incremental autopsy trail). ──
+  const trails: DocTrail[] = []
+  let spendUsd = 0
+  for (const doc of docs) {
+    const grounded = await groundDoc({ url: doc.url, focus: doc.focus })
+    console.log(
+      `\n[${doc.tag}] grounded ${grounded.url}  chunk=${grounded.chunkIndex}/${grounded.totalChunks} (${grounded.doc.length} chars)`,
+    )
+    const attemptsPath = `data/powered-${doc.tag}-attempts.jsonl`
+    const result = await buildAutodataDataset({
+      apiKey,
+      source: grounded,
+      outPath: `data/powered-${doc.tag}.jsonl`,
+      attemptsPath,
+      target: slotsPerDoc,
+      samples,
+      maxRetries,
+    })
+    const docSpend = result.cost.summary().totalCostUsd
+    spendUsd += docSpend
+    console.log(
+      `[${doc.tag}] done · ${result.accepted.length}/${slotsPerDoc} accepted · ${result.attempts.length} attempts · $${docSpend.toFixed(4)}`,
+    )
+    trails.push({
+      tag: doc.tag,
+      url: doc.url,
+      target: slotsPerDoc,
+      rows: await readTrail(attemptsPath),
+    })
+  }
+
+  // ── Recompute everything from the on-disk trails (durable source of truth). ──
+  const stats = analyzeTrails(trails)
+  printReport(stats, spendUsd)
+
+  // Emit the machine-readable result alongside the prose, for the doc + any re-analysis.
+  console.log('RESULT_JSON ' + JSON.stringify({ ...stats, spendUsd }))
+}
+
+// Only auto-run when invoked directly (keeps `analyzeTrails` importable + unit-testable).
+if (process.argv[1] && process.argv[1].endsWith('powered.ts')) {
+  main().catch((err) => {
+    console.error(err)
+    process.exit(1)
+  })
+}