diff --git a/docs/results/autodata-live.md b/docs/results/autodata-live.md index 0847d25..80df2e8 100644 --- a/docs/results/autodata-live.md +++ b/docs/results/autodata-live.md @@ -1,17 +1,16 @@ -# Autodata live result: the causal challenger widens the gap (reproduced) — but clearing the accept bar is noisy at this n/tier (NOT robust) +# Autodata live result: the causal-challenger loop reliably discriminates at power — 38% accept-rate, CI [23%, 55%] (NOT a coin-flip) -Running the agentic data-creation loop (`src/autodata/`) on a real arXiv doc with real two-tier +Running the agentic data-creation loop (`src/autodata/`) on real arXiv docs with real two-tier solvers, to manufacture training examples that separate a strong solver from a weak one (the discriminative reward of the Autodata / Agentic-Self-Instruct method). -**Honest headline (two independent runs):** the non-extractive causal challenger + the refine fold -**reliably widen the strong/weak gap by ~+0.20 vs plain generation** (reproduced in both runs — the -method's Table-1 *direction* holds). BUT **clearing the hard accept bar** (weak < 0.5 ∧ strong ≥ 0.65 -∧ gap ≥ 0.2) is **noisy and marginal**: one run accepted 1–2 of 3, an **independent re-run accepted -0 of 3**. The reason is in the answers — `llama-3.1-8b` on these MoE questions sometimes flails -(0.24) and sometimes answers *competently* (0.75), straddling the 0.5 "weak must struggle" line. So: -**directionally confirmed, not a robust positive at n=3 / this tier.** This is the same small-n -mirage that bit the earlier two-agent A/B (positive at n=1, washes at power) — flagged, not buried. +**Powered headline (32 independent slots, 2 docs, samples=4):** the loop **reliably manufactures +discriminating examples — accept-rate 38%, Wilson 95% CI [23%, 55%]** (12 of 32 slots cleared the +hard accept bar: weak < 0.5 ∧ strong ≥ 0.65 ∧ gap ≥ 0.2). The CI lower bound (23%) excludes ~0, so +this is a **real, repeatable rate, not the n=1–2 luck** that made it look like a coin-flip at n=3. +Acceptance is **doc-dependent** (mixtral 19%, deepseek-v3 56%) and gated by **whether the weak model +struggles** (it does on only 39% of attempts), but it is decisively above zero on both docs. This +**replaces** the earlier n=3 result, which was too noisy to tell "real rate" from "coin-flip ~0". ## The two levers that turned the null into a positive @@ -29,22 +28,30 @@ both fixed here: 2. **The grounding doc was memorized.** The default was "Attention Is All You Need" — the most canonical paper in ML, which an 8B has memorized, so even reasoning questions are answerable from - pretraining and capability cannot separate. Fix — **ground on a doc the weak solver has not - memorized**: the new default is the Mixtral-of-Experts paper (arXiv 2401.04088, Jan 2024), which - post-dates `llama-3.1-8b`'s knowledge cutoff, forcing it to reason from the context. + pretraining and capability cannot separate. Fix — **ground on docs the weak solver has not + memorized**: the Mixtral-of-Experts paper (arXiv 2401.04088, Jan 2024) and the DeepSeek-V3 paper + (arXiv 2412.19437, Dec 2024), both post-dating `llama-3.1-8b`'s knowledge cutoff, forcing it to + reason from the context. ## Setup (all env-overridable) | role | model | why | |---|---|---| -| weak solver | `groq/llama-3.1-8b-instant` | small; cutoff predates the 2024 doc → must reason, can't recall | +| weak solver | `groq/llama-3.1-8b-instant` | small; cutoff predates the 2024 docs → must reason, can't recall | | strong solver | `gemini-2.5-pro` | frontier reasoner; a real wide capability gap | | challenger + judge | `deepseek-v4-flash` | capable, fast, reliable, a DIFFERENT family from both solvers (no judge-bias) | -| grounding doc | Mixtral-of-Experts (2401.04088) | non-memorized, reasoning-rich (MoE routing / gating) | +| grounding doc A | Mixtral-of-Experts (2401.04088) | non-memorized; MoE expert routing / gating (`focus=expert`) | +| grounding doc B | DeepSeek-V3 (2412.19437) | non-memorized; auxiliary-loss-free load balancing / expert specialization (`focus=auxiliary`) | -Accept thresholds (the paper's): strong >= 0.65, weak < 0.50, gap >= 0.20. (`glm-5.2`, the brief's -challenger/judge, was returning upstream-capacity 503s during this run; `deepseek-v4-flash` is the -live, neutral substitute. `routerChat` now retries transient 503/429/timeout with bounded backoff.) +Accept thresholds (the paper's): strong ≥ 0.65, weak < 0.50, gap ≥ 0.20. (`glm-5.2`, the brief's +challenger/judge, was returning upstream-capacity 503s; `deepseek-v4-flash` is the live, neutral +substitute. `routerChat` retries transient 503/429/timeout with bounded backoff.) + +The grounding chunk must be **prose, not equations**: an equation-dense chunk (e.g. DeepSeek-V3's MLA +section) breaks the challenger's strict-JSON output (LaTeX backslashes), so both `focus` terms select +the prose description of an MoE-expert mechanism. Even so, 5 of 32 slots (~16%) still hit a +LaTeX-in-JSON failure and produced no example — those count as rejects in the headline (the +conservative floor); see below. ## The judge is reliable (checked before trusting any gap) @@ -54,70 +61,94 @@ each: `deepseek-v4-flash` returned strong `[1.00, 1.00, 1.00]` (mean 1.00) vs we measured gap reflects answer quality, not judge noise. (`gemini-2.5-flash` as judge threw parse errors — `deepseek` is the better grader here.) -## The result — the gap opens, examples are accepted - -**Memorized doc (Transformer paper), recall challenger — reproduces the null:** mean gap **0.117**, -**0 accepted**; the weak solver scored 0.68–0.78 (it has the content memorized — reading beats -reasoning). - -**Non-memorized doc (Mixtral), non-extractive causal challenger — three runs, NOT consistent:** - -| run | accepted | gap widening (plain → refined) | note | -|---|---|---|---| -| target=3, samples=2, maxRetries=3 | **1 / 3** | 0.306 → 0.508 (Δ +0.202) | fold steered a too-easy draft (weak 0.78) to an accepted one (weak 0.24) | -| target=1, samples=3, maxRetries=4 | **1 / 1** | — | first causal draft already separated | -| **target=3 — independent re-run** | **0 / 3** | 0.052 → 0.246 (Δ +0.194) | gap widened the same, but **no slot cleared the bar**; weak scored **0.75** on a near-miss — a competent, correct answer, not a struggle | - -**What reproduces:** the +0.19–0.20 gap-widening from the fold (both runs). **What does not:** the -accepted count (0 to 2 of 3). The accept bar requires the weak model to *struggle* (< 0.5), and on -these MoE-reasoning questions `llama-3.1-8b` is too often competent (0.75) to fall below it — so -acceptance is close to a coin-flip at n=3. Total live spend ≈ **$0.25** across all runs. +## The powered result — a real ~38% accept-rate -## An autopsied accepted example (real discrimination, both answers read) +**Design (fixed-slots, not until-N-accepted):** run a fixed K = 32 independent slots (each slot = one +full challenger → refine → accept cycle), split 16 / 16 across the two docs, samples = 4 per solver +(stabilise the weak mean), maxRetries = 2 (3 challenger attempts per slot). Record each slot's +outcome (accept / reject) + best gap, so the rate is bounded-cost and unbiased. Runnable: +`src/autodata/powered.ts`; per-attempt autopsy JSONL per doc; the CIs are agent-eval's published +estimators (`wilson` for the binomial accept-rate, `pairedBootstrap` for the paired widening). -> **Q:** Walk through how the MoE layer processes a single token. If the router's gating network were -> broken and always output uniform weights (G(x)_i = 1/8 for all 8 experts), how would the layer's -> output differ from the intended behavior, and why is this failure mode problematic? - -- **strong (`gemini-2.5-pro`): [1.00, 1.00, 1.00]** — walks through top-2 routing, then derives that - uniform weights make the layer average ALL 8 experts (dense, no specialization/sparsity), losing - the point of the MoE. Correct. -- **weak (`llama-3.1-8b`): [0.21, 0.27], mean 0.24** — restates the routing steps but does NOT derive - the failure consequence; it never reaches "all experts averaged → specialization lost." - -When the gap *does* open, it is real discrimination — not a judge artifact (judge verified above) or -leakage (the answer is not in the context). **But it does not open reliably.** In the independent -re-run, the analogous near-miss question drew a *competent* weak answer (0.75): `llama-3.1-8b` -correctly explained that high positional locality routes consecutive tokens to the same expert → -over-subscription, and that uniform routing would balance the load. On that draw the 8B reasoned -fine, so weak ≮ 0.5 and nothing was accepted. The weak model's competence on these questions is the -variance that makes acceptance a coin-flip. +| metric | value | read | +|---|---|---| +| **accept-rate (headline)** | **38% CI [23%, 55%]** (12 / 32) | excludes ~0 → **reliable, not a coin-flip** | +| accept-rate (producing slots) | 44% CI [28%, 63%] (12 / 27) | excludes the 5 challenger-stage (LaTeX) failures | +| — mixtral | 19% CI [7%, 43%] (3 / 16) | the harder doc; still excludes 0 | +| — deepseek-v3 | 56% CI [33%, 77%] (9 / 16) | the easier-to-discriminate doc | +| best gap / slot (n=27) | min −0.23 · median **0.42** · p90 0.80 · max 0.95 | how far each slot separated the tiers | +| plain (first-draft) gap / slot | min −0.23 · median 0.19 · p90 0.61 · max 0.95 | the un-refined baseline | +| **gap-widening Δ (plain → best-refined)** | mean **+0.103** CI [+0.029, +0.193] (paired bootstrap, n=27) | the fold's lift; **excludes 0** (median Δ 0 — it helps a minority) | +| weak score / attempt (n=33) | min 0.05 · median **0.55** · max 1.00 | the variance source — competent ~half the time | +| strong score / attempt (n=33) | min 0.21 · median **0.99** · max 1.00 | the strong solver almost always derives | + +**Accept-rule decomposition (33 quality-clean attempts):** strong ≥ 0.65 = **88%**, weak < 0.50 = +**39%** ← the binding gate, gap ≥ 0.20 = 52%, all-three (= accept) = 36%. The strong solver derives +almost everything; the bottleneck is the weak model failing — which happens on only ~39% of +attempts, so the per-slot accept-rate is set by **how often `llama-3.1-8b` actually struggles**, not +by the challenger or judge. **Total live spend: $0.57** for the 32-slot run (~$1.0 including pilots). + +## Two autopsied accepted examples (real discrimination, both answers read) + +**deepseek-v3 — gap 0.93 (weak 0.07, strong 1.00):** +> **Q:** Why does using a *sequence-wise* auxiliary loss lead to a higher validation loss than a +> *batch-wise* auxiliary loss or the auxiliary-loss-free method in MoE models? + +- **strong (`gemini-2.5-pro`): 1.00** — derives that the sequence-wise loss imposes a *stricter, + less flexible* per-sequence balance constraint that *hinders the emergence of expert + specialisation*. Correct, matches the reference. +- **weak (`llama-3.1-8b`): [0.10, 0.03, 0.10, 0.03]** — *restates the question* and never derives the + reason. A recall-shaped non-answer; the judge's `reasoning` criterion floors it. + +**mixtral — gap 0.95 (weak 0.05, strong 1.00):** +> **Q:** The text says each input is routed to 2 of 8 experts, yet the output sums `G(x)_i · E_i(x)` +> over all `n` experts. Are these consistent? If not, which should be revised? + +- **strong: 1.00** — derives YES, consistent: the gating vector `G(x)` is *sparse* (nonzero only for + the 2 selected experts), so the full-`n` sum effectively includes only those 2. Correct. +- **weak: [0.03, 0.07, 0.03, 0.07]** — concludes the statements are *inconsistent*; it never grasps + the sparse-gating equivalence. A genuine reasoning error, not a judge artifact or leakage (the + answer is derived, not in the context). + +These are real weak-fails-strong-derives examples on both docs — the loop is manufacturing genuine +discrimination, not gaming the gap. ## The finding -The two levers are **directionally confirmed and necessary**: a non-extractive causal challenger -(no leakage) AND a grounding doc the weak solver hasn't memorized — drop either and it nulls hard -(recall challenger leaks; the memorized Transformer paper lets the 8B recall). With both, the fold -**reliably widens the strong/weak gap by ~+0.20** (reproduced in both runs). - -But "the discriminative reward works" is **NOT** established. Clearing the accept bar (weak must -*struggle*, < 0.5) is noisy: 0–2 accepted of 3 across runs, because `llama-3.1-8b` answers these -MoE-reasoning questions competently (0.75) about as often as it flails (0.24). At n=3 that is a -coin-flip, not a result. Honest verdict: **promising, directionally right, under-powered** — the -exact small-n shape that has repeatedly looked positive here and washed out at power. +The question "does the causal-challenger loop reliably manufacture discriminating examples, or is +acceptance a coin-flip ~0?" is now **settled at power: it reliably works.** Accept-rate **38%, CI +[23%, 55%]** over 32 slots — the lower bound excludes ~0, and even the harder of the two docs +(mixtral, 19% [7%, 43%]) excludes 0. The fold also **reliably widens the gap** (mean +0.103, CI +[+0.029, +0.193]), reproducing the n=3 direction at power, though most of the discrimination comes +from the first causal draft already separating (median widening 0 — the refine helps a minority of +slots). + +Two honest caveats, both quantified, neither overturns the verdict: + +1. **Doc-dependence.** The rate ranges 19% (mixtral) → 56% (deepseek-v3). The pooled 38% is a real + average across two non-memorized MoE papers, not a single lucky doc — but expect the rate to move + with the source material's difficulty for the 8B. +2. **The binding constraint is the weak model's competence, not the method.** `llama-3.1-8b` answers + these MoE-reasoning questions competently (weak median 0.55) about as often as it flails, so + ~39% of attempts clear the "weak must struggle" gate. A weaker weak model (or harder docs) would + raise the rate; a stronger one would lower it. The loop's discriminative reward works as designed — + the rate is a property of the **tier gap**, which is exactly what it should measure. ## Status -Mechanism + observability: solid (gap-widening reproduced, judge reliability checked, every attempt -dumped to a JSONL autopsy trail via `AUTODATA_ATTEMPTS` — which is how the over-claim was caught). -Empirical positive: **not yet** — acceptance is too noisy at n=3. To actually settle it: raise -`samples` (stabilize the weak mean per question), raise the slot count to n≥24, and report the -*accepted-rate* with a confidence interval — not a single lucky run. Until then this is a confirmed -direction, not a confirmed win. +Mechanism + observability + **power**: solid. The accept-rate is measured at n=32 with a Wilson CI +that excludes ~0, the gap-widening with a paired-bootstrap CI that excludes 0, every attempt dumped +to a JSONL autopsy trail, and the two headline accepted examples read end-to-end (real +discrimination). The n=3 "coin-flip ~0?" worry is **resolved: ~38% accept-rate, not zero.** ## Reproduce ``` -dotenvx run -f .env -- pnpm tsx src/autodata/run.ts # causal, default Mixtral doc -dotenvx run -f .env -- pnpm tsx src/autodata/calibrate.ts # recall-vs-causal A/B, same doc +# Powered accept-rate + CIs (32 slots, 2 docs, samples=4) — the headline result: +dotenvx run -f .env -- pnpm tsx src/autodata/powered.ts +# knobs: AUTODATA_SLOTS_PER_DOC=16 AUTODATA_SAMPLES=4 AUTODATA_MAXRETRIES=2 + +# Single-doc builder + recall-vs-causal calibration (the lever's A/B): +dotenvx run -f .env -- pnpm tsx src/autodata/run.ts +dotenvx run -f .env -- pnpm tsx src/autodata/calibrate.ts ``` diff --git a/src/autodata/index.ts b/src/autodata/index.ts index c1372a7..deb45ca 100644 --- a/src/autodata/index.ts +++ b/src/autodata/index.ts @@ -36,6 +36,7 @@ export { type GroundedDoc, groundDoc, } from './grounding' +export { analyzeTrails, type DocTrail, type PoweredStats } from './powered' export { type AutodataRoles, buildAutodataRoles, diff --git a/src/autodata/powered.test.ts b/src/autodata/powered.test.ts new file mode 100644 index 0000000..861e127 --- /dev/null +++ b/src/autodata/powered.test.ts @@ -0,0 +1,130 @@ +import { describe, expect, it } from 'vitest' +import type { AttemptRecord, SolverEval } from './data-creation-loop' +import { analyzeTrails, type DocTrail } from './powered' + +// A minimal attempt-row factory: only the fields `analyzeTrails` reads matter. +function attempt(args: { + slot: number + iteration: number + weak: number + strong: number + accept: boolean + qualityOk?: boolean +}): AttemptRecord { + const weak: SolverEval = { mean: args.weak, samples: [] } + const strong: SolverEval = { mean: args.strong, samples: [] } + const gap = args.strong - args.weak + return { + slotIndex: args.slot, + iteration: args.iteration, + example: { context: 'c', question: 'q', reference: 'r', rubric: ['a', 'b'] }, + weak, + strong, + gap, + decision: { accept: args.accept, reason: args.accept ? 'discriminates' : 'rejected' }, + qualityOk: args.qualityOk ?? true, + } +} + +describe('analyzeTrails (powered aggregation)', () => { + it('counts accept-rate per slot over the requested target, not the trailed slots', () => { + // 3 slots requested, only 2 left a trail; slot 0 accepted, slot 1 rejected, slot 2 errored (no rows). + const trail: DocTrail = { + tag: 'd', + url: 'u', + target: 3, + rows: [ + attempt({ slot: 0, iteration: 0, weak: 0.7, strong: 0.8, accept: false }), + attempt({ slot: 0, iteration: 1, weak: 0.3, strong: 0.9, accept: true }), + attempt({ slot: 1, iteration: 0, weak: 0.6, strong: 0.7, accept: false }), + ], + } + const s = analyzeTrails([trail], { bootstrapSeed: 1 }) + expect(s.totalSlots).toBe(3) // denominator = requested target, not 2 trailed slots + expect(s.acceptedSlots).toBe(1) + expect(s.acceptRate.estimate).toBeCloseTo(1 / 3, 6) + // Wilson lower bound is strictly above 0 and the point estimate sits inside the interval. + expect(s.acceptRate.lower).toBeGreaterThan(0) + expect(s.acceptRate.lower).toBeLessThan(s.acceptRate.estimate) + expect(s.acceptRate.upper).toBeGreaterThan(s.acceptRate.estimate) + // Slot 2 left no trail → counted as a challenger-stage failure, still in the denominator. + expect(s.slotsWithAttempts).toBe(2) + expect(s.challengerFailedSlots).toBe(1) + // Among only producing slots the denominator is 2, so the rate is higher. + expect(s.acceptRateAmongProducing.estimate).toBeCloseTo(0.5, 6) + }) + + it('takes the slot best-gap from quality-clean attempts and pairs plain→refined', () => { + const trail: DocTrail = { + tag: 'd', + url: 'u', + target: 1, + rows: [ + attempt({ slot: 0, iteration: 0, weak: 0.7, strong: 0.8, accept: false }), // plain gap 0.10 + attempt({ slot: 0, iteration: 1, weak: 0.3, strong: 0.9, accept: true }), // refined gap 0.60 + ], + } + const s = analyzeTrails([trail], { bootstrapSeed: 1 }) + expect(s.bestGapPerSlot.max).toBeCloseTo(0.6, 6) + expect(s.plainGap.median).toBeCloseTo(0.1, 6) + // The fold widened the gap from 0.10 to 0.60 → a positive paired delta. + expect(s.widening.meanDelta).toBeCloseTo(0.5, 6) + }) + + it('a leaky (quality-failed) draft contributes a 0 best-gap but is excluded from the condition decomposition', () => { + const trail: DocTrail = { + tag: 'd', + url: 'u', + target: 1, + rows: [ + attempt({ slot: 0, iteration: 0, weak: 0, strong: 0, accept: false, qualityOk: false }), + ], + } + const s = analyzeTrails([trail], { bootstrapSeed: 1 }) + expect(s.bestGapPerSlot.max).toBe(0) + expect(s.conditions.nAttempts).toBe(0) // quality-failed attempt excluded + expect(s.acceptedSlots).toBe(0) + }) + + it('decomposes the accept rule: the binding gate is weak < 0.5', () => { + // All attempts have strong high + gap wide, but weak only struggles half the time. + const trail: DocTrail = { + tag: 'd', + url: 'u', + target: 4, + rows: [ + attempt({ slot: 0, iteration: 0, weak: 0.2, strong: 0.9, accept: true }), + attempt({ slot: 1, iteration: 0, weak: 0.7, strong: 0.95, accept: false }), // weak too competent + attempt({ slot: 2, iteration: 0, weak: 0.1, strong: 0.85, accept: true }), + attempt({ slot: 3, iteration: 0, weak: 0.75, strong: 0.98, accept: false }), // weak too competent + ], + } + const s = analyzeTrails([trail], { bootstrapSeed: 1 }) + expect(s.conditions.strongHi).toBeCloseTo(1, 6) // strong always high + expect(s.conditions.gapWide).toBeCloseTo(1, 6) // gap always wide + expect(s.conditions.weakLo).toBeCloseTo(0.5, 6) // weak struggles only half → the binding gate + expect(s.conditions.all).toBeCloseTo(0.5, 6) + expect(s.acceptRate.estimate).toBeCloseTo(0.5, 6) + }) + + it('aggregates across multiple docs (denominators add)', () => { + const a: DocTrail = { + tag: 'a', + url: 'u1', + target: 2, + rows: [attempt({ slot: 0, iteration: 0, weak: 0.2, strong: 0.9, accept: true })], + } + const b: DocTrail = { + tag: 'b', + url: 'u2', + target: 2, + rows: [attempt({ slot: 0, iteration: 0, weak: 0.6, strong: 0.7, accept: false })], + } + const s = analyzeTrails([a, b], { bootstrapSeed: 1 }) + expect(s.totalSlots).toBe(4) + expect(s.acceptedSlots).toBe(1) + expect(s.perDoc).toHaveLength(2) + expect(s.perDoc[0]?.accepted).toBe(1) + expect(s.perDoc[1]?.accepted).toBe(0) + }) +}) diff --git a/src/autodata/powered.ts b/src/autodata/powered.ts new file mode 100644 index 0000000..3140e8e --- /dev/null +++ b/src/autodata/powered.ts @@ -0,0 +1,450 @@ +/** + * Autodata — POWERED accept-rate measurement. + * + * Settles the question n=3 was too noisy to answer: does the causal-challenger loop RELIABLY + * manufacture discriminating examples, or is acceptance a coin-flip? It runs a FIXED number of + * independent slots (each slot = one full challenger→refine→accept cycle) split across >= 2 + * non-memorized grounding docs, and reports the ACCEPTED-RATE with a Wilson 95% CI, the per-slot + * best-gap distribution, and the plain-vs-refined gap-widening with a paired-bootstrap CI. + * + * This is a FIXED-SLOTS harness, NOT until-N-accepted: it runs K slots and records each slot's + * outcome (accept/reject) + best gap, so the rate is bounded-cost and unbiased. It reuses + * `buildAutodataDataset` — which already runs exactly `target` independent slots — so nothing in the + * loop is rebuilt; only the cross-slot aggregation + the two confidence intervals are added here. + * The CIs are agent-eval's published estimators (`wilson` for the binomial accept-rate, `pairedBootstrap` + * for the paired plain-vs-refined widening), never hand-rolled. + * + * The source of truth is the per-attempt autopsy JSONL each doc writes incrementally: the final + * statistics are recomputed by re-reading those trails from disk (`analyzeTrails`), so an interrupted + * run loses no data — re-run the analysis over the JSONL. + * + * Run (key never printed): + * dotenvx run -f /home/drew/company/devops/secrets/agent-state.env -- \ + * pnpm tsx src/autodata/powered.ts + * + * Env knobs: AUTODATA_SLOTS_PER_DOC (default 14), AUTODATA_SAMPLES (default 4), + * AUTODATA_MAXRETRIES (default 3), AUTODATA_DOCS ("url|focus|tag,url|focus|tag" override), + * AUTODATA_{WEAK,STRONG,CHALLENGER,JUDGE}_MODEL, TANGLE_API_KEY (or TANGLE_ROUTER_KEY). + */ + +import { readFile } from 'node:fs/promises' +import { pairedBootstrap, wilson } from '@tangle-network/agent-eval' +import { buildAutodataDataset } from './build-dataset' +import type { AttemptRecord } from './data-creation-loop' +import { groundDoc } from './grounding' +import { + CHALLENGER_MODEL, + JUDGE_MODEL, + STRONG_SOLVER_MODEL, + smokeTestModels, + WEAK_SOLVER_MODEL, +} from './router-roles' + +/** One grounding document to split slots across. */ +interface DocSpec { + url: string + focus: string + tag: string +} + +/** + * Two non-memorized, reasoning-rich MoE papers (both post-date `llama-3.1-8b`'s knowledge cutoff, so + * the weak solver must REASON from the context, not recall) — the precondition for any gap to open. + * Each `focus` selects a PROSE mechanism chunk in the same "MoE-expert reasoning" band: a chunk dense + * with LaTeX equations (e.g. DeepSeek-V3's MLA section) breaks the challenger's strict-JSON output, so + * both focuses target the prose description of an expert-routing mechanism, not the equations. + * • Mixtral-of-Experts (2401.04088, Jan 2024) — sparse MoE expert routing / gating. + * • DeepSeek-V3 (2412.19437, Dec 2024) — auxiliary-loss-free load balancing / expert specialization. + */ +const defaultDocs: DocSpec[] = [ + { url: 'https://ar5iv.labs.arxiv.org/html/2401.04088', focus: 'expert', tag: 'mixtral' }, + { url: 'https://ar5iv.labs.arxiv.org/html/2412.19437', focus: 'auxiliary', tag: 'deepseek-v3' }, +] + +function envInt(name: string, fallback: number): number { + const raw = process.env[name] + if (!raw) return fallback + const n = Number.parseInt(raw, 10) + if (!Number.isFinite(n) || n <= 0) throw new Error(`${name}='${raw}' is not a positive integer`) + return n +} + +function parseDocsEnv(): DocSpec[] { + const raw = process.env.AUTODATA_DOCS + if (!raw) return defaultDocs + return raw.split(',').map((entry) => { + const [url, focus, tag] = entry.split('|').map((s) => s.trim()) + if (!url || !focus || !tag) + throw new Error(`AUTODATA_DOCS entry '${entry}' is not url|focus|tag`) + return { url, focus, tag } + }) +} + +// ── Descriptive distribution helpers (NOT inferential — those reuse agent-eval) ──────────────── + +/** Linear-interpolated quantile of a sorted-or-unsorted numeric sample. */ +function quantile(xs: number[], q: number): number { + if (xs.length === 0) return Number.NaN + const s = [...xs].sort((a, b) => a - b) + if (s.length === 1) return s[0] as number + const pos = (s.length - 1) * q + const lo = Math.floor(pos) + const hi = Math.ceil(pos) + const frac = pos - lo + return (s[lo] as number) * (1 - frac) + (s[hi] as number) * frac +} + +function mean(xs: number[]): number { + return xs.length === 0 ? Number.NaN : xs.reduce((a, b) => a + b, 0) / xs.length +} + +interface Distribution { + n: number + min: number + median: number + p90: number + max: number + mean: number +} + +function describe(xs: number[]): Distribution { + return { + n: xs.length, + min: xs.length ? Math.min(...xs) : Number.NaN, + median: quantile(xs, 0.5), + p90: quantile(xs, 0.9), + max: xs.length ? Math.max(...xs) : Number.NaN, + mean: mean(xs), + } +} + +// ── The aggregation: per-attempt JSONL trail → per-slot outcomes → CIs ───────────────────────── + +/** One JSONL row from a per-attempt trail: an `AttemptRecord` plus the challenger style tag. */ +type TrailRow = AttemptRecord & { style?: string } + +/** One doc's trail + the number of slots it was asked to run (the accept-rate denominator). */ +export interface DocTrail { + tag: string + url: string + /** Slots requested for this doc — the denominator (a slot whose drafts all errored counts as a + * reject, so the denominator is the requested count, not the number of slots that left a trail). */ + target: number + rows: TrailRow[] +} + +export interface PoweredStats { + totalSlots: number + acceptedSlots: number + /** Slots that produced >= 1 attempt (the challenger authored a parseable example at least once). */ + slotsWithAttempts: number + /** Slots where the challenger threw on every refine (0 attempts) — an infra/parse failure, NOT a + * discrimination reject. Surfaced so it can be flagged as a threat to validity. */ + challengerFailedSlots: number + /** Wilson 95% CI on the accept-rate (binomial: a slot accepts or it does not). */ + acceptRate: { estimate: number; lower: number; upper: number } + /** Wilson 95% CI on the accept-rate among only the slots that produced >= 1 attempt (excludes the + * challenger-stage failures) — the rate if every slot had at least authored an example. */ + acceptRateAmongProducing: { estimate: number; lower: number; upper: number } + /** Per-slot BEST gap (max gap over the slot's quality-clean attempts) — the discriminating power + * each slot reached, accepted or not. */ + bestGapPerSlot: Distribution + /** Plain (first-draft) gap per slot. */ + plainGap: Distribution + /** Paired plain→best-refined gap-widening with a bootstrap CI on the mean delta. */ + widening: { n: number; meanDelta: number; medianDelta: number; lower: number; upper: number } + /** Weak solver mean-score distribution over quality-clean attempts (the coin-flip's source). */ + weakScore: Distribution + /** Strong solver mean-score distribution over quality-clean attempts. */ + strongScore: Distribution + /** Per-attempt pass fractions for each sub-condition of the accept rule (decomposes the rate). */ + conditions: { + nAttempts: number + strongHi: number // fraction with strong >= 0.65 + weakLo: number // fraction with weak < 0.50 (the "weak must struggle" gate) + gapWide: number // fraction with gap >= 0.20 + all: number // fraction passing all three (== accept) + } + perDoc: { tag: string; target: number; accepted: number; meanBestGap: number }[] +} + +const minStrong = 0.65 +const maxWeak = 0.5 +const minGap = 0.2 + +/** + * Compute the powered statistics from the per-doc attempt trails. Pure + deterministic (seeded + * bootstrap), so it can be re-run standalone over the on-disk JSONL after an interrupted run. + */ +export function analyzeTrails(trails: DocTrail[], opts?: { bootstrapSeed?: number }): PoweredStats { + const seed = opts?.bootstrapSeed ?? 0xc0ffee + + let totalSlots = 0 + let acceptedSlots = 0 + let slotsWithAttempts = 0 + const bestGapPerSlotAll: number[] = [] + const plainGapsPaired: number[] = [] + const refinedGapsPaired: number[] = [] + const weakScores: number[] = [] + const strongScores: number[] = [] + let nAttempts = 0 + let strongHi = 0 + let weakLo = 0 + let gapWide = 0 + let acceptAll = 0 + const perDoc: PoweredStats['perDoc'] = [] + + for (const trail of trails) { + totalSlots += trail.target + + // Group this doc's attempts by slot index. + const bySlot = new Map() + for (const row of trail.rows) { + const arr = bySlot.get(row.slotIndex) ?? [] + arr.push(row) + bySlot.set(row.slotIndex, arr) + } + + slotsWithAttempts += bySlot.size + let docAccepted = 0 + const docBestGaps: number[] = [] + for (const [, rows] of bySlot) { + // A slot is ACCEPTED iff any of its attempts cleared the accept rule. + const accepted = rows.some((r) => r.decision.accept) + if (accepted) { + acceptedSlots += 1 + docAccepted += 1 + } + // Best gap the slot reached over its quality-clean attempts (a leaky/thin draft has gap 0). + const cleanGaps = rows.filter((r) => r.qualityOk).map((r) => r.gap) + const bestGap = cleanGaps.length ? Math.max(...cleanGaps) : 0 + bestGapPerSlotAll.push(bestGap) + docBestGaps.push(bestGap) + + // Paired plain→refined: plain = the earliest iteration's gap, refined = the slot's best gap. + const earliest = [...rows].sort((a, b) => a.iteration - b.iteration)[0] + if (earliest) { + plainGapsPaired.push(earliest.gap) + refinedGapsPaired.push(bestGap) + } + + // Per-attempt condition decomposition (quality-clean attempts only — a leaky draft never + // reached the solvers, so its 0/0 scores would falsely depress every sub-condition). + for (const r of rows) { + if (!r.qualityOk) continue + nAttempts += 1 + weakScores.push(r.weak.mean) + strongScores.push(r.strong.mean) + if (r.strong.mean >= minStrong) strongHi += 1 + if (r.weak.mean < maxWeak) weakLo += 1 + if (r.gap >= minGap) gapWide += 1 + if (r.strong.mean >= minStrong && r.weak.mean < maxWeak && r.gap >= minGap) acceptAll += 1 + } + } + perDoc.push({ + tag: trail.tag, + target: trail.target, + accepted: docAccepted, + meanBestGap: mean(docBestGaps), + }) + } + + const accept = wilson(acceptedSlots, totalSlots, 0.95) + const acceptProducing = wilson(acceptedSlots, slotsWithAttempts, 0.95) + const boot = pairedBootstrap(plainGapsPaired, refinedGapsPaired, { + confidence: 0.95, + statistic: 'mean', + seed, + resamples: 5000, + }) + + return { + totalSlots, + acceptedSlots, + slotsWithAttempts, + challengerFailedSlots: totalSlots - slotsWithAttempts, + acceptRate: { estimate: accept.estimate, lower: accept.lower, upper: accept.upper }, + acceptRateAmongProducing: { + estimate: acceptProducing.estimate, + lower: acceptProducing.lower, + upper: acceptProducing.upper, + }, + bestGapPerSlot: describe(bestGapPerSlotAll), + plainGap: describe(plainGapsPaired), + widening: { + n: boot.n, + meanDelta: boot.mean, + medianDelta: boot.median, + lower: boot.low, + upper: boot.high, + }, + weakScore: describe(weakScores), + strongScore: describe(strongScores), + conditions: { + nAttempts, + strongHi: nAttempts ? strongHi / nAttempts : Number.NaN, + weakLo: nAttempts ? weakLo / nAttempts : Number.NaN, + gapWide: nAttempts ? gapWide / nAttempts : Number.NaN, + all: nAttempts ? acceptAll / nAttempts : Number.NaN, + }, + perDoc, + } +} + +async function readTrail(path: string): Promise { + // A slot whose every challenger draft errored produces no attempt → no trail file is written. + // That is a legitimate all-reject outcome (the loop failed to manufacture an example), not a + // crash: return an empty trail so those slots still count against the accept-rate denominator. + let text: string + try { + text = await readFile(path, 'utf8') + } catch (err) { + if ((err as NodeJS.ErrnoException).code === 'ENOENT') return [] + throw err + } + return text + .split('\n') + .filter((l) => l.trim().length > 0) + .map((l) => JSON.parse(l) as TrailRow) +} + +function pctCI(x: { estimate: number; lower: number; upper: number }): string { + return `${(x.estimate * 100).toFixed(0)}% CI[${(x.lower * 100).toFixed(0)}%, ${(x.upper * 100).toFixed(0)}%]` +} + +function dist(d: Distribution): string { + return `n=${d.n} min=${d.min.toFixed(2)} median=${d.median.toFixed(2)} p90=${d.p90.toFixed(2)} max=${d.max.toFixed(2)} (mean=${d.mean.toFixed(2)})` +} + +function printReport(stats: PoweredStats, spendUsd: number): void { + console.log('\n══════════════════════════════════════════════════════════════════════════') + console.log(' POWERED ACCEPT-RATE — does the causal-challenger loop reliably discriminate?') + console.log('══════════════════════════════════════════════════════════════════════════\n') + console.log(` slots run : ${stats.totalSlots} (${stats.acceptedSlots} accepted)`) + console.log(` ACCEPTED-RATE : ${pctCI(stats.acceptRate)} ← the headline (Wilson 95%)`) + if (stats.challengerFailedSlots > 0) { + console.log( + ` challenger-failed : ${stats.challengerFailedSlots} slot(s) produced 0 attempts (infra/parse, not a discrimination reject)`, + ) + console.log( + ` accept-rate (producing): ${pctCI(stats.acceptRateAmongProducing)} (excl. challenger-failed slots)`, + ) + } + console.log('') + console.log(` best gap / slot : ${dist(stats.bestGapPerSlot)}`) + console.log(` plain gap / slot : ${dist(stats.plainGap)}`) + console.log( + ` gap-widening Δ : mean ${stats.widening.meanDelta >= 0 ? '+' : ''}${stats.widening.meanDelta.toFixed( + 3, + )} median ${stats.widening.medianDelta >= 0 ? '+' : ''}${stats.widening.medianDelta.toFixed(3)} ` + + `CI[${stats.widening.lower.toFixed(3)}, ${stats.widening.upper.toFixed(3)}] (paired bootstrap, n=${stats.widening.n})`, + ) + console.log('') + console.log(` weak score / attempt : ${dist(stats.weakScore)}`) + console.log(` strong score / attempt: ${dist(stats.strongScore)}`) + console.log('') + const c = stats.conditions + console.log(` accept-rule decomposition over ${c.nAttempts} quality-clean attempts:`) + console.log(` strong >= 0.65 : ${(c.strongHi * 100).toFixed(0)}%`) + console.log( + ` weak < 0.50 : ${(c.weakLo * 100).toFixed(0)}% ← the binding gate (weak must struggle)`, + ) + console.log(` gap >= 0.20 : ${(c.gapWide * 100).toFixed(0)}%`) + console.log(` all three (accept): ${(c.all * 100).toFixed(0)}%`) + console.log('') + console.log(' per-doc:') + for (const d of stats.perDoc) { + console.log( + ` ${d.tag.padEnd(14)} accepted ${d.accepted}/${d.target} mean best-gap ${d.meanBestGap.toFixed(3)}`, + ) + } + console.log(`\n total live spend : $${spendUsd.toFixed(4)}`) + const lo = stats.acceptRate.lower + const verdict = + lo > 0.15 + ? `WORKS AT POWER — accept-rate CI lower bound ${(lo * 100).toFixed(0)}% excludes ~0` + : stats.acceptRate.upper < 0.2 + ? 'COIN-FLIP / DOES NOT RELIABLY WORK — accept-rate CI sits near 0' + : 'UNDER-POWERED / MARGINAL — accept-rate CI still includes near-0; not settled' + console.log(`\n VERDICT: ${verdict}\n`) +} + +async function main(): Promise { + const apiKey = process.env.TANGLE_API_KEY ?? process.env.TANGLE_ROUTER_KEY + if (!apiKey) throw new Error('no TANGLE_API_KEY in env — run under dotenvx so the key is set') + + const docs = parseDocsEnv() + const slotsPerDoc = envInt('AUTODATA_SLOTS_PER_DOC', 14) + const samples = envInt('AUTODATA_SAMPLES', 4) + const maxRetries = envInt('AUTODATA_MAXRETRIES', 3) + + console.log('Autodata · POWERED accept-rate measurement') + console.log( + ` challenger/judge=${CHALLENGER_MODEL}/${JUDGE_MODEL} weak=${WEAK_SOLVER_MODEL} strong=${STRONG_SOLVER_MODEL}`, + ) + console.log( + ` ${docs.length} docs × ${slotsPerDoc} slots = ${docs.length * slotsPerDoc} slots · samples=${samples} maxRetries=${maxRetries}\n`, + ) + + // ── COST GATE: one cheap call per model, all must return non-empty content before the burn ── + const smoke = await smokeTestModels({ + apiKey, + models: [CHALLENGER_MODEL, WEAK_SOLVER_MODEL, STRONG_SOLVER_MODEL], + }) + for (const s of smoke) { + console.log( + ` ${s.ok ? 'ok ' : 'DEAD'} ${s.model.padEnd(28)} chars=${String(s.contentChars).padStart(4)} ` + + `finish=${s.finishReason ?? '?'} cost=$${s.costUsd.toFixed(5)} (${s.costSource})`, + ) + } + const dead = smoke.filter((s) => !s.ok) + if (dead.length > 0) { + throw new Error(`cost gate failed — empty content from: ${dead.map((d) => d.model).join(', ')}`) + } + + // ── Run K slots per doc (each doc writes its own incremental autopsy trail). ── + const trails: DocTrail[] = [] + let spendUsd = 0 + for (const doc of docs) { + const grounded = await groundDoc({ url: doc.url, focus: doc.focus }) + console.log( + `\n[${doc.tag}] grounded ${grounded.url} chunk=${grounded.chunkIndex}/${grounded.totalChunks} (${grounded.doc.length} chars)`, + ) + const attemptsPath = `data/powered-${doc.tag}-attempts.jsonl` + const result = await buildAutodataDataset({ + apiKey, + source: grounded, + outPath: `data/powered-${doc.tag}.jsonl`, + attemptsPath, + target: slotsPerDoc, + samples, + maxRetries, + }) + const docSpend = result.cost.summary().totalCostUsd + spendUsd += docSpend + console.log( + `[${doc.tag}] done · ${result.accepted.length}/${slotsPerDoc} accepted · ${result.attempts.length} attempts · $${docSpend.toFixed(4)}`, + ) + trails.push({ + tag: doc.tag, + url: doc.url, + target: slotsPerDoc, + rows: await readTrail(attemptsPath), + }) + } + + // ── Recompute everything from the on-disk trails (durable source of truth). ── + const stats = analyzeTrails(trails) + printReport(stats, spendUsd) + + // Emit the machine-readable result alongside the prose, for the doc + any re-analysis. + console.log('RESULT_JSON ' + JSON.stringify({ ...stats, spendUsd })) +} + +// Only auto-run when invoked directly (keeps `analyzeTrails` importable + unit-testable). +if (process.argv[1] && process.argv[1].endsWith('powered.ts')) { + main().catch((err) => { + console.error(err) + process.exit(1) + }) +}