feat(autodata): powered accept-rate with CI — settle whether the loop reliably discriminates#44
Conversation
… reliably discriminates Run a fixed K=32 independent slots (16 each over two non-memorized MoE papers, samples=4) through the causal-challenger -> refine -> accept loop and report the accepted-rate with a Wilson 95% CI, the per-slot gap distribution, and the plain-vs-refined gap-widening with a paired-bootstrap CI. This settles the n=3 noise (one run 2/3, a re-run 0/3): acceptance is a real 38% rate, CI [23%, 55%] (12/32), not a coin-flip ~0. - powered.ts: a fixed-slots harness over the existing buildAutodataDataset (which already runs exactly `target` independent slots); only the cross-slot aggregation + the two CIs are added. CIs are agent-eval's published `wilson` / `pairedBootstrap`, never hand-rolled. Stats are recomputed from the on-disk per-attempt JSONL so an interrupted run loses no data. Surfaces challenger-stage (LaTeX-in-JSON) failures separately so they are not silently miscounted as discrimination rejects. - powered.test.ts: offline unit coverage for analyzeTrails (denominator = requested target, per-slot best-gap pairing, the accept-rule decomposition, multi-doc aggregation). - docs/results/autodata-live.md: replace the noisy n=3 section with the powered rate + CI, the gap distribution, and two autopsied accepted examples (real weak-fails-strong-derives on both docs).
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 07ee88dd
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-26T14:49:39Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 1 (1 low) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 137.8s (2 bridge agents) |
| Total | 137.8s |
💰 Value — sound
Adds a fixed-slots powered measurement harness over the existing autodata loop, reporting accept-rate with Wilson CIs and gap-widening with paired-bootstrap CIs so the n=3 coin-flip question can be settled at power.
- What it does: Introduces
src/autodata/powered.ts(and tests) that runs a configured number of independent autodata slots across two non-memorized grounding docs, writes per-attempt JSONL trails, and recomputes durable statistics from those trails: accept-rate with a Wilson 95% CI, per-slot best-gap distribution, plain→refined gap-widening with a paired-bootstrap CI, accept-rule decomposition, and per-doc brea - Goals it achieves: Settles whether the causal-challenger autodata loop reliably manufactures discriminating examples (accept-rate bounded away from zero) after the prior n=3 runs were too noisy to distinguish a real rate from ~0. It also makes the measurement interruption-safe and bounded-cost, with explicit confidence intervals and a cost gate.
- Assessment: Good change, coherent and well-scoped. It layers on top of the existing loop without rewriting it, follows the repo's patterns (env knobs,
smokeTestModelscost gate, JSONL trails, agent-eval stats, fail-loud), and adds offline unit tests for the aggregation edge cases (denominator = requested target, multi-doc pooling, condition decomposition). - Better / existing approach: none — this is the right approach. I checked: agent-eval exports
wilsonandpairedBootstrapbut does not export a genericquantile/describehelper (itsquantileis internal to matrix aggregation), so the local descriptive helpers are not duplication. The existingrun.tsandcalibrate.tsrunners are single-doc / until-N-accepted and do no cross-slot CI aggregation, so there is no exis - Model: opencode/kimi-for-coding/k2p7
- Bridge attempts: 1
🎯 Usefulness — sound
A well-fit powered-measurement runnable (third sibling of run.ts/calibrate.ts) that reuses the existing dataset loop + agent-eval's published CIs to settle whether the autodata loop reliably discriminates — reachable, on-pattern, interruption-safe.
- Assessment: Net positive, not dead-end. It settles the actual question the prior n=3 run left open (accept-rate with a CI that excludes ~0) and emits a machine-readable RESULT_JSON (powered.ts:441) plus feeds the cited docs/results/autodata-live.md. The per-doc and per-condition decomposition is genuinely informative (identifies weak<0.5 as the binding gate), and
analyzeTrailsis reusable for any future re- - Integration: Reachable and wired correctly.
powered.tsis a standalone runnable guarded at powered.ts:445 (process.argv[1].endsWith('powered.ts')), invoked exactly like the siblingpnpm tsx src/autodata/run.ts/calibrate.ts.analyzeTrails,DocTrail,PoweredStatsare re-exported from src/autodata/index.ts:39 for programmatic re-use. It composes only existing pieces:buildAutodataDataset(build- - Fit with existing patterns: Excellent — fits the established grain precisely. It is a third sibling of run.ts and calibrate.ts: identical shape (env knobs via a local
envInt, cost-gate smoke test, ground doc, callbuildAutodataDataset, print a prose report). The inferential stats reuse agent-eval's publishedwilson/pairedBootstraprather than hand-rolling, which matches the AGENTS.md layering rule that agent-knowledg - Real-world viability: Holds up off the happy path. Interruption-safe by design: final stats are recomputed from the on-disk JSONL via
readTrail, which treats a missing trail file as an empty trail (powered.ts:302, ENOENT → []) so all-challenger-failed slots still count against the denominator; this is explicitly tested at powered.test.ts:42-52.analyzeTrailsis pure + seeded-deterministic (powered.ts:180), so re-ru - Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added src/autodata/powered.ts
- console.log('\n══════════════════════════════════════════════════════════════════════════')
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| opencode-kimi | glm | deepseek | aggregate | |
|---|---|---|---|---|
| Readiness | 80 | 83 | 85 | 80 |
| Confidence | 70 | 70 | 70 | 70 |
| Correctness | 80 | 83 | 85 | 80 |
| Security | 80 | 83 | 85 | 80 |
| Testing | 80 | 83 | 85 | 80 |
| Architecture | 80 | 83 | 85 | 80 |
Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Condition decomposition thresholds hardcoded, diverge from configurable accept rule — src/autodata/powered.ts
Lines 171-173 hardcode minStrong=0.65, maxWeak=0.5, minGap=0.2 for the per-attempt condition decomposition (lines 238-241). But buildAutodataDataset accepts DiscriminativeThresholds (build-dataset.ts:22-26) that can override these thresholds. If a caller runs the loop with custom thresholds and then imports analyzeTrails to decomposing the resulting trails, the strongHi/weakLo/gapWide/all fractions will be computed against the wrong gates. The per-slot accept-rate is unaffected (it uses decision.accept). Fix: accept optional thr
🟡 LOW Reproduce command does not export the env vars needed to match the headline config — docs/results/autodata-live.md
The Design section (line 66-69) and the reproduce comment (line 149) state the headline run used AUTODATA_SLOTS_PER_DOC=16 and AUTODATA_MAXRETRIES=2. But the reproduce command at line 148 (
pnpm tsx src/autodata/powered.ts) runs with code defaults, which are slotsPerDoc=14 and maxRetries=3 (powered.ts:377-379). The knobs are listed only as an inert comment (# knobs: ...), not exported, so
🟡 LOW Unexplained n=33 attempt denominator — docs/results/autodata-live.md
Lines 82-89 report the weak/strong score distributions and accept-rule decomposition over '33 quality-clean attempts', while the experimental design just above describes 32 total slots (16 per doc) with 5 challenger-stage failures leaving 27 producing slots. The relationship between 32 slots, 27 producing slots, and 33 attempts is not stated, so a reader has to infer that a slot may emit multiple quality-clean candidates across the refine fold. Concrete: the condition table says 'weak < 0.50 = 39%' (≈13/33) and 'all-three = 36%' (≈12/33), while the headline accept-rate is 12/32 slots. Impact: minor friction reconciling denominators; not a numerical error.
🟡 LOW AUTODATA_DOCS tag interpolated into paths unsanitized — src/autodata/powered.ts
Lines 413, 417, and 432 build paths
data/powered-${doc.tag}.jsonlanddata/powered-${doc.tag}-attempts.jsonlfromAUTODATA_DOCSsegments without rejecting path separators or... A tag like../../tmp/xwould cause writes/reads/deletes outsidedata/. Fix by validating tags against/^[a-zA-Z0-9_-]+$/or sanitizing the path segment.
🟡 LOW Biome lint warnings in new code — src/autodata/powered.ts
Lines 441 (
console.log('RESULT_JSON ' + ...)) and 445 (if (process.argv[1] && process.argv[1].endsWith(...))) triggerlint/style/useTemplateandlint/complexity/useOptionalChain. Both are auto-fixable. Leaving them creates noise in the project's lint output. Fix by runningbiome check --write src/autodata/powered.tsfor these two lines.
🟡 LOW Cost gate omits the judge model — src/autodata/powered.ts
Line 392 calls
smokeTestModelswith only[CHALLENGER_MODEL, WEAK_SOLVER_MODEL, STRONG_SOLVER_MODEL]. The loop still invokes the judge viabuildAutodataDataset, so a dead or misconfiguredAUTODATA_JUDGE_MODELis not caught before the burn. AddJUDGE_MODELto the smoke-test list so all four live roles are gated.
🟡 LOW Quality-failed plain drafts contaminate the paired widening CI — src/autodata/powered.ts
In analyzeTrails (lines ~270-275),
earliestis selected from ALL rows via[...rows].sort((a,b) => a.iteration - b.iteration)[0], without filtering onqualityOk. MeanwhilebestGap(the refined side of the pair) is computed from quality-clean attempts only. So if the first (plain) draft fails the quality gate (leak/rubric) but a later refined draft is clean, the paired delta = cleanBestGap - failedPlainGap pairs an invalid baseline with a valid refined gap. Theconditionsdecomposition correctly excludes quality-failed attempts (line ~283:if (!r.qualityOk) continue), but the widening pairing does not — an inconsistency. A quality-failed plain draft could have an artificially low gap (scores from a leaky example), inflatingwidening.meanDeltaupward. Fix: filterearliestto q
🟡 LOW Two biome lint warnings in powered.ts — src/autodata/powered.ts
Line 441:
console.log('RESULT_JSON ' + JSON.stringify(...))triggers lint/style/useTemplate (prefer template literal). Line 445:process.argv[1] && process.argv[1].endsWith(...)triggers lint/complexity/useOptionalChain (prefer?.). Both are auto-fixable nits, non-blocking.
🟡 LOW describe() emits NaN fields when every slot's challenger fails — src/autodata/powered.ts
If every slot produces zero attempts (all challenger failures),
bestGapPerSlotAllandplainGapsPaired/refinedGapsPairedare empty.describe([])returns{n:0, min:NaN, median:NaN, p90:NaN, max:NaN, mean:NaN}. Thedist()formatter then prints 'NaN' values in the report. Not a crash (wilson(0,0) and pairedBootstrap([],[]) are safe per verified source), and unlikely given the cost gate, but the report degrades silently. Consider short-circuiting printReport with a 'no slots produced attempts' message when slotsWithAttempts === 0.
🟡 LOW envInt accepts non-integer string prefixes — src/autodata/powered.ts
Lines 64-70 use
Number.parseInt(raw, 10), which silently accepts inputs like'14abc'as14or'0x10'as16for slot/sample/retry counts. This is a silent misconfiguration risk. Fix by validating with/^\d+$/before parsing and throwing for non-integer strings.
tangletools · 2026-06-26T15:47:13Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Approved — 10 non-blocking findings — 07ee88dd
Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 2/2 planned shots over 4 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-26T15:47:13Z · immutable trace
What this settles
At n=3 the Autodata accept result was noise (one run 2/3, an independent re-run 0/3). This runs it at power and reports the accepted-rate with a confidence interval so n=1–2 luck can't fake it.
Verdict: WORKS AT POWER — not a coin-flip. Accept-rate 38%, Wilson 95% CI [23%, 55%] (12 of 32 slots cleared weak<0.5 ∧ strong≥0.65 ∧ gap≥0.2). The CI lower bound excludes ~0, and even the harder of the two docs (mixtral, 19% [7%, 43%]) excludes 0.
Design (powered, fixed-slots)
llama-3.1-8b-cutoff MoE papers, prose chunks (LaTeX breaks the challenger's strict-JSON).wilson/pairedBootstrap— not hand-rolled. Stats recomputed from the on-disk per-attempt JSONL (interruption-safe).buildAutodataDatasetloop unchanged — it already runs exactlytargetindependent slots; only cross-slot aggregation + the two CIs are added.Numbers
Total live spend: $0.57 (32-slot run; ~$1.0 incl. pilots).
Two autopsied accepted examples (in
docs/results/autodata-live.md) confirm real weak-fails-strong-derives discrimination on both docs — not a judge artifact or leakage.The honest caveats (quantified, neither overturns the verdict)
llama-3.1-8bstruggles on only ~39% of attempts (weak median 0.55). The rate is a property of the tier gap — exactly what the discriminative reward should measure.Verify
pnpm typecheck && pnpm lint && pnpm build && pnpm testall green (203 tests pass, 12 live-gated skipped). New offline coverage insrc/autodata/powered.test.ts.Reproduce:
dotenvx run -f <secrets>.env -- pnpm tsx src/autodata/powered.ts