Skip to content

feat: training intelligence — checker agent, checkpoint cadence, experiment checklist (Batch B+C)#95

Merged
SamPlvs merged 1 commit into
mainfrom
claude/training-intelligence
May 30, 2026
Merged

feat: training intelligence — checker agent, checkpoint cadence, experiment checklist (Batch B+C)#95
SamPlvs merged 1 commit into
mainfrom
claude/training-intelligence

Conversation

@SamPlvs
Copy link
Copy Markdown
Owner

@SamPlvs SamPlvs commented May 30, 2026

Batch B+C — Training intelligence

First of the process-hardening batches: turn training-time rules that previously had to be re-supplied every run into enforced platform behaviour (PR-005 applied to the autonomous run — aspirational agent prose → code paths that always fire).

What's in it

  • training-checker agent (Sonnet, phase-in) — a per-model-run live monitor the Lead spawns as training-{modelname}-checker. Tails the active experiment's metrics.jsonl/training_status.json, alerts Model Builder + Lead on NaN/Inf, divergence, gradient blow-up, overfit, dead LR, or stall (kill a broken run early), and writes a mechanistic diagnosis.md + feeds next.md. Enforced via an always-fires Phase-4 instruction in Orchestrator._prompt_experiment_context (not the plan's active-agent roster, which _agents_for_phase filters), backed by the agent file + AGENT_PHASE_MAP.
  • Checkpoint cadence + disaster recovery — importable should_checkpoint(epoch, total_epochs, every=10, is_best=) (replaces the contract's previously-undefined pseudocode) + a "Checkpointing and Disaster Recovery (REQUIRED)" section in model-builder.md: DL every-10-epochs + best + last with fully-resumable state (optimizer/scheduler/AMP-scaler/epoch/RNG); ML per-fold + best + persisted HPO study state.
  • Research-scout general-AI track — Phase-4-iteration survey of the broad ML literature (time-series/sequence modelling, optimization, regularization), method-first, pairing with the checker on each failure mode (additive to its domain problem-class survey).
  • Auto-maintained experiment checklistrender_checklist/write_checklist regenerate .zo/experiments/CHECKLIST.md on every registry mutation: exp → hypothesis → metric → Δ vs parent → tier → top shortfall, + "Next planned".

Cascade + verification

  • Agent roster 20 → 21 (setup.sh, README, lead-orchestrator, specs/agents.md, plans, PRD).
  • +20 tests → 780 passed / 7 skipped on Python 3.11 and 3.12.
  • ruff src/ clean; validate-docs 0 failures (2 warnings: client-blocklist skip + the known grep-vs-pytest test-badge parameterization gap; README badge updated 743 → 780).
  • PR-040 added (enforce-not-aspirate; cross-refs PR-005/009/035).

Batches A (per-project self-evolution), D (optimization audit + software-engineer agent), E (idle-agent shutdown + swarm reinforcement) queued next.

🤖 Generated with Claude Code

…klist

Batch B+C of the process-hardening work — enforce training-time
behaviour the user kept re-instructing every run (PR-005 applied to
the autonomous run):

- New training-checker agent (Sonnet, phase-in): per-model-run live
  monitor spawned as training-{modelname}-checker; alerts on NaN/
  divergence/overfit/stall, writes diagnosis.md + next.md. Enforced via
  the always-fires Phase-4 instruction in _prompt_experiment_context +
  AGENT_PHASE_MAP, independent of the plan's active-agent list.
- should_checkpoint() helper + model-builder "Checkpointing and
  Disaster Recovery (REQUIRED)" section: DL every-10-epochs + best +
  last (fully resumable); ML per-fold + best + persisted HPO state.
- research-scout general-AI-research track for Phase-4 iteration
  (time-series/sequence modelling, method-first), pairs with checker.
- Auto-maintained .zo/experiments/CHECKLIST.md via render_checklist/
  write_checklist, baked into every registry mutation.

Agent roster 20 -> 21 with full doc cascade. +20 tests (760 -> 780 on
Python 3.11 & 3.12). ruff src/ clean, validate-docs 0 failures.
PR-040 captures the enforce-not-aspirate lesson.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 30, 2026

Deploying zero-operators with  Cloudflare Pages  Cloudflare Pages

Latest commit: b162b38
Status: ✅  Deploy successful!
Preview URL: https://bdce04af.zero-operators.pages.dev
Branch Preview URL: https://claude-training-intelligence.zero-operators.pages.dev

View logs

@SamPlvs SamPlvs merged commit f0a6abd into main May 30, 2026
5 checks passed
@SamPlvs SamPlvs deleted the claude/training-intelligence branch May 30, 2026 08:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant