From 11b4a4f81f431947bd252daa348623cb576f4f01 Mon Sep 17 00:00:00 2001 From: SamPlvs Date: Wed, 27 May 2026 19:20:54 +0100 Subject: [PATCH 1/3] =?UTF-8?q?copy:=20launch-prep=20cascade=20=E2=80=94?= =?UTF-8?q?=20contributors,=20origin,=20stale=20token=20+=20test=20counts?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit External-review pass before broad launch. Three categories: 1. Stale "~70-80% reduction" claims in docs/quickstart.mdx:116, docs/cli/build.mdx:155, docs/COMMANDS.md:28 updated to match the measured-honesty framing already in cost-benchmark.mdx, low-token-mode.mdx and the README: ~30% measured ($7.75 vs ~$11 on MNIST), 50-60% targeted with Haiku routing + per-phase trims (pending second bench), 70-80% on the roadmap via SDK refactor. The session-025 honesty cascade missed these three surfaces. 2. Stale test counts: docs/installation.mdx:130 (675 → 743), README.md tests badge + line 508 (735 → 743). Verified via `pytest -q`: 743 passed, 7 skipped. 3. Website origin story at website/src/pages/index.html:951 stripped of industry detail ("high-stakes client / power plant" → "eight-week production ML project"). The prior phrasing brushed the public-repo confidentiality rules in CLAUDE.md; safer to drop the domain entirely before broader launch. Also: footer heading website/src/pages/index.html:1046 "The people behind it." → "Contributors." per user direction. Memory updated per auto-protocol (session 029 hand-off in STATE.md + new DECISION_LOG entry). validate-docs 9/11 (0 failures; 2 pre-existing warnings). Not in this PR (live on the repo via `gh`): description, homepage, 9 topics, Discussions enabled, 3 seed threads posted (#80 / #81 / #82). Co-Authored-By: Claude Opus 4.7 (1M context) --- README.md | 4 ++-- docs/COMMANDS.md | 2 +- docs/cli/build.mdx | 2 +- docs/installation.mdx | 2 +- docs/quickstart.mdx | 2 +- memory/zo-platform/DECISION_LOG.md | 33 ++++++++++++++++++++++++++++++ memory/zo-platform/STATE.md | 4 +++- website/src/pages/index.html | 10 ++++----- 8 files changed, 47 insertions(+), 12 deletions(-) diff --git a/README.md b/README.md index 1fe4ca0..4a36792 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@
[![Status](https://img.shields.io/badge/status-validated-D87A57?style=flat-square&labelColor=12110F)](#status) -[![Tests](https://img.shields.io/badge/tests-735_passing-D87A57?style=flat-square&labelColor=12110F)](#status) +[![Tests](https://img.shields.io/badge/tests-743_passing-D87A57?style=flat-square&labelColor=12110F)](#status) [![Agents](https://img.shields.io/badge/agents-20_defined-D87A57?style=flat-square&labelColor=12110F)](#agent-teams) [![Docs](https://img.shields.io/badge/docs-zerooperators.com-D87A57?style=flat-square&labelColor=12110F)](https://docs.zerooperators.com) @@ -505,7 +505,7 @@ delivery-repo/ | 1.0.2 | Platform-aware Docker scaffold + reference-project end-to-end demos | Done | | 1.0.2-post | `--low-token` cost-saving preset (two-tier model routing, per-phase trims) + `ZOTrainingCallback` hard gate enforcement | Done | -735 platform tests. ruff clean (`src/zo/`). 20 agents. 24 slash commands. Measured benchmarks tracked in [docs/reference/cost-benchmark.mdx](docs/reference/cost-benchmark.mdx). +743 platform tests. ruff clean (`src/zo/`). 20 agents. 24 slash commands. Measured benchmarks tracked in [docs/reference/cost-benchmark.mdx](docs/reference/cost-benchmark.mdx). --- diff --git a/docs/COMMANDS.md b/docs/COMMANDS.md index ca2b2f6..9f0f696 100644 --- a/docs/COMMANDS.md +++ b/docs/COMMANDS.md @@ -25,7 +25,7 @@ zo build plans/project.md [--gate-mode supervised|auto|full-auto] [--no-tmux] ``` **Cost-saving options:** -- `--low-token`: activates the cost-saving preset (Sonnet lead, 2 max iterations, full-auto gates, no headlines, earlier auto-compaction). Equivalent to `low_token: true` in plan frontmatter. ~70-80% reduction. See `docs/concepts/low-token-mode.mdx`. +- `--low-token`: activates the cost-saving preset (Sonnet lead, Haiku for code-reviewer / test-engineer / oracle-qa, 2 max iterations, full-auto gates, no headlines, earlier auto-compaction). Equivalent to `low_token: true` in plan frontmatter. **~30% measured savings** on MNIST ($7.75 vs ~$11 default); 50-60% targeted, 70-80% on the roadmap (prompt caching + Batch API + Files API via SDK refactor). See `docs/concepts/low-token-mode.mdx`. - `--lead-model`: override lead orchestrator model. Composes with `--low-token`. - `--max-iterations N`: hard cap on Phase-4 experiment iterations. Wins over plan and preset. - `--no-headlines`: disable the Haiku headline ticker (saves ~60 small calls/hour). diff --git a/docs/cli/build.mdx b/docs/cli/build.mdx index 4e484ce..faf0f72 100644 --- a/docs/cli/build.mdx +++ b/docs/cli/build.mdx @@ -152,7 +152,7 @@ While `zo build` is running, you have several windows into what the team is doin ```bash zo build plans/my-project.md --low-token ``` - Sonnet lead, 2 max iterations, no headlines, full-auto gates, earlier auto-compaction. ~70-80% token reduction. See [low-token mode](/concepts/low-token-mode) for the full preset and trade-offs. + Sonnet lead, Haiku for code-reviewer / test-engineer / oracle-qa, 2 max iterations, no headlines, full-auto gates, earlier auto-compaction. **~30% measured savings** on MNIST (\$7.75 vs ~\$11 default); 50-60% targeted, 70-80% on the [roadmap](/roadmap). See [low-token mode](/concepts/low-token-mode) for the full preset and trade-offs. ```bash diff --git a/docs/installation.mdx b/docs/installation.mdx index 769e13d..bad7be1 100644 --- a/docs/installation.mdx +++ b/docs/installation.mdx @@ -127,7 +127,7 @@ After installation, run the test suite to confirm everything wires up: ```bash PYTHONPATH=src pytest tests/ -q -# 675 passed, 7 skipped in 7s +# 743 passed, 7 skipped in 5s ``` And run preflight on the bundled MNIST plan: diff --git a/docs/quickstart.mdx b/docs/quickstart.mdx index 6dfab6d..b896427 100644 --- a/docs/quickstart.mdx +++ b/docs/quickstart.mdx @@ -113,7 +113,7 @@ The **Lead Orchestrator** reads the plan, decomposes it into phases, builds cont zo build plans/mnist-demo.md --low-token ``` - Sonnet lead instead of Opus, 2 Phase-4 iterations instead of 10, full-auto gates, no headlines, earlier auto-compaction. ~70-80% cost reduction (MNIST: ~\$11 → ~\$2-3). Slight quality trade at the lead step. See [low-token mode](/concepts/low-token-mode) for the full preset and trade-offs. + Sonnet lead instead of Opus, Haiku for code-reviewer / test-engineer / oracle-qa, 2 Phase-4 iterations instead of 10, full-auto gates, no headlines, earlier auto-compaction. **~30% measured savings** on the MNIST bench (\$7.75 vs ~\$11 default); 50-60% targeted with the Haiku routing + per-phase trims (pending second bench), 70-80% on the [roadmap](/roadmap) via SDK refactor (prompt caching, Batch API, Files API). Slight quality trade at the lead step. See [low-token mode](/concepts/low-token-mode) for the full preset and trade-offs. ## 6. Verify diff --git a/memory/zo-platform/DECISION_LOG.md b/memory/zo-platform/DECISION_LOG.md index 9d3d21d..e8dc2d7 100644 --- a/memory/zo-platform/DECISION_LOG.md +++ b/memory/zo-platform/DECISION_LOG.md @@ -998,3 +998,36 @@ Recommended sequencing within Tier 2: **architecture-first** (cross-phase backed - **Visibility-first sequencing for Tier 2 (VS Code extension before backedge primitive)** — defensible if researcher onboarding is the bigger growth lever; lean is architecture-first because F3 is the architectural differentiator. **Outcome:** `docs/roadmap.mdx` rewritten with new tier structure + concrete next actions + caveman/VS-code-extension/cross-phase-backedge framing (replaces the prior 3-card "top priorities" page). Plan file at `~/.claude/plans/so-before-i-begin-peaceful-valley.md` holds the full internal detail (effort estimates, file paths, TODO checklists for F3 Phase A/Phase B). Open questions still pending: time horizon (Q3 vs H2), Tier 2 sequencing confirmation, audience weighting. **No new PRIOR added** — this is a strategic-design decision, not a failure-driven self-evolution; design rationale captured in this DECISION_LOG entry is the auditable record. **No PR opened yet** — roadmap doc and memory updates land in this commit; Tier 1 implementation kickoff (caveman ablation, onboarding hardening) deferred to user direction. + +--- + +## Decision: 2026-05-27T00:00:00Z +**Type:** DOCUMENTATION + LAUNCH-PREP +**Title:** Launch-prep cascade — contributors rename, origin story, stale token + test-count claims, GitHub repo metadata, Discussions seeded + +**Decision:** Ship a single-PR launch-prep cascade fixing three categories of stale/risky public surface area surfaced by external review, plus apply GitHub-side metadata and seed Discussions before the broad launch push. + +Cascade scope (file edits, single PR to `main`): +- **`website/src/pages/index.html:1046`** — footer heading "The people behind it." → "Contributors." per user direction. Simpler, less precious framing. +- **`website/src/pages/index.html:951`** — origin paragraph rewritten to strip industry detail ("a high-stakes client needed a model to optimise their power plant" → "an eight-week production ML project"). Even anonymised, the prior phrasing brushed against the repo's confidentiality rules; safer to drop the domain context entirely before launch. +- **`docs/quickstart.mdx:116`, `docs/cli/build.mdx:155`, `docs/COMMANDS.md:28`** — three remaining surfaces still claiming "~70-80% reduction" updated to match the measured-honesty framing already in place at `docs/reference/cost-benchmark.mdx`, `docs/concepts/low-token-mode.mdx`, and the README: "**~30% measured savings** on MNIST ($7.75 vs ~$11); 50-60% targeted via Haiku routing + per-phase trims (pending second bench), 70-80% on the roadmap via SDK refactor". The session-025 honesty cascade missed these three. +- **`docs/installation.mdx:130`** — example test output `# 675 passed, 7 skipped in 7s` → `# 743 passed, 7 skipped in 5s`. Current `pytest -q` confirms 743 passed + 7 skipped. +- **`README.md`** — tests badge `735_passing` → `743_passing`; line 508 `735 platform tests` → `743 platform tests`. + +GitHub-side changes (applied via `gh`, not in the PR): +- **Repo description** set: "Autonomous AI research team: write a plan.md, spawn agents, train, verify, and deliver a clean ML repo." +- **Homepage** set: `https://zerooperators.com`. +- **9 topics** applied: `ai-agents`, `claude-code`, `machine-learning`, `mlops`, `autonomous-agents`, `agent-orchestration`, `research-engineering`, `pytorch`, `devtools`. Lifts surface area for the relevant GitHub topic indexes pre-launch. +- **Discussions enabled** via `gh repo edit --enable-discussions=true`. +- **3 seed threads posted** by SamPlvs via GraphQL `createDiscussion`: #80 "Show us your use case" → Show and tell, #81 "Roadmap: VS Code extension vs cost work" → Ideas (asks community to weigh in on the open Tier 2 sequencing question from session 028), #82 "Help wanted: try the MNIST/CIFAR demos" → General. + +**Rationale:** Pre-launch surface area should match the corrective narrative the project has already committed to (PR-005, "enforcement > aspiration"). The session-025 honesty cascade caught the high-traffic surfaces (cost-benchmark, low-token-mode concept page, README's `--low-token` paragraph) but missed three quickstart/CLI/COMMANDS references — they were lower-traffic at the time, but quickstart is the page a new user lands on after the README. Test counts had drifted 8 (735 listed, 743 actual) which is small but visible and erodes trust on close reading. The website origin story's industry detail is the highest-risk item: the public repo's CLAUDE.md treats client/industry naming as a legal-level non-negotiable, and the phrasing was close enough to identify the project even if no name was given. Discussions seeded with substantive prompts (not "hello world" posts) so the first external visitor sees activity worth engaging with, and the roadmap-sequencing thread doubles as a real signal-gathering instrument for session 028's open question. + +**Alternatives considered:** +- **Defer the GitHub metadata + Discussions to a post-launch follow-up** — rejected. The repo currently has empty description/homepage/topics, which means search and discovery surfaces are dead at the moment of launch. Cheap to fix now. +- **Pin the roadmap discussion thread (#81) at the top of Discussions** — user chose "post all 3 as drafted" over "post but pin Thread 2"; pinning preserved as a follow-up option if external signal warrants it. +- **Seed Discussions without posting (let community start)** — rejected. Three substantive prompts give external visitors something to react to and lower the activation energy for first replies. +- **Bundle the launch-prep cascade into the next feature PR (e.g. caveman ablation)** — rejected. Launch is imminent; mixing copy + metadata fixes with a feature change muddles the diff and slows the merge. +- **Keep the website's industry-detail origin story for "credibility"** — rejected. The eight-week constraint and budget framing carry the credibility without naming a domain; the named industry adds risk without proportionate signal. + +**Outcome:** Single conventional commit on a feature branch, PR opened against `main`. validate-docs 9/11 (2 pre-existing warnings, 0 failures — test badge warning re grep vs pytest count is structural, client-blocklist warning is intentional skip). No code or tests touched; no rebench needed. **No new PRIOR added** — this is a maintenance cascade closing out an earlier evolution-driven cleanup (session 025's honesty pass), not a novel failure mode. The generalisation worth recording for future cascades: when a documentation honesty pass ships, sweep all `grep -rn ""` hits across `docs/`, `README.md`, and `website/` in the same PR rather than trusting that the high-traffic surfaces are the only ones that matter. Closed under PR-005's enforcement spirit applied to documentation cleanup. diff --git a/memory/zo-platform/STATE.md b/memory/zo-platform/STATE.md index f364572..e2c71d3 100644 --- a/memory/zo-platform/STATE.md +++ b/memory/zo-platform/STATE.md @@ -8,7 +8,9 @@ status: complete ## Current Position -**Session 028 hand-off — pick up here.** No code shipped — strategic roadmap pass synthesising three pieces of external user feedback into a tiered Q3/H2 roadmap. `docs/roadmap.mdx` rewritten (replaces prior 3-card "top priorities"); full internal plan at `~/.claude/plans/so-before-i-begin-peaceful-valley.md`. **Three workstreams:** (1) **cost** — caveman as scoped Claude Code skill in `--low-token` preset (target +10-20pp on the measured ~30% baseline); (2) **cross-phase autonomy (F3, DAG→weighted graph)** — Phase A = `RETURN_TO_PHASE` `GateDecision` backedge primitive (validates F3 hypothesis cheaply), Phase B = full graph refactor with `completeness` per node + revisit ranking + `zo nodes freeze`; (3) **accessibility** — **VS Code extension** chosen over fork (Cursor/Windsurf) and over from-scratch Tauri/Electron, with **headless tmux as engine + xterm.js webview as renderer** (zero changes to `wrapper.py`, Claude Code keeps real TTY, sub-agent panes render naturally, crash-recoverable); "ZO Studio" bundled installer (VS Code + extension + Claude CLI + uv) as optional follow-on for non-technical researchers. **Multi-provider** (CodeX/OpenCode) reframed as **worker-pool model** (Claude Code lead + single-agent workers via filesystem JSONL), NOT full peer-to-peer parity (would require rebuilding `TeamCreate` / `Agent` / `SendMessage` infrastructure on bare APIs). **Recommended next 2 PRs (Tier 1):** caveman ablation (~1 week) → onboarding hardening (~1 week, one-line installer + Homebrew formula + `zo doctor` command). **Recommended Tier 2 sequencing:** architecture-first (backedge primitive → SDK refactor → VS Code extension); visibility-first defensible if onboarding is the bigger growth lever. **Open questions still pending:** time horizon (Q3 vs H2), Tier 2 sequencing confirmation, audience weighting (research peers / paying customers / internal). Captured in DECISION_LOG entry 2026-05-05T12:00Z (ROADMAP + ARCHITECTURE). No new PR opened yet — roadmap doc + memory updates land in this commit; Tier 1 implementation deferred to user direction. +**Session 029 hand-off — pick up here.** Launch-prep cascade (no code, copy + metadata only). Three external-review fixes shipped: (1) stale 70-80% token claims in `docs/quickstart.mdx:116`, `docs/cli/build.mdx:155`, `docs/COMMANDS.md:28` replaced with measured ~30% + 50-60% targeted + 70-80% roadmap framing (matches the cost-benchmark page already shipped in session 025); (2) stale test counts updated — `docs/installation.mdx:130` 675→743, `README.md` tests badge + line 508 735→743 (verified via `pytest -q`: 743 passed, 7 skipped); (3) website origin story at `website/src/pages/index.html:951` stripped of industry detail ("high-stakes client / power plant" → "eight-week production ML project") to remove confidentiality risk before broad launch push; (4) footer heading at `website/src/pages/index.html:1046` "The people behind it." → "Contributors." per user direction. **GitHub repo metadata applied** via `gh repo edit`: description set, homepage `https://zerooperators.com`, 9 topics (ai-agents, claude-code, machine-learning, mlops, autonomous-agents, agent-orchestration, research-engineering, pytorch, devtools), Discussions enabled. **3 seed discussion threads posted** by SamPlvs: #80 "Show us your use case" (Show and tell), #81 "Roadmap: VS Code extension vs cost work" (Ideas, soliciting Tier 2 sequencing input), #82 "Help wanted: try the MNIST/CIFAR demos" (General). validate-docs 9/11 (2 pre-existing warnings, 0 failures). File changes shipped via PR (this session). Memory + DECISION_LOG updated per auto-protocol. **Next action when picking up:** monitor Discussions for early external-user signal on Tier 2 sequencing (extension vs cost), then resume the Tier 1 recommendation from session 028 (caveman ablation → onboarding hardening). + +**Session 028 hand-off (prior).** No code shipped — strategic roadmap pass synthesising three pieces of external user feedback into a tiered Q3/H2 roadmap. `docs/roadmap.mdx` rewritten (replaces prior 3-card "top priorities"); full internal plan at `~/.claude/plans/so-before-i-begin-peaceful-valley.md`. **Three workstreams:** (1) **cost** — caveman as scoped Claude Code skill in `--low-token` preset (target +10-20pp on the measured ~30% baseline); (2) **cross-phase autonomy (F3, DAG→weighted graph)** — Phase A = `RETURN_TO_PHASE` `GateDecision` backedge primitive (validates F3 hypothesis cheaply), Phase B = full graph refactor with `completeness` per node + revisit ranking + `zo nodes freeze`; (3) **accessibility** — **VS Code extension** chosen over fork (Cursor/Windsurf) and over from-scratch Tauri/Electron, with **headless tmux as engine + xterm.js webview as renderer** (zero changes to `wrapper.py`, Claude Code keeps real TTY, sub-agent panes render naturally, crash-recoverable); "ZO Studio" bundled installer (VS Code + extension + Claude CLI + uv) as optional follow-on for non-technical researchers. **Multi-provider** (CodeX/OpenCode) reframed as **worker-pool model** (Claude Code lead + single-agent workers via filesystem JSONL), NOT full peer-to-peer parity (would require rebuilding `TeamCreate` / `Agent` / `SendMessage` infrastructure on bare APIs). **Recommended next 2 PRs (Tier 1):** caveman ablation (~1 week) → onboarding hardening (~1 week, one-line installer + Homebrew formula + `zo doctor` command). **Recommended Tier 2 sequencing:** architecture-first (backedge primitive → SDK refactor → VS Code extension); visibility-first defensible if onboarding is the bigger growth lever. **Open questions still pending:** time horizon (Q3 vs H2), Tier 2 sequencing confirmation, audience weighting (research peers / paying customers / internal). Captured in DECISION_LOG entry 2026-05-05T12:00Z (ROADMAP + ARCHITECTURE). No new PR opened yet — roadmap doc + memory updates land in this commit; Tier 1 implementation deferred to user direction. **Session 025 hand-off (prior).** PR #58 merged (commit 45b2fdc), PR #59 merged (commit f290311 — watch-training contract enforcement). First `--low-token` MNIST bench completed end-to-end: lead + 3 sub-agents (data-engineer, code-reviewer, test-engineer) all confirmed running on `claude-sonnet-4-6` via `ps aux` — **the override mechanism in `_prompt_low_token_overrides()` ships as designed.** Claude Code 2.1.107 honours `model="claude-sonnet-4-6"` on `Agent()` spawns. SDK refactor previously flagged in STATE no longer needed for the override piece. The bench surfaced two material findings, both addressed: diff --git a/website/src/pages/index.html b/website/src/pages/index.html index e76f4a3..a589028 100644 --- a/website/src/pages/index.html +++ b/website/src/pages/index.html @@ -948,10 +948,10 @@

Where this came from.

Creator note

- I had an 8-week project: a high-stakes client needed a model to optimise - their power plant. Eight weeks to do all of it (data understanding, - cleaning, scaffolding, training, iteration, packaging) and a model - delivered to production. The math didn't work. + I had an eight-week production ML project. Eight weeks to do + all of it (data understanding, cleaning, scaffolding, training, + iteration, packaging) and a model delivered to production. The math + didn't work.

So I asked the obvious question: what if I had a digital research @@ -1043,7 +1043,7 @@