Skip to content

Step 2.75: Replay harness v0 + CI gate #24

@heavygee

Description

@heavygee

Goal

Build the fleet replay harness — captured-stream loader, run-once promotion/prioritization entry point, golden-scenario assertions, the one-boss invariant test stub — and gate it in CI so Overseer logic changes can be told apart from regressions.

Spec

  • docs/plans/2026-06-03-overseer-build-sequence.md Step 2.75 (primary)
  • docs/plans/2026-06-03-overseer-prioritization.md §6 (replay / evaluation harness, golden test cases, KPIs)
  • docs/adr/0001-worker-facing-attribution-one-boss.md §"Invariant test" (the one-boss invariant stub)
  • docs/plans/2026-06-03-overseer-contracts.md §7 (transcript retention — fixtures must not be production transcripts)

Acceptance

  • Captured-event-stream loader: reads events + event_links + inbox_items from a snapshot file, replays into a sandbox DB.
  • Promotion + prioritization run-once entry point invokable against a snapshot without touching the production DB.
  • Golden-scenario assertions for the starter set (prioritization §6 table): 30 routine progress events surface nothing; same dedupe_key collapses; root-cause blocked_by chain surfaces upstream not symptoms; stale-item aging; etc. Initial target: at least 10 of the listed scenarios.
  • One-boss invariant test stub (ADR-001 §"Invariant test"): for every dispatched event, the corresponding worker-facing messages row carries no Overseer-attribution metadata and the rendered instruction contains no generated attribution boilerplate. Passes vacuously now (no dispatches yet) but the assertion shape is wired so Step 4: Disagreement-capable Overseer + voice dispatch with confirm #26 activates real coverage automatically.
  • CI gate: harness runs on every PR touching Overseer logic, inbox scoring, event taxonomy, or worker-emission contract. Failure blocks merge.
  • Captured fixtures live under test/fixtures/overseer-replay/ and are NOT production transcripts.

Out of scope

Dependencies

Suggested PR breakdown

1 PR: replay harness v0; golden scenarios; one-boss invariant test stub; CI gate.

Risks

  • Skipping or under-investing in this step is the single highest-leverage way to fail the whole project. Without harness-backed assertions, Steps 3-4 ship behavior changes nobody can tell improved or regressed the persona; every prompt edit becomes a hand-eval. Build at least the 10-scenario starter set + the one-boss invariant stub.
  • Captured fixtures must NOT be production transcripts (contracts §7) — sanitized synthetic fixtures or explicitly captured non-production sessions only, or the harness ships operator transcript data into CI logs / artifact storage / public PR diffs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    architectureArchitectural / substrate workfleet-overseerFleet attention-arbitration architecturemvpPart of the Overseer MVP acceptance bar (Steps 1-4)

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions