Goal
Measure, in a one-evening experiment, how reliably the prompted event-emission contract (contracts §1) actually gets compliance from real worker agents, so Step 2's scope and risk are calibrated against real data rather than assumption.
Spec
docs/plans/2026-06-03-overseer-contracts.md §1 (worker event taxonomy, wire format, hub-observed fallback)
docs/plans/2026-06-03-overseer-build-sequence.md Step 2 (the events substrate this informs)
Experiment
- Spawn one worker per flavor (Cursor / Claude / Codex) with the §1 wire-format prompt baked into the system instruction (sentinel-delimited JSON block,
schema_version, event objects with event_type + summary).
- Give each a small bounded task (run tests, open a PR, fix a small bug).
- Measure: emission rate (did the worker emit at the moments the taxonomy expects?) and shape conformance (valid JSON inside sentinels, required fields present, sane
event_type values).
Acceptance
- A short written finding: "compliance is X%, malformed in Y way, hub-observed fallback needs to cover Z."
- The finding names which event types workers reliably emit vs. routinely miss (e.g.
completed, blocked, stale).
- The finding recalibrates Step 2's hub-observed synthesis scope (which gaps the fallback must cover).
Out of scope
Dependencies
Kill-criterion
If compliance is under ~40% even with prompt iteration, the prompted-emission contract is the wrong primitive and #22-#26 need a code-level emission API rewrite (see framing doc "things to push back on" #11). Surfacing this early is the entire point of the pre-flight.
Risks
- A too-narrow task set could overstate compliance (workers emit well on the happy path, poorly under failure). Include at least one task that fails or stalls so
failed / blocked / stale emission is exercised.
Goal
Measure, in a one-evening experiment, how reliably the prompted event-emission contract (contracts §1) actually gets compliance from real worker agents, so Step 2's scope and risk are calibrated against real data rather than assumption.
Spec
docs/plans/2026-06-03-overseer-contracts.md§1 (worker event taxonomy, wire format, hub-observed fallback)docs/plans/2026-06-03-overseer-build-sequence.mdStep 2 (the events substrate this informs)Experiment
schema_version, event objects withevent_type+summary).event_typevalues).Acceptance
completed,blocked,stale).Out of scope
Dependencies
Kill-criterion
If compliance is under ~40% even with prompt iteration, the prompted-emission contract is the wrong primitive and #22-#26 need a code-level emission API rewrite (see framing doc "things to push back on" #11). Surfacing this early is the entire point of the pre-flight.
Risks
failed/blocked/staleemission is exercised.