Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 28 additions & 0 deletions .agents/plans/live-use-hardening/GOAL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
# Live-use hardening - pasteable goal

```text
/goal Work in the dispatch repo root. Implement the live-use hardening plan in .agents/plans/live-use-hardening/PLAN.md.

Context: a real Trails delegation attempt exposed trust failures that unit/parity tests did not catch. Dispatch accepted work that had not proven alive, hid model/system failures behind raw watch events, let slash-command goal text look like a native goal, left daemon lifecycle commands outside the JSON/scriptability contract, and allowed CLI projection hand-wiring to grow beyond the no-drift doctrine.

Objective: make dispatch trustworthy for live agent coordination. Document the incident, add regression tests and guardrails, tighten derived surface boundaries, fix launch/error/status semantics, make cleanup/lifecycle commands agent-safe, update docs/skills, and run local review until no P0/P1/P2 issues remain.

Required outcomes:
- A durable plan/retro records decisions, checks, review findings, and deferred work.
- Public CLI/MCP projections are governed by explicit projection metadata or an allowlisted control-surface contract; ungoverned hand-wired per-op routes are test failures.
- `new`/`send` outputs distinguish accepted delivery from proof of execution.
- `get`/list-like status surfaces expose latest turn/error state well enough that raw `watch` is not required to discover obvious model/system failures.
- `/goal ...` as message text is either rejected/warned or replaced by a first-class `new --goal` path that calls the native goal API.
- Destroy operations have explicit non-interactive confirmation support, and `up`/`down` expose JSON output.
- Registry schema recovery is boring: doctor/up explain or expose a safe migrate/repair path without manual DB surgery.
- Docs, README, skills, plugin docs, schemas/help, tests, and ADR/rules are updated where behavior or doctrine changes.
- Checks pass, including focused tests and `just check`; run local review loops and fix P2+ findings.

Constraints:
- Preserve contract-first/no-drift architecture; if a surface needs special ergonomics, make the override explicit and tested.
- Do not touch live user Codex state in tests. Use isolated `DISPATCH_HOME`/`CODEX_HOME` for any smoke.
- Do not merge, publish, or mutate release state unless explicitly asked.
- If model preflight cannot be made reliable from the current App Server contract, surface the first failure clearly and record the limitation.

Done only when all required outcomes are implemented or explicitly deferred with evidence, local checks pass, review P2+ is clear, and RETRO.md contains final proof.
```
123 changes: 123 additions & 0 deletions .agents/plans/live-use-hardening/PLAN.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# Live-use Hardening - implementation plan

One-branch hardening packet for the real-use failures found during a Trails
delegation attempt. Goal loop: [`GOAL.md`](./GOAL.md). References:
[`REFS.md`](./REFS.md). Execution ledger: [`RETRO.md`](./RETRO.md).

## Objective

Make dispatch trustworthy when an agent uses it for real coordination:

- keep surface projection honest and guarded;
- make launch results distinguish "accepted" from "alive/responded";
- surface latest turn/model/system failures in normal status surfaces;
- make native goals first-class instead of relying on slash-command text;
- make daemon lifecycle and destructive cleanup scriptable;
- make registry migration/recovery safe and obvious;
- update docs, skills, and tests so this class of failure does not sneak
through again.

## Incident facts

The real Trails use case found these product failures:

- A stale registry with schema v1 and missing tables required manual DB backup
and recreation because `dispatch up` no-oped while a daemon answered.
- `dispatch up --json` failed even though most agent-operated commands are
JSON-shaped.
- `dispatch new --model gpt-5.5-codex --text "$goal_prompt"` used a stale,
guessed explicit model id and returned
`sent: true` and `status: idle`, but no assistant work happened.
- `/goal ...` sent as initial text did not create native goal state.
- The unsupported model failure was only obvious through `dispatch watch`, not
`dispatch get`.
- `trigger rm --json` and `archive --json` still required interactive stdin.
- The existing parity/handler tests stayed green.

## Root causes to address

1. Projection doctrine is written down, but CLI has bespoke route functions and
control commands without an enforceable manifest/allowlist.
2. Tests prove routing and accepted calls, not live coordination trust.
3. Normal state models do not persist latest turn failures or suspicious
no-assistant completions.
4. Goal text and native App Server goals are separate, but docs/skills do not
make the boundary loud enough.
5. Integration tests are intentionally out of the default gate, so real
semantics need cheap fake-level regression tests plus release smoke guidance.

## Implementation chunks

### Chunk 1 - regression tests and projection guardrails

- Add failing tests for:
- destroy commands supporting an explicit non-interactive confirmation flag;
- `up --json` / `down --json`;
- `/goal` text guard or first-class `new --goal`;
- `TurnFailed.message` being persisted and exposed by `get`;
- `new` not overclaiming that a turn produced work.
- Introduce CLI projection metadata/manifest or a strict allowlist that
classifies public commands as:
- op projection;
- composed op projection;
- surface control.
- Add tests that fail for ungoverned public commands and mismatched schema/help
routes.

### Chunk 2 - launch, goal, and status semantics

- Replace or supplement `NewLane.sent` with explicit launch fields such as
`message_accepted`, `goal_set`, `first_turn`, and/or a structured launch
result. Maintain honest naming in docs/schemas.
- Add `NewInput.goal` or equivalent. If text starts with `/goal` and no native
goal field is used, fail or warn clearly.
- Persist latest turn/error state in the registry and expose it in `get`,
relevant list outputs, and MCP schemas.
- Ensure model/system failures show in normal status without raw `watch`.

### Chunk 3 - scriptable surfaces and registry recovery

- Add `--yes`/`--no-interactive` support for destroy-intent CLI commands from
projection rules, not one-off commands.
- Add JSON output to `up` and `down`.
- Improve doctor recovery for versioned missing tables.
- Add a safe registry migrate/repair command or lifecycle helper if it can be
done without broad architecture churn. At minimum, make `up`/doctor refuse
misleading no-op recovery and provide exact safe commands.
- Add tests for older schema v1/v2 cases with existing lanes/triggers.

### Chunk 4 - docs, skills, and release smoke

- Update README, docs/usage, skills/dispatch, skills/dm if affected, plugin docs,
AGENTS/rules/ADRs where behavior or doctrine changed.
- Add a documented pre-release/live-dogfood smoke that uses isolated state and
proves lane liveness.
- Update examples/schema expectations.

### Chunk 5 - local review and finalization

- Run focused tests after each chunk.
- Run `just check`.
- Run a local review pass focused on P0/P1/P2:
- surface derivation drift;
- live-use trust;
- destructive/scripted safety;
- registry migration safety;
- docs/skill truthfulness.
- Fix P2+; fix cheap P3s; record deferred P3s in RETRO.

## Deferral policy

Acceptable deferrals only if recorded in RETRO with evidence:

- account-specific model preflight if `model/list`/verification does not expose
reliable support in the current App Server;
- optional Graphite/worktree ownership, which is useful but not the root control
plane trust failure;
- long-lived streaming subscriptions beyond bounded `watch`.

## Done

Done only when tests and docs prove the full objective, `just check` passes, a
review loop has no unresolved P0/P1/P2, and RETRO contains exact verification
commands, final git state, remaining risks, and PR state if submitted.
59 changes: 59 additions & 0 deletions .agents/plans/live-use-hardening/REFS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
# Live-use Hardening - references

## Field report

- `/tmp/trails-dispatch-real-use-feedback-2026-06-07.md`
- Registry schema v1 missing `lane_snapshots` and `lane_sync_sources`.
- `dispatch up --json` unsupported; `dispatch up` no-oped against running
daemon.
- `dispatch new --model gpt-5.5-codex --text "$goal_prompt"` used a stale,
guessed explicit model id and returned `sent: true`, `status: idle`, but no
assistant response and no goal state.
- `watch` surfaced unsupported model error; `get` did not.
- Destroy cleanup required `printf 'y\n' | ... --json`.

## Architecture docs

- `docs/adrs/0000-contract-first-surface-derived.md`
- Every surface is a pure projection of one op registry.
- Parity tests must check behavior, not only names.
- `docs/adrs/0010-surface-projections-are-ergonomic-not-isomorphic.md`
- Surfaces may group/rename/compose, but may not restate schemas, examples,
safety intent, error behavior, or capability policy.
- `.claude/rules/contracts.md`
- Overrides must be visible escape hatches, not default hand wiring.
- `.claude/rules/surfaces.md`
- Surface modules contain projection wiring only.

## Code hot spots

- `src/outfitter/dispatch/contracts/derive_cli.py`
- CLI projection, custom route functions, schema route table, destroy prompt.
- `src/outfitter/dispatch/surfaces/cli.py`
- `doctor`, `up`, `down`, and `mcp` hand-written control commands.
- `src/outfitter/dispatch/core/handlers.py`
- `new_lane`, `show`, send/goal handlers.
- `src/outfitter/dispatch/core/reactor.py`
- `TurnFailed` currently updates status but does not persist message.
- `src/outfitter/dispatch/registry/store.py`
- Schema migrations and registry state.
- `src/outfitter/dispatch/doctor.py`
- Registry diagnostics and recovery hints.

## Existing tests

- `tests/surfaces/test_parity.py`
- `tests/surfaces/test_derive_cli.py`
- `tests/core/test_handlers.py`
- `tests/test_doctor.py`
- `tests/integration/test_daemon_e2e.py`
- `tests/integration/test_app_server.py`

## Verification commands

```bash
uv run pytest tests/surfaces/test_parity.py tests/surfaces/test_derive_cli.py tests/test_doctor.py tests/core/test_handlers.py -q
just check
```

Optional live smoke must use isolated runtime paths.
64 changes: 64 additions & 0 deletions .agents/plans/live-use-hardening/RETRO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# Live-use Hardening - execution ledger

Durable execution ledger for the live-use trust hardening goal.

## Timeline

- 2026-06-07: Created packet on `feat/live-use-hardening` after the Trails
real-use report showed green tests missing operator trust failures.
- 2026-06-07: Implemented runtime turn-state persistence, native `new --goal`,
honest `message_accepted` launch output, scriptable daemon lifecycle output,
destroy-command confirmation flags, registry migration recovery, and explicit
CLI projection/control manifests.
- 2026-06-07: Updated README, usage docs, dispatch skill, plugin README,
development design notes, agent rules, and ADR-0020.

## Checks

- `uv run pytest tests/test_doctor.py tests/surfaces/test_parity.py tests/surfaces/test_derive_cli.py tests/core/test_handlers.py tests/registry/test_store.py -q`
- `103 passed`
- `uv run pytest -q`
- `210 passed, 9 deselected`
- `just check`
- `ruff check`: passed
- `ruff format --check`: passed
- `mypy src tests`: passed
- `pytest`: `210 passed, 9 deselected`
- `uv build`: built `outfitter_dispatch-0.4.0` sdist/wheel
- `scripts/check_package_contents.py`: passed
- CLI smoke:
- `uv run dispatch schema new | jq -r ...`
- verified goal and `message_accepted` schema descriptions.
- `uv run dispatch schema 'list --unmanaged' | jq -r .op`
- returned `discover`.
- `uv run dispatch schema 'tail --follow'`
- exited `2` with a clean unknown-command error, matching current docs.
- `uv run dispatch registry migrate --help`
- showed JSON/text, backup, and controlled-running options.

## Review

- P0/P1/P2 review pass:
- Verified projection guardrails cover op-backed CLI routes, schema spellings,
and full CLI surface-control allowlist.
- Verified synchronous `turn/start` failures no longer leave registered lanes
looking idle; `new`/`send` now persist latest error state before re-raising.
- Verified `TurnFailed.message` projects through reactor -> registry -> `get`.
- Verified `/goal ...` initial text is rejected unless callers use native
`--goal`, and native goal set happens before the initial turn.
- Verified old registry recovery has doctor guidance plus `registry migrate`
tests, including daemon-running refusal.
- Verified docs/skill/plugin/rules/ADR describe the changed behavior and
current limitations.
- Unresolved P0/P1/P2: none found in local review.

## Deferred

- Account/model preflight remains deferred. The current App Server client
accepts model strings on thread/turn options but does not expose a cheap,
reliable account-specific model support check in dispatch's verified contract.
The implemented mitigation is to persist and expose App Server failures through
ordinary status surfaces instead of requiring raw `watch`.
- Infinite streaming remains deferred. `watch` is still a bounded live event
sample over a request/response control socket; a subscription-capable control
socket remains future work.
3 changes: 2 additions & 1 deletion .claude/rules/client.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,12 +19,13 @@ Demux the single stream: responses by request `id`, notifications by `threadId`,
- `thread/start.sandbox` is a **string** enum (`read-only`/`workspace-write`/`danger-full-access`); `turn/start.sandboxPolicy` is an **object** (`{type:"readOnly", ...}`). Different encodings — model both.
- `turn/steer` requires `expectedTurnId` (from `turn/started`).
- `thread/list` results are under `result.data` (not `result.threads`); `useStateDbOnly:true` reads the persisted store.
- Current `thread/list` supports native `archived`, `cwd`, `searchTerm`, `sourceKinds`, and sort filters; use them when they match dispatch semantics, then keep registry/authority filters in core.
- `thread/search` is experimental; enable the experimental API capability before using it and keep the wrapper thin.
- `thread/resume` of a *persisted* thread yields live event fan-out; pre-persistence it errors `no rollout found`.
- Approvals are server→client requests: lane emits `thread/status/changed` `activeFlags:["waitingOnApproval"]`; reply `{id, result:{decision}}` (`accept`/`acceptForSession`/`decline`/`cancel`); server emits `serverRequest/resolved`. File-change approvals carry **no diff** — correlate by `itemId` to the `fileChange` item.
- Threads persist by default (`ephemeral:false`). Pass `ephemeral:true` for throwaway/test lanes.

## Discipline

- Pin the binary; regenerate wire models from `codex app-server generate-json-schema` for that version. Do NOT depend on the `openai-codex` Python SDK (it pins an older CLI).
- Pin/record the binary; regenerate wire models from `codex app-server generate-json-schema` for that version. Do not assume the `openai-codex` Python SDK matches the installed CLI; it has lagged before.
- No business logic here — this layer is transport + typed primitives only. Orchestration lives in `core/`.
8 changes: 8 additions & 0 deletions .claude/rules/contracts.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,11 @@ compose ops (for example `list --unmanaged` → `discover`, `goal status` →
derived from the registry; never hand-implement the same behavior separately in a
surface.

If the CLI needs custom shell grammar, declare it in the CLI projection manifest
(`CliRoute`, `cli_public_routes`, and when needed `cli_schema_routes`). Do not add
or special-case a command path without a parity test proving the path reaches the
canonical op and that `dispatch schema <command>` reports the canonical op schema.

## Derivation (never hand-write a surface per op)

Surfaces are pure projections of the registry, mirroring Trails' `derive* → create* → surface`:
Expand All @@ -44,6 +49,9 @@ One `DispatchError` hierarchy in `errors.py` (e.g. `NotFoundError`, `LaneBusyErr
- Adding capability = adding an op, registering it, and ensuring the derived
projections route it intentionally. If a route is missing, the parity tests
should fail.
- If a route is intentionally a surface control rather than an op (`doctor`,
`up`, `down`, `registry migrate`, `schema`, `mcp`), document why and keep it out
of per-op business logic.
- Every op exposed on MCP/remote must define `output`.
- Keep handlers pure-ish: input in, output out (or raise). Side effects go through injected dependencies (the App Server client, the registry) passed via `ctx`, never imported ad hoc.
- A parity test must stay green — and it checks **behavior/reachability, not
Expand Down
5 changes: 5 additions & 0 deletions .claude/rules/surfaces.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,9 @@ Path: `src/outfitter/dispatch/surfaces/`. Each surface is a thin, generated proj
tree may group/alias ops for shell ergonomics, but each command marshals
contract input → calls the daemon control socket → renders the result with
Rich. The CLI is a **sync** client; it does not import `core/` or `client/`.
Process/control commands such as `doctor`, `up`, `down`, `registry migrate`,
`schema`, and `mcp` are the allowed exceptions: they manage or inspect the
surface/runtime itself and must not duplicate op behavior.
- **MCP** (`mcp.py`): a stdio MCP server (via the `mcp` SDK) from
`derive_mcp(registry)`; grouped tool handlers route to the daemon control
socket, same as the CLI. Spawned by the MCP client (Claude/Codex), not hosted
Expand All @@ -18,6 +21,8 @@ Path: `src/outfitter/dispatch/surfaces/`. Each surface is a thin, generated proj
- Keep the **parity test** green: every registered op must be reachable through
each surface's derived projection with matching schemas, annotations, and error
projection. Surface names do not need to equal op ids.
- Destroy-intent CLI routes must preserve the derived confirmation behavior:
prompt interactively, and require `--yes` when paired with `--no-interactive`.

## Why

Expand Down
18 changes: 14 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ uv tool install outfitter-dispatch
dispatch --help
dispatchd --help
dispatch doctor
dispatch up --json
dispatch down --json
```

From a source checkout:
Expand All @@ -20,7 +22,7 @@ From a source checkout:
uv sync
uv run dispatch --help
uv run dispatch doctor --no-app-server
uv run dispatch up
uv run dispatch up --json
uv run dispatch daemon status
```

Expand All @@ -30,12 +32,13 @@ Create an owned managed thread, send it work, and inspect the daemon:
uv run dispatch new \
--name docs \
--cwd /path/to/dispatch \
--goal "Finish the docs review." \
--text "Please summarize the current stack state."
uv run dispatch list
uv run dispatch get <dispatch-ref>
uv run dispatch tail <dispatch-ref> --limit 20
uv run dispatch goal set <dispatch-ref> "Finish the docs review."
uv run dispatch daemon log --limit 10
uv run dispatch down
uv run dispatch down --json
```

Use owned managed threads for turn-writing work. Existing desktop Codex threads can be attached as
Expand All @@ -48,12 +51,19 @@ unmanaged Codex thread ids, and `search` can span both. Attach is metadata-only
default; use `dispatch sync <selector>` when you want dispatch to refresh its local
indexed view of an attached thread.

`new` reports whether the first message was accepted by the App Server, not whether
assistant work completed. Use `get` to inspect the latest turn state and persisted
App Server errors, or `watch` for a bounded live event sample. Slash commands in
`--text` are plain text; use `--goal` when creating a native App Server goal.

For the operator guide, CLI/MCP examples, triggers, and plugin setup, start at
[`docs/usage/README.md`](docs/usage/README.md).

Start troubleshooting with `dispatch doctor`. It checks PATH visibility, the Codex CLI
and auth footprint, daemon socket/pidfile state, registry schema/integrity, packaged
skills/plugin assets, and a low-risk Codex App Server initialize smoke.
skills/plugin assets, and a low-risk Codex App Server initialize smoke. If doctor reports
an old registry schema, stop the daemon and run `dispatch registry migrate` before
starting it again.

## Agent And Plugin Support

Expand Down
Loading