fix: replace orchestrators through safe canonical branch handoff by nikhilachale · Pull Request #2338 · AgentWrapper/agent-orchestrator

nikhilachale · 2026-07-01T18:26:44Z

Summary

Replace clean=true orchestrator restarts with a replacement-specific retire path instead of interactive Kill
Preserve old orchestrator work before force-releasing the canonical worktree branch
Clear stale restore markers so retired orchestrators do not come back on daemon boot
Add shared frontend pending/error state and replacement UI for orchestrator restart flows

Details

Backend:

Sends a best-effort retire notice before replacing an active orchestrator
Adds RetireForReplacement, which:
- captures uncommitted work with StashUncommitted
- clears session_worktrees restore markers
- marks the old orchestrator terminated
- destroys the old runtime best-effort
- force-removes the old workspace so ao/<prefix>-orchestrator can be claimed
Verifies the replacement session is live, is an orchestrator, uses the configured harness when set, and remains on the canonical orchestrator branch

Frontend:

Tracks pending orchestrator replacement state by project
Disables new task / orchestrator actions while replacement is pending
Surfaces orchestrator health states and replacement failures
Adds retry/open-current actions for failed replacement attempts

Test Plan

cd backend && GOCACHE=/private/tmp/go-build-ao-orch-replace go test ./internal/session_manager ./internal/service/session
cd frontend && npm run typecheck

closes #2310

…sesWorkspace

nikhilachale · 2026-07-01T18:32:14Z

Exact flow for switching/replacing an orchestrator now:

User triggers orchestrator restart/replacement from the UI.
Frontend calls:

spawnOrchestrator(projectId, true)

which sends:

POST /api/v1/orchestrators
{ "projectId": "...", "clean": true }

Backend SpawnOrchestrator takes a per-project lock so two replacements cannot race.
Backend loads the project config.
Backend lists active orchestrators for the project.
For each active old orchestrator:
- sends a retire notice:
  
  AO is replacing this project orchestrator. Stop coordinating new work now; a fresh orchestrator will take over on the canonical branch.
- retire notice failure is ignored
- calls RetireForReplacement
RetireForReplacement:
- loads the old session
- if already gone or terminated, no-ops
- if workspace metadata is incomplete:
  - clears any restore marker
  - marks session terminated
  - destroys runtime best-effort
- if workspace exists:
  - captures uncommitted work with StashUncommitted
  - clears session_worktrees restore markers
  - marks old session terminated
  - destroys old runtime best-effort
  - force-removes old worktree, which frees ao/-orchestrator
Backend spawns a fresh orchestrator normally.
Workspace creation claims the canonical branch:

ao/-orchestrator

Backend verifies replacement:

new session is not terminated
kind is orchestrator
harness matches configured orchestrator agent when configured
branch is canonical when present

Backend returns the new orchestrator session.
Frontend invalidates workspace data and navigates to the new orchestrator.

During steps 7 to 8 there is a short no-orchestrator gap by design. After success, there is one active orchestrator on the canonical branch.

@illegalcall

illegalcall

I found a few issues that should be fixed before merging:

P2: Retry can get stuck after a partial replacement retirement failure

RetireForReplacement clears restore markers, marks the old orchestrator terminated, destroys the runtime best-effort, and only then calls ForceDestroy on the old worktree. If ForceDestroy fails, the API returns an error, but the old session is already terminated.

Example: the old canonical orchestrator worktree fails to force-remove because the path cannot be deleted or Git still has it registered. The user sees replacement failed and clicks Retry. On retry, SpawnOrchestrator(clean=true) lists only active orchestrators, so it no longer calls RetireForReplacement for the old session. It then tries to spawn a new orchestrator onto the same canonical branch/path and can keep failing because the old worktree still blocks it. The project is left with no active orchestrator and a retry button that cannot repair the stuck state.

Please keep the old session retirable until the branch/worktree release succeeds, or make the retry path detect and finish incomplete replacement retirements.

P2: The configured orchestrator agent is not hydrated into the workspace model

WorkspaceSummary now has orchestratorAgent, and orchestratorHealth depends on it to show restart_needed. But the normal workspace query is built from GET /api/v1/projects, whose Summary does not include project config, and fetchWorkspaces never maps project.config.orchestrator.agent.

Example: a project is configured to use codex as orchestrator, but the currently running orchestrator is claude-code. After app load or after Project Settings saves and invalidates the workspace query, workspace.orchestratorAgent is undefined, so orchestratorNeedsRestart returns false and the board never shows the restart-needed CTA.

Please either expose the configured orchestrator harness in the project summary endpoint and map it in useWorkspaceQuery, or hydrate the workspace list from project detail/config data.

P2: Duplicate-orchestrator UI still opens the first active orchestrator, not the newest

The new health message says “The newest one is used,” and the backend clean=false path now returns newestSession. The frontend does not match that behavior: findProjectOrchestrator uses .find(...), and SessionsBoard/Sidebar also pick the first active orchestrator from the session list.

Example: sessions are ordered by spawn number as proj-1 then proj-2, and both are active after a replacement race. The duplicate warning says the newest is used, but the Orchestrator button opens proj-1, the older/stale session.

Please make the frontend select the newest active orchestrator using the same CreatedAt/UpdatedAt/id fallback logic as the service, and reuse that helper everywhere the current orchestrator is opened.

…kspace components

nikhilachale · 2026-07-01T21:59:55Z

Retry stuck after partial retirement failure: fixed.
RetireForReplacement now only marks the old orchestrator terminated after ForceDestroy succeeds. If worktree release fails, the old session remains
active/retirable for Retry.
Configured orchestrator agent hydration: fixed.
GET /api/v1/projects summaries now include config, OpenAPI/schema were regenerated, and useWorkspaceQuery maps project.config.orchestrator.agent into
workspace.orchestratorAgent.
Duplicate orchestrator opens first instead of newest: fixed.
Added shared newestActiveOrchestrator(...) with createdAt, updatedAt, id fallback ordering, and reused it in findProjectOrchestrator, SessionsBoard,
Sidebar, and health logic.
@illegalcall

illegalcall

Remaining issue on latest head 3983d37:

P2: Project summaries expose full project config just to hydrate orchestrator agent

The previous replacement-flow issues look fixed, but this fix adds Config *domain.ProjectConfig to the GET /api/v1/projects summary shape and useWorkspaceQuery reads project.config?.orchestrator?.agent from it. That exposes much more than the workspace list needs: env, postCreate, symlinks, reviewer config, and agent config now travel through the project-list endpoint and the polling workspace query.

Example: if a project config contains env: { GITHUB_TOKEN: "..." } or other local runtime variables, opening the normal workspace list now fetches and caches that data even though the UI only needs the configured orchestrator harness.

Please expose a narrow field on the summary, e.g. orchestratorAgent, or a small role summary, instead of returning full ProjectConfig from the list endpoint. Keep full config on GET /api/v1/projects/{id} / settings flows.

Refs: backend/internal/service/project/types.go, backend/internal/service/project/service.go, backend/internal/domain/projectconfig.go, frontend/src/renderer/hooks/useWorkspaceQuery.ts

…related tests

nikhilachale · 2026-07-02T20:24:57Z

GET /api/v1/projects now returns each project summary like:

{
"id": "ao",
"name": "Agent Orchestrator",
"path": "/path/to/repo",
"kind": "single-repo",
"sessionPrefix": "ao",
"orchestratorAgent": "codex"
}
@illegalcall

nikhilachale · 2026-07-02T20:42:36Z

Unknown installed agents: fixed in frontend/src/renderer/components/CreateProjectAgentSheet.tsx
- Installed agents with authStatus: "unknown" are selectable.
- They show Auth unknown with a warning icon.
- Explicit unauthorized still stays disabled as Needs auth.
- Fresh daemon empty catalog: fixed in frontend/src/renderer/routes/_shell.tsx
  - First ready daemon load now uses refreshAgents under the existing agents query key, so /agents/refresh runs
    instead of only cached GET /agents.
- False auth parser: fixed in backend/internal/adapters/agent/authprobe/authprobe.go
  - Explicit false keys like "authenticated": false, "authorized": false, logged_in=false, etc. are checked before
    broad positive substring matches.
  - Covered by new tests in authprobe_test.go.
- Broken docs links: fixed in README.md and docs/README.md
  - Removed stale links to missing docs/agent/README.md and docs/agent/switching.md.

nikhilachale added 2 commits July 1, 2026 23:48

fix: replace orchestrators through canonical handoff

5029ae5

fix(tests): correct logic in TestRetireForReplacementCapturesAndRelea…

82c32c6

…sesWorkspace

illegalcall reviewed Jul 1, 2026

View reviewed changes

fix: enhance tests and add project config handling in service and wor…

3983d37

…kspace components

nikhilachale requested a review from illegalcall July 1, 2026 22:00

illegalcall reviewed Jul 2, 2026

View reviewed changes

fix: update project summary to include orchestrator agent and adjust …

b300c6c

…related tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: replace orchestrators through safe canonical branch handoff#2338

fix: replace orchestrators through safe canonical branch handoff#2338
nikhilachale wants to merge 4 commits into
AgentWrapper:mainfrom
nikhilachale:main

nikhilachale commented Jul 1, 2026

Uh oh!

nikhilachale commented Jul 1, 2026

Uh oh!

illegalcall left a comment

Uh oh!

nikhilachale commented Jul 1, 2026 •

edited

Loading

Uh oh!

illegalcall left a comment

Uh oh!

nikhilachale commented Jul 2, 2026

Uh oh!

nikhilachale commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nikhilachale commented Jul 1, 2026

Summary

Details

Test Plan

Uh oh!

nikhilachale commented Jul 1, 2026

Uh oh!

illegalcall left a comment

Choose a reason for hiding this comment

Uh oh!

nikhilachale commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

illegalcall left a comment

Choose a reason for hiding this comment

Uh oh!

nikhilachale commented Jul 2, 2026

Uh oh!

nikhilachale commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

nikhilachale commented Jul 1, 2026 •

edited

Loading