Skip to content

fix: replace orchestrators through safe canonical branch handoff#2338

Open
nikhilachale wants to merge 4 commits into
AgentWrapper:mainfrom
nikhilachale:main
Open

fix: replace orchestrators through safe canonical branch handoff#2338
nikhilachale wants to merge 4 commits into
AgentWrapper:mainfrom
nikhilachale:main

Conversation

@nikhilachale

Copy link
Copy Markdown
Collaborator

Summary

  • Replace clean=true orchestrator restarts with a replacement-specific retire path instead of interactive Kill
  • Preserve old orchestrator work before force-releasing the canonical worktree branch
  • Clear stale restore markers so retired orchestrators do not come back on daemon boot
  • Add shared frontend pending/error state and replacement UI for orchestrator restart flows

Details

Backend:

  • Sends a best-effort retire notice before replacing an active orchestrator
  • Adds RetireForReplacement, which:
    • captures uncommitted work with StashUncommitted
    • clears session_worktrees restore markers
    • marks the old orchestrator terminated
    • destroys the old runtime best-effort
    • force-removes the old workspace so ao/<prefix>-orchestrator can be claimed
  • Verifies the replacement session is live, is an orchestrator, uses the configured harness when set, and remains on the canonical orchestrator branch

Frontend:

  • Tracks pending orchestrator replacement state by project
  • Disables new task / orchestrator actions while replacement is pending
  • Surfaces orchestrator health states and replacement failures
  • Adds retry/open-current actions for failed replacement attempts

Test Plan

  • cd backend && GOCACHE=/private/tmp/go-build-ao-orch-replace go test ./internal/session_manager ./internal/service/session
  • cd frontend && npm run typecheck

closes #2310

@nikhilachale

Copy link
Copy Markdown
Collaborator Author

Exact flow for switching/replacing an orchestrator now:

  1. User triggers orchestrator restart/replacement from the UI.
  2. Frontend calls:

spawnOrchestrator(projectId, true)

which sends:

POST /api/v1/orchestrators
{ "projectId": "...", "clean": true }

  1. Backend SpawnOrchestrator takes a per-project lock so two replacements cannot race.

  2. Backend loads the project config.

  3. Backend lists active orchestrators for the project.

  4. For each active old orchestrator:

    • sends a retire notice:

      AO is replacing this project orchestrator. Stop coordinating new work now; a fresh orchestrator will take over on the canonical branch.

    • retire notice failure is ignored

    • calls RetireForReplacement

  5. RetireForReplacement:

    • loads the old session

    • if already gone or terminated, no-ops

    • if workspace metadata is incomplete:

      • clears any restore marker
      • marks session terminated
      • destroys runtime best-effort
    • if workspace exists:

      • captures uncommitted work with StashUncommitted
      • clears session_worktrees restore markers
      • marks old session terminated
      • destroys old runtime best-effort
      • force-removes old worktree, which frees ao/-orchestrator
  6. Backend spawns a fresh orchestrator normally.

  7. Workspace creation claims the canonical branch:

ao/-orchestrator

  1. Backend verifies replacement:
  • new session is not terminated
  • kind is orchestrator
  • harness matches configured orchestrator agent when configured
  • branch is canonical when present
  1. Backend returns the new orchestrator session.
  2. Frontend invalidates workspace data and navigates to the new orchestrator.

During steps 7 to 8 there is a short no-orchestrator gap by design. After success, there is one active orchestrator on the canonical branch.

@illegalcall

@illegalcall illegalcall left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found a few issues that should be fixed before merging:

  1. P2: Retry can get stuck after a partial replacement retirement failure

RetireForReplacement clears restore markers, marks the old orchestrator terminated, destroys the runtime best-effort, and only then calls ForceDestroy on the old worktree. If ForceDestroy fails, the API returns an error, but the old session is already terminated.

Example: the old canonical orchestrator worktree fails to force-remove because the path cannot be deleted or Git still has it registered. The user sees replacement failed and clicks Retry. On retry, SpawnOrchestrator(clean=true) lists only active orchestrators, so it no longer calls RetireForReplacement for the old session. It then tries to spawn a new orchestrator onto the same canonical branch/path and can keep failing because the old worktree still blocks it. The project is left with no active orchestrator and a retry button that cannot repair the stuck state.

Please keep the old session retirable until the branch/worktree release succeeds, or make the retry path detect and finish incomplete replacement retirements.

  1. P2: The configured orchestrator agent is not hydrated into the workspace model

WorkspaceSummary now has orchestratorAgent, and orchestratorHealth depends on it to show restart_needed. But the normal workspace query is built from GET /api/v1/projects, whose Summary does not include project config, and fetchWorkspaces never maps project.config.orchestrator.agent.

Example: a project is configured to use codex as orchestrator, but the currently running orchestrator is claude-code. After app load or after Project Settings saves and invalidates the workspace query, workspace.orchestratorAgent is undefined, so orchestratorNeedsRestart returns false and the board never shows the restart-needed CTA.

Please either expose the configured orchestrator harness in the project summary endpoint and map it in useWorkspaceQuery, or hydrate the workspace list from project detail/config data.

  1. P2: Duplicate-orchestrator UI still opens the first active orchestrator, not the newest

The new health message says “The newest one is used,” and the backend clean=false path now returns newestSession. The frontend does not match that behavior: findProjectOrchestrator uses .find(...), and SessionsBoard/Sidebar also pick the first active orchestrator from the session list.

Example: sessions are ordered by spawn number as proj-1 then proj-2, and both are active after a replacement race. The duplicate warning says the newest is used, but the Orchestrator button opens proj-1, the older/stale session.

Please make the frontend select the newest active orchestrator using the same CreatedAt/UpdatedAt/id fallback logic as the service, and reuse that helper everywhere the current orchestrator is opened.

@nikhilachale

nikhilachale commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author
  1. Retry stuck after partial retirement failure: fixed.
    RetireForReplacement now only marks the old orchestrator terminated after ForceDestroy succeeds. If worktree release fails, the old session remains
    active/retirable for Retry.

  2. Configured orchestrator agent hydration: fixed.
    GET /api/v1/projects summaries now include config, OpenAPI/schema were regenerated, and useWorkspaceQuery maps project.config.orchestrator.agent into
    workspace.orchestratorAgent.

  3. Duplicate orchestrator opens first instead of newest: fixed.
    Added shared newestActiveOrchestrator(...) with createdAt, updatedAt, id fallback ordering, and reused it in findProjectOrchestrator, SessionsBoard,
    Sidebar, and health logic.
    @illegalcall

@nikhilachale nikhilachale requested a review from illegalcall July 1, 2026 22:00

@illegalcall illegalcall left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remaining issue on latest head 3983d37:

  1. P2: Project summaries expose full project config just to hydrate orchestrator agent

The previous replacement-flow issues look fixed, but this fix adds Config *domain.ProjectConfig to the GET /api/v1/projects summary shape and useWorkspaceQuery reads project.config?.orchestrator?.agent from it. That exposes much more than the workspace list needs: env, postCreate, symlinks, reviewer config, and agent config now travel through the project-list endpoint and the polling workspace query.

Example: if a project config contains env: { GITHUB_TOKEN: "..." } or other local runtime variables, opening the normal workspace list now fetches and caches that data even though the UI only needs the configured orchestrator harness.

Please expose a narrow field on the summary, e.g. orchestratorAgent, or a small role summary, instead of returning full ProjectConfig from the list endpoint. Keep full config on GET /api/v1/projects/{id} / settings flows.

Refs: backend/internal/service/project/types.go, backend/internal/service/project/service.go, backend/internal/domain/projectconfig.go, frontend/src/renderer/hooks/useWorkspaceQuery.ts

@nikhilachale

Copy link
Copy Markdown
Collaborator Author

GET /api/v1/projects now returns each project summary like:

{
"id": "ao",
"name": "Agent Orchestrator",
"path": "/path/to/repo",
"kind": "single-repo",
"sessionPrefix": "ao",
"orchestratorAgent": "codex"
}
@illegalcall

@nikhilachale

Copy link
Copy Markdown
Collaborator Author
  • Unknown installed agents: fixed in frontend/src/renderer/components/CreateProjectAgentSheet.tsx
    - Installed agents with authStatus: "unknown" are selectable.
    - They show Auth unknown with a warning icon.
    - Explicit unauthorized still stays disabled as Needs auth.

    • Fresh daemon empty catalog: fixed in frontend/src/renderer/routes/_shell.tsx

      • First ready daemon load now uses refreshAgents under the existing agents query key, so /agents/refresh runs
        instead of only cached GET /agents.
    • False auth parser: fixed in backend/internal/adapters/agent/authprobe/authprobe.go

      • Explicit false keys like "authenticated": false, "authorized": false, logged_in=false, etc. are checked before
        broad positive substring matches.

      • Covered by new tests in authprobe_test.go.

    • Broken docs links: fixed in README.md and docs/README.md

      • Removed stale links to missing docs/agent/README.md and docs/agent/switching.md.
Screenshot 2026-07-03 at 02 02 48

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Project Settings allows switching to unsupported orchestrator agents without preflight install/auth checks

2 participants