fix(terminal): prevent multi-tab disconnect from killing shared control-mode process#107
Conversation
…pture for OneShot
Two improvements to OneShot (-p) session control:
1. **Fix initial prompt not submitting** (\r instead of \n): The session driver
was sending initialPrompt + "\n" (LF) to the PTY. Claude Code's interactive
readline needs \r (CR) to submit — the same signal a physical Enter key sends.
Sessions received the typed text but never executed it, causing inactivity
timeouts. Confirmed by log: "sent initial prompt" followed by 10-min idle.
The startup dialog answers ("1\n") work because those menus handle both,
but Claude's readline interface only responds to \r.
2. **Capture claude session_id from --output-format json output**: OneShot
sessions now launch with -p --output-format json. When the session exits,
the driver parses the JSON output for "session_id" and stores it as
ConversationUUID on the instance. Subsequent restarts automatically use
--resume <uuid>, sending Claude back into the same conversation with full
context instead of re-running the task from scratch.
Supporting infrastructure:
- SetClaudeConversationUUID() — thread-safe setter that fires a save callback
- SetClaudeSessionIDSavedCallback() — wired in service layer to flush to DB
- wireClaudeSessionIDCallback() — registered on all creation paths including
loadInstancesWithWiring (startup) and CreateDirectorySession (backlog)
- Prompt is now appended even when claudeSessionID is set for OneShot
sessions so continuation prompts are delivered after --resume
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
New JSON columns (programs, subcommands, etc.) were NOT NULL with no
SQL-level DEFAULT, causing the SQLite copy-table migration to fail for
existing rows. Adding Optional() makes the columns nullable so old rows
can be copied; new rows still get []string{} from the Go-level Default.
Also adds build backup + auto-rollback to install-service:
- make install-service saves stapler-squad.prev before building
- health check polls /health for 15s after service start
- on failure, auto-restores .prev and restarts the service
- make rollback provides a manual escape hatch
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ol-mode process When multiple browser windows or split panes subscribe to the same tmux session's control-mode stream, disconnecting any one WebSocket was calling StopControlMode() which killed the shared tmux process and closed all subscriber channels — leaving the remaining windows with a broken terminal. Add refcount-based lifecycle to StartControlMode/StopControlMode: the process stays alive until every subscriber has stopped. A serializing mutex prevents the TOCTOU race where two concurrent first-time callers could fork two tmux processes. Upgrade broadcastControlModeUpdate from RLock to Lock to prevent a send-on-closed-channel panic when a subscriber unsubscribes concurrently. Also guard the pre-sizing getBoundingClientRect() call behind the ResizeObserver's containerSize, ensuring split-pane terminals send their actual pane width (not the full browser window width) in the PTY handshake. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Registry ValidationTest Coverage: 93/131 features have
|
Go Benchmarks (Tier 1) |
There was a problem hiding this comment.
Pull request overview
This PR addresses a production terminal reliability issue where disconnecting one browser tab could tear down a shared tmux control-mode process and break other active tabs. It introduces a refcounted lifecycle for tmux control mode, improves frontend initial sizing behavior in split-pane layouts, and includes some additional operational + DB schema adjustments.
Changes:
- Add refcount + serialized start/stop to tmux control-mode so the underlying process stays alive until the last subscriber disconnects, and harden broadcast against unsubscribe races.
- Update
TerminalOutputpre-sizing to wait for real pane dimensions (ResizeObserver) before usinggetBoundingClientRect(), and add a JestResizeObserverstub for tests. - Make several
ApprovalRuleJSON criteria fields nullable in ent schema/migrations and add generated clear/reset plumbing; add install-service health check/rollback support plus Makefile targets.
Reviewed changes
Copilot reviewed 10 out of 15 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| web-app/src/components/sessions/TerminalOutput.tsx | Defer pre-sizing until pane layout is resolved; optionally initiate connect once real size is known. |
| web-app/jest.setup.js | Add jsdom ResizeObserver stub to enable deterministic sizing tests. |
| session/tmux/tmux.go | Extend TmuxSession with control-mode start mutex + refcount fields. |
| session/tmux/control_mode.go | Refcount-based Start/Stop lifecycle; broadcast uses write lock to avoid close/send races. |
| session/tmux/control_mode_refcount_test.go | Add tests for refcount invariants and broadcast/unsubscribe race regression. |
| session/session_driver.go | Remove a trailing blank line (formatting-only). |
| session/instance_claude.go | Remove a trailing blank line (formatting-only). |
| session/ent/schema/approvalrule.go | Make JSON criteria fields nullable (Optional) with defaults for migration compatibility. |
| session/ent/mutation.go | Generated mutation updates: add Clear*/Cleared* support for nullable JSON fields. |
| session/ent/migrate/schema.go | Generated schema updates: mark JSON columns Nullable. |
| session/ent/approvalrule/where.go | Generated predicates: add IsNil/NotNil helpers for nullable JSON fields. |
| session/ent/approvalrule_update.go | Generated updates: add Clear* methods and SQL clear handling for nullable JSON fields. |
| session/ent/approvalrule_create.go | Generated create/upsert updates aligning with nullable JSON fields. |
| scripts/install-service.sh | Add post-install health check with auto-rollback to .prev binary. |
| Makefile | Add backup-binary + rollback targets and wire backup into install-service. |
Files not reviewed (5)
- session/ent/approvalrule/where.go: Generated file
- session/ent/approvalrule_create.go: Generated file
- session/ent/approvalrule_update.go: Generated file
- session/ent/migrate/schema.go: Generated file
- session/ent/mutation.go: Generated file
Comments suppressed due to low confidence (1)
Makefile:280
- The PR description doesn’t mention the new Make targets (backup-binary/rollback) and the changed install-service dependency chain. Since this affects developer/service install flows, it should be documented in the PR description (and ideally in any install docs) or split into a separate PR.
backup-binary: ## Snapshot the current binary to stapler-squad.prev before a new build (called by install-service)
@if [ -f ./stapler-squad ]; then \
cp -f ./stapler-squad ./stapler-squad.prev; \
echo "==> Saved current binary to ./stapler-squad.prev"; \
fi
install-service: backup-binary build ## Install stapler-squad as a system service (systemd on Linux, LaunchAgent on macOS)
ifeq ($(UNAME_S),Darwin)
@$(MAKE) _codesign-binary
endif
@STAPLER_SQUAD_BIN="$(CURDIR)/stapler-squad" ./scripts/install-service.sh $(if $(NO_PROFILE),--no-profile) $(if $(PROFILE_PORT),--profile-port $(PROFILE_PORT))
rollback: ## Restore the previous build (stapler-squad.prev) and restart the service
@if [ ! -f ./stapler-squad.prev ]; then \
echo "✗ No previous build found (./stapler-squad.prev does not exist)"; \
exit 1; \
fi
@echo "==> Restoring previous build..."
@cp -f ./stapler-squad.prev ./stapler-squad
@echo "✓ Binary restored from stapler-squad.prev"
ifeq ($(UNAME_S),Darwin)
@$(MAKE) _codesign-binary
endif
@STAPLER_SQUAD_BIN="$(CURDIR)/stapler-squad" ./scripts/install-service.sh $(if $(NO_PROFILE),--no-profile) $(if $(PROFILE_PORT),--profile-port $(PROFILE_PORT))
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| url="http://localhost:8543/health" | ||
| printf "==> Waiting for service to be healthy" |
There was a problem hiding this comment.
This file is outside the scope of this PR (introduced in a prior commit). Deferring the curl guard to a follow-up.
| // TestRefcount_LastStopActuallyTearesDown verifies that when refcount reaches 0, | ||
| // StopControlMode closes controlModeDone and nils controlModeCmd. | ||
| func TestRefcount_LastStopActuallyTearesDown(t *testing.T) { | ||
| sess := newRefcountTestSession(t) | ||
| doneCh := sess.controlModeDone | ||
|
|
||
| // Simulate teardown path: we can't call real StopControlMode on a fake cmd | ||
| // (it would try to kill a nil process), so we validate the refcount logic inline | ||
| // by checking what happens when refcount reaches 0 via the explicit decrement path. | ||
|
|
||
| // Manually decrement to 0 and check that the guard passes. | ||
| sess.controlModeSubMu.Lock() | ||
| sess.controlModeRefCount = 1 | ||
| sess.controlModeRefCount-- | ||
| remaining := sess.controlModeRefCount | ||
| if sess.controlModeDone != nil { | ||
| close(sess.controlModeDone) | ||
| sess.controlModeDone = nil | ||
| } | ||
| sess.controlModeSubMu.Unlock() | ||
|
|
||
| if remaining != 0 { | ||
| t.Errorf("remaining after final decrement = %d, want 0", remaining) | ||
| } | ||
| select { | ||
| case <-doneCh: | ||
| // Expected: channel was closed. | ||
| case <-time.After(100 * time.Millisecond): | ||
| t.Error("controlModeDone was not closed after final Stop") | ||
| } | ||
| } |
There was a problem hiding this comment.
Fixed in the follow-up commit 9050c0d: TestRefcount_LastStopActuallyTearsDown now calls the real StopControlMode() and uses a real already-exited process, verifying controlModeCmd is nil and controlModeRefCount is 0 after teardown.
| // TestRefcount_LastStopActuallyTearesDown verifies that when refcount reaches 0, | ||
| // StopControlMode closes controlModeDone and nils controlModeCmd. | ||
| func TestRefcount_LastStopActuallyTearesDown(t *testing.T) { |
There was a problem hiding this comment.
Fixed in commit 9050c0d — renamed to TestRefcount_LastStopActuallyTearsDown.
| // Optional() makes the DB column nullable so SQLite migrations can copy old rows | ||
| // (which lack these columns) without hitting a NOT NULL constraint. New rows still | ||
| // get []string{} via Default(). Application code treats nil slice as empty. | ||
| field.JSON("programs", []string{}). | ||
| Optional(). | ||
| Default([]string{}), | ||
| field.JSON("subcommands", []string{}). | ||
| Optional(). | ||
| Default([]string{}), | ||
| field.JSON("blocked_subcommands", []string{}). | ||
| Optional(). | ||
| Default([]string{}), | ||
| field.JSON("required_flags", []string{}). | ||
| Optional(). | ||
| Default([]string{}), | ||
| field.JSON("forbidden_flags", []string{}). | ||
| Optional(). | ||
| Default([]string{}), | ||
| field.JSON("required_flag_prefixes", []string{}). | ||
| Optional(). | ||
| Default([]string{}), | ||
| field.JSON("python_modes", []string{}). | ||
| Optional(). | ||
| Default([]string{}), |
There was a problem hiding this comment.
The ApprovalRule ent schema changes are from a prior commit on this branch (fix(db): make approval_rule JSON fields Optional to fix SQLite migration), not from the control-mode work. Noted — will split into separate PRs going forward for unrelated scope changes.
E2E RPC Latency |
UX Analysis
|
Frontend Terminal Throughput |
🎬 E2E Feature Demos2 shard(s) recorded feature flows for this PR. recordings shard 1 Demo preview opens directly in browser (single-file HTML). Raw WebM recordings in ZIP. Expires after 30 days. |
…t, channel race, ResetExitOnce - Reset controlModeRefCount and controlModeCmd when tmux dies unilaterally (readControlModeOutput EOF path) so StartControlMode() can fork a fresh process rather than fast-returning against a dead cmd (ARCH-1 BLOCKER) - Reset controlModeRefCount/Cmd in ResetExitOnce() for clean session restart - Fix data race: read highPriSendCh/normPriSendCh under controlModeSubMu.RLock in send paths; nil them inside controlModeSubMu in StopControlMode (CQ-2) - Close stderr pipe on cmd.Start() failure to fix fd leak - Update tests: real StopControlMode teardown path, concurrent Start+Stop race, underflow guard coverage, broadcast post-condition assertions - Fix typo TearesDown → TearsDown; fix misleading comment Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Registry ValidationTest Coverage: 93/131 features have
|
…elper Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Registry ValidationTest Coverage: 93/131 features have
|
…iles Resolved 5 modify/delete conflicts in session/ent/ generated files (approvalrule/where.go, approvalrule_create.go, approvalrule_update.go, migrate/schema.go, mutation.go). Main untracked these via c568e00 "chore: untrack generated files". Accepted main's deletion for all five. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
✅ Registry ValidationTest Coverage: 93/131 features have
|
Summary
StopControlMode()killed the shared tmux control-mode process and closed all subscriber channels, leaving every other open browser window/tab with a broken terminal showing "open terminal failed: terminal does not support clear"controlModeRefCount) so the process stays alive until the last subscriber stops; a serializingcontrolModeStartMumutex prevents the TOCTOU race where two concurrent first-time callers could fork two tmux processesbroadcastControlModeUpdatefromRLock→Lockto prevent a send-on-closed-channel panic when a subscriber unsubscribes concurrently (RACE feat: Enhance session management and CLI #7)getBoundingClientRect()call behindcontainerSize.width > 0(from ResizeObserver) so split-pane terminals send their actual pane width rather than the full browser window width in the PTY handshakeChanges
session/tmux/tmux.gocontrolModeStartMu sync.Mutex(serializes Start/Stop) andcontrolModeRefCount intfields toTmuxSessionsession/tmux/control_mode.goStartControlMode(): if process already running, increment refcount and return (no new fork); first caller acquires startMu for the full startup sequenceStopControlMode(): decrement refcount; only tear down when refcount reaches 0broadcastControlModeUpdate(): upgraded fromRLocktoLockto prevent concurrent close/send racessession/tmux/control_mode_refcount_test.go(new)-raceweb-app/src/components/sessions/TerminalOutput.tsxcontainerSize.width > 0) before readinggetBoundingClientRect(), preventing the wrong full-window-width being sent as the initial PTY size in split-pane modeconnect()directly if not yet initiated (session-switch effect won't re-run sincesessionIdunchanged)web-app/jest.setup.jsResizeObserverstub that fires the callback synchronously withgetBoundingClientRect()whenobserve()is called, enabling tests that mockgetBoundingClientRectto control the reported container sizeTest plan
go test ./session/tmux/... -count=1 -race— 206 tests pass (including 6 new refcount tests)npx jest --no-coverage --testPathPatterns="TerminalOutput"— all 74 tests pass (were 74 failures before ResizeObserver stub fix)window.matchMedia, canvas, scrollIntoView)🤖 Generated with Claude Code