feat: v0.6 daemon lifecycle + observability#17
Conversation
Add CLI→daemon communication via Unix domain socket at {poolDir}/daemon.sock.
The socket uses NDJSON protocol with method dispatch (stop, status, subscribe).
- New socketServer in internal/daemon/socket.go with accept loop and per-connection handling
- daemon.Run() creates child context shared with socket — stop RPC cancels it
- `agent-pool stop` CLI command connects to socket and sends shutdown request
- `agent-pool status` CLI command queries live daemon state (pool, experts, task counts)
- Auto-fallback to /tmp/ hashed path when poolDir exceeds macOS 104-byte socket limit
- 5 integration tests covering stop, unknown method, stale cleanup, shutdown cleanup, status
- Includes v0.6 implementation plan in docs/plans/
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Track all goroutine dispatch sites with sync.WaitGroup so the daemon waits for in-flight work before exiting. 5 dispatch sites wrapped: handleApprovalRequest, handleInbox (event loop), wake experts, drainAllInboxes (architect + per-expert). - 30s drain timeout with configurable WithDrainTimeout option - Double-signal handler: first SIGTERM/SIGINT triggers graceful drain, second forces immediate exit via os.Exit(1) - Integration test with slowSpawner verifies daemon waits for blocked spawn before completing shutdown Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enhance the Status() method to include active task details (ID, expert, duration). The CLI status command renders pool name, state, uptime, expert list, task count table, and active task list in a human-readable format instead of raw JSON. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add channel-based event bus for daemon state transitions. Events are emitted at 6 existing log points: task.routed, expert.spawning, expert.completed, expert.failed, task.cancelled, task.unblocked. - eventBus with fan-out to subscribers via buffered channels (cap 64) - Non-blocking emit prevents slow subscribers from blocking the daemon - Socket subscribe method streams events as NDJSON - `agent-pool watch` CLI renders events with ANSI colors - 4 unit tests for event bus, 1 integration test for subscribe streaming Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the default 10m session timeout. When session_timeout is empty or unset, expert sessions run to completion without a time limit. - ParseSessionTimeout returns (0, nil) for empty strings - Daemon uses context.WithCancel instead of context.WithTimeout when 0 - Set session_timeout in pool.toml to impose a limit Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clean break: all tool names simplified from pool_X to X. The prefix only existed for ~1 day in production (bosun dogfood). Updated across code, tests, plugin skills, and bosun concierge identity. Tool renames: pool_read_state -> read_state, pool_update_state -> update_state, pool_append_error -> append_error, pool_send_response -> send_response, pool_recall -> recall, pool_search_index -> search_index, pool_define_contract -> define_contract, pool_send_task -> send_task, pool_verify_result -> verify_result, pool_amend_contract -> amend_contract, pool_ask_expert -> ask_expert, pool_dispatch -> dispatch, pool_collect -> collect, pool_submit_plan -> submit_plan, pool_check_status -> check_status, pool_list_experts -> list_experts Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ship a macOS launchd plist template with placeholder markers for binary path and pool directory. Includes install/stop/unload documentation and notes on KeepAlive and double-signal behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Apply action-oriented logging pattern to socket infrastructure: - Info: stop command received - Warn: invalid request, unknown method, accept failures - Debug: subscriber connect/disconnect Also align daemon drain messages with the narrative pattern (Preparing to / Successfully / Skipping. Reason:). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full-scope audit after v0.6 implementation. 69.9% coverage, 25 test files, no P0 issues. Key gaps: DiscoverPoolDir (high risk), connectAndSend (medium risk). Main quality concern is time.Sleep-based synchronization in integration tests (30+ occurrences). [skip-ci] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds a Unix-domain socket control plane and CLI subcommands ( Changes
Sequence Diagram(s)sequenceDiagram
participant CLI as agent-pool CLI
participant Socket as Unix Socket Server
participant Daemon as Daemon.Run
participant Events as EventBus
CLI->>Socket: connect to daemon.sock
Socket->>Daemon: accept & dispatch {method}
alt method == "subscribe"
Socket->>Daemon: subscribe request
Daemon->>Events: register subscriber
Socket->>CLI: send ack (NDJSON)
loop events
Daemon->>Events: emit(event)
Events->>Socket: deliver to subscriber channel
Socket->>CLI: write NDJSON event
end
else method == "status"
Socket->>Daemon: request Status()
Daemon->>Socket: return status JSON
Socket->>CLI: send response then close
else method == "stop"
Socket->>Daemon: request stop -> cancel childCtx
Daemon->>Daemon: wait for wg (drainTimeout)
Daemon->>Socket: respond ok then exit
end
sequenceDiagram
participant User
participant CLI as agent-pool CLI
participant Socket as Socket Server
participant Daemon as Daemon
participant WG as WaitGroup
User->>CLI: agent-pool stop
CLI->>Socket: connect & send {method:"stop"}
Socket->>Daemon: invoke stop handler (calls cancel)
Daemon->>WG: wait for in-flight goroutines (up to drainTimeout)
alt drained within timeout
WG-->>Daemon: done
Daemon->>Socket: close listener & remove socket file
Daemon->>CLI: exit 0
else timeout expired
Daemon->>Daemon: log warning, exit
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~75 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Apply CSO (Claude Search Optimization) to skill descriptions:
- Change from workflow summaries to triggering conditions ("Use when...")
- This prevents Claude from shortcutting the skill body
Content improvements:
- pool-ask: Use dispatch+collect (non-blocking) instead of ask_expert
- pool-build: Same dispatch+collect pattern for expert input gathering
- concierge-identity: Add dispatch and collect to tools table, use
table format for tool reference
[skip-ci]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 5
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
internal/config/config.go (1)
65-73:⚠️ Potential issue | 🟠 MajorReject negative
session_timeoutvalues.
time.ParseDurationaccepts negative durations (e.g.,"-1s"), and the daemon'stimeout > 0check means a negative value would silently fall through tocontext.WithCanceland run unbounded instead of timing out. Add validation to fail fast on negative values.Suggested fix
func (d DefaultsSection) ParseSessionTimeout() (time.Duration, error) { if d.SessionTimeout == "" { return 0, nil } dur, err := time.ParseDuration(d.SessionTimeout) if err != nil { return 0, fmt.Errorf("parsing defaults.session_timeout %q: %w", d.SessionTimeout, err) } + if dur < 0 { + return 0, fmt.Errorf("parsing defaults.session_timeout %q: must be >= 0", d.SessionTimeout) + } return dur, nil }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/config/config.go` around lines 65 - 73, In ParseSessionTimeout, time.ParseDuration can return negative durations which the daemon currently treats as no timeout; after parsing the duration (dur, err := time.ParseDuration(d.SessionTimeout)), add a validation that rejects negative values (if dur < 0) and return a descriptive error (e.g., fmt.Errorf("defaults.session_timeout %q must not be negative", d.SessionTimeout)); leave empty string handling as-is so only explicit negative durations fail fast.internal/mcp/config.go (1)
10-20:⚠️ Potential issue | 🟠 MajorArchitect and concierge tools are missing from headless pre-approval.
ExpertToolNamesonly includes the 6 expert-scope tools. However, the same spawning flow applies to architect and concierge roles—they also go throughprocessInboxMessage()and receive only ExpertToolNames pre-approval, plus defaults from config.Architect role registers
define_contract,send_task,verify_result, andamend_contractviaRegisterArchitectTools(). Concierge role registersask_expert,submit_plan,check_status, andlist_expertsviaRegisterConciergeTools(). These role-specific tools are not in the pre-approved list and would be blocked in Claude's headless mode.Add
ArchitectToolNamesandConciergeToolNamesvariables with the appropriate tool names in the formatmcp__agent-pool__<tool-name>, and append them when spawning architect/concierge roles:if mail.IsBuiltinRole(expertName) && expertName == "architect" { allTools = append(allTools, agentmcp.ArchitectToolNames...) } else if mail.IsBuiltinRole(expertName) && expertName == "concierge" { allTools = append(allTools, agentmcp.ConciergeToolNames...) }Or consolidate pre-approval logic by checking
cfg.Rolein the MCP server registration to determine which tools to pre-approve.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@internal/mcp/config.go` around lines 10 - 20, ExpertToolNames currently only lists expert-scoped MCP tool names, so architect and concierge role tools registered by RegisterArchitectTools() and RegisterConciergeTools() are not pre-approved in headless mode; add two new slices ArchitectToolNames and ConciergeToolNames containing the mcp__agent-pool__<tool-name> entries for define_contract, send_task, verify_result, amend_contract and ask_expert, submit_plan, check_status, list_experts respectively, and update the spawning/pre-approval logic in processInboxMessage (where allTools is assembled) to append agentmcp.ArchitectToolNames when mail.IsBuiltinRole(expertName) && expertName == "architect" and append agentmcp.ConciergeToolNames when mail.IsBuiltinRole(expertName) && expertName == "concierge" (alternatively consolidate by branching on cfg.Role in the MCP server registration path) so architect/concierge tools are included alongside ExpertToolNames.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/agent-pool/main.go`:
- Around line 118-142: The CLI commands (cmdStop, cmdStatus, cmdWatch) currently
use filepath.Join(poolDir, "daemon.sock") which mismatches the daemon's
resolveSockPath() fallback for long paths; add a shared helper (e.g.,
ResolveSockPath(poolDir string)) that implements the same logic as
resolveSockPath (use the candidate path if <=100 chars, otherwise hash poolDir
with fnv.New32a and return os.TempDir()/ap-{hash}.sock), put it in a shared
package (e.g., internal/config/socket.go), and replace direct filepath.Join
calls in cmdStop/cmdStatus/cmdWatch and the daemon's resolveSockPath to call
ResolveSockPath so both sides compute the same socket path.
In `@internal/daemon/events_test.go`:
- Around line 79-100: The test hardcodes the subscriber buffer size (64) causing
brittle duplication; define an exported constant EventBusBufferCapacity = 64 in
the events package (where the bus/channel is created) and update
internal/daemon/events_test.go to use that constant (e.g., emit
EventBusBufferCapacity + 16 events and assert count == EventBusBufferCapacity)
instead of literal 64 so the test stays correct if the capacity changes;
reference the bus creation symbol (e.g., EventBus or the event subscription
code) to ensure the constant is used where the channel buffer is allocated and
in the test assertion.
In `@internal/daemon/socket.go`:
- Around line 171-181: writeResponse currently ignores encoding errors; modify
socketServer.writeResponse to capture the error returned by
json.NewEncoder(conn).Encode(resp) and, if non-nil, emit a debug-level log
including the error and minimal context (e.g. the resp type/id or a short
string) so client disconnects are diagnosable. Keep the existing
SetWriteDeadline call and ensure you call the socketServer logger (e.g.
s.logger.Debugf or s.log.Debugf depending on the struct field used on
socketServer) rather than panicking or returning an error.
In `@internal/mcp/concierge_tools.go`:
- Around line 512-515: The change replaced the public response key
"pool_experts" with "experts", which breaks callers; restore backwards
compatibility by including the old key as an alias while adding the new key
(i.e., ensure the map returned by the code that builds result contains both
"pool_experts": poolExperts and "experts": poolExperts, and similarly keep
"shared_experts" as-is), or alternatively add an explicit version field to the
response; update the construction that references poolExperts and
poolCfg.Shared.Include so both keys are present for one release.
In `@scripts/com.agent-pool.daemon.plist`:
- Around line 15-29: The plist currently lacks an ExitTimeOut setting so launchd
may SIGKILL the daemon after its 20s default; add an ExitTimeOut key with value
35 (e.g., <key>ExitTimeOut</key> with integer 35) to the existing plist
alongside the other keys (near RunAtLoad/KeepAlive/WorkingDirectory) so the
daemon has 35 seconds to complete its graceful drain before being force-killed.
---
Outside diff comments:
In `@internal/config/config.go`:
- Around line 65-73: In ParseSessionTimeout, time.ParseDuration can return
negative durations which the daemon currently treats as no timeout; after
parsing the duration (dur, err := time.ParseDuration(d.SessionTimeout)), add a
validation that rejects negative values (if dur < 0) and return a descriptive
error (e.g., fmt.Errorf("defaults.session_timeout %q must not be negative",
d.SessionTimeout)); leave empty string handling as-is so only explicit negative
durations fail fast.
In `@internal/mcp/config.go`:
- Around line 10-20: ExpertToolNames currently only lists expert-scoped MCP tool
names, so architect and concierge role tools registered by
RegisterArchitectTools() and RegisterConciergeTools() are not pre-approved in
headless mode; add two new slices ArchitectToolNames and ConciergeToolNames
containing the mcp__agent-pool__<tool-name> entries for define_contract,
send_task, verify_result, amend_contract and ask_expert, submit_plan,
check_status, list_experts respectively, and update the spawning/pre-approval
logic in processInboxMessage (where allTools is assembled) to append
agentmcp.ArchitectToolNames when mail.IsBuiltinRole(expertName) && expertName ==
"architect" and append agentmcp.ConciergeToolNames when
mail.IsBuiltinRole(expertName) && expertName == "concierge" (alternatively
consolidate by branching on cfg.Role in the MCP server registration path) so
architect/concierge tools are included alongside ExpertToolNames.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 68b1662f-c860-4f6f-97bf-263176df54ae
📒 Files selected for processing (24)
cmd/agent-pool/main.godocs/launchd.mddocs/plans/2026-04-05-daemon-lifecycle.mddocs/test-audit-report.mdinternal/config/config.gointernal/config/config_test.gointernal/daemon/daemon.gointernal/daemon/daemon_test.gointernal/daemon/events.gointernal/daemon/events_test.gointernal/daemon/socket.gointernal/daemon/socket_test.gointernal/mcp/architect_tools.gointernal/mcp/architect_tools_test.gointernal/mcp/concierge_tools.gointernal/mcp/concierge_tools_test.gointernal/mcp/config.gointernal/mcp/tools.gointernal/mcp/tools_test.goplugin/concierge-identity.mdplugin/skills/pool-ask.mdplugin/skills/pool-build.mdplugin/skills/pool-status.mdscripts/com.agent-pool.daemon.plist
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@plugin/skills/pool-build.md`:
- Around line 34-35: The docs say to call submit_plan with only the plan body
but the implementation still accepts an optional contracts parameter
(comma-separated contract IDs) in the submit_plan tool and its handler; either
document contracts as an optional parameter (describe format, example, and
behavior) or remove/deprecate the parameter from the submit_plan handler and
tool registration so behavior and docs match; update the submit_plan tool docs
to explicitly mention the optional contracts parameter and its format (or add a
deprecation notice) and ensure the submit_plan handler that parses contracts is
kept in sync with the documentation.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 19f59619-c829-4bc7-a2f5-563515c8a966
📒 Files selected for processing (4)
plugin/concierge-identity.mdplugin/skills/pool-ask.mdplugin/skills/pool-build.mdplugin/skills/pool-status.md
CodeRabbit findings: 1. [Critical] Socket path mismatch — CLI used hardcoded path while daemon had fnv fallback for long paths. Extracted shared config.ResolveSockPath used by both sides. 2. [Nitpick] Magic number 64 in event bus test — exported EventBufSize constant, test uses it instead of hardcoded value. 3. [Nitpick] writeResponse silently ignored errors — added debug log. 4. [Major] list_experts response renamed pool_experts to experts without compatibility — added pool_experts as deprecated alias alongside new experts key. 5. [Major] launchd plist missing ExitTimeOut — added 35s to exceed the 30s drain timeout, preventing SIGKILL before graceful shutdown completes. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/agent-pool/main.go`:
- Around line 174-180: The printStatusField helper currently ignores the error
returned by json.Unmarshal which can lead to printing incorrect or empty values;
update printStatusField to capture the error (e.g., err := json.Unmarshal(v,
&s)) and if err != nil simply skip printing the field (return) so only
successfully unmarshalled string values are printed; reference the
printStatusField closure and its use of label, key, data, json.Unmarshal and
variable s when making the change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: d1b51232-15ac-4ac6-8f67-a7021f493c75
📒 Files selected for processing (8)
cmd/agent-pool/main.gointernal/config/socket.gointernal/daemon/daemon.gointernal/daemon/events.gointernal/daemon/events_test.gointernal/daemon/socket.gointernal/mcp/concierge_tools.goscripts/com.agent-pool.daemon.plist
| printStatusField := func(label, key string) { | ||
| if v, ok := data[key]; ok { | ||
| var s string | ||
| json.Unmarshal(v, &s) | ||
| fmt.Printf("%-10s %s\n", label+":", s) | ||
| } | ||
| } |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider handling unmarshal errors in printStatusField.
The helper ignores json.Unmarshal errors, which is acceptable for string fields since they'll default to empty. However, for robustness, you might want to skip printing the field if unmarshal fails for non-string types.
💡 Optional: Handle unmarshal errors
printStatusField := func(label, key string) {
if v, ok := data[key]; ok {
var s string
- json.Unmarshal(v, &s)
- fmt.Printf("%-10s %s\n", label+":", s)
+ if err := json.Unmarshal(v, &s); err == nil {
+ fmt.Printf("%-10s %s\n", label+":", s)
+ }
}
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/agent-pool/main.go` around lines 174 - 180, The printStatusField helper
currently ignores the error returned by json.Unmarshal which can lead to
printing incorrect or empty values; update printStatusField to capture the error
(e.g., err := json.Unmarshal(v, &s)) and if err != nil simply skip printing the
field (return) so only successfully unmarshalled string values are printed;
reference the printStatusField closure and its use of label, key, data,
json.Unmarshal and variable s when making the change.
1. [Minor] pool-build skill omitted contracts parameter from submit_plan call — restored parameter list matching implementation. 2. [Nitpick] printStatusField ignored unmarshal errors — added fallback to print raw JSON when field isn't a string. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/agent-pool/main.go`:
- Around line 277-374: The current watcher only prints events (the scanner loop
handling event struct and switch on e.Type) but never loads or maintains task
state for a live table; update the logic to fetch an initial task snapshot (via
the same backend API used for events) into an in-memory map keyed by task ID,
implement a redraw function that clears the terminal and prints a compact task
table (showing status, expert, model, duration, summary) and call it initially
and whenever events arrive, and in the scanner loop update the in-memory task
entries based on event types ("task.routed", "expert.spawning",
"expert.completed", "expert.failed", "task.cancelled", "task.unblocked") before
calling redraw; keep the existing event log output but use fmt.Printf and the
ANSI constants (reset, green, red, yellow, cyan) to colorize both the table rows
and the appended event line.
- Around line 124-130: The discovered poolDir returned by config.DiscoverPoolDir
must be canonicalized before being passed to config.ResolveSockPath (or
alternately make ResolveSockPath canonicalize internally); update the client
commands that call config.ResolveSockPath (e.g., the stop/status/watch command
handling in main.go where connectAndSend(config.ResolveSockPath(poolDir),
"stop") is invoked) to first run filepath.Abs (or an equivalent normalization)
on poolDir and handle the error, then pass the normalized path into
config.ResolveSockPath (or modify ResolveSockPath to call filepath.Abs at its
start and use the normalized string for both path joining and hashing).
- Around line 255-264: The initial ack read using bufio.NewScanner(conn) and the
NDJSON read loop both ignore scanner.Err() when Scan() returns false; update the
ack block (where `scanner := bufio.NewScanner(conn)` and `var ack
socketResponse` are used) to check `if !scanner.Scan() { if err :=
scanner.Err(); err != nil { fmt.Fprintf(os.Stderr, "error: reading ack: %v\n",
err); os.Exit(1) } fmt.Fprintf(os.Stderr, "error: no ack from daemon\n");
os.Exit(1) }` and update the loop that uses `for scanner.Scan()` to, after the
loop exits, check `if err := scanner.Err(); err != nil { /* surface error
(log/exit/return) */ }` following the same pattern used in `connectAndSend`
(lines 399–409) so frame-size or I/O errors are reported instead of silently
falling through.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: cd701a48-cc25-4588-8faa-623c2371977c
📒 Files selected for processing (2)
cmd/agent-pool/main.goplugin/skills/pool-build.md
| fmt.Println("Watching daemon events (Ctrl-C to stop)...") | ||
| fmt.Println() | ||
|
|
||
| // ANSI colors | ||
| const ( | ||
| reset = "\033[0m" | ||
| green = "\033[32m" | ||
| red = "\033[31m" | ||
| yellow = "\033[33m" | ||
| cyan = "\033[36m" | ||
| ) | ||
|
|
||
| type event struct { | ||
| Type string `json:"type"` | ||
| Timestamp time.Time `json:"timestamp"` | ||
| Data json.RawMessage `json:"data"` | ||
| } | ||
|
|
||
| for scanner.Scan() { | ||
| var e event | ||
| if err := json.Unmarshal(scanner.Bytes(), &e); err != nil { | ||
| continue | ||
| } | ||
|
|
||
| ts := e.Timestamp.Format("15:04:05") | ||
| var color, detail string | ||
|
|
||
| switch e.Type { | ||
| case "task.routed": | ||
| color = cyan | ||
| var d struct { | ||
| ID string `json:"id"` | ||
| From string `json:"from"` | ||
| To string `json:"to"` | ||
| } | ||
| json.Unmarshal(e.Data, &d) | ||
| detail = fmt.Sprintf("%s -> %s (%s)", d.From, d.To, d.ID) | ||
|
|
||
| case "expert.spawning": | ||
| color = yellow | ||
| var d struct { | ||
| Expert string `json:"expert"` | ||
| TaskID string `json:"task_id"` | ||
| Model string `json:"model"` | ||
| } | ||
| json.Unmarshal(e.Data, &d) | ||
| detail = fmt.Sprintf("%s task=%s model=%s", d.Expert, d.TaskID, d.Model) | ||
|
|
||
| case "expert.completed": | ||
| color = green | ||
| var d struct { | ||
| Expert string `json:"expert"` | ||
| TaskID string `json:"task_id"` | ||
| Duration string `json:"duration"` | ||
| Summary string `json:"summary"` | ||
| } | ||
| json.Unmarshal(e.Data, &d) | ||
| detail = fmt.Sprintf("%s task=%s %s", d.Expert, d.TaskID, d.Duration) | ||
| if d.Summary != "" { | ||
| if len(d.Summary) > 60 { | ||
| d.Summary = d.Summary[:60] + "..." | ||
| } | ||
| detail += " " + d.Summary | ||
| } | ||
|
|
||
| case "expert.failed": | ||
| color = red | ||
| var d struct { | ||
| Expert string `json:"expert"` | ||
| TaskID string `json:"task_id"` | ||
| ExitCode int `json:"exit_code"` | ||
| } | ||
| json.Unmarshal(e.Data, &d) | ||
| detail = fmt.Sprintf("%s task=%s exit=%d", d.Expert, d.TaskID, d.ExitCode) | ||
|
|
||
| case "task.cancelled": | ||
| color = red | ||
| var d struct { | ||
| TaskID string `json:"task_id"` | ||
| } | ||
| json.Unmarshal(e.Data, &d) | ||
| detail = d.TaskID | ||
|
|
||
| case "task.unblocked": | ||
| color = green | ||
| var d struct { | ||
| TaskID string `json:"task_id"` | ||
| Expert string `json:"expert"` | ||
| } | ||
| json.Unmarshal(e.Data, &d) | ||
| detail = fmt.Sprintf("%s -> %s", d.TaskID, d.Expert) | ||
|
|
||
| default: | ||
| detail = string(e.Data) | ||
| } | ||
|
|
||
| fmt.Printf("[%s] %s%-18s%s %s\n", ts, color, e.Type, reset, detail) | ||
| } |
There was a problem hiding this comment.
watch is still event-only.
Issue #10 calls for a live task table plus the event stream, but this implementation never fetches an initial status snapshot or maintains task state for redraws. Operators only get an append-only log.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/agent-pool/main.go` around lines 277 - 374, The current watcher only
prints events (the scanner loop handling event struct and switch on e.Type) but
never loads or maintains task state for a live table; update the logic to fetch
an initial task snapshot (via the same backend API used for events) into an
in-memory map keyed by task ID, implement a redraw function that clears the
terminal and prints a compact task table (showing status, expert, model,
duration, summary) and call it initially and whenever events arrive, and in the
scanner loop update the in-memory task entries based on event types
("task.routed", "expert.spawning", "expert.completed", "expert.failed",
"task.cancelled", "task.unblocked") before calling redraw; keep the existing
event log output but use fmt.Printf and the ANSI constants (reset, green, red,
yellow, cyan) to colorize both the table rows and the appended event line.
1. [Critical] ResolveSockPath now canonicalizes poolDir with filepath.Abs internally, ensuring daemon and CLI always compute the same socket path regardless of relative vs absolute input. 2. [Major] cmdWatch now checks scanner.Err() after ack read and after the event stream loop, surfacing frame-size and I/O errors instead of silently dropping out. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cmd/agent-pool/main.go (1)
96-109:⚠️ Potential issue | 🟠 MajorA race condition exists in the double-signal handling: the second signal can be absorbed by the original NotifyContext handler before stop() executes.
The current implementation waits on
ctx.Done(), then callsstop(), then registers a new signal handler. If a second signal arrives during the window between<-ctx.Done()returning andstop()completing, the originalNotifyContexthandler (which is still active) will absorb it into its internal buffer, leaving the process waiting for a third signal or the drain timeout.Using a single buffered channel registered once avoids this race:
🐛 Suggested pattern
- ctx, stop := signal.NotifyContext(context.Background(), - syscall.SIGTERM, syscall.SIGINT) - defer stop() - - // Double-signal: first signal triggers graceful drain, second forces exit. - go func() { - <-ctx.Done() - stop() // reset signal handling to default - sigCh := make(chan os.Signal, 1) - signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT) - <-sigCh - logger.Warn("Received second signal, forcing immediate exit") - os.Exit(1) - }() + ctx, cancel := context.WithCancel(context.Background()) + sigCh := make(chan os.Signal, 2) + signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT) + defer signal.Stop(sigCh) + defer cancel() + + // Double-signal: first signal triggers graceful drain, second forces exit. + go func() { + <-sigCh + cancel() + <-sigCh + logger.Warn("Received second signal, forcing immediate exit") + os.Exit(1) + }()🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/agent-pool/main.go` around lines 96 - 109, The double-signal goroutine currently races because it calls stop() after ctx.Done() and re-registers a new signal handler; replace this with a single buffered channel registered once: create sigCh := make(chan os.Signal, 1) and call signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT) at startup, then drive graceful shutdown from ctx (use context.WithCancel or the existing ctx) by reading the first signal from sigCh to trigger the existing shutdown (cancel/stop), and then block on the same sigCh for a second signal to log via logger.Warn and os.Exit(1); remove the signal.NotifyContext usage and the re-registration inside the goroutine so no signals can be lost.
♻️ Duplicate comments (1)
cmd/agent-pool/main.go (1)
247-377:⚠️ Potential issue | 🟠 Major
watchis still append-only.Issue
#10calls for a live task table plus the event stream, but this implementation never fetches an initial snapshot or keeps per-task state for redraws. Operators only get a colored log feed right now. The existingstatusRPC should be enough to seed the first render before the subscribe loop.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cmd/agent-pool/main.go` around lines 247 - 377, The watch command is append-only; call the existing status RPC to fetch an initial snapshot, seed an in-memory tasks map, and switch from printing appended lines to redrawing a live table on each event. Before sending the subscribe request (around the conn.Encode("subscribe") call), send a JSON status request (method:"status"), decode the response into a tasks structure (use the same socketResponse/status shape used elsewhere or define statusResponse), populate a map[string]Task keyed by task ID, and render it via a new renderTasks function that prints a table (or clears the terminal and reprints). Inside the event loop (the event struct, switch on e.Type and the various d structs), update the tasks map according to events (e.g., task.routed, expert.spawning/completed/failed, task.cancelled/unblocked) instead of printing detail, then call renderTasks after each update; keep the subscribe/ack handling, deadline, and signal handling as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cmd/agent-pool/main.go`:
- Around line 273-279: The signal handler currently closes conn which makes
scanner.Scan() return an error that is then logged as "error: stream
interrupted" even on intentional Ctrl-C; add an intentional-shutdown flag (e.g.,
a package-level uint32 used with sync/atomic or an atomic.Bool) and set it
inside the goroutine before calling conn.Close() (where sigCh, conn.Close() are
used), then when checking scanner.Err() (and the similar check around lines
380-383), suppress logging/exit if the intentional-shutdown flag is set; apply
the same pattern to the other signal handler block so intentional closes don’t
produce error messages.
---
Outside diff comments:
In `@cmd/agent-pool/main.go`:
- Around line 96-109: The double-signal goroutine currently races because it
calls stop() after ctx.Done() and re-registers a new signal handler; replace
this with a single buffered channel registered once: create sigCh := make(chan
os.Signal, 1) and call signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT) at
startup, then drive graceful shutdown from ctx (use context.WithCancel or the
existing ctx) by reading the first signal from sigCh to trigger the existing
shutdown (cancel/stop), and then block on the same sigCh for a second signal to
log via logger.Warn and os.Exit(1); remove the signal.NotifyContext usage and
the re-registration inside the goroutine so no signals can be lost.
---
Duplicate comments:
In `@cmd/agent-pool/main.go`:
- Around line 247-377: The watch command is append-only; call the existing
status RPC to fetch an initial snapshot, seed an in-memory tasks map, and switch
from printing appended lines to redrawing a live table on each event. Before
sending the subscribe request (around the conn.Encode("subscribe") call), send a
JSON status request (method:"status"), decode the response into a tasks
structure (use the same socketResponse/status shape used elsewhere or define
statusResponse), populate a map[string]Task keyed by task ID, and render it via
a new renderTasks function that prints a table (or clears the terminal and
reprints). Inside the event loop (the event struct, switch on e.Type and the
various d structs), update the tasks map according to events (e.g., task.routed,
expert.spawning/completed/failed, task.cancelled/unblocked) instead of printing
detail, then call renderTasks after each update; keep the subscribe/ack
handling, deadline, and signal handling as-is.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 5b534128-da91-4027-8969-74fef671dc6f
📒 Files selected for processing (2)
cmd/agent-pool/main.gointernal/config/socket.go
| // Handle Ctrl-C cleanly | ||
| sigCh := make(chan os.Signal, 1) | ||
| signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT) | ||
| go func() { | ||
| <-sigCh | ||
| conn.Close() | ||
| }() |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cd /repo && wc -l cmd/agent-pool/main.goRepository: cameronsjo/agent-pool
Length of output: 120
🏁 Script executed:
cd /repo && sed -n '273,279p' cmd/agent-pool/main.goRepository: cameronsjo/agent-pool
Length of output: 120
🏁 Script executed:
cd /repo && sed -n '380,383p' cmd/agent-pool/main.goRepository: cameronsjo/agent-pool
Length of output: 120
🏁 Script executed:
cd /repo && sed -n '260,300p' cmd/agent-pool/main.goRepository: cameronsjo/agent-pool
Length of output: 120
🏁 Script executed:
cd /repo && sed -n '370,400p' cmd/agent-pool/main.goRepository: cameronsjo/agent-pool
Length of output: 120
🏁 Script executed:
find . -name "main.go" -path "*/agent-pool/*" -type fRepository: cameronsjo/agent-pool
Length of output: 89
🏁 Script executed:
wc -l cmd/agent-pool/main.goRepository: cameronsjo/agent-pool
Length of output: 91
🏁 Script executed:
sed -n '273,279p' cmd/agent-pool/main.goRepository: cameronsjo/agent-pool
Length of output: 222
🏁 Script executed:
sed -n '380,383p' cmd/agent-pool/main.goRepository: cameronsjo/agent-pool
Length of output: 184
🏁 Script executed:
sed -n '260,300p' cmd/agent-pool/main.goRepository: cameronsjo/agent-pool
Length of output: 967
🏁 Script executed:
sed -n '370,410p' cmd/agent-pool/main.goRepository: cameronsjo/agent-pool
Length of output: 1294
🏁 Script executed:
sed -n '240,280p' cmd/agent-pool/main.goRepository: cameronsjo/agent-pool
Length of output: 1179
🏁 Script executed:
sed -n '330,410p' cmd/agent-pool/main.goRepository: cameronsjo/agent-pool
Length of output: 2292
🏁 Script executed:
rg "scanner.Err\(\)" cmd/agent-pool/main.go -A 2 -B 2Repository: cameronsjo/agent-pool
Length of output: 514
🏁 Script executed:
rg "net.ErrClosed" . -t goRepository: cameronsjo/agent-pool
Length of output: 47
Track intentional shutdown to suppress error message on Ctrl-C.
The signal handler closes conn when Ctrl-C is pressed, causing scanner.Scan() to fail. The subsequent scanner.Err() check will return that error and print "error: stream interrupted" before exiting 1, even though this is a normal, graceful shutdown. Set a flag when the signal handler triggers and suppress the error message for intentional closes.
Also applies to: 380-383
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cmd/agent-pool/main.go` around lines 273 - 279, The signal handler currently
closes conn which makes scanner.Scan() return an error that is then logged as
"error: stream interrupted" even on intentional Ctrl-C; add an
intentional-shutdown flag (e.g., a package-level uint32 used with sync/atomic or
an atomic.Bool) and set it inside the goroutine before calling conn.Close()
(where sigCh, conn.Close() are used), then when checking scanner.Err() (and the
similar check around lines 380-383), suppress logging/exit if the
intentional-shutdown flag is set; apply the same pattern to the other signal
handler block so intentional closes don’t produce error messages.
Summary
{poolDir}/daemon.sockfor CLI→daemon communication (NDJSON protocol)agent-pool stop— graceful shutdown via socket RPCagent-pool status— live pool state with pretty-print (experts, task counts, active tasks)agent-pool watch— stream daemon events with ANSI-colored terminal outputsync.WaitGroup(30s timeout, double-signal for immediate exit)pool_prefix from all 16 MCP tool names (clean break)Issues: closes #8, closes #9, closes #10, closes #11, closes #13
Changes
socket.go, child context pattern,cmdStopStatus()method,cmdStatuswith pretty-printevents.goevent bus, socket subscribe,cmdWatchParseSessionTimeoutreturns 0 for empty,WithCancelfallbackpool_X→Xacross code, tests, plugin, bosun+1,804 / -202 lines across 24 files
Test plan
agent-pool start+agent-pool status+agent-pool watch+agent-pool stop🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Refactor
Documentation
Tests