Skip to content

feat: v0.6 daemon lifecycle + observability#17

Merged
cameronsjo merged 13 commits into
mainfrom
feat/v0.6-daemon-lifecycle
Apr 5, 2026
Merged

feat: v0.6 daemon lifecycle + observability#17
cameronsjo merged 13 commits into
mainfrom
feat/v0.6-daemon-lifecycle

Conversation

@cameronsjo

@cameronsjo cameronsjo commented Apr 5, 2026

Copy link
Copy Markdown
Owner

Summary

  • Unix domain socket at {poolDir}/daemon.sock for CLI→daemon communication (NDJSON protocol)
  • agent-pool stop — graceful shutdown via socket RPC
  • agent-pool status — live pool state with pretty-print (experts, task counts, active tasks)
  • agent-pool watch — stream daemon events with ANSI-colored terminal output
  • Graceful drain with sync.WaitGroup (30s timeout, double-signal for immediate exit)
  • Event bus with fan-out to subscribers (6 event types at existing state transitions)
  • Session timeout now optional (empty = sessions run to completion)
  • Dropped pool_ prefix from all 16 MCP tool names (clean break)
  • launchd plist template for macOS backgrounding
  • Narrative logging on socket server

Issues: closes #8, closes #9, closes #10, closes #11, closes #13

Changes

Phase Commit What
1 Socket + Stop socket.go, child context pattern, cmdStop
2 Graceful Drain WaitGroup on 5 goroutine sites, double-signal handler
3 Status Status() method, cmdStatus with pretty-print
4 Event Streaming events.go event bus, socket subscribe, cmdWatch
5a Timeout ParseSessionTimeout returns 0 for empty, WithCancel fallback
5b Tool Rename pool_XX across code, tests, plugin, bosun
6 launchd Plist template + install docs

+1,804 / -202 lines across 24 files

Test plan

  • 5 socket lifecycle tests (stop, unknown method, stale cleanup, shutdown cleanup, status)
  • 4 event bus unit tests (subscribe, multi-subscriber, unsubscribe, slow drop)
  • 1 drain integration test (slowSpawner verifies daemon waits for blocked spawn)
  • 1 subscribe integration test (task.routed event streams over socket)
  • All existing tests pass (69.9% coverage)
  • Manual: agent-pool start + agent-pool status + agent-pool watch + agent-pool stop
  • Manual: double-signal (Ctrl-C twice) forces immediate exit

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • CLI: added stop, status, and watch (live colored event stream); status reports pool/state/uptime plus experts, task counts, and active tasks. Unix-socket control plane, daemon subscribe/streaming, configurable graceful drain with double-signal behavior. Version bumped to v0.6.0-dev. Session timeouts may be disabled.
  • Refactor

    • Renamed MCP/expert tools to drop the pool_ prefix; list_experts adds an experts key while preserving a compatibility alias.
  • Documentation

    • Added macOS launchd guide and updated audit/report and plugin/skill docs.
  • Tests

    • Expanded socket, event bus, and drain lifecycle tests.

cameronsjo and others added 9 commits April 5, 2026 13:50
Add CLI→daemon communication via Unix domain socket at {poolDir}/daemon.sock.
The socket uses NDJSON protocol with method dispatch (stop, status, subscribe).

- New socketServer in internal/daemon/socket.go with accept loop and per-connection handling
- daemon.Run() creates child context shared with socket — stop RPC cancels it
- `agent-pool stop` CLI command connects to socket and sends shutdown request
- `agent-pool status` CLI command queries live daemon state (pool, experts, task counts)
- Auto-fallback to /tmp/ hashed path when poolDir exceeds macOS 104-byte socket limit
- 5 integration tests covering stop, unknown method, stale cleanup, shutdown cleanup, status
- Includes v0.6 implementation plan in docs/plans/

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Track all goroutine dispatch sites with sync.WaitGroup so the daemon
waits for in-flight work before exiting. 5 dispatch sites wrapped:
handleApprovalRequest, handleInbox (event loop), wake experts,
drainAllInboxes (architect + per-expert).

- 30s drain timeout with configurable WithDrainTimeout option
- Double-signal handler: first SIGTERM/SIGINT triggers graceful drain,
  second forces immediate exit via os.Exit(1)
- Integration test with slowSpawner verifies daemon waits for blocked
  spawn before completing shutdown

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Enhance the Status() method to include active task details (ID, expert,
duration). The CLI status command renders pool name, state, uptime,
expert list, task count table, and active task list in a human-readable
format instead of raw JSON.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add channel-based event bus for daemon state transitions. Events are
emitted at 6 existing log points: task.routed, expert.spawning,
expert.completed, expert.failed, task.cancelled, task.unblocked.

- eventBus with fan-out to subscribers via buffered channels (cap 64)
- Non-blocking emit prevents slow subscribers from blocking the daemon
- Socket subscribe method streams events as NDJSON
- `agent-pool watch` CLI renders events with ANSI colors
- 4 unit tests for event bus, 1 integration test for subscribe streaming

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Remove the default 10m session timeout. When session_timeout is empty
or unset, expert sessions run to completion without a time limit.

- ParseSessionTimeout returns (0, nil) for empty strings
- Daemon uses context.WithCancel instead of context.WithTimeout when 0
- Set session_timeout in pool.toml to impose a limit

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Clean break: all tool names simplified from pool_X to X. The prefix
only existed for ~1 day in production (bosun dogfood). Updated across
code, tests, plugin skills, and bosun concierge identity.

Tool renames:
  pool_read_state -> read_state, pool_update_state -> update_state,
  pool_append_error -> append_error, pool_send_response -> send_response,
  pool_recall -> recall, pool_search_index -> search_index,
  pool_define_contract -> define_contract, pool_send_task -> send_task,
  pool_verify_result -> verify_result, pool_amend_contract -> amend_contract,
  pool_ask_expert -> ask_expert, pool_dispatch -> dispatch,
  pool_collect -> collect, pool_submit_plan -> submit_plan,
  pool_check_status -> check_status, pool_list_experts -> list_experts

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Ship a macOS launchd plist template with placeholder markers for binary
path and pool directory. Includes install/stop/unload documentation
and notes on KeepAlive and double-signal behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Apply action-oriented logging pattern to socket infrastructure:
- Info: stop command received
- Warn: invalid request, unknown method, accept failures
- Debug: subscriber connect/disconnect

Also align daemon drain messages with the narrative pattern
(Preparing to / Successfully / Skipping. Reason:).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Full-scope audit after v0.6 implementation. 69.9% coverage, 25 test
files, no P0 issues. Key gaps: DiscoverPoolDir (high risk),
connectAndSend (medium risk). Main quality concern is time.Sleep-based
synchronization in integration tests (30+ occurrences).

[skip-ci]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai

coderabbitai Bot commented Apr 5, 2026

Copy link
Copy Markdown

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a Unix-domain socket control plane and CLI subcommands (stop, status, watch); integrates socket lifecycle into the daemon Run loop; implements graceful drain with WaitGroup and configurable drain timeout; introduces an in-process event bus for NDJSON streaming; makes session timeouts optional; and removes the pool_ prefix from MCP tool names across code and docs.

Changes

Cohort / File(s) Summary
Daemon core & lifecycle
internal/daemon/daemon.go, internal/daemon/daemon_test.go
Integrates socket server lifecycle into Run(), adds wg drain tracking, drainTimeout and WithDrainTimeout, WithSocketPath, startedAt/sockPath fields, and Status(); shutdown cancels childCtx → wait (with timeout) → close socket; test helper accepts variadic daemon options.
Socket server & tests
internal/daemon/socket.go, internal/daemon/socket_test.go
New Unix-domain socket server (NDJSON) supporting stop, status, subscribe; stale-socket cleanup, per-conn deadlines, subscription streaming; tests for stop/status/subscribe, stale socket handling, and drain semantics.
Event bus & tests
internal/daemon/events.go, internal/daemon/events_test.go
Adds typed Event/EventType payloads, eventBus with buffered subscriber channels (EventBufSize=64), non-blocking Emit (drops for slow subscribers), subscribe/unsubscribe APIs, and unit tests for delivery, unsubscribe, and buffer-drop behavior.
CLI: commands & signal handling
cmd/agent-pool/main.go
Adds stop, status, watch commands and helpers (connectAndSend, socketResponse); implements watch streaming with colored, event-specific output; updates cmdStart for double-signal shutdown; updates usage and bumps version to agent-pool v0.6.0-dev.
Config: socket path & session timeout
internal/config/socket.go, internal/config/config.go, internal/config/config_test.go
Adds ResolveSockPath(poolDir) deterministic fallback path for long paths; DefaultsSection.ParseSessionTimeout treats empty as zero (no timeout); removes the prior "10m" default and updates tests to expect empty string.
MCP tool renames
internal/mcp/tools.go, internal/mcp/architect_tools.go, internal/mcp/concierge_tools.go, internal/mcp/config.go, internal/mcp/*_test.go
Mechanical rename removing the pool_ prefix from registered MCP tool names (architect, concierge, expert tools); tests and plugin docs updated accordingly. list_experts now adds experts key while retaining pool_experts alias.
Docs, plugins & launchd
docs/launchd.md, docs/plans/..., docs/test-audit-report.md, plugin/*, scripts/com.agent-pool.daemon.plist
Adds macOS launchd plist and documentation for socket lifecycle, watch/stop/status usage, session-timeout change, event streaming, and updates plugin/skill docs and audit report to reflect tool-name and timeout changes.
CLI watch formatting & helpers
cmd/agent-pool/main.go (same file)
watch streams newline-delimited JSON events and prints colorized, event-specific output; connectAndSend applies request/response deadlines and decodes single-line JSON responses.

Sequence Diagram(s)

sequenceDiagram
    participant CLI as agent-pool CLI
    participant Socket as Unix Socket Server
    participant Daemon as Daemon.Run
    participant Events as EventBus

    CLI->>Socket: connect to daemon.sock
    Socket->>Daemon: accept & dispatch {method}
    alt method == "subscribe"
        Socket->>Daemon: subscribe request
        Daemon->>Events: register subscriber
        Socket->>CLI: send ack (NDJSON)
        loop events
            Daemon->>Events: emit(event)
            Events->>Socket: deliver to subscriber channel
            Socket->>CLI: write NDJSON event
        end
    else method == "status"
        Socket->>Daemon: request Status()
        Daemon->>Socket: return status JSON
        Socket->>CLI: send response then close
    else method == "stop"
        Socket->>Daemon: request stop -> cancel childCtx
        Daemon->>Daemon: wait for wg (drainTimeout)
        Daemon->>Socket: respond ok then exit
    end
Loading
sequenceDiagram
    participant User
    participant CLI as agent-pool CLI
    participant Socket as Socket Server
    participant Daemon as Daemon
    participant WG as WaitGroup

    User->>CLI: agent-pool stop
    CLI->>Socket: connect & send {method:"stop"}
    Socket->>Daemon: invoke stop handler (calls cancel)
    Daemon->>WG: wait for in-flight goroutines (up to drainTimeout)
    alt drained within timeout
        WG-->>Daemon: done
        Daemon->>Socket: close listener & remove socket file
        Daemon->>CLI: exit 0
    else timeout expired
        Daemon->>Daemon: log warning, exit
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

Poem

🐰 A soft socket hum beneath my paws,
I nibble events and chew the pause,
No ticking fence to rush the architect’s art,
Send "stop" polite — we'll drain before we part,
Two taps and I bolt — quick as a hop, not a pause.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 31.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title "feat: v0.6 daemon lifecycle + observability" accurately captures the main changes: adding daemon lifecycle management via Unix socket (stop/status/watch commands), graceful shutdown, and event-based observability (watch command with ANSI-colored output).
Linked Issues check ✅ Passed The PR comprehensively addresses all five linked issues: #8 (socket-based stop/status commands), #9 (WaitGroup-based graceful drain with 30s timeout and double-signal handling), #10 (event bus with six event types and watch command), #11 (optional session timeouts with context.WithCancel fallback), and #13 (removal of pool_ prefix from 16 MCP tool names across code/tests/docs).
Out of Scope Changes check ✅ Passed All changes directly support the PR objectives: socket server, CLI commands, event bus, graceful drain, optional timeouts, tool name refactoring, launchd plist, and documentation are all within scope of issues #8-#11 and #13. No extraneous changes detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/v0.6-daemon-lifecycle

Comment @coderabbitai help to get the list of available commands and usage tips.

Apply CSO (Claude Search Optimization) to skill descriptions:
- Change from workflow summaries to triggering conditions ("Use when...")
- This prevents Claude from shortcutting the skill body

Content improvements:
- pool-ask: Use dispatch+collect (non-blocking) instead of ask_expert
- pool-build: Same dispatch+collect pattern for expert input gathering
- concierge-identity: Add dispatch and collect to tools table, use
  table format for tool reference

[skip-ci]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
internal/config/config.go (1)

65-73: ⚠️ Potential issue | 🟠 Major

Reject negative session_timeout values.

time.ParseDuration accepts negative durations (e.g., "-1s"), and the daemon's timeout > 0 check means a negative value would silently fall through to context.WithCancel and run unbounded instead of timing out. Add validation to fail fast on negative values.

Suggested fix
 func (d DefaultsSection) ParseSessionTimeout() (time.Duration, error) {
 	if d.SessionTimeout == "" {
 		return 0, nil
 	}
 	dur, err := time.ParseDuration(d.SessionTimeout)
 	if err != nil {
 		return 0, fmt.Errorf("parsing defaults.session_timeout %q: %w", d.SessionTimeout, err)
 	}
+	if dur < 0 {
+		return 0, fmt.Errorf("parsing defaults.session_timeout %q: must be >= 0", d.SessionTimeout)
+	}
 	return dur, nil
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/config/config.go` around lines 65 - 73, In ParseSessionTimeout,
time.ParseDuration can return negative durations which the daemon currently
treats as no timeout; after parsing the duration (dur, err :=
time.ParseDuration(d.SessionTimeout)), add a validation that rejects negative
values (if dur < 0) and return a descriptive error (e.g.,
fmt.Errorf("defaults.session_timeout %q must not be negative",
d.SessionTimeout)); leave empty string handling as-is so only explicit negative
durations fail fast.
internal/mcp/config.go (1)

10-20: ⚠️ Potential issue | 🟠 Major

Architect and concierge tools are missing from headless pre-approval.

ExpertToolNames only includes the 6 expert-scope tools. However, the same spawning flow applies to architect and concierge roles—they also go through processInboxMessage() and receive only ExpertToolNames pre-approval, plus defaults from config.

Architect role registers define_contract, send_task, verify_result, and amend_contract via RegisterArchitectTools(). Concierge role registers ask_expert, submit_plan, check_status, and list_experts via RegisterConciergeTools(). These role-specific tools are not in the pre-approved list and would be blocked in Claude's headless mode.

Add ArchitectToolNames and ConciergeToolNames variables with the appropriate tool names in the format mcp__agent-pool__<tool-name>, and append them when spawning architect/concierge roles:

if mail.IsBuiltinRole(expertName) && expertName == "architect" {
    allTools = append(allTools, agentmcp.ArchitectToolNames...)
} else if mail.IsBuiltinRole(expertName) && expertName == "concierge" {
    allTools = append(allTools, agentmcp.ConciergeToolNames...)
}

Or consolidate pre-approval logic by checking cfg.Role in the MCP server registration to determine which tools to pre-approve.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@internal/mcp/config.go` around lines 10 - 20, ExpertToolNames currently only
lists expert-scoped MCP tool names, so architect and concierge role tools
registered by RegisterArchitectTools() and RegisterConciergeTools() are not
pre-approved in headless mode; add two new slices ArchitectToolNames and
ConciergeToolNames containing the mcp__agent-pool__<tool-name> entries for
define_contract, send_task, verify_result, amend_contract and ask_expert,
submit_plan, check_status, list_experts respectively, and update the
spawning/pre-approval logic in processInboxMessage (where allTools is assembled)
to append agentmcp.ArchitectToolNames when mail.IsBuiltinRole(expertName) &&
expertName == "architect" and append agentmcp.ConciergeToolNames when
mail.IsBuiltinRole(expertName) && expertName == "concierge" (alternatively
consolidate by branching on cfg.Role in the MCP server registration path) so
architect/concierge tools are included alongside ExpertToolNames.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/agent-pool/main.go`:
- Around line 118-142: The CLI commands (cmdStop, cmdStatus, cmdWatch) currently
use filepath.Join(poolDir, "daemon.sock") which mismatches the daemon's
resolveSockPath() fallback for long paths; add a shared helper (e.g.,
ResolveSockPath(poolDir string)) that implements the same logic as
resolveSockPath (use the candidate path if <=100 chars, otherwise hash poolDir
with fnv.New32a and return os.TempDir()/ap-{hash}.sock), put it in a shared
package (e.g., internal/config/socket.go), and replace direct filepath.Join
calls in cmdStop/cmdStatus/cmdWatch and the daemon's resolveSockPath to call
ResolveSockPath so both sides compute the same socket path.

In `@internal/daemon/events_test.go`:
- Around line 79-100: The test hardcodes the subscriber buffer size (64) causing
brittle duplication; define an exported constant EventBusBufferCapacity = 64 in
the events package (where the bus/channel is created) and update
internal/daemon/events_test.go to use that constant (e.g., emit
EventBusBufferCapacity + 16 events and assert count == EventBusBufferCapacity)
instead of literal 64 so the test stays correct if the capacity changes;
reference the bus creation symbol (e.g., EventBus or the event subscription
code) to ensure the constant is used where the channel buffer is allocated and
in the test assertion.

In `@internal/daemon/socket.go`:
- Around line 171-181: writeResponse currently ignores encoding errors; modify
socketServer.writeResponse to capture the error returned by
json.NewEncoder(conn).Encode(resp) and, if non-nil, emit a debug-level log
including the error and minimal context (e.g. the resp type/id or a short
string) so client disconnects are diagnosable. Keep the existing
SetWriteDeadline call and ensure you call the socketServer logger (e.g.
s.logger.Debugf or s.log.Debugf depending on the struct field used on
socketServer) rather than panicking or returning an error.

In `@internal/mcp/concierge_tools.go`:
- Around line 512-515: The change replaced the public response key
"pool_experts" with "experts", which breaks callers; restore backwards
compatibility by including the old key as an alias while adding the new key
(i.e., ensure the map returned by the code that builds result contains both
"pool_experts": poolExperts and "experts": poolExperts, and similarly keep
"shared_experts" as-is), or alternatively add an explicit version field to the
response; update the construction that references poolExperts and
poolCfg.Shared.Include so both keys are present for one release.

In `@scripts/com.agent-pool.daemon.plist`:
- Around line 15-29: The plist currently lacks an ExitTimeOut setting so launchd
may SIGKILL the daemon after its 20s default; add an ExitTimeOut key with value
35 (e.g., <key>ExitTimeOut</key> with integer 35) to the existing plist
alongside the other keys (near RunAtLoad/KeepAlive/WorkingDirectory) so the
daemon has 35 seconds to complete its graceful drain before being force-killed.

---

Outside diff comments:
In `@internal/config/config.go`:
- Around line 65-73: In ParseSessionTimeout, time.ParseDuration can return
negative durations which the daemon currently treats as no timeout; after
parsing the duration (dur, err := time.ParseDuration(d.SessionTimeout)), add a
validation that rejects negative values (if dur < 0) and return a descriptive
error (e.g., fmt.Errorf("defaults.session_timeout %q must not be negative",
d.SessionTimeout)); leave empty string handling as-is so only explicit negative
durations fail fast.

In `@internal/mcp/config.go`:
- Around line 10-20: ExpertToolNames currently only lists expert-scoped MCP tool
names, so architect and concierge role tools registered by
RegisterArchitectTools() and RegisterConciergeTools() are not pre-approved in
headless mode; add two new slices ArchitectToolNames and ConciergeToolNames
containing the mcp__agent-pool__<tool-name> entries for define_contract,
send_task, verify_result, amend_contract and ask_expert, submit_plan,
check_status, list_experts respectively, and update the spawning/pre-approval
logic in processInboxMessage (where allTools is assembled) to append
agentmcp.ArchitectToolNames when mail.IsBuiltinRole(expertName) && expertName ==
"architect" and append agentmcp.ConciergeToolNames when
mail.IsBuiltinRole(expertName) && expertName == "concierge" (alternatively
consolidate by branching on cfg.Role in the MCP server registration path) so
architect/concierge tools are included alongside ExpertToolNames.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 68b1662f-c860-4f6f-97bf-263176df54ae

📥 Commits

Reviewing files that changed from the base of the PR and between 2397d65 and 4b408a2.

📒 Files selected for processing (24)
  • cmd/agent-pool/main.go
  • docs/launchd.md
  • docs/plans/2026-04-05-daemon-lifecycle.md
  • docs/test-audit-report.md
  • internal/config/config.go
  • internal/config/config_test.go
  • internal/daemon/daemon.go
  • internal/daemon/daemon_test.go
  • internal/daemon/events.go
  • internal/daemon/events_test.go
  • internal/daemon/socket.go
  • internal/daemon/socket_test.go
  • internal/mcp/architect_tools.go
  • internal/mcp/architect_tools_test.go
  • internal/mcp/concierge_tools.go
  • internal/mcp/concierge_tools_test.go
  • internal/mcp/config.go
  • internal/mcp/tools.go
  • internal/mcp/tools_test.go
  • plugin/concierge-identity.md
  • plugin/skills/pool-ask.md
  • plugin/skills/pool-build.md
  • plugin/skills/pool-status.md
  • scripts/com.agent-pool.daemon.plist

Comment thread cmd/agent-pool/main.go
Comment thread internal/daemon/events_test.go Outdated
Comment thread internal/daemon/socket.go
Comment thread internal/mcp/concierge_tools.go
Comment thread scripts/com.agent-pool.daemon.plist

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@plugin/skills/pool-build.md`:
- Around line 34-35: The docs say to call submit_plan with only the plan body
but the implementation still accepts an optional contracts parameter
(comma-separated contract IDs) in the submit_plan tool and its handler; either
document contracts as an optional parameter (describe format, example, and
behavior) or remove/deprecate the parameter from the submit_plan handler and
tool registration so behavior and docs match; update the submit_plan tool docs
to explicitly mention the optional contracts parameter and its format (or add a
deprecation notice) and ensure the submit_plan handler that parses contracts is
kept in sync with the documentation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 19f59619-c829-4bc7-a2f5-563515c8a966

📥 Commits

Reviewing files that changed from the base of the PR and between 4b408a2 and 2a07d2b.

📒 Files selected for processing (4)
  • plugin/concierge-identity.md
  • plugin/skills/pool-ask.md
  • plugin/skills/pool-build.md
  • plugin/skills/pool-status.md

Comment thread plugin/skills/pool-build.md Outdated
CodeRabbit findings:

1. [Critical] Socket path mismatch — CLI used hardcoded path while daemon
   had fnv fallback for long paths. Extracted shared config.ResolveSockPath
   used by both sides.

2. [Nitpick] Magic number 64 in event bus test — exported EventBufSize
   constant, test uses it instead of hardcoded value.

3. [Nitpick] writeResponse silently ignored errors — added debug log.

4. [Major] list_experts response renamed pool_experts to experts without
   compatibility — added pool_experts as deprecated alias alongside new
   experts key.

5. [Major] launchd plist missing ExitTimeOut — added 35s to exceed the
   30s drain timeout, preventing SIGKILL before graceful shutdown completes.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/agent-pool/main.go`:
- Around line 174-180: The printStatusField helper currently ignores the error
returned by json.Unmarshal which can lead to printing incorrect or empty values;
update printStatusField to capture the error (e.g., err := json.Unmarshal(v,
&s)) and if err != nil simply skip printing the field (return) so only
successfully unmarshalled string values are printed; reference the
printStatusField closure and its use of label, key, data, json.Unmarshal and
variable s when making the change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: d1b51232-15ac-4ac6-8f67-a7021f493c75

📥 Commits

Reviewing files that changed from the base of the PR and between 2a07d2b and 9132934.

📒 Files selected for processing (8)
  • cmd/agent-pool/main.go
  • internal/config/socket.go
  • internal/daemon/daemon.go
  • internal/daemon/events.go
  • internal/daemon/events_test.go
  • internal/daemon/socket.go
  • internal/mcp/concierge_tools.go
  • scripts/com.agent-pool.daemon.plist

Comment thread cmd/agent-pool/main.go
Comment on lines +174 to +180
printStatusField := func(label, key string) {
if v, ok := data[key]; ok {
var s string
json.Unmarshal(v, &s)
fmt.Printf("%-10s %s\n", label+":", s)
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick | 🔵 Trivial

Consider handling unmarshal errors in printStatusField.

The helper ignores json.Unmarshal errors, which is acceptable for string fields since they'll default to empty. However, for robustness, you might want to skip printing the field if unmarshal fails for non-string types.

💡 Optional: Handle unmarshal errors
 	printStatusField := func(label, key string) {
 		if v, ok := data[key]; ok {
 			var s string
-			json.Unmarshal(v, &s)
-			fmt.Printf("%-10s %s\n", label+":", s)
+			if err := json.Unmarshal(v, &s); err == nil {
+				fmt.Printf("%-10s %s\n", label+":", s)
+			}
 		}
 	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/agent-pool/main.go` around lines 174 - 180, The printStatusField helper
currently ignores the error returned by json.Unmarshal which can lead to
printing incorrect or empty values; update printStatusField to capture the error
(e.g., err := json.Unmarshal(v, &s)) and if err != nil simply skip printing the
field (return) so only successfully unmarshalled string values are printed;
reference the printStatusField closure and its use of label, key, data,
json.Unmarshal and variable s when making the change.

1. [Minor] pool-build skill omitted contracts parameter from submit_plan
   call — restored parameter list matching implementation.

2. [Nitpick] printStatusField ignored unmarshal errors — added fallback
   to print raw JSON when field isn't a string.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/agent-pool/main.go`:
- Around line 277-374: The current watcher only prints events (the scanner loop
handling event struct and switch on e.Type) but never loads or maintains task
state for a live table; update the logic to fetch an initial task snapshot (via
the same backend API used for events) into an in-memory map keyed by task ID,
implement a redraw function that clears the terminal and prints a compact task
table (showing status, expert, model, duration, summary) and call it initially
and whenever events arrive, and in the scanner loop update the in-memory task
entries based on event types ("task.routed", "expert.spawning",
"expert.completed", "expert.failed", "task.cancelled", "task.unblocked") before
calling redraw; keep the existing event log output but use fmt.Printf and the
ANSI constants (reset, green, red, yellow, cyan) to colorize both the table rows
and the appended event line.
- Around line 124-130: The discovered poolDir returned by config.DiscoverPoolDir
must be canonicalized before being passed to config.ResolveSockPath (or
alternately make ResolveSockPath canonicalize internally); update the client
commands that call config.ResolveSockPath (e.g., the stop/status/watch command
handling in main.go where connectAndSend(config.ResolveSockPath(poolDir),
"stop") is invoked) to first run filepath.Abs (or an equivalent normalization)
on poolDir and handle the error, then pass the normalized path into
config.ResolveSockPath (or modify ResolveSockPath to call filepath.Abs at its
start and use the normalized string for both path joining and hashing).
- Around line 255-264: The initial ack read using bufio.NewScanner(conn) and the
NDJSON read loop both ignore scanner.Err() when Scan() returns false; update the
ack block (where `scanner := bufio.NewScanner(conn)` and `var ack
socketResponse` are used) to check `if !scanner.Scan() { if err :=
scanner.Err(); err != nil { fmt.Fprintf(os.Stderr, "error: reading ack: %v\n",
err); os.Exit(1) } fmt.Fprintf(os.Stderr, "error: no ack from daemon\n");
os.Exit(1) }` and update the loop that uses `for scanner.Scan()` to, after the
loop exits, check `if err := scanner.Err(); err != nil { /* surface error
(log/exit/return) */ }` following the same pattern used in `connectAndSend`
(lines 399–409) so frame-size or I/O errors are reported instead of silently
falling through.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: cd701a48-cc25-4588-8faa-623c2371977c

📥 Commits

Reviewing files that changed from the base of the PR and between 9132934 and 85f7ea0.

📒 Files selected for processing (2)
  • cmd/agent-pool/main.go
  • plugin/skills/pool-build.md

Comment thread cmd/agent-pool/main.go
Comment thread cmd/agent-pool/main.go
Comment thread cmd/agent-pool/main.go
Comment on lines +277 to +374
fmt.Println("Watching daemon events (Ctrl-C to stop)...")
fmt.Println()

// ANSI colors
const (
reset = "\033[0m"
green = "\033[32m"
red = "\033[31m"
yellow = "\033[33m"
cyan = "\033[36m"
)

type event struct {
Type string `json:"type"`
Timestamp time.Time `json:"timestamp"`
Data json.RawMessage `json:"data"`
}

for scanner.Scan() {
var e event
if err := json.Unmarshal(scanner.Bytes(), &e); err != nil {
continue
}

ts := e.Timestamp.Format("15:04:05")
var color, detail string

switch e.Type {
case "task.routed":
color = cyan
var d struct {
ID string `json:"id"`
From string `json:"from"`
To string `json:"to"`
}
json.Unmarshal(e.Data, &d)
detail = fmt.Sprintf("%s -> %s (%s)", d.From, d.To, d.ID)

case "expert.spawning":
color = yellow
var d struct {
Expert string `json:"expert"`
TaskID string `json:"task_id"`
Model string `json:"model"`
}
json.Unmarshal(e.Data, &d)
detail = fmt.Sprintf("%s task=%s model=%s", d.Expert, d.TaskID, d.Model)

case "expert.completed":
color = green
var d struct {
Expert string `json:"expert"`
TaskID string `json:"task_id"`
Duration string `json:"duration"`
Summary string `json:"summary"`
}
json.Unmarshal(e.Data, &d)
detail = fmt.Sprintf("%s task=%s %s", d.Expert, d.TaskID, d.Duration)
if d.Summary != "" {
if len(d.Summary) > 60 {
d.Summary = d.Summary[:60] + "..."
}
detail += " " + d.Summary
}

case "expert.failed":
color = red
var d struct {
Expert string `json:"expert"`
TaskID string `json:"task_id"`
ExitCode int `json:"exit_code"`
}
json.Unmarshal(e.Data, &d)
detail = fmt.Sprintf("%s task=%s exit=%d", d.Expert, d.TaskID, d.ExitCode)

case "task.cancelled":
color = red
var d struct {
TaskID string `json:"task_id"`
}
json.Unmarshal(e.Data, &d)
detail = d.TaskID

case "task.unblocked":
color = green
var d struct {
TaskID string `json:"task_id"`
Expert string `json:"expert"`
}
json.Unmarshal(e.Data, &d)
detail = fmt.Sprintf("%s -> %s", d.TaskID, d.Expert)

default:
detail = string(e.Data)
}

fmt.Printf("[%s] %s%-18s%s %s\n", ts, color, e.Type, reset, detail)
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

watch is still event-only.

Issue #10 calls for a live task table plus the event stream, but this implementation never fetches an initial status snapshot or maintains task state for redraws. Operators only get an append-only log.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/agent-pool/main.go` around lines 277 - 374, The current watcher only
prints events (the scanner loop handling event struct and switch on e.Type) but
never loads or maintains task state for a live table; update the logic to fetch
an initial task snapshot (via the same backend API used for events) into an
in-memory map keyed by task ID, implement a redraw function that clears the
terminal and prints a compact task table (showing status, expert, model,
duration, summary) and call it initially and whenever events arrive, and in the
scanner loop update the in-memory task entries based on event types
("task.routed", "expert.spawning", "expert.completed", "expert.failed",
"task.cancelled", "task.unblocked") before calling redraw; keep the existing
event log output but use fmt.Printf and the ANSI constants (reset, green, red,
yellow, cyan) to colorize both the table rows and the appended event line.

1. [Critical] ResolveSockPath now canonicalizes poolDir with filepath.Abs
   internally, ensuring daemon and CLI always compute the same socket
   path regardless of relative vs absolute input.

2. [Major] cmdWatch now checks scanner.Err() after ack read and after
   the event stream loop, surfacing frame-size and I/O errors instead
   of silently dropping out.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cmd/agent-pool/main.go (1)

96-109: ⚠️ Potential issue | 🟠 Major

A race condition exists in the double-signal handling: the second signal can be absorbed by the original NotifyContext handler before stop() executes.

The current implementation waits on ctx.Done(), then calls stop(), then registers a new signal handler. If a second signal arrives during the window between <-ctx.Done() returning and stop() completing, the original NotifyContext handler (which is still active) will absorb it into its internal buffer, leaving the process waiting for a third signal or the drain timeout.

Using a single buffered channel registered once avoids this race:

🐛 Suggested pattern
-	ctx, stop := signal.NotifyContext(context.Background(),
-		syscall.SIGTERM, syscall.SIGINT)
-	defer stop()
-
-	// Double-signal: first signal triggers graceful drain, second forces exit.
-	go func() {
-		<-ctx.Done()
-		stop() // reset signal handling to default
-		sigCh := make(chan os.Signal, 1)
-		signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)
-		<-sigCh
-		logger.Warn("Received second signal, forcing immediate exit")
-		os.Exit(1)
-	}()
+	ctx, cancel := context.WithCancel(context.Background())
+	sigCh := make(chan os.Signal, 2)
+	signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)
+	defer signal.Stop(sigCh)
+	defer cancel()
+
+	// Double-signal: first signal triggers graceful drain, second forces exit.
+	go func() {
+		<-sigCh
+		cancel()
+		<-sigCh
+		logger.Warn("Received second signal, forcing immediate exit")
+		os.Exit(1)
+	}()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/agent-pool/main.go` around lines 96 - 109, The double-signal goroutine
currently races because it calls stop() after ctx.Done() and re-registers a new
signal handler; replace this with a single buffered channel registered once:
create sigCh := make(chan os.Signal, 1) and call signal.Notify(sigCh,
syscall.SIGTERM, syscall.SIGINT) at startup, then drive graceful shutdown from
ctx (use context.WithCancel or the existing ctx) by reading the first signal
from sigCh to trigger the existing shutdown (cancel/stop), and then block on the
same sigCh for a second signal to log via logger.Warn and os.Exit(1); remove the
signal.NotifyContext usage and the re-registration inside the goroutine so no
signals can be lost.
♻️ Duplicate comments (1)
cmd/agent-pool/main.go (1)

247-377: ⚠️ Potential issue | 🟠 Major

watch is still append-only.

Issue #10 calls for a live task table plus the event stream, but this implementation never fetches an initial snapshot or keeps per-task state for redraws. Operators only get a colored log feed right now. The existing status RPC should be enough to seed the first render before the subscribe loop.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/agent-pool/main.go` around lines 247 - 377, The watch command is
append-only; call the existing status RPC to fetch an initial snapshot, seed an
in-memory tasks map, and switch from printing appended lines to redrawing a live
table on each event. Before sending the subscribe request (around the
conn.Encode("subscribe") call), send a JSON status request (method:"status"),
decode the response into a tasks structure (use the same socketResponse/status
shape used elsewhere or define statusResponse), populate a map[string]Task keyed
by task ID, and render it via a new renderTasks function that prints a table (or
clears the terminal and reprints). Inside the event loop (the event struct,
switch on e.Type and the various d structs), update the tasks map according to
events (e.g., task.routed, expert.spawning/completed/failed,
task.cancelled/unblocked) instead of printing detail, then call renderTasks
after each update; keep the subscribe/ack handling, deadline, and signal
handling as-is.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cmd/agent-pool/main.go`:
- Around line 273-279: The signal handler currently closes conn which makes
scanner.Scan() return an error that is then logged as "error: stream
interrupted" even on intentional Ctrl-C; add an intentional-shutdown flag (e.g.,
a package-level uint32 used with sync/atomic or an atomic.Bool) and set it
inside the goroutine before calling conn.Close() (where sigCh, conn.Close() are
used), then when checking scanner.Err() (and the similar check around lines
380-383), suppress logging/exit if the intentional-shutdown flag is set; apply
the same pattern to the other signal handler block so intentional closes don’t
produce error messages.

---

Outside diff comments:
In `@cmd/agent-pool/main.go`:
- Around line 96-109: The double-signal goroutine currently races because it
calls stop() after ctx.Done() and re-registers a new signal handler; replace
this with a single buffered channel registered once: create sigCh := make(chan
os.Signal, 1) and call signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT) at
startup, then drive graceful shutdown from ctx (use context.WithCancel or the
existing ctx) by reading the first signal from sigCh to trigger the existing
shutdown (cancel/stop), and then block on the same sigCh for a second signal to
log via logger.Warn and os.Exit(1); remove the signal.NotifyContext usage and
the re-registration inside the goroutine so no signals can be lost.

---

Duplicate comments:
In `@cmd/agent-pool/main.go`:
- Around line 247-377: The watch command is append-only; call the existing
status RPC to fetch an initial snapshot, seed an in-memory tasks map, and switch
from printing appended lines to redrawing a live table on each event. Before
sending the subscribe request (around the conn.Encode("subscribe") call), send a
JSON status request (method:"status"), decode the response into a tasks
structure (use the same socketResponse/status shape used elsewhere or define
statusResponse), populate a map[string]Task keyed by task ID, and render it via
a new renderTasks function that prints a table (or clears the terminal and
reprints). Inside the event loop (the event struct, switch on e.Type and the
various d structs), update the tasks map according to events (e.g., task.routed,
expert.spawning/completed/failed, task.cancelled/unblocked) instead of printing
detail, then call renderTasks after each update; keep the subscribe/ack
handling, deadline, and signal handling as-is.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 5b534128-da91-4027-8969-74fef671dc6f

📥 Commits

Reviewing files that changed from the base of the PR and between 85f7ea0 and 3afd8bb.

📒 Files selected for processing (2)
  • cmd/agent-pool/main.go
  • internal/config/socket.go

Comment thread cmd/agent-pool/main.go
Comment on lines +273 to +279
// Handle Ctrl-C cleanly
sigCh := make(chan os.Signal, 1)
signal.Notify(sigCh, syscall.SIGTERM, syscall.SIGINT)
go func() {
<-sigCh
conn.Close()
}()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cd /repo && wc -l cmd/agent-pool/main.go

Repository: cameronsjo/agent-pool

Length of output: 120


🏁 Script executed:

cd /repo && sed -n '273,279p' cmd/agent-pool/main.go

Repository: cameronsjo/agent-pool

Length of output: 120


🏁 Script executed:

cd /repo && sed -n '380,383p' cmd/agent-pool/main.go

Repository: cameronsjo/agent-pool

Length of output: 120


🏁 Script executed:

cd /repo && sed -n '260,300p' cmd/agent-pool/main.go

Repository: cameronsjo/agent-pool

Length of output: 120


🏁 Script executed:

cd /repo && sed -n '370,400p' cmd/agent-pool/main.go

Repository: cameronsjo/agent-pool

Length of output: 120


🏁 Script executed:

find . -name "main.go" -path "*/agent-pool/*" -type f

Repository: cameronsjo/agent-pool

Length of output: 89


🏁 Script executed:

wc -l cmd/agent-pool/main.go

Repository: cameronsjo/agent-pool

Length of output: 91


🏁 Script executed:

sed -n '273,279p' cmd/agent-pool/main.go

Repository: cameronsjo/agent-pool

Length of output: 222


🏁 Script executed:

sed -n '380,383p' cmd/agent-pool/main.go

Repository: cameronsjo/agent-pool

Length of output: 184


🏁 Script executed:

sed -n '260,300p' cmd/agent-pool/main.go

Repository: cameronsjo/agent-pool

Length of output: 967


🏁 Script executed:

sed -n '370,410p' cmd/agent-pool/main.go

Repository: cameronsjo/agent-pool

Length of output: 1294


🏁 Script executed:

sed -n '240,280p' cmd/agent-pool/main.go

Repository: cameronsjo/agent-pool

Length of output: 1179


🏁 Script executed:

sed -n '330,410p' cmd/agent-pool/main.go

Repository: cameronsjo/agent-pool

Length of output: 2292


🏁 Script executed:

rg "scanner.Err\(\)" cmd/agent-pool/main.go -A 2 -B 2

Repository: cameronsjo/agent-pool

Length of output: 514


🏁 Script executed:

rg "net.ErrClosed" . -t go

Repository: cameronsjo/agent-pool

Length of output: 47


Track intentional shutdown to suppress error message on Ctrl-C.

The signal handler closes conn when Ctrl-C is pressed, causing scanner.Scan() to fail. The subsequent scanner.Err() check will return that error and print "error: stream interrupted" before exiting 1, even though this is a normal, graceful shutdown. Set a flag when the signal handler triggers and suppress the error message for intentional closes.

Also applies to: 380-383

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cmd/agent-pool/main.go` around lines 273 - 279, The signal handler currently
closes conn which makes scanner.Scan() return an error that is then logged as
"error: stream interrupted" even on intentional Ctrl-C; add an
intentional-shutdown flag (e.g., a package-level uint32 used with sync/atomic or
an atomic.Bool) and set it inside the goroutine before calling conn.Close()
(where sigCh, conn.Close() are used), then when checking scanner.Err() (and the
similar check around lines 380-383), suppress logging/exit if the
intentional-shutdown flag is set; apply the same pattern to the other signal
handler block so intentional closes don’t produce error messages.

@cameronsjo cameronsjo merged commit 460c182 into main Apr 5, 2026
2 checks passed
@cameronsjo cameronsjo deleted the feat/v0.6-daemon-lifecycle branch April 5, 2026 20:20
@coderabbitai coderabbitai Bot mentioned this pull request Apr 7, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

1 participant