Skip to content

Decouple nvm/node bootstrap from agent Run/Exec commands for clearer failure attribution #72

@zpzjzj

Description

@zpzjzj

Problem

nodeRuntimeCommandWithGuard (internal/agent/node_install.go:72) wraps the agent's real command (e.g. claude -p ..., codex exec --json ...) with the nvm/Node.js bootstrap and ships the whole thing as a single multi-line shell script to one rt.Exec call.

This means the node-runtime bootstrap and the agent invocation share one Exec, one exit code, one stderr, and one stdout. They are operationally distinct phases but indistinguishable in the result.

Affected call sites

Agent Call site Coupled command
Claude Code — Run internal/agent/claude_code.go:127 nodeRuntimeCommandWithGuard(\"claude\", buildClaudePrintCmd(...))
Claude Code — MCP install internal/agent/claude_code.go:72 nodeRuntimeCommandWithGuard(\"claude\", cmd)
Codex — Run internal/agent/codex.go:236 nodeRuntimeCommandWithGuard(\"codex\", buildCodexRunCmdWithLastMessage(...))
Codex — MCP install internal/agent/codex.go:136, :171 nodeRuntimeCommandWithGuard(\"codex\", cmd.String())

qodercli is not affected (its Run/InstallMCP don't use this wrapper, and its install path uses curl ... | bash standalone).

Why this affects judgment / debugging

  1. Misleading error attribution. When the bootstrap fails (e.g. curl to the nvm installer times out, sha256 mismatch, nvm install fails on the offline mirror), the error surfaces as claude-code run failed: ... / codex run failed: .... Operators have to read the multi-line shell script to figure out it was actually a node/nvm problem, not an agent invocation problem.

  2. Polluted stderr feeds signal detectors. providerAuthFailureSignal / providerRateLimitSignal in claude_code.go scan result.Stderr for substrings. Bootstrap noise (curl warnings, nvm install output, sha256 messages) is concatenated into the same buffer and widens the false-positive surface.

  3. Mixed stdout. When the run path also writes stdout.json from the same Exec, bootstrap-stage stdout (e.g. nvm install banners) can prepend the actual agent JSONL stream.

  4. Bootstrap re-runs after Install. The Install step is supposed to be the one place that prepares the runtime. The Run-time guard makes the bootstrap silently fire again when claude/codex is somehow not on PATH at Run time, masking what should be a clear "Install was never called / Install was wiped" failure.

  5. Harder to read shell. The Exec'd script is now ~20 lines of mixed install + invocation logic, with set -e covering both phases — diagnosing partial failures from the recorded artifact takes extra parsing.

Proposed direction

A couple of options, not married to any one:

  • A. Two Execs. Run the node bootstrap (when needed) as a separate rt.Exec call, then run the agent command as its own Exec. Each has its own exit code, stderr, stdout and timestamps in the trace.
  • B. Lean Run. Strip Run/InstallMCP down to just the agent command (plus optionally a one-liner [ -s \"$NVM_DIR/nvm.sh\" ] && . \"$NVM_DIR/nvm.sh\" to source an already-installed nvm). Heavy bootstrap stays in Install only. If claude/codex is missing at Run time, return a structured error pointing the user at skill-up install (the current Check method is already designed for this).
  • C. Tagged sections. If keeping one Exec is necessary, prefix each phase's output with sentinels and split them in result.Stderr / result.Stdout before passing to detectors and persisters.

Option B (or A) seems most consistent with how qodercli already works.

Reproduction sketch

Temporarily break the nvm bootstrap (e.g. point NVM_SOURCE at a non-existent host) and call skill-up with the claude-code agent on a runtime that already has claude installed — the failure is reported as claude-code run failed, even though claude was never invoked.

Acceptance criteria

  • A node bootstrap failure no longer surfaces as an agent-run/agent-MCP failure.
  • result.Stderr consumed by providerAuthFailureSignal / providerRateLimitSignal contains only output from the agent process.
  • stdout.json artifact contains only agent output, no bootstrap output prefix.
  • qodercli behavior is unchanged.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type
No fields configured for issues without a type.

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions