fix(onboard): surface network-policy hint when OpenClaw plugin install fails during sandbox build#5835
fix(onboard): surface network-policy hint when OpenClaw plugin install fails during sandbox build#5835Dongni-Yang wants to merge 6 commits into
Conversation
…the openclaw plugins install step When `openshell sandbox create` fails because the Dockerfile's `openclaw plugins install` RUN step exits non-zero, the build output contains the distinctive strings "openclaw plugins install" or "npm:@openclaw/", but `classifySandboxCreateFailure` had no branch for this pattern and fell through to `kind="unknown"`. The resulting hint was the generic "nemoclaw onboard --resume" line with no indication of the likely cause. Add a `"plugin_install_network_denied"` kind to `SandboxCreateFailure` and a classifier that matches those strings in the openshell create output. `printSandboxCreateRecoveryHints` now emits a targeted message noting that network policy may be blocking outbound access to ClawHub or the npm registry, and suggests disabling the feature (e.g. `NEMOCLAW_WEB_SEARCH_ENABLED=0`) as an alternative if a preset isn't available. Fixes #4127 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
The previous regex matched `openclaw plugins install|npm:@openclaw/` anywhere in the captured output, which would mis-classify a build where that step succeeded and a later step failed — the Docker step header (`Step N: RUN openclaw plugins install ...`) would match even though the failure came from an unrelated subsequent RUN step. Tighten to the Docker error block format: The command '...<plugin text>...' returned a non-zero code `[^']*` is used (not `[^\n]*`) because the embedded shell command often spans multiple lines (chained commands joined with semicolons), and JS character classes match newlines. Add a regression test that verifies the step-header false-positive is rejected. Refs #4127 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
…n-install failure as network-denied The previous classifier fired on any failed `openclaw plugins install` Docker command, including package-not-found (HTTP 404 from the registry), version conflicts, and auth errors — all of which would receive a misleading "network policy blocking egress" hint. Narrow the match to require both: 1. The Docker error block anchored to the failed plugin-install step 2. A network/egress reachability error (ENOTFOUND, EAI_AGAIN, ECONNREFUSED, ETIMEDOUT, fetch failed, etc.) in the captured output This ensures the hint "your sandbox network policy may be blocking outbound plugin-install access" is only shown when the underlying failure is actually a network reachability problem. Test changes: - Updated positive tests to include ENOTFOUND / ECONNREFUSED output matching real npm network error messages (registry.npmjs.org and ClawHub paths respectively) - Added negative test: same Docker failed-command block but E404 package-not-found → classifies as "unknown" - Added direct build-context.test.ts coverage for the new hint branch, asserting the key user-visible strings (plugin-install step, ClawHub, npm registry, network policy, NEMOCLAW_WEB_SEARCH_ENABLED=0, onboard --resume) Refs #4127 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
📝 WalkthroughWalkthroughAdds a new sandbox-create failure kind for plugin-install network denial, classifies matching OpenClaw install errors, and prints recovery hints that mention the plugin-install step, network policy, and ChangesOpenClaw plugin install network denial
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related issues
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Code Coverage OverviewLanguages: TypeScript TypeScript / code-coverage/pluginThe overall coverage in the branch is 96%. Coverage data for the branch is not yet available. Show a code coverage summary of the most covered files.
TypeScript / code-coverage/cliThe overall coverage in the branch is 47%. Coverage data for the branch is not yet available. Show a code coverage summary of the most covered files.
Updated |
PR Review Advisor — Changes requestedMerge posture: Do not merge yet Action checklist
Findings index
Review findings by urgency: 0 required fixes, 2 items to resolve/justify, 0 in-scope improvements
|
E2E Advisor RecommendationRequired E2E: Dispatch hint: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Vitest E2E Scenario RecommendationRequired Vitest E2E scenarios: None Full Vitest E2E advisor summaryVitest E2E Scenario AdvisorBase: Required Vitest E2E scenarios
Optional Vitest E2E scenarios
Relevant changed files
|
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/validation.ts`:
- Around line 144-149: The classifier in validation.ts is too broad because it
matches any Docker RUN block containing openclaw plugins install, even when a
later command in the same block fails. Tighten the logic in the network-failure
branch so it only classifies plugin_install_network_denied when the failing
subcommand itself shows plugin-fetch/install evidence, using the existing
validation.ts matching and the relevant openclaw plugins install / npm:`@openclaw`
patterns as anchors. Update validation.test.ts with a regression case where
plugin install succeeds but a later command in the same shell block fails, and
verify it no longer returns the plugin install recovery hint.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: b142153c-913c-41f6-bb87-87ba3729f360
📒 Files selected for processing (4)
src/lib/build-context.test.tssrc/lib/build-context.tssrc/lib/validation.test.tssrc/lib/validation.ts
| if ( | ||
| /The command '[^']*(?:openclaw plugins install|npm:@openclaw\/)[^']*'\s*returned a non-zero code/i.test( | ||
| text, | ||
| ) && | ||
| /ENOTFOUND|EAI_AGAIN|ECONNREFUSED|ETIMEDOUT|ESOCKETTIMEDOUT|network request.*failed|getaddrinfo|fetch failed|socket hang up|network timeout/i.test( | ||
| text, |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟠 Major | ⚡ Quick win
Anchor this classifier to the failing subcommand, not just the enclosing Docker block.
Line 145 only proves the quoted RUN block contains openclaw plugins install; it does not prove that plugin install was the step that failed. The fixture in src/lib/validation.test.ts already shows the same block also runs openclaw doctor --fix, so a later network failure inside that shell block would still classify as plugin_install_network_denied and emit the wrong recovery hint. Please narrow this to plugin-fetch evidence or otherwise isolate the failing subcommand, and add a regression test for “plugin install succeeded, later command in the same block failed”.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/lib/validation.ts` around lines 144 - 149, The classifier in
validation.ts is too broad because it matches any Docker RUN block containing
openclaw plugins install, even when a later command in the same block fails.
Tighten the logic in the network-failure branch so it only classifies
plugin_install_network_denied when the failing subcommand itself shows
plugin-fetch/install evidence, using the existing validation.ts matching and the
relevant openclaw plugins install / npm:`@openclaw` patterns as anchors. Update
validation.test.ts with a regression case where plugin install succeeds but a
later command in the same shell block fails, and verify it no longer returns the
plugin install recovery hint.
…false positives from later RUN block commands The previous network-error predicate matched any ENOTFOUND/ECONNREFUSED/etc. in the captured output, which could fire the plugin-install hint when the plugin install itself succeeded but a later command in the same RUN block (e.g. `openclaw doctor --fix`) failed with a network error. npm's error output consistently prefixes every line with "npm error", whereas commands run after a successful install produce different error formats. Requiring "npm error" + network pattern anchors the evidence to the npm installer specifically, ruling out later-command false positives. Add a regression test: the same RUN block runs plugin install (succeeds) then openclaw doctor (fails with ENOTFOUND but no "npm error" prefix) → correctly returns "unknown". Refs #4127 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
…get divergence Refs #4127 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
Bound the network-evidence regex to the text up to and including the failed-plugin-install Docker error block rather than scanning the entire output. This prevents an npm script that runs after a successful plugin install in the same RUN block from producing a false plugin_install_network_denied classification when that later script emits an npm-prefixed network error. Adds a regression test for this case (npm script post-install in the same RUN block emits npm error ENOTFOUND after the Docker boundary). Refs #4127 Signed-off-by: Dongni Yang <dongniy@nvidia.com>
|
Addressing PRA-2 / CodeRabbit (windowed npm-error search): Both PRA-2 and the CodeRabbit comment correctly identified that scanning the entire output for Fixed in the latest commit ( A regression test was added for this case: Addressing PRA-1 (source-of-truth review): The classification is intentionally localized to NemoClaw's classifier layer. The source-of-truth facts:
|
Selective E2E Results — ✅ All requested jobs passedRun: 28261980087
|
Summary
When `openshell sandbox create` runs and the Docker build fails at the `openclaw plugins install` RUN step due to a network/egress reachability error, NemoClaw previously fell through to the generic "Recovery: nemoclaw onboard --resume" message with no indication of why the build failed.
This PR adds a new `plugin_install_network_denied` failure kind to `classifySandboxCreateFailure()` and a targeted hint in `printSandboxCreateRecoveryHints()` that:
The classifier requires both the Docker failed-command block embedding the plugin-install step and a network/egress reachability error (`ENOTFOUND`, `EAI_AGAIN`, `ECONNREFUSED`, `ETIMEDOUT`, etc.) within the text up to and including the Docker error boundary. Scoping the network-evidence search to that window prevents a network error from an unrelated later RUN block — or from an npm script that runs after a successful plugin install in the same block — from producing a misleading hint. The `npm error` prefix further anchors evidence to npm's own output format.
Changes
Test plan
Fixes #4127
Signed-off-by: Dongni Yang dongniy@nvidia.com