fix(sandbox): add host-mediated gateway restart#5874
Conversation
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds gateway restart control, PID 1 supervisor wiring, descriptor-safe state/config locking, identity-based process supervision, Hermes protocol handling, forward recovery, and timer-bound shields/mutation flows across CLI, scripts, docs, and tests. ChangesGateway restart, PID 1 supervisor, and shields hardening
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Suggested labels
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Code Coverage OverviewLanguages: TypeScript TypeScript / code-coverage/pluginThe overall coverage in the branch is 96%. Coverage data for the branch is not yet available. Show a code coverage summary of the most covered files.
TypeScript / code-coverage/cliThe overall coverage in the branch is 68%. Coverage data for the branch is not yet available. Show a code coverage summary of the most covered files.
Updated |
|
🌿 Preview your docs: https://nvidia-preview-pr-5874.docs.buildwithfern.com/nemoclaw |
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
E2E Advisor RecommendationRequired E2E: Dispatch hint: Auto-dispatched E2E: Full advisor summaryE2E Recommendation AdvisorBase: Required E2E
Optional E2E
New E2E recommendations
Dispatch hint
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 4
🧹 Nitpick comments (2)
docs/manage-sandboxes/runtime-controls.mdx (1)
60-60: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse
$$nemoclawon this shared page.These Hermes-only references hard-code
nemohermesfor a host command that also exists on the shared alias surface. Use$$nemoclawhere so the generated OpenClaw and Hermes variants render the correct command name consistently.
As per coding guidelines,Use $$nemoclaw for host CLI command examples on shared OpenClaw and Hermes pages.As per path instructions,ask the author to use $$nemoclaw instead so generated OpenClaw and Hermes docs render the right command name.Based on learnings, use a concrete alias only when the command is intentionally OpenClaw-specific.Also applies to: 70-70
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/manage-sandboxes/runtime-controls.mdx` at line 60, The runtime-controls table currently hard-codes the host command as nemohermes on a shared OpenClaw/Hermes page, which should instead use the shared alias surface. Update the affected entry in the docs content that references the gateway restart command to use $$nemoclaw so both generated variants render the correct host CLI name consistently, and apply the same fix to the other referenced occurrence in this section.Sources: Coding guidelines, Path instructions, Learnings
docs/reference/commands.mdx (1)
1334-1334: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winUse
$$nemoclawin the shared reference page.This note is on a shared docs page, and
gateway restartis not Hermes-only. Hard-codingnemohermeshere breaks the variant-friendly command style used elsewhere in this page.
As per coding guidelines,Use $$nemoclaw for host CLI command examples on shared OpenClaw and Hermes pages.As per path instructions,ask the author to use $$nemoclaw instead so generated OpenClaw and Hermes docs render the right command name.Based on learnings, concrete aliases are fine when the command is intentionally agent-specific, which is not the case here.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@docs/reference/commands.mdx` at line 1334, Replace the hard-coded Hermes CLI name in the shared docs example with the shared host-command placeholder used elsewhere on this page. Update the command example in the reference docs so it uses $$nemoclaw instead of nemohermes, keeping the wording variant-friendly for both OpenClaw and Hermes and preserving the existing gateway restart guidance.Sources: Coding guidelines, Path instructions, Learnings
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/process-recovery.test.ts`:
- Around line 295-299: The test is reading failure-only fields from
`restartSandboxGateway` without narrowing the union first, so `result.detail` is
not type-safe after `toMatchObject`. Update the assertion in
`process-recovery.test.ts` to explicitly narrow on `result.ok` for the failure
branch before accessing `detail`, while keeping the existing
`restartSandboxGateway` and `deps.buildOpenClawGatewayRestartScript`
expectations intact.
In `@src/lib/actions/sandbox/process-recovery.ts`:
- Around line 617-624: The success log in process recovery is not honoring
quiet, so `process-recovery.ts` still prints the gateway restart message
unconditionally. Update the `processRecovery` restart path around the
`forwardRecovered` checks so the final `console.log` for “Gateway restarted;
health passed; forwards checked/recovered” only runs when `quiet` is false,
matching the existing quiet handling used for the earlier status messages.
- Around line 593-599: The wedge diagnostics path is bypassing the injected exec
behavior by passing the direct sandbox exec helper instead of the dependency
override. In the process-recovery flow that calls waitForRecoveredSandboxGateway
and printGatewayWedgeDiagnostics, make sure the diagnostics invocation uses
deps.executeSandboxExecCommand when present (falling back to the default helper
only if not injected) so callers can fully control exec behavior during
health-timeout recovery.
In `@src/lib/agent/runtime.ts`:
- Around line 282-291: In buildOpenClawGatewayRestartScript and the
gatewayRootGosuLaunchCommand flow, move the root check ahead of any
state-mutating restart steps so OpenClaw verifies root before log setup, lock
removal, or stale-process termination. Use the existing root-check logic from
gatewayRootGosuLaunchCommand and ensure the non-root path exits before
buildGatewayLogSetup, buildGatewayGuardRecoveryLines, rm -rf lock cleanup, and
buildGatewayStopLines are executed.
---
Nitpick comments:
In `@docs/manage-sandboxes/runtime-controls.mdx`:
- Line 60: The runtime-controls table currently hard-codes the host command as
nemohermes on a shared OpenClaw/Hermes page, which should instead use the shared
alias surface. Update the affected entry in the docs content that references the
gateway restart command to use $$nemoclaw so both generated variants render the
correct host CLI name consistently, and apply the same fix to the other
referenced occurrence in this section.
In `@docs/reference/commands.mdx`:
- Line 1334: Replace the hard-coded Hermes CLI name in the shared docs example
with the shared host-command placeholder used elsewhere on this page. Update the
command example in the reference docs so it uses $$nemoclaw instead of
nemohermes, keeping the wording variant-friendly for both OpenClaw and Hermes
and preserving the existing gateway restart guidance.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 4573e314-45ce-43a2-b097-901ac1580774
📒 Files selected for processing (17)
agents/hermes/Dockerfiledocs/manage-sandboxes/install-plugins-hermes.mdxdocs/manage-sandboxes/lifecycle.mdxdocs/manage-sandboxes/runtime-controls.mdxdocs/reference/commands-nemohermes.mdxdocs/reference/commands.mdxdocs/reference/troubleshooting.mdxsrc/commands/sandbox/gateway/restart.tssrc/commands/sandbox/oclif-command-adapters.test.tssrc/lib/actions/sandbox/process-recovery.test.tssrc/lib/actions/sandbox/process-recovery.tssrc/lib/agent/runtime.test.tssrc/lib/agent/runtime.tssrc/lib/cli/command-registry.test.tssrc/lib/cli/public-argv-translation.test.tssrc/lib/cli/public-display-defaults.tstest/hermes-doctor-config-hash.test.ts
Vitest E2E Scenario RecommendationRequired Vitest E2E scenarios: Dispatch required Vitest E2E scenarios:
Full Vitest E2E advisor summaryVitest E2E Scenario AdvisorBase: Required Vitest E2E scenarios
Optional Vitest E2E scenarios
Relevant changed files
|
PR Review Advisor — BlockedMerge posture: Do not merge until addressed Action checklist
Findings index
🚨 Required before mergeAddress these before merging unless a maintainer explicitly overrides the advisor with rationale.
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 28253422289
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/lib/actions/sandbox/process-recovery.ts`:
- Line 401: The restart script is being built with the wrong port value:
`buildHermesGatewayRestartScript` expects the Hermes health/API port for
`_NEMOCLAW_RESTART_HEALTH_PORT`, but `process-recovery.ts` is currently passing
`dashboardPort`. Update the call site to pass the actual Hermes health port from
the recovery context instead of the dashboard port, and make sure the value used
here matches the fixture’s health probe port so
`buildHermesGatewayRestartScript` launches socat/health recovery against the
correct endpoint.
- Around line 395-405: The Hermes recovery path in recoverSandboxProcesses still
prints failure details even when callers request quiet mode, so thread the quiet
flag into this helper and suppress the printGatewayRestartFailure calls when
quiet is true. Update the existing checkAndRecoverSandboxProcesses call site to
pass { quiet }, and use the recoverSandboxProcesses and
printGatewayRestartFailure symbols to locate the affected Hermes branch.
In `@src/lib/agent/runtime.ts`:
- Around line 364-367: The restart validation in runtime.ts should fail closed
instead of falling back to resolving Hermes from PATH. Update the logic in the
validationSteps construction so that if AGENT_BIN at binaryPath is missing or
not executable, it exits with AGENT_MISSING rather than calling command -v on
binaryName. Keep the check within the restart flow that relaunches as gateway,
and ensure the code only accepts the trusted binaryPath for locating the
executable.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e725fd75-9d5c-4add-9d95-ab5fbe819868
📒 Files selected for processing (4)
src/lib/actions/sandbox/process-recovery.tssrc/lib/agent/runtime.test.tssrc/lib/agent/runtime.tstest/process-recovery.test.ts
🚧 Files skipped from review as they are similar to previous changes (1)
- src/lib/agent/runtime.test.ts
Selective E2E Results — ✅ All requested jobs passedRun: 28254560749
|
Selective E2E Results — ✅ All requested jobs passedRun: 28254817942
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 28255266980
|
Selective E2E Results — ✅ All requested jobs passedRun: 28255490504
|
Selective E2E Results — ✅ All requested jobs passedRun: 28257003744
|
Vitest E2E Scenario Results — ❌ Some jobs failedRun: 28337415605
|
Vitest E2E Scenario Results — ❌ Some jobs failedRun: 28337418297
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Vitest E2E Scenario Results — ❌ Some jobs failedRun: 28337416532
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Vitest E2E Scenario Results — ❌ Some jobs failedRun: 28338427259
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Selective E2E Results — ✅ All requested jobs passedRun: 28338521527
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Vitest E2E Scenario Results —
|
| Job | Result |
|---|---|
| sandbox-rebuild-vitest | |
| sandbox-survival-vitest | ✅ success |
| snapshot-commands-vitest | ✅ success |
| state-backup-restore-vitest | ✅ success |
Vitest E2E Scenario Results —
|
| Job | Result |
|---|---|
| credential-sanitization-vitest | ✅ success |
| hermes-dashboard-vitest | |
| hermes-inference-switch-vitest | |
| openclaw-inference-switch-vitest | ✅ success |
Vitest E2E Scenario Results —
|
| Job | Result |
|---|---|
| cloud-onboard-vitest | ✅ success |
| hermes-e2e-vitest | |
| hermes-root-entrypoint-smoke-vitest | ✅ success |
| hermes-sandbox-secret-boundary-vitest | ✅ success |
Vitest E2E Scenario Results —
|
| Job | Result |
|---|---|
| gateway-guard-recovery | |
| rebuild-hermes-vitest | |
| rebuild-openclaw-vitest | ✅ success |
| shields-config-vitest | ✅ success |
Selective E2E Results — ✅ All requested jobs passedRun: 28338817268
|
Vitest E2E Scenario Results — ✅ All selected jobs passedRun: 28338697335
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Vitest E2E Scenario Results —
|
| Job | Result |
|---|---|
| credential-sanitization-vitest | ✅ success |
| hermes-dashboard-vitest | |
| hermes-inference-switch-vitest | |
| openclaw-inference-switch-vitest | ✅ success |
Vitest E2E Scenario Results — ❌ Some jobs failedRun: 28338696474
|
Vitest E2E Scenario Results —
|
| Job | Result |
|---|---|
| cloud-onboard-vitest | ✅ success |
| hermes-e2e-vitest | |
| hermes-root-entrypoint-smoke-vitest | ✅ success |
| hermes-sandbox-secret-boundary-vitest | ✅ success |
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Vitest E2E Scenario Results —
|
| Job | Result |
|---|---|
| cloud-onboard-vitest | ✅ success |
| hermes-e2e-vitest | |
| hermes-root-entrypoint-smoke-vitest | ✅ success |
| hermes-sandbox-secret-boundary-vitest | ✅ success |
Vitest E2E Scenario Results —
|
| Job | Result |
|---|---|
| gateway-guard-recovery | |
| rebuild-hermes-vitest | |
| rebuild-openclaw-vitest | |
| shields-config-vitest | ✅ success |
Vitest E2E Scenario Results —
|
| Job | Result |
|---|---|
| credential-sanitization-vitest | ✅ success |
| hermes-dashboard-vitest | |
| hermes-inference-switch-vitest | |
| openclaw-inference-switch-vitest | ✅ success |
Vitest E2E Scenario Results —
|
| Job | Result |
|---|---|
| sandbox-rebuild-vitest | |
| sandbox-survival-vitest | ✅ success |
| snapshot-commands-vitest | ✅ success |
| state-backup-restore-vitest | ✅ success |
Vitest E2E Scenario Results — ✅ All selected jobs passedRun: 28339472376
|
Vitest E2E Scenario Results — ❌ Some jobs failedRun: 28339472357
|
Vitest E2E Scenario Results — ❌ Some jobs failedRun: 28339472359
|
Vitest E2E Scenario Results — ❌ Some jobs failedRun: 28339472411
|
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Signed-off-by: Aaron Erickson <aerickson@nvidia.com>
Summary
Adds a supported host-mediated
gateway restartcommand for NemoClaw-managed OpenClaw and Hermes gateways so runtime configuration and plugin changes can be applied without weakening the sandbox/gateway user boundary.This revision also hardens the supervised gateway lifecycle: restart and crash recovery act only on a proven process identity and listener, configuration/state changes are transactional, and ambiguous privileged execution or failed built-in health probes fail closed.
Related Issue
Fixes #2426
Fixes #5253
Supersedes #5416
Changes
sandbox:gateway:restartplus publicnemoclaw <name> gateway restartandnemohermes <name> gateway restartrouting.CAP_DAC_OVERRIDE.Type of Change
Quality Gates
Verification
Verifiedin GitHubnpx prek run --from-ref main --to-ref HEADpassesnpm testpasses (broad runtime changes only)npm run docsbuilds without warnings (doc changes only)Verification run locally against the current
mainmerge:npm run build:clinpm run typecheck:clinpm run typechecktsc --noEmitpassed.main'sifcount; all new test files are at zero. The affected suites passed 414 tests with 2 platform skips, and the final forward-listener consumers reran 19/19.test/e2e-non-root-smoke.shpassed 2/2 under--security-opt no-new-privileges.CAP_DAC_OVERRIDEabsent; stale-base group repair is idempotent and target-user step-down still resets supplementary groups.npm run checksshfmt, ShellCheck, Ruff, and Python compilation over changed helpers.git diff --checkgates.npm run docscompleted with 0 errors and 2 pre-existing Fern warnings.The full local CLI hook was run before commit. It still encounters unchanged macOS/Node 22.16 harness failures (plain Node loading TypeScript diagnostic preloads and a GNU
sedassumption) plus broad parallel timeout noise. Branch-related failures exposed by that run were fixed, then commits were retried with onlytest-cliskipped; all other commit hooks and all push hooks passed. Exact-head CI is the final broad-suite proof.Signed-off-by: Aaron Erickson aerickson@nvidia.com
Summary by CodeRabbit
gateway restartfor supported sandboxes, including quiet mode and automatic gateway/port-forward recovery verification.Advisor dispositions (exact head c4223e3)
normalizeCustomEndpointUrland its accepted bridge-host behavior have no diff from base SHA8e4784e8; this PR only adds serialization and shields-posture checks around inference mutations. Changing the endpoint trust policy here would be an unrelated security-policy migration, so it is outside this PR.gatewayUID specifically so the sandbox user cannot kill or restart that child. The live scenario therefore uses trusted sandbox root to TERM/KILL the exact tracked gateway PID, which establishes the same stopped-process state without weakening the new privilege boundary. Recovery then runs from the host, modeling the post-exitstep, and verifies a replacement gateway PID, unchanged PID 1 identity, both config hashes, API health on 8642, dashboard forward/HTML on 18789 when enabled, and Ready/healthy status.test/cli/connect-recovery.test.tsassertsProbe complete: recovered Hermes Agent gateway; the live Hermes scenario separately proves exit code 0 and the recovered runtime/forward state.c4223e3fepassed, including five CLI shards and aggregate, both sandbox-image builds, non-root smoke, gateway isolation, port overrides, sandbox E2E, macOS/WSL, CodeRabbit, GPT-5.5, Nemotron workflow, and CodeQL with 0 annotations (down from 41). The broader cloud matrix remains an advisor recommendation; no unrun cloud job is claimed as proof.