Skip to content

Allow sandbox supervisor to skip inner network namespace creation on hardened / immutable hosts (rootless Podman, dm-verity, overlayroot) #1194

@crdion

Description

@crdion

Problem Statement

On hardened and immutable Linux hosts — dm-verity-protected rootfs, overlayroot, rootless Podman, and similar configurations — the sandbox supervisor cannot create its inner network namespace because setns is rejected by the kernel. Everything else in the OpenShell stack works in these environments: the standalone gateway runs natively, the Docker driver talks to a Podman socket cleanly, the sandbox container starts, and the supervisor successfully calls back to the gateway over gRPC. The only step that fails is ip netns add, which is the supervisor's first action after policy validation.

Concretely, this blocks every host where the kernel mount/network namespace policy is more restrictive than a stock mutable Linux distribution — including ARM64 NVIDIA GB10 / DGX Spark systems running an immutable Host OS, Bottlerocket, Talos, MicroOS, CIS-hardened RHEL/Ubuntu images, and any rootless-only environment where unprivileged setns across the host's mount namespace is denied.

We're not asking OpenShell to weaken any isolation on platforms that can enforce it. We're asking for a way to opt sandboxes into a "host-network" execution mode so that on a host where the kernel itself prevents the supervisor from creating its private netns, the sandbox can still run with the rest of the policy engine intact (filesystem, process, seccomp, OPA L7 proxy in pass-through / observation mode).

This is a small surface change with high leverage: it unblocks every overlayroot/immutable/rootless-restricted platform.


Proposed Design

Add a sandbox-level network mode that the supervisor honors at startup, with three values:

Mode Behavior
isolated (current behavior) Supervisor creates inner netns + veth pair, runs OPA L7 proxy with full egress filtering.
host Supervisor skips ip netns add and the veth setup. Sandbox process runs in the container's existing network namespace (which is already host-mode in the Docker driver after #1080). OPA L7 proxy still runs but in pass-through / observation mode (logs verdicts, does not enforce egress). Filesystem, process, and seccomp policies are unchanged.
auto (suggested new default) Supervisor probes capability at startup: tries unshare(CLONE_NEWNET) or ip netns add once; on EPERM falls back to host mode and emits a degraded-mode telemetry event.

Surface changes

  1. CLI / spec — Add a network_mode field to the sandbox template (isolated | host | auto). Or a simpler --host-network flag on openshell sandbox create.
  2. Gateway — Plumb the value through to the driver as part of the existing DriverSandbox proto.
  3. Supervisor — Single decision branch in the netns-creation step. The ip netns add call (and the subsequent veth setup) is gated on network_mode != Host. If host (or auto after probe), skip namespace creation and skip the veth/proxy-redirect setup. The OPA proxy can still run as a sidecar process listening on the container's own loopback for observability.
  4. Telemetry — Emit an OCSF event such as CONFIG:DEGRADED when running in host mode so operators can see clearly that egress enforcement is observation-only on that sandbox.

Why this is the right shape

  • It does not weaken the default isolation — isolated remains available for platforms where it works.
  • It mirrors the design already taken for the Docker driver in feat(driver-docker): use host networking for sandboxes #1080 (host networking for the container itself). This change brings the supervisor's inner namespace in line with the container's outer namespace on platforms that need it.
  • It is platform-detection-friendly: auto lets the supervisor self-degrade rather than requiring operators to know in advance whether their kernel will permit setns.
  • It keeps every other layer of the policy engine intact (filesystem Landlock, seccomp, OPA, provider env injection). Users on these platforms still get most of OpenShell's value.

Alternatives Considered

  1. Make the gateway run in a kernel-permissive container. Tried — putting the entire k3s cluster image in a privileged Podman container also fails on overlayroot because k3s itself needs veth pairs for pod networking, and kube-proxy needs nft (also blocked). The standalone gateway binary sidesteps this entirely; this issue is the last mile.
  2. Run the sandbox container with broader capabilities. Already done — the supervisor runs with SYS_ADMIN, NET_ADMIN, SYS_PTRACE, SYSLOG, SETUID, SETGID, DAC_READ_SEARCH. The blocker is not capability-based; it is the kernel refusing setns() on the host's mount namespace regardless of the capability set.
  3. Patch the host kernel to permit netns operations on overlayroot. Possible but has security implications for the host, is not portable across distributions, and requires every operator to maintain a custom kernel. Pushing the workaround into the host is the wrong direction; a network_mode flag in OpenShell is portable and explicit.
  4. Use a different CNI plugin. Not applicable — this failure is at the supervisor level, after the container is already created. We confirmed bridge, ptp, host-local, and netavark all fail at the same setns boundary inside the container; CNI is not involved at this point.

Reproduction (against current dev build)

# Platform: ARM64 immutable host (dm-verity rootfs + overlayroot)
# Kernel:   6.11.x
# Podman:   5.4.1 (with netavark, passt, aardvark-dns, crun 1.20)
# OpenShell: 0.0.37-dev.79+g721c39f1 (standalone gateway binary)

# 1. Start the standalone gateway with the Docker driver pointed at the user Podman socket
DOCKER_HOST=unix:///run/user/$(id -u)/podman/podman.sock \
  openshell-gateway \
  --db-url 'sqlite:///path/to/gateway.db?mode=rwc' \
  --drivers docker \
  --disable-tls \
  --port 8080 \
  --sandbox-image ghcr.io/nvidia/openshell-community/sandboxes/base:latest \
  --docker-supervisor-image ghcr.io/nvidia/openshell/supervisor:latest \
  --grpc-endpoint http://<HOST_IP>:8080 \
  --ssh-gateway-host <HOST_IP> \
  --ssh-handshake-secret $(openssl rand -hex 32)

# 2. Register and create a sandbox
openshell gateway add http://127.0.0.1:8080 --local --name local
openshell sandbox create --name test-sandbox

What works (everything up to the netns step)

Using compute driver: docker
Extracting supervisor binary from image to host cache
Server listening: address=0.0.0.0:8080
gateway add: Status: Connected, Version: 0.0.37-dev.79+g721c39f1

Created sandbox: test-sandbox
  [0.0s] Requesting compute...

The container starts. The supervisor runs, calls back to the gateway, and the gateway logs show successful gRPC traffic for the entire bring-up sequence:

  • GetSandboxConfig → 200
  • GetSandboxProviderEnvironment → 200
  • PushSandboxLogs → 200 (repeatedly)

Where it fails (supervisor ip netns add)

Supervisor logs from podman logs <sandbox-container>:

2026-05-04T00:27:18.279Z INFO openshell_sandbox: Starting sandbox
2026-05-04T00:27:18.279Z INFO openshell_sandbox: Fetching sandbox policy via gRPC
2026-05-04T00:27:18.322Z INFO openshell_sandbox: Creating OPA engine from proto policy data
2026-05-04T00:27:18.325Z OCSF CONFIG:VALIDATED [INFO] Validated 'sandbox' user exists in image
2026-05-04T00:27:18.327Z OCSF CONFIG:LOADED [INFO] Fetched provider environment [env_count:0]
2026-05-04T00:27:18.331Z OCSF CONFIG:ENABLED [INFO] TLS termination enabled: ephemeral CA generated
2026-05-04T00:27:18.331Z OCSF CONFIG:CREATING [INFO] Creating network namespace
                          [ns:sandbox-b5266943
                           host_veth:veth-h-b5266943
                           sandbox_veth:veth-s-b5266943]

Error:   × Network namespace creation failed and proxy mode requires isolation.
  │ Ensure CAP_NET_ADMIN and CAP_SYS_ADMIN are available and iproute2 is
  │ installed. Error: ip netns add sandbox-b5266943 failed:
  │ setns: Operation not permitted

The sandbox container then exits 1 and the gateway eventually reports:

Error: × sandbox provisioning timed out after 300s. Last reported status:
  │ DependenciesNotReady: Container is running; waiting for supervisor relay

The supervisor has all the capabilities the error message asks for — they are set by the driver per crates/openshell-driver-podman/src/container.rs. The kernel is rejecting setns because the host's mount namespace policy does not permit it for unprivileged containers, regardless of capability set.


Agent Investigation

We investigated the codebase and the failure path with the help of a coding agent. Findings:

  • Standalone gateway is the right entry point on hardened hosts. Running openshell-gateway natively (rather than via openshell gateway start, which puts k3s + containerd inside a container) sidesteps the broader kernel-level problems with k3s on overlayroot (veth, iptables/nft, kubelet cgroup writes). The standalone path got us 90% to a working sandbox with no other workarounds beyond pointing DOCKER_HOST at the user Podman socket.
  • Docker driver + Podman socket is functional. With podman-docker and the user-level Podman socket, the Docker compute driver connects to Podman cleanly. No code changes were needed on either side. The connection uses bollard against the Docker-compat API exposed by Podman.
  • The Podman driver currently hardcodes bridge mode. crates/openshell-driver-podman/src/container.rs (around lines 110 and 164) sets netns: NetNS { nsmode: "bridge" } unconditionally. Even after the host-networking improvements in feat(driver-docker): use host networking for sandboxes #1080 for the Docker driver, the Podman driver's container spec still requests bridge networking, which trips netavark on overlayroot before the supervisor even starts. Routing via --drivers docker against the same Podman socket is the workaround that gets us past this layer; the real fix would be to bring the Podman driver in line with feat(driver-docker): use host networking for sandboxes #1080.
  • The supervisor's netns-creation is the actual deepest blocker. Once we routed around the driver-level bridge issue by using --drivers docker, the supervisor reached the point where it tries to create its own inner netns inside the (host-networked) container. That is the precise line where setns returns EPERM. This is the smallest, most surgical fix point for the project — it's the topic of this issue.
  • Filesystem / process / seccomp / OPA all work in this environment. We saw the supervisor successfully validate the policy, generate the ephemeral CA, fetch the provider environment, and emit OCSF events. None of those layers are blocked. Only the network-isolation layer is.

Why this matters

  • Unblocks immutable Host OS images (dm-verity / verified-boot images, A/B-slot edge OSes, etc.).
  • Unblocks rootless-only environments where unprivileged setns across the host mount namespace is denied by policy.
  • Unblocks hardened distributions in the same family — Bottlerocket, Talos, MicroOS, Fedora CoreOS, CIS-hardened images.
  • Aligns the supervisor's network model with the container-runtime-side decision already taken for Docker sandboxes in feat(driver-docker): use host networking for sandboxes #1080 (use host networking).
  • Provides a clean degradation path with explicit OCSF telemetry, so operators always know which sandboxes are running with reduced egress enforcement.

We're happy to test patches on real ARM64 hardware running an immutable Host OS as soon as they land in dev.


Environment

  • Hardware: NVIDIA GB10 / DGX Spark, ARM64 / aarch64
  • Host OS: Immutable Ubuntu 24.04 ARM64 base, dm-verity verified rootfs, overlayroot tmpfs overlay
  • Kernel: 6.11.0 (NVIDIA arm64 build)
  • Container runtime: Podman 5.4.1 + netavark 1.12.1 + passt + aardvark-dns + crun 1.20
  • OpenShell: 0.0.37-dev.79+g721c39f1 (standalone openshell-gateway-aarch64-unknown-linux-gnu)
  • NemoClaw: 0.0.17 (CLI, blocked downstream of this issue)
  • GPU: NVIDIA GB10, Driver 580.x, CUDA 13.0 (CDI working — nvidia.com/gpu=all passes through to Podman containers fine)

Checklist

Metadata

Metadata

Assignees

Labels

os:linuxBug affects Linux hostsstate:triage-neededOpened without agent diagnostics and needs triage

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions