feat: add gpu container support to microvm driver#1143
Conversation
ea7a545 to
5943739
Compare
| # TODO(gpu): Pin SHA-256 checksum for reproducible builds. Compute with: | ||
| # curl -fsSL <url> | sha256sum | ||
| RUN curl -fsSL \ | ||
| "https://us.download.nvidia.com/XFree86/Linux-x86_64/${NVIDIA_DRIVER_VERSION}/NVIDIA-Linux-x86_64-${NVIDIA_DRIVER_VERSION}.run" \ |
There was a problem hiding this comment.
This seems brittle since the user mode driver version needs to match the kernel mode driver (in the VM) exactly. Why is the user mode driver not installed in the VM itself?
There was a problem hiding this comment.
The userspace-in-image split is intentional — the VM guest is a minimal libkrunfw kernel with no package manager and no NVIDIA Container Runtime, so there's nothing to mount or install drivers at boot. The .run --no-kernel-module installer is the only way to get libcuda.so / libnvidia-ml.so into the rootfs without pulling in apt dependencies.
That said, the version-coupling concern is valid. Addressed in this update:
Removed the ARG default — bare docker build now fails immediately with a clear error instead of silently using a stale version.
Added build.sh — sources versions.env (single source of truth) and passes --build-arg. The CLI's --from path also auto-detects versions.env in the Dockerfile context.
Added a version stamp (/etc/openshell-gpu-driver-version) — the VM driver reads this at sandbox creation and warns if it doesn't match the kernel module firmware version.
| /// image like `ubuntu:latest` instead of the GPU sandbox Dockerfile. | ||
| fn warn_missing_gpu_userspace(rootfs: &Path) { | ||
| let nvidia_smi_candidates = [ | ||
| "usr/bin/nvidia-smi", |
There was a problem hiding this comment.
On some systems nvidia-smi is installed at /usr/sbin/nvidia-smi.
There was a problem hiding this comment.
Added usr/sbin/nvidia-smi to the candidates.
|
|
||
| # Kernel modules are built for a specific guest kernel version. | ||
| # If the running kernel doesn't match, depmod/modprobe will silently fail. | ||
| local expected_kver="6.12.76" |
There was a problem hiding this comment.
Can we also move this to a single location? (Codex suggests pins.env).
There was a problem hiding this comment.
(alternatively, does it make sense to template this script and render it in the VM just before init?)
There was a problem hiding this comment.
Went with pins.env over templating. The init script is include_str!'d at compile time with no existing template infrastructure, and the kernel version is a build-time invariant tied to LIBKRUNFW_REF
| /// ensures the guest always runs the init script and supervisor that match | ||
| /// the running driver binary. | ||
| pub fn refresh_runtime_artifacts(rootfs: &Path) -> Result<(), String> { | ||
| let init_path = rootfs.join("srv/openshell-vm-sandbox-init.sh"); |
There was a problem hiding this comment.
Question: Why don't we use the global constant here (and elsewhere)?
There was a problem hiding this comment.
Good catch — this is just an oversight. SANDBOX_GUEST_INIT_PATH is defined at the top of the file but the two call sites (refresh_runtime_artifacts and prepare_sandbox_rootfs) duplicate the path as an inline string because the constant has a leading / and Path::join would discard the rootfs prefix. The fix is straightforward: use SANDBOX_GUEST_INIT_PATH.trim_start_matches('/') at the join sites, same pattern that require_rootfs_path already uses. Applied the same treatment to SANDBOX_SUPERVISOR_PATH in validate_sandbox_rootfs for consistency.
| # --build-arg NVIDIA_DRIVER_VERSION=<ver> manually. | ||
|
|
||
| ARG CUDA_VERSION=12.8.1 | ||
| ARG UBUNTU_VERSION=22.04 |
There was a problem hiding this comment.
The onle reason these are set is to define the base image version. Is there value in making the base image more flexible? Is vulnerabilities in out-of-date images a concern in the VM use case? (if so, does it not make sense to just start with ubuntu:latest instead of an nvidia/cuda base image -- especially since we don't actually need access to the CUDA repos defined in the nvidia/cuda image?
There was a problem hiding this comment.
Switched to ubuntu:latest.
| && echo "/usr/local/cuda/lib64" > /etc/ld.so.conf.d/cuda.conf \ | ||
| && echo "/usr/lib/x86_64-linux-gnu" >> /etc/ld.so.conf.d/cuda.conf \ |
There was a problem hiding this comment.
At least for the CUDA base images, these should not be required. Also, do we only support x86_64 VMs here?
There was a problem hiding this comment.
The redundant cuda.conf / ld.so.conf.d block was already removed alongside the switch from nvidia/cuda to ubuntu:latest. The .run installer places libs in /usr/lib/x86_64-linux-gnu/, which is in the default linker search path, so a bare ldconfig is sufficient — no explicit ld.so.conf.d entries needed.
| # version. The open modules support newer kernels better than the | ||
| # proprietary DKMS source shipped in /usr/src/. | ||
| # Source the shared version pin. Individual overrides still work via | ||
| # NVIDIA_OPEN_VERSION environment variable. |
There was a problem hiding this comment.
So if NVIDIA_OPEN_VERSION is specified, then it is up to the user to ensure that the same driver version is used for the kernel modules and the usermode components?
Does it make sense to just use NVIDIA_DRIVER_VERSION everywhere? Does this remove a possible source of error -- at least at this stage?
There was a problem hiding this comment.
Unified on NVIDIA_DRIVER_VERSION everywhere. The single pin in versions.env now drives both the Dockerfile's .run installer URL and build-nvidia-modules.sh's open-gpu-kernel-modules tag/tarball/firmware path.
| mkdir -p "${OUTPUT_DIR}" | ||
|
|
||
| NPROC="$(nproc 2>/dev/null || echo 4)" | ||
| IGNORE_CC_MISMATCH=1 make -C "${NVIDIA_SRC}" \ |
There was a problem hiding this comment.
Question: Do we cache these at all, or do we always rebuild this?
There was a problem hiding this comment.
The tarball download and extraction are presence-cached (skipped if the archive/source dir already exists), but the make modules step runs unconditionally every time. make's own incremental compilation avoids recompiling unchanged .o files, so re-runs are faster than a clean build, but it still does a full dependency scan and link pass.
Since this is a one-time setup step (mise run vm:nvidia-modules) rather than something that runs on every sandbox create, the current behavior is probably acceptable for now. If we want to optimize it later, we could add a stamp file keyed on sha256(NVIDIA_VERSION + KERNEL_VERSION + kconfig) that skips the entire build when the inputs haven't changed — happy to add that as a follow-up if you think it's worth it at this stage.
There was a problem hiding this comment.
I did see the download and extraction being cached after I asked this. I think what we have is fine for the time being. Let's look at adding a stamp file if we find this problematic. As you say, this is not for every sandbox run.
Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>
bffe5d7 to
1eb4cb3
Compare
|
|
||
| /// Load key=value pairs from a `versions.env` file in the given directory. | ||
| /// Returns an empty map if the file doesn't exist or can't be read. | ||
| fn load_versions_env(context: &Path) -> HashMap<String, String> { |
There was a problem hiding this comment.
nit: This can be used to process any env file and not just versions.env.
There was a problem hiding this comment.
Agreed — renamed to load_env_file and parameterized the filename.
| if key.is_empty() { | ||
| return None; | ||
| } | ||
| Some((key.to_string(), value.trim().to_string())) |
There was a problem hiding this comment.
As a question, how does this handle ENVVAR="value"? (I'm not saying that it should)
There was a problem hiding this comment.
It doesn't strip quotes — they'd be included as literal characters in the value. Our .env files use bare assignments so this works today. Added a doc comment noting the limitation rather than growing a full shell parser; if we need that later we should pull in dotenvy.
| format!("/lib/modules/{GUEST_KERNEL_VERSION}/extra/nvidia"), | ||
| ] { | ||
| let p = PathBuf::from(&candidate); | ||
| if dir_has_ko_files(&p) { |
There was a problem hiding this comment.
Question: Why is this the only case where we check for .ko files?
There was a problem hiding this comment.
The env-var and state-dir paths (steps 1–2) are explicit operator config, so we trust .is_dir() and let inject_gpu_modules fail with a clear error if there are no .ko files. The host paths (steps 3–4) are speculative probes, so has_nvidia_ko avoids claiming a directory that happens to exist but has no NVIDIA modules. The build-tree paths also use has_nvidia_ko.
| /// This is a development convenience — production deployments should use | ||
| /// `OPENSHELL_GPU_MODULES_DIR` or pre-provision `<state_dir>/gpu-modules/`. |
There was a problem hiding this comment.
Is there a way to ensure that this code path is not triggered in production? I would assume that there are security concerns with allowing this there?
There was a problem hiding this comment.
There was no hard guard, just the heuristic that target/libkrun-build/… won't exist in production installs. Gated discover_build_tree_modules() behind #[cfg(debug_assertions)] so release builds only resolve from the env var or the state-dir provisioned path.
| None | ||
| } | ||
|
|
||
| fn dir_has_ko_files(dir: &Path) -> bool { |
There was a problem hiding this comment.
Should we look for specific NVIDIA-specific ko modules?
There was a problem hiding this comment.
Yes — tightened the check to require nvidia.ko specifically. The previous generic .ko check could have matched unrelated module directories.
| /// host firmware causes version mismatches when the host driver differs | ||
| /// from the image's driver version. | ||
| fn inject_gpu_firmware(rootfs: &Path, modules_dir: &Path) { | ||
| let fw_dst = rootfs.join("lib/firmware/nvidia"); |
There was a problem hiding this comment.
What ensures that this path does not include symlinks that resolve on the host?
There was a problem hiding this comment.
Nothing was preventing it — fs::copy follows symlinks. Added symlink detection in copy_dir_contents to skip (with a warning) any entry that is a symlink, so host files can't leak into the guest rootfs through the firmware or modules directories.
|
|
||
| /// Extract the default value from a `GUEST_KERNEL_VERSION="${GUEST_KERNEL_VERSION:-<default>}"` | ||
| /// line in pins.env. | ||
| fn parse_guest_kernel_version(contents: &str) -> Option<String> { |
There was a problem hiding this comment.
Not critical for this PR, but we should reuse this logic if we add other envvars with similar logic.
There was a problem hiding this comment.
Agreed — extracted into a generic parse_env_default(contents, var_name) helper so the same parser is reusable for future pins.
| > **Note:** GPU passthrough requires an **x86_64 host and guest**. The QEMU | ||
| > backend uses `qemu-system-x86_64`, and the NVIDIA driver installer / | ||
| > kernel module build scripts target x86_64 exclusively. ARM/aarch64 GPU | ||
| > passthrough is not yet supported. |
There was a problem hiding this comment.
Question: Is this detected somewhere?
There was a problem hiding this comment.
Not previously — you'd get a confusing QEMU failure. Added an early arch check in the GPU startup path that errors with "GPU passthrough requires x86_64" before attempting to launch QEMU.
Summary
Related Issue
Changes
Testing
mise run pre-commitpassesChecklist