[RFD]: OpenCHAMI-managed templated HPC virtual machines

### Decision Goal

Decide whether OpenCHAMI should pursue an MVP capability for managing a small set of templated, device-aware HPC virtual machines on selected nodes, and choose the initial implementation approach.  The specific decision requested is:  Should OpenCHAMI prototype a hypervisor workflow where node inventory, FRU/device discovery, image catalog, and cloud-init metadata are reused to define and manage static VM templates with deterministic GPU/NIC/SR-IOV/PCI-passthrough bindings?

### Category

Architecture

### Stakeholders / Affected Areas

OpenCHAMI users and operators interested in virtualized HPC environments; SMD/inventory; fru-tracker; image catalog; cloud-init metadata service; Magellan/Redfish-oriented workflows; node provisioning workflows; future hardware-management APIs; sites with GPU- and high-speed-network-heavy nodes; developers evaluating libvirt, Incus, KubeVirt, Redfish virtual BMCs, or other VM lifecycle backends.

### Decision Needed By

No hard deadline proposed.

### Problem Statement

OpenCHAMI already has several building blocks that overlap with what is needed for an HPC-oriented virtual machine management capability:

- A mature node inventory system.
- A fru-tracker service that scrapes Redfish and can identify devices such as GPUs and NICs.
- An image catalog for bootable system images.
- A cloud-init metadata service for post-boot configuration.

Several OpenCHAMI use cases could benefit from being able to launch and manage VMs on HPC nodes.

> The hard part is not basic VM lifecycle. The hard part is binding virtual machines to the correct physical devices on dense HPC nodes. Some target nodes may have many GPUs and many high-speed interfaces, including InfiniBand and Ethernet devices. We need deterministic, auditable handling of PCI passthrough and/or SR-IOV so that a VM template receives the intended GPUs, NICs, ports, VFs, NUMA-local resources, and IOMMU-safe device groups.

General-purpose VM orchestration stacks such as KubeVirt may be useful, but they may not naturally express the exact hardware binding model OpenCHAMI needs. Conversely, building a full VM scheduler or cloud platform is likely too broad for an initial MVP.

The proposed discussion is whether OpenCHAMI should start with a narrower model: a small set of predefined VM templates per supported node type, using OpenCHAMI inventory and metadata systems to describe those templates, and delegating local VM lifecycle to a simpler per-node runtime.

### Proposed Solution

Prototype an OpenCHAMI-managed “templated HPC VM” capability with deliberately narrow scope.

The MVP would focus on static or semi-static VM slots rather than arbitrary dynamic VM scheduling. Each supported node type would have a small number of predefined VM templates. A template would describe the expected image, CPU/memory shape, cloud-init configuration, and exact device bindings.

Example concept:

```yaml
nodeType: dense-gpu-hsn-node
vmSlots:
  - name: slot0
    profile: gpu-dev-small
    image: ubuntu-hpc-base
    cpu:
      cores: 32
      numaPolicy: strict
    memory:
      sizeGiB: 256
      hugepages: true
    devices:
      - name: gpu0
        kind: gpu
        selector:
          physicalLabel: GPU0
          pciAddress: "0000:81:00.0"
      - name: hsn0
        kind: nic
        mode: sriov-vf
        selector:
          pf: mlx5_0
          vfIndex: 0
      - name: data0
        kind: nic
        mode: sriov-vf
        selector:
          pf: ens7f0
          vfIndex: 0
```
The MVP should reuse existing OpenCHAMI services where possible:
```
SMD / inventory
  -> node identity and placement context

fru-tracker
  -> physical device inventory from Redfish

image catalog
  -> VM base image references

cloud-init metadata service
  -> guest personalization

new thin coordination layer or host configuration
  -> defines which static VM slots exist on each host

local VM runtime
  -> starts, stops, and reports state for predefined VM slots
```

The initial implementation should consider two candidate MVP paths.

### Option A: Static libvirt/systemd plus virtual BMC

Use host cloud-init and systemd to predefine a static set of libvirt domains on each supported node. The VM definitions would include exact PCI/SR-IOV device bindings. A virtual BMC layer, likely sushy-tools for early prototyping, could expose each VM slot as a Redfish-managed system.

In this model, OpenCHAMI could treat VM slots similarly to bare-metal nodes from a power-control perspective.

Expected properties:

- Minimal custom lifecycle API.
- Reuses existing Redfish-oriented workflows.
- Keeps device binding static and auditable.
- Good for proving the OpenCHAMI integration model quickly.
- sushy-tools would be treated as a prototype dependency, not necessarily a production dependency.

### Option B: Incus as the per-node VM runtime

Use Incus as the local VM lifecycle/runtime API and have OpenCHAMI coordinate profiles, images, metadata, and device bindings.

In this model, OpenCHAMI would not expose the VMs primarily as Redfish-managed systems. Instead, a coordinator would call Incus APIs to create/start/stop predefined VM instances or profiles.

Expected properties:

- Avoids custom libvirt XML and most custom lifecycle logic.
- Provides a real VM lifecycle API out of the box.
- Supports templated profiles and explicit device assignment patterns.
- May be a better fit if Redfish compatibility is not required for the MVP.
- Introduces Incus as a new operational dependency.

### Initial MVP scope

The MVP should support only:

- A small number of supported node types.
- A small number of predefined VM templates.
- Single-node VMs.
- Catalog-selected images.
- Cloud-init-based post-boot configuration.
- Deterministic GPU passthrough.
- Deterministic NIC passthrough or SR-IOV VF assignment.
- Start, stop, status, and delete/reset semantics.
- Validation that expected devices exist and are safe to assign.

The MVP should explicitly not attempt to support:

- Arbitrary user-defined VM specs.
- Dynamic multi-node VM scheduling.
- Live migration.
- Overcommit.
- Full cloud semantics.
- Arbitrary user-provided libvirt XML.
- Full Redfish hardware emulation.
- Complete support for every GPU/NIC mode.

### Alternatives Considered

### KubeVirt

KubeVirt provides a Kubernetes-native VM model and may be attractive where Kubernetes or OpenShift is already the control plane. However, the main OpenCHAMI challenge is deterministic binding of specific GPUs, NICs, ports, VFs, NUMA domains, and IOMMU-safe device groups on dense HPC nodes. That may require custom device plugins and substantial topology modeling. This could be valuable later, but it may be too heavy for the MVP.

### Full custom libvirt node agent

A custom per-node agent could expose REST/gRPC lifecycle operations and generate libvirt XML dynamically from OpenCHAMI inventory and VM requests. This offers maximum control, but it requires OpenCHAMI to build and maintain a new runtime agent, validation layer, XML generator, lifecycle engine, and reconciliation model. This may be appropriate later, but it is broader than the first prototype needs.

### Static libvirt/systemd without virtual BMC

This keeps the implementation very small: systemd defines local lifecycle, libvirt runs the VMs, and a tiny agent wraps systemctl/virsh. This is simple and testable, but it creates a new lifecycle API instead of reusing Redfish-like power-control semantics. It is still a strong fallback if virtual BMC tooling proves insufficient.

### sushy-tools virtual BMC

sushy-tools can expose libvirt-backed VMs through a Redfish-like API and could allow OpenCHAMI to model VM slots as Redfish-managed systems. This is attractive for an MVP, but it should be treated cautiously because it is primarily a development/testing tool rather than a hardened production BMC implementation.

### Incus

Incus could replace most of the local VM lifecycle layer, including lifecycle API, profiles, storage/image handling, and device attachment. It may be the strongest contender if OpenCHAMI does not require Redfish/libvirt as the primary interface. The tradeoff is introducing a new runtime dependency and deciding how much OpenCHAMI state should be mirrored into Incus.

### mvpnet / Ethernet-over-MPI

mvpnet is interesting for launching VM clusters inside HPC allocations and experimenting with MPI-backed networking, but it appears less aligned with the immediate goal of per-node lifecycle management for predefined, hardware-bound VM templates. Ethernet-over-MPI may be useful later for specific virtual cluster experiments, but it is not required for this MVP.

### Other Considerations

### Device identity and validation

The key technical risk is hardware identity. The system must reliably map OpenCHAMI/fru-tracker device information to runtime host devices such as PCI BDFs, Linux netdev names, RDMA devices, SR-IOV VFs, NUMA nodes, and IOMMU groups.

Even if VM templates are static, the host should validate before declaring a VM slot available:

```
expected PCI devices exist
expected PF/VF relationships exist
expected NUMA locality matches the template
expected IOMMU groups are safe
expected drivers are bound correctly
expected devices are not already assigned
expected VM definition matches the template
```

### Source of truth

OpenCHAMI should remain the source of truth for inventory, image references, metadata, and high-level VM slot definitions. The node-local runtime should be treated as the execution layer, not the authoritative inventory system.

### Redfish modeling

If VM slots are exposed as Redfish systems, the community should decide whether those virtual systems should appear as first-class nodes in inventory, child resources of a hypervisor host, or a distinct virtual-node resource type.

### Security

The MVP should avoid giving end users direct access to the hypervisor runtime. Whether the backend is libvirt, Incus, or something else, local runtime control should be treated as privileged/root-equivalent. OpenCHAMI should mediate access through service credentials, policy, and audit logging.

### Scheduling and ownership

The MVP can avoid general scheduling by requiring explicit node/slot selection. Future versions may add placement logic or integration with Slurm/Flux/resource-manager allocations.

### SR-IOV lifecycle

The MVP should decide whether VFs are created dynamically or pre-created at host boot. A fixed VF pool per supported node type is likely simpler and easier to audit for the first implementation.

### Guest metadata

The guest should ideally use the existing OpenCHAMI cloud-init metadata service. The community should decide whether the VM runtime injects local cloud-init data directly, or whether guests fetch metadata from OpenCHAMI using a stable instance ID.

### Production readiness

If sushy-tools or similar emulation is used, the RFD should explicitly state that this is for prototype validation unless the community agrees to harden, fork, or replace the virtual BMC component.

### Related Docs / PRs

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFD]: OpenCHAMI-managed templated HPC virtual machines #127

Decision Goal

Category

Stakeholders / Affected Areas

Decision Needed By

Problem Statement

Proposed Solution

Option A: Static libvirt/systemd plus virtual BMC

Option B: Incus as the per-node VM runtime

Initial MVP scope

Alternatives Considered

KubeVirt

Full custom libvirt node agent

Static libvirt/systemd without virtual BMC

sushy-tools virtual BMC

Incus

mvpnet / Ethernet-over-MPI

Other Considerations

Device identity and validation

Source of truth

Redfish modeling

Security

Scheduling and ownership

SR-IOV lifecycle

Guest metadata

Production readiness

Related Docs / PRs

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[RFD]: OpenCHAMI-managed templated HPC virtual machines #127

Description

Decision Goal

Category

Stakeholders / Affected Areas

Decision Needed By

Problem Statement

Proposed Solution

Option A: Static libvirt/systemd plus virtual BMC

Option B: Incus as the per-node VM runtime

Initial MVP scope

Alternatives Considered

KubeVirt

Full custom libvirt node agent

Static libvirt/systemd without virtual BMC

sushy-tools virtual BMC

Incus

mvpnet / Ethernet-over-MPI

Other Considerations

Device identity and validation

Source of truth

Redfish modeling

Security

Scheduling and ownership

SR-IOV lifecycle

Guest metadata

Production readiness

Related Docs / PRs

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions