Decision Goal
Decide whether OpenCHAMI should pursue an MVP capability for managing a small set of templated, device-aware HPC virtual machines on selected nodes, and choose the initial implementation approach. The specific decision requested is: Should OpenCHAMI prototype a hypervisor workflow where node inventory, FRU/device discovery, image catalog, and cloud-init metadata are reused to define and manage static VM templates with deterministic GPU/NIC/SR-IOV/PCI-passthrough bindings?
Category
Architecture
Stakeholders / Affected Areas
OpenCHAMI users and operators interested in virtualized HPC environments; SMD/inventory; fru-tracker; image catalog; cloud-init metadata service; Magellan/Redfish-oriented workflows; node provisioning workflows; future hardware-management APIs; sites with GPU- and high-speed-network-heavy nodes; developers evaluating libvirt, Incus, KubeVirt, Redfish virtual BMCs, or other VM lifecycle backends.
Decision Needed By
No hard deadline proposed.
Problem Statement
OpenCHAMI already has several building blocks that overlap with what is needed for an HPC-oriented virtual machine management capability:
- A mature node inventory system.
- A fru-tracker service that scrapes Redfish and can identify devices such as GPUs and NICs.
- An image catalog for bootable system images.
- A cloud-init metadata service for post-boot configuration.
Several OpenCHAMI use cases could benefit from being able to launch and manage VMs on HPC nodes.
The hard part is not basic VM lifecycle. The hard part is binding virtual machines to the correct physical devices on dense HPC nodes. Some target nodes may have many GPUs and many high-speed interfaces, including InfiniBand and Ethernet devices. We need deterministic, auditable handling of PCI passthrough and/or SR-IOV so that a VM template receives the intended GPUs, NICs, ports, VFs, NUMA-local resources, and IOMMU-safe device groups.
General-purpose VM orchestration stacks such as KubeVirt may be useful, but they may not naturally express the exact hardware binding model OpenCHAMI needs. Conversely, building a full VM scheduler or cloud platform is likely too broad for an initial MVP.
The proposed discussion is whether OpenCHAMI should start with a narrower model: a small set of predefined VM templates per supported node type, using OpenCHAMI inventory and metadata systems to describe those templates, and delegating local VM lifecycle to a simpler per-node runtime.
Proposed Solution
Prototype an OpenCHAMI-managed “templated HPC VM” capability with deliberately narrow scope.
The MVP would focus on static or semi-static VM slots rather than arbitrary dynamic VM scheduling. Each supported node type would have a small number of predefined VM templates. A template would describe the expected image, CPU/memory shape, cloud-init configuration, and exact device bindings.
Example concept:
nodeType: dense-gpu-hsn-node
vmSlots:
- name: slot0
profile: gpu-dev-small
image: ubuntu-hpc-base
cpu:
cores: 32
numaPolicy: strict
memory:
sizeGiB: 256
hugepages: true
devices:
- name: gpu0
kind: gpu
selector:
physicalLabel: GPU0
pciAddress: "0000:81:00.0"
- name: hsn0
kind: nic
mode: sriov-vf
selector:
pf: mlx5_0
vfIndex: 0
- name: data0
kind: nic
mode: sriov-vf
selector:
pf: ens7f0
vfIndex: 0
The MVP should reuse existing OpenCHAMI services where possible:
SMD / inventory
-> node identity and placement context
fru-tracker
-> physical device inventory from Redfish
image catalog
-> VM base image references
cloud-init metadata service
-> guest personalization
new thin coordination layer or host configuration
-> defines which static VM slots exist on each host
local VM runtime
-> starts, stops, and reports state for predefined VM slots
The initial implementation should consider two candidate MVP paths.
Option A: Static libvirt/systemd plus virtual BMC
Use host cloud-init and systemd to predefine a static set of libvirt domains on each supported node. The VM definitions would include exact PCI/SR-IOV device bindings. A virtual BMC layer, likely sushy-tools for early prototyping, could expose each VM slot as a Redfish-managed system.
In this model, OpenCHAMI could treat VM slots similarly to bare-metal nodes from a power-control perspective.
Expected properties:
- Minimal custom lifecycle API.
- Reuses existing Redfish-oriented workflows.
- Keeps device binding static and auditable.
- Good for proving the OpenCHAMI integration model quickly.
- sushy-tools would be treated as a prototype dependency, not necessarily a production dependency.
Option B: Incus as the per-node VM runtime
Use Incus as the local VM lifecycle/runtime API and have OpenCHAMI coordinate profiles, images, metadata, and device bindings.
In this model, OpenCHAMI would not expose the VMs primarily as Redfish-managed systems. Instead, a coordinator would call Incus APIs to create/start/stop predefined VM instances or profiles.
Expected properties:
- Avoids custom libvirt XML and most custom lifecycle logic.
- Provides a real VM lifecycle API out of the box.
- Supports templated profiles and explicit device assignment patterns.
- May be a better fit if Redfish compatibility is not required for the MVP.
- Introduces Incus as a new operational dependency.
Initial MVP scope
The MVP should support only:
- A small number of supported node types.
- A small number of predefined VM templates.
- Single-node VMs.
- Catalog-selected images.
- Cloud-init-based post-boot configuration.
- Deterministic GPU passthrough.
- Deterministic NIC passthrough or SR-IOV VF assignment.
- Start, stop, status, and delete/reset semantics.
- Validation that expected devices exist and are safe to assign.
The MVP should explicitly not attempt to support:
- Arbitrary user-defined VM specs.
- Dynamic multi-node VM scheduling.
- Live migration.
- Overcommit.
- Full cloud semantics.
- Arbitrary user-provided libvirt XML.
- Full Redfish hardware emulation.
- Complete support for every GPU/NIC mode.
Alternatives Considered
KubeVirt
KubeVirt provides a Kubernetes-native VM model and may be attractive where Kubernetes or OpenShift is already the control plane. However, the main OpenCHAMI challenge is deterministic binding of specific GPUs, NICs, ports, VFs, NUMA domains, and IOMMU-safe device groups on dense HPC nodes. That may require custom device plugins and substantial topology modeling. This could be valuable later, but it may be too heavy for the MVP.
Full custom libvirt node agent
A custom per-node agent could expose REST/gRPC lifecycle operations and generate libvirt XML dynamically from OpenCHAMI inventory and VM requests. This offers maximum control, but it requires OpenCHAMI to build and maintain a new runtime agent, validation layer, XML generator, lifecycle engine, and reconciliation model. This may be appropriate later, but it is broader than the first prototype needs.
Static libvirt/systemd without virtual BMC
This keeps the implementation very small: systemd defines local lifecycle, libvirt runs the VMs, and a tiny agent wraps systemctl/virsh. This is simple and testable, but it creates a new lifecycle API instead of reusing Redfish-like power-control semantics. It is still a strong fallback if virtual BMC tooling proves insufficient.
sushy-tools virtual BMC
sushy-tools can expose libvirt-backed VMs through a Redfish-like API and could allow OpenCHAMI to model VM slots as Redfish-managed systems. This is attractive for an MVP, but it should be treated cautiously because it is primarily a development/testing tool rather than a hardened production BMC implementation.
Incus
Incus could replace most of the local VM lifecycle layer, including lifecycle API, profiles, storage/image handling, and device attachment. It may be the strongest contender if OpenCHAMI does not require Redfish/libvirt as the primary interface. The tradeoff is introducing a new runtime dependency and deciding how much OpenCHAMI state should be mirrored into Incus.
mvpnet / Ethernet-over-MPI
mvpnet is interesting for launching VM clusters inside HPC allocations and experimenting with MPI-backed networking, but it appears less aligned with the immediate goal of per-node lifecycle management for predefined, hardware-bound VM templates. Ethernet-over-MPI may be useful later for specific virtual cluster experiments, but it is not required for this MVP.
Other Considerations
Device identity and validation
The key technical risk is hardware identity. The system must reliably map OpenCHAMI/fru-tracker device information to runtime host devices such as PCI BDFs, Linux netdev names, RDMA devices, SR-IOV VFs, NUMA nodes, and IOMMU groups.
Even if VM templates are static, the host should validate before declaring a VM slot available:
expected PCI devices exist
expected PF/VF relationships exist
expected NUMA locality matches the template
expected IOMMU groups are safe
expected drivers are bound correctly
expected devices are not already assigned
expected VM definition matches the template
Source of truth
OpenCHAMI should remain the source of truth for inventory, image references, metadata, and high-level VM slot definitions. The node-local runtime should be treated as the execution layer, not the authoritative inventory system.
Redfish modeling
If VM slots are exposed as Redfish systems, the community should decide whether those virtual systems should appear as first-class nodes in inventory, child resources of a hypervisor host, or a distinct virtual-node resource type.
Security
The MVP should avoid giving end users direct access to the hypervisor runtime. Whether the backend is libvirt, Incus, or something else, local runtime control should be treated as privileged/root-equivalent. OpenCHAMI should mediate access through service credentials, policy, and audit logging.
Scheduling and ownership
The MVP can avoid general scheduling by requiring explicit node/slot selection. Future versions may add placement logic or integration with Slurm/Flux/resource-manager allocations.
SR-IOV lifecycle
The MVP should decide whether VFs are created dynamically or pre-created at host boot. A fixed VF pool per supported node type is likely simpler and easier to audit for the first implementation.
Guest metadata
The guest should ideally use the existing OpenCHAMI cloud-init metadata service. The community should decide whether the VM runtime injects local cloud-init data directly, or whether guests fetch metadata from OpenCHAMI using a stable instance ID.
Production readiness
If sushy-tools or similar emulation is used, the RFD should explicitly state that this is for prototype validation unless the community agrees to harden, fork, or replace the virtual BMC component.
Related Docs / PRs
No response
Decision Goal
Decide whether OpenCHAMI should pursue an MVP capability for managing a small set of templated, device-aware HPC virtual machines on selected nodes, and choose the initial implementation approach. The specific decision requested is: Should OpenCHAMI prototype a hypervisor workflow where node inventory, FRU/device discovery, image catalog, and cloud-init metadata are reused to define and manage static VM templates with deterministic GPU/NIC/SR-IOV/PCI-passthrough bindings?
Category
Architecture
Stakeholders / Affected Areas
OpenCHAMI users and operators interested in virtualized HPC environments; SMD/inventory; fru-tracker; image catalog; cloud-init metadata service; Magellan/Redfish-oriented workflows; node provisioning workflows; future hardware-management APIs; sites with GPU- and high-speed-network-heavy nodes; developers evaluating libvirt, Incus, KubeVirt, Redfish virtual BMCs, or other VM lifecycle backends.
Decision Needed By
No hard deadline proposed.
Problem Statement
OpenCHAMI already has several building blocks that overlap with what is needed for an HPC-oriented virtual machine management capability:
Several OpenCHAMI use cases could benefit from being able to launch and manage VMs on HPC nodes.
General-purpose VM orchestration stacks such as KubeVirt may be useful, but they may not naturally express the exact hardware binding model OpenCHAMI needs. Conversely, building a full VM scheduler or cloud platform is likely too broad for an initial MVP.
The proposed discussion is whether OpenCHAMI should start with a narrower model: a small set of predefined VM templates per supported node type, using OpenCHAMI inventory and metadata systems to describe those templates, and delegating local VM lifecycle to a simpler per-node runtime.
Proposed Solution
Prototype an OpenCHAMI-managed “templated HPC VM” capability with deliberately narrow scope.
The MVP would focus on static or semi-static VM slots rather than arbitrary dynamic VM scheduling. Each supported node type would have a small number of predefined VM templates. A template would describe the expected image, CPU/memory shape, cloud-init configuration, and exact device bindings.
Example concept:
The MVP should reuse existing OpenCHAMI services where possible:
The initial implementation should consider two candidate MVP paths.
Option A: Static libvirt/systemd plus virtual BMC
Use host cloud-init and systemd to predefine a static set of libvirt domains on each supported node. The VM definitions would include exact PCI/SR-IOV device bindings. A virtual BMC layer, likely sushy-tools for early prototyping, could expose each VM slot as a Redfish-managed system.
In this model, OpenCHAMI could treat VM slots similarly to bare-metal nodes from a power-control perspective.
Expected properties:
Option B: Incus as the per-node VM runtime
Use Incus as the local VM lifecycle/runtime API and have OpenCHAMI coordinate profiles, images, metadata, and device bindings.
In this model, OpenCHAMI would not expose the VMs primarily as Redfish-managed systems. Instead, a coordinator would call Incus APIs to create/start/stop predefined VM instances or profiles.
Expected properties:
Initial MVP scope
The MVP should support only:
The MVP should explicitly not attempt to support:
Alternatives Considered
KubeVirt
KubeVirt provides a Kubernetes-native VM model and may be attractive where Kubernetes or OpenShift is already the control plane. However, the main OpenCHAMI challenge is deterministic binding of specific GPUs, NICs, ports, VFs, NUMA domains, and IOMMU-safe device groups on dense HPC nodes. That may require custom device plugins and substantial topology modeling. This could be valuable later, but it may be too heavy for the MVP.
Full custom libvirt node agent
A custom per-node agent could expose REST/gRPC lifecycle operations and generate libvirt XML dynamically from OpenCHAMI inventory and VM requests. This offers maximum control, but it requires OpenCHAMI to build and maintain a new runtime agent, validation layer, XML generator, lifecycle engine, and reconciliation model. This may be appropriate later, but it is broader than the first prototype needs.
Static libvirt/systemd without virtual BMC
This keeps the implementation very small: systemd defines local lifecycle, libvirt runs the VMs, and a tiny agent wraps systemctl/virsh. This is simple and testable, but it creates a new lifecycle API instead of reusing Redfish-like power-control semantics. It is still a strong fallback if virtual BMC tooling proves insufficient.
sushy-tools virtual BMC
sushy-tools can expose libvirt-backed VMs through a Redfish-like API and could allow OpenCHAMI to model VM slots as Redfish-managed systems. This is attractive for an MVP, but it should be treated cautiously because it is primarily a development/testing tool rather than a hardened production BMC implementation.
Incus
Incus could replace most of the local VM lifecycle layer, including lifecycle API, profiles, storage/image handling, and device attachment. It may be the strongest contender if OpenCHAMI does not require Redfish/libvirt as the primary interface. The tradeoff is introducing a new runtime dependency and deciding how much OpenCHAMI state should be mirrored into Incus.
mvpnet / Ethernet-over-MPI
mvpnet is interesting for launching VM clusters inside HPC allocations and experimenting with MPI-backed networking, but it appears less aligned with the immediate goal of per-node lifecycle management for predefined, hardware-bound VM templates. Ethernet-over-MPI may be useful later for specific virtual cluster experiments, but it is not required for this MVP.
Other Considerations
Device identity and validation
The key technical risk is hardware identity. The system must reliably map OpenCHAMI/fru-tracker device information to runtime host devices such as PCI BDFs, Linux netdev names, RDMA devices, SR-IOV VFs, NUMA nodes, and IOMMU groups.
Even if VM templates are static, the host should validate before declaring a VM slot available:
Source of truth
OpenCHAMI should remain the source of truth for inventory, image references, metadata, and high-level VM slot definitions. The node-local runtime should be treated as the execution layer, not the authoritative inventory system.
Redfish modeling
If VM slots are exposed as Redfish systems, the community should decide whether those virtual systems should appear as first-class nodes in inventory, child resources of a hypervisor host, or a distinct virtual-node resource type.
Security
The MVP should avoid giving end users direct access to the hypervisor runtime. Whether the backend is libvirt, Incus, or something else, local runtime control should be treated as privileged/root-equivalent. OpenCHAMI should mediate access through service credentials, policy, and audit logging.
Scheduling and ownership
The MVP can avoid general scheduling by requiring explicit node/slot selection. Future versions may add placement logic or integration with Slurm/Flux/resource-manager allocations.
SR-IOV lifecycle
The MVP should decide whether VFs are created dynamically or pre-created at host boot. A fixed VF pool per supported node type is likely simpler and easier to audit for the first implementation.
Guest metadata
The guest should ideally use the existing OpenCHAMI cloud-init metadata service. The community should decide whether the VM runtime injects local cloud-init data directly, or whether guests fetch metadata from OpenCHAMI using a stable instance ID.
Production readiness
If sushy-tools or similar emulation is used, the RFD should explicitly state that this is for prototype validation unless the community agrees to harden, fork, or replace the virtual BMC component.
Related Docs / PRs
No response