pci_resource_assignment: add PCI resource assignment crate#3570
Conversation
When booting Linux directly (without UEFI), there is no firmware to enumerate the PCI bus, assign bus numbers, probe BAR sizes, or program BAR addresses and bridge windows. Devices behind PCIe root ports are invisible to the guest because their config space and MMIO regions are unconfigured. This crate fills that gap for Linux direct boot. Longer term, this will also replace UEFI's PCI enumeration for all boot modes. Performing resource assignment in the VMM rather than in guest firmware lets us validate the PCI topology and MMIO layout before the guest ever runs, catching configuration errors (undersized apertures, impossible BAR placements, bus exhaustion) as clear VMM-side errors instead of mysterious guest boot failures. The algorithm has two phases. Phase 1 walks the bus topology depth-first, assigning secondary and subordinate bus numbers to bridges and probing each device's BAR sizes. SR-IOV VF bus requirements are accounted for when setting subordinate bus numbers. Phase 2 uses hierarchical bottom-up/top-down allocation. Each bridge computes the total aligned resource requirement of its subtree, then the host bridge carves its MMIO aperture among top-level devices, with each bridge subdividing its allocated range among children. This guarantees non-overlapping bridge windows and correct alignment at every level. BARs are split into two pools: non-prefetchable BARs go to low MMIO (32-bit bridge window), while 64-bit prefetchable BARs go to high MMIO (prefetchable bridge window, the only window capable of 64-bit addresses). The crate is wired into openvmm_core for Linux direct boot: after loading the kernel, state units are temporarily started with VPs held so that config space accesses route through the chipset's MMIO dispatch, the assignment runs, then state units stop again before the guest resumes.
There was a problem hiding this comment.
Pull request overview
Adds a new pci_resource_assignment crate that performs VMM-side PCI bus enumeration and BAR/bridge-window allocation for Linux direct boot (where no firmware is present to do it). The crate works purely through a PciConfigAccess trait. An ECAM-based implementation is wired into openvmm_core, which routes config-space cycles through Chipset MMIO dispatch and runs the assignment after kernel load (and after reset) for LoadMode::Linux VMs with non-empty PCIe host bridges.
Changes:
- New crate with two-phase algorithm: DFS bus enumeration + BAR-size probing, then bottom-up sizing / top-down address assignment, with SR-IOV VF bus reservation and 32-bit vs 64-bit-prefetchable pool splitting.
- New
ecam_config_accessmodule providing an ECAMPciConfigAccessimpl via the chipset. LoadedVm::assign_pci_resourcesruns the assignment with state units started and VPs held; integrated into initial boot and reset paths.
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| Cargo.toml | Adds pci_resource_assignment workspace member + dependency entry. |
| Cargo.lock | Lockfile entries for the new crate and openvmm_core. |
| vm/devices/pci/pci_resource_assignment/Cargo.toml | New crate manifest. |
| vm/devices/pci/pci_resource_assignment/src/lib.rs | Public API: PciConfigAccess, AssignmentParams, assign_pci_resources, errors, and crate-private result types. |
| vm/devices/pci/pci_resource_assignment/src/enumerate.rs | Phase 1: DFS bus enumeration, BAR size probing, and SR-IOV VF bus reservation. |
| vm/devices/pci/pci_resource_assignment/src/assign.rs | Phase 2: bottom-up sizing, top-down address allocation, and bridge/BAR programming. |
| vm/devices/pci/pci_resource_assignment/src/tests.rs | Unit tests with mock config space covering endpoints, bridges, switches, SR-IOV, errors, and alignment edge cases. |
| openvmm/openvmm_core/Cargo.toml | Adds dependency on the new crate. |
| openvmm/openvmm_core/src/worker/dispatch.rs | Stores Arc<Chipset> in LoadedVmInner; adds assign_pci_resources invoked after load_firmware(false) on boot and reset. |
| openvmm/openvmm_core/src/worker/dispatch/ecam_config_access.rs | New module implementing PciConfigAccess via chipset MMIO at ECAM addresses, and a helper that iterates all host bridges. |
| let stop_guard = self.inner.partition_unit.temporarily_stop_vps().await; | ||
|
|
||
| // Start state units so device config space is accessible. | ||
| self.state_units.start().await; |
There was a problem hiding this comment.
All this state unit churn is going to create a lot of tracing, anything we can do to reduce it?
There was a problem hiding this comment.
Hmm. I don't have any ideas. Do you?
There was a problem hiding this comment.
I wonder if I can change the state unit code to only trace when the time exceeds some threshold.
| /// resource requirement of its subtree. | ||
| /// 2. **Top-down assignment**: The host bridge carves its aperture among | ||
| /// top-level devices/bridges. Each bridge subdivides its allocated | ||
| /// range among children, largest-first. |
There was a problem hiding this comment.
Will this support leaf devices providing static GPAs for reservation in the future?
There was a problem hiding this comment.
I think that will require an alternate algorithm. I was thinking we'd probably say that either all devices in a root complex must have static BAR assignments, or none--then there's a pretty simple bottom-up bridge window assignment algorithm. But a mix is difficult, since there's no obvious greedy best-fit algorithm after you've poked a bunch of holes in the aperture.
When booting Linux directly (without UEFI), there is no firmware to enumerate the PCI bus, assign bus numbers, probe BAR sizes, or program BAR addresses and bridge windows. Devices behind PCIe root ports are invisible to the guest because their config space and MMIO regions are unconfigured. This crate fills that gap for Linux direct boot.
Longer term, this will also replace UEFI's PCI enumeration for all boot modes. Performing resource assignment in the VMM rather than in guest firmware lets us validate the PCI topology and MMIO layout before the guest ever runs, catching configuration errors (undersized apertures, impossible BAR placements, bus exhaustion) as clear VMM-side errors instead of mysterious guest boot failures.
The algorithm has two phases. Phase 1 walks the bus topology depth-first, assigning secondary and subordinate bus numbers to bridges and probing each device's BAR sizes. SR-IOV VF bus requirements are accounted for when setting subordinate bus numbers.
Phase 2 uses hierarchical bottom-up/top-down allocation. Each bridge computes the total aligned resource requirement of its subtree, then the host bridge carves its MMIO aperture among top-level devices, with each bridge subdividing its allocated range among children. This guarantees non-overlapping bridge windows and correct alignment at every level. BARs are split into two pools: non-prefetchable BARs go to low MMIO (32-bit bridge window), while 64-bit prefetchable BARs go to high MMIO (prefetchable bridge window, the only window capable of 64-bit addresses).
The crate is wired into openvmm_core for Linux direct boot: after loading the kernel, state units are temporarily started with VPs held so that config space accesses route through the chipset's MMIO dispatch, the assignment runs, then state units stop again before the guest resumes.