Skip to content

pci_resource_assignment: add PCI resource assignment crate#3570

Merged
jstarks merged 7 commits into
microsoft:mainfrom
jstarks:pre-alloc
May 28, 2026
Merged

pci_resource_assignment: add PCI resource assignment crate#3570
jstarks merged 7 commits into
microsoft:mainfrom
jstarks:pre-alloc

Conversation

@jstarks
Copy link
Copy Markdown
Member

@jstarks jstarks commented May 27, 2026

When booting Linux directly (without UEFI), there is no firmware to enumerate the PCI bus, assign bus numbers, probe BAR sizes, or program BAR addresses and bridge windows. Devices behind PCIe root ports are invisible to the guest because their config space and MMIO regions are unconfigured. This crate fills that gap for Linux direct boot.

Longer term, this will also replace UEFI's PCI enumeration for all boot modes. Performing resource assignment in the VMM rather than in guest firmware lets us validate the PCI topology and MMIO layout before the guest ever runs, catching configuration errors (undersized apertures, impossible BAR placements, bus exhaustion) as clear VMM-side errors instead of mysterious guest boot failures.

The algorithm has two phases. Phase 1 walks the bus topology depth-first, assigning secondary and subordinate bus numbers to bridges and probing each device's BAR sizes. SR-IOV VF bus requirements are accounted for when setting subordinate bus numbers.

Phase 2 uses hierarchical bottom-up/top-down allocation. Each bridge computes the total aligned resource requirement of its subtree, then the host bridge carves its MMIO aperture among top-level devices, with each bridge subdividing its allocated range among children. This guarantees non-overlapping bridge windows and correct alignment at every level. BARs are split into two pools: non-prefetchable BARs go to low MMIO (32-bit bridge window), while 64-bit prefetchable BARs go to high MMIO (prefetchable bridge window, the only window capable of 64-bit addresses).

The crate is wired into openvmm_core for Linux direct boot: after loading the kernel, state units are temporarily started with VPs held so that config space accesses route through the chipset's MMIO dispatch, the assignment runs, then state units stop again before the guest resumes.

When booting Linux directly (without UEFI), there is no firmware to
enumerate the PCI bus, assign bus numbers, probe BAR sizes, or program
BAR addresses and bridge windows. Devices behind PCIe root ports are
invisible to the guest because their config space and MMIO regions are
unconfigured. This crate fills that gap for Linux direct boot.

Longer term, this will also replace UEFI's PCI enumeration for all boot
modes. Performing resource assignment in the VMM rather than in guest
firmware lets us validate the PCI topology and MMIO layout before the
guest ever runs, catching configuration errors (undersized apertures,
impossible BAR placements, bus exhaustion) as clear VMM-side errors
instead of mysterious guest boot failures.

The algorithm has two phases. Phase 1 walks the bus topology
depth-first, assigning secondary and subordinate bus numbers to bridges
and probing each device's BAR sizes. SR-IOV VF bus requirements are
accounted for when setting subordinate bus numbers.

Phase 2 uses hierarchical bottom-up/top-down allocation. Each bridge
computes the total aligned resource requirement of its subtree, then
the host bridge carves its MMIO aperture among top-level devices, with
each bridge subdividing its allocated range among children. This
guarantees non-overlapping bridge windows and correct alignment at
every level. BARs are split into two pools: non-prefetchable BARs go
to low MMIO (32-bit bridge window), while 64-bit prefetchable BARs go
to high MMIO (prefetchable bridge window, the only window capable of
64-bit addresses).

The crate is wired into openvmm_core for Linux direct boot: after
loading the kernel, state units are temporarily started with VPs held
so that config space accesses route through the chipset's MMIO dispatch,
the assignment runs, then state units stop again before the guest
resumes.
Copilot AI review requested due to automatic review settings May 27, 2026 00:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new pci_resource_assignment crate that performs VMM-side PCI bus enumeration and BAR/bridge-window allocation for Linux direct boot (where no firmware is present to do it). The crate works purely through a PciConfigAccess trait. An ECAM-based implementation is wired into openvmm_core, which routes config-space cycles through Chipset MMIO dispatch and runs the assignment after kernel load (and after reset) for LoadMode::Linux VMs with non-empty PCIe host bridges.

Changes:

  • New crate with two-phase algorithm: DFS bus enumeration + BAR-size probing, then bottom-up sizing / top-down address assignment, with SR-IOV VF bus reservation and 32-bit vs 64-bit-prefetchable pool splitting.
  • New ecam_config_access module providing an ECAM PciConfigAccess impl via the chipset.
  • LoadedVm::assign_pci_resources runs the assignment with state units started and VPs held; integrated into initial boot and reset paths.

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
Cargo.toml Adds pci_resource_assignment workspace member + dependency entry.
Cargo.lock Lockfile entries for the new crate and openvmm_core.
vm/devices/pci/pci_resource_assignment/Cargo.toml New crate manifest.
vm/devices/pci/pci_resource_assignment/src/lib.rs Public API: PciConfigAccess, AssignmentParams, assign_pci_resources, errors, and crate-private result types.
vm/devices/pci/pci_resource_assignment/src/enumerate.rs Phase 1: DFS bus enumeration, BAR size probing, and SR-IOV VF bus reservation.
vm/devices/pci/pci_resource_assignment/src/assign.rs Phase 2: bottom-up sizing, top-down address allocation, and bridge/BAR programming.
vm/devices/pci/pci_resource_assignment/src/tests.rs Unit tests with mock config space covering endpoints, bridges, switches, SR-IOV, errors, and alignment edge cases.
openvmm/openvmm_core/Cargo.toml Adds dependency on the new crate.
openvmm/openvmm_core/src/worker/dispatch.rs Stores Arc<Chipset> in LoadedVmInner; adds assign_pci_resources invoked after load_firmware(false) on boot and reset.
openvmm/openvmm_core/src/worker/dispatch/ecam_config_access.rs New module implementing PciConfigAccess via chipset MMIO at ECAM addresses, and a helper that iterates all host bridges.

Comment thread vm/devices/pci/pci_resource_assignment/src/enumerate.rs Outdated
Comment thread vm/devices/pci/pci_resource_assignment/src/tests.rs Outdated
Comment thread openvmm/openvmm_core/src/worker/dispatch/ecam_config_access.rs Outdated
Copilot AI review requested due to automatic review settings May 27, 2026 01:02
@jstarks jstarks marked this pull request as ready for review May 27, 2026 01:03
@jstarks jstarks requested a review from a team as a code owner May 27, 2026 01:03
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

Comment thread vm/devices/pci/pci_resource_assignment/src/tests.rs Outdated
Comment thread vm/devices/pci/pci_resource_assignment/src/assign.rs Outdated
Comment thread vm/devices/pci/pci_resource_assignment/src/enumerate.rs Outdated
Comment thread vm/devices/pci/pci_resource_assignment/src/assign.rs Outdated
@github-actions
Copy link
Copy Markdown

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 10 changed files in this pull request and generated 5 comments.

Comment thread vm/devices/pci/pci_resource_assignment/src/enumerate.rs Outdated
Comment thread vm/devices/pci/pci_resource_assignment/src/assign.rs
Comment thread vm/devices/pci/pci_resource_assignment/src/assign.rs
Comment thread vm/devices/pci/pci_resource_assignment/src/enumerate.rs
Comment thread vm/devices/pci/pci_resource_assignment/src/enumerate.rs Outdated
@github-actions
Copy link
Copy Markdown

let stop_guard = self.inner.partition_unit.temporarily_stop_vps().await;

// Start state units so device config space is accessible.
self.state_units.start().await;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this state unit churn is going to create a lot of tracing, anything we can do to reduce it?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I don't have any ideas. Do you?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if I can change the state unit code to only trace when the time exceeds some threshold.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/// resource requirement of its subtree.
/// 2. **Top-down assignment**: The host bridge carves its aperture among
/// top-level devices/bridges. Each bridge subdivides its allocated
/// range among children, largest-first.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this support leaf devices providing static GPAs for reservation in the future?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that will require an alternate algorithm. I was thinking we'd probably say that either all devices in a root complex must have static BAR assignments, or none--then there's a pretty simple bottom-up bridge window assignment algorithm. But a mix is difficult, since there's no obvious greedy best-fit algorithm after you've poked a bunch of holes in the aperture.

@jstarks jstarks merged commit 24d0e18 into microsoft:main May 28, 2026
67 checks passed
@jstarks jstarks deleted the pre-alloc branch May 28, 2026 21:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants