Comment 16 for bug 1849563

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote : Re: Unable to passthrough GPUs to guest

Hi Dann, thanks a lot for your logs and pretty great bug report here. The thread with edk2 folks is really informative!

I've been doing some research on this topic, and will share here in order to document it.
So, first thing is about "pci=nocrs" "pci=realloc". When setting "pci=nocrs", we are telling kernel to disregard ACPI resource information (CRS == Current Resource Settings object). In this mode, all memory is available to PCI Host Bridge allocations except the RAM and other detected reservations, so it bypasses the limitations of the OVMF firmware. This mode was the default 10 years ago, but it was changed due to some incompatibilities, for example systems with more than 1 PCI host bridge - the PCI subsystem maintainer decided then to "trust" more in the FW resource mapping, and allowed a kernel fallback through the option "nocrs". This is well-explained in [0] and [1].

The option "pci=realloc" is somewhat orthogonal to it; basically it allows kernel to perform PCI endpoints (aka, devices) memory (re-)assignments under their PCI host bridge memory space. So, an analogy would be: the PCI host bridge resource is a pile of memory in which devices will take some and consume for their BARs. With "pci=realloc", we allow kernel to retry this memory mapping for PCI devices some times, until it works (or eventually fail). It's natural to use "pci=realloc" and this option is somewhat automatic, due to kernel build-time configuration PCI_REALLOC_ENABLE_AUTO, which is default in Ubuntu kernels. In summary, "pci=realloc" is the way the memory of PCI host bridge is distributed to the PCI devices.

Now, regarding the firmware differences between OVMF and seabios. As per the ed2k thread mentioned in the above comment, OVMF has a strict limitation of the PCI64 aperture size. In seabios, things are a bit different - the ACPI table passed to Linux containing the PCI64 aperture information is DSST, this table is built dynamically based on SSDT construction on boot time (build_ssdt() on seabios code). This is ultimately based on PCI initialization routines that construct the BARs' sizes and sum all of them, given the information in the PCI devices' configuration space. The functions involved in this process are:

pci_setup() -> pci_bios_check_devices()/pci_bios_map_devices()

There's no limit on the aperture size, which is variable and can accommodate as many devices the guest memory allows. In a way, this is similar to the way Linux would perform the PCI resource allocations with "pci=nocrs" parameter.

Now, OVMF is more complex in nature. The source tree of OVMF is composed by multiple modules. The module MdeModulePkg is responsible for the PCI enumeration for OVMF. There are 2 parts involved in that:

- the aperture is calculated on submodule PciHostBridgeDxe; it comes from the early portions of the firmware code (submodule OvmfPkg/PlatformPei), in the memory detection routine (and in that point we can hijack into it using the experimental parameter X-PciMmio64Mb). This is then passed to PciHostBridgeDxe which will create a bridge with the memory resources' limits set.

- The PCI enumeration itself (and specially the device dropping in case the aperture is exceeded) comes in the submodule PciBusDxe, through the following functions:
PciBusDriverBindingStart() -> PciEnumerator() -> PciHostBridgeResourceAllocator()

The function PciHostBridgeResourceAllocator() is the one that tries to allocate effectively the memory through what's called Global Coherency Domain (GCD), the edk2/UEFI generic memory/IO manager. It's done in the PCI Bridge "level" and if it fails due to lack of resources then it'll go through the following functions to free resources in the bridge:
PciHostBridgeAdjustAllocation() -> GetMaxResourceConsumerDevice()

In this point, the GPU is discarded on benefit of other devices in case its BAR is too large based on the limitation of OVMF PCI64 aperture. For reference, this is the edk2/OVMF commit that limits by default the PCI64 aperture size: 7e5b1b670c ("OvmfPkg: PlatformPei: determine the 64-bit PCI host aperture for X64 DXE)

Cheers,

Guilherme

[0] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/341681/comments/97
[1] https://bugzilla.kernel.org/show_bug.cgi?id=14183