Unable to passthrough GPUs to guest, due to PCI64 aperture limitation

Bug #1849563 reported by dann frazier
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
edk2 (Ubuntu)
Confirmed
Medium
Guilherme G. Piccoli
Bionic
Confirmed
Medium
Guilherme G. Piccoli
Eoan
Won't Fix
Medium
Guilherme G. Piccoli
Focal
Confirmed
Medium
Guilherme G. Piccoli
Groovy
Won't Fix
Medium
Guilherme G. Piccoli

Bug Description

I'm having issues passing Nvidia Tesla GPUs to an OVMF-mode guest. While I can passthrough other devices to an OVMF-mode guest w/o a problem (e.g. Mellanox Connect-X 5 VFs), I'm seeing a couple different failure modes when passing through a GPU:

1) No output:

---------
$ virsh start virtinst; virsh console virtinst
Domain virtinst started

Connected to domain virtinst
Escape character is ^]
---------

I discovered that I'm able to avoid this by placing the device on a different BSF in the guest.

This results in a hang:
<address type='pci' domain='0x0000' bus='0x00' slot='0x0b' function='0x0'/>

Whilst this gets us further:

<address type='pci' domain='0x0000' bus='0x05' slot='0x02' function='0x0'/>

Though that too fails after OS boot as described next:

2) OS boots, device appears within, but the kernel is unable to configure resources:

[ 4.744211] nvidia-nvlink: Nvlink Core is being initialized, major device number 241
[ 4.750811] NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:
[ 4.750811] NVRM: BAR1 is 0M @ 0x0 (PCI:0000:01:02.0)
[ 4.756960] NVRM: The system BIOS may have misconfigured your GPU.
[ 4.759725] nvidia: probe of 0000:01:02.0 failed with error -1
[ 4.762347] NVRM: The NVIDIA probe routine failed for 1 device(s).
[ 4.766010] NVRM: None of the NVIDIA devices were initialized.
[ 4.769701] nvidia-nvlink: Unregistered the Nvlink Core, major device number 241

I've found that #2 can be worked around w/ 'pci=nocrs'.

Neither issue is reproducible when booting in non-UEFI mode.

I observed this with bionic's ovmf 0~20180205.c0d9813c-2ubuntu0.1, and it is reproducible with Debian's 0~20190828.37eef910-3, and a manually built version of the latest upstream edk2 (@412c96384). Kernel-wise, I experimented with upgrading the guest and later the host from bionic's 4.15 GA to 5.3 hwe-edge kernel w/o any noticeable change in behavior.

Tags: sts
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :

cpaelzer recommends that we retest w/ the q35 machine type as a next step.

Revision history for this message
dann frazier (dannf) wrote :

I retested w/ q35 and did not have an issue. Here's the command I used:

virt-install --name q35 --machine q35 --memory 4096 --boot uefi --disk /home/ubuntu/q35.img --disk q35-seed.img --hostdev pci_0000_34_00_0 --hostdev pci_0000_36_00_0 --graphics none

But, by default, virt-install placed the devices on different BDFs in the guest when selecting q35:
      <address type='pci' domain='0x0000' bus='0x03' slot='0x00' function='0x0'/>

virsh edit wouldn't let me place the device at slot >= 0, so I couldn't do an apples/apples comparison w/ the i440FX

Revision history for this message
dann frazier (dannf) wrote :

Urgh, sorry - ignore comment #5. The only reason I didn't see issues is that I hadn't yet installed nvidia-dkms in the guest :( Even with q35 I see the "NVRM: This PCI I/O region assigned to your NVIDIA device is invalid:" and following messages. Further, in q35 mode, pci=nocrs no longer seems to be a functioning workaround.

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

What is host architecture?
Does host support gpu passthrough?
Was it enabled in Bios settings, and to the kernel commandline?

I.e. VT-d enabled in BIOS + intel_iommu=on passed on the kernel commandline for an Intel based machine. Similarish things need to happen on other platforms, ie. AMD. As these passthrough support are host-vendor-hw specific.

Revision history for this message
dann frazier (dannf) wrote : Re: [Bug 1849563] Re: Unable to passthrough GPUs to guest

On Thu, Nov 7, 2019 at 7:50 AM Dimitri John Ledkov
<email address hidden> wrote:
>
> What is host architecture?

amd64 - specifically, an Nvidia DGX2 system.

> Does host support gpu passthrough?
> Was it enabled in Bios settings, and to the kernel commandline?
>
> I.e. VT-d enabled in BIOS + intel_iommu=on passed on the kernel
> commandline for an Intel based machine. Similarish things need to happen
> on other platforms, ie. AMD. As these passthrough support are host-
> vendor-hw specific.

I think the important piece here - and the reason I filed the issue
against edk2 - is the following from the Description:

"Neither issue is reproducible when booting in non-UEFI mode."

Since I can passthrough devices just fine in legacy BIOS mode, I
assume all of my host BIOS/host kernel configuration is OK.

dann frazier (dannf)
Changed in edk2 (Ubuntu Bionic):
status: New → Confirmed
Revision history for this message
dann frazier (dannf) wrote : Re: Unable to passthrough GPUs to guest

As an experiment, I retried this test with focal host/guest (on the off-chance that e.g. we were missing something from QEMU or some topology logic in virtinst), but the results were the same.

Revision history for this message
dann frazier (dannf) wrote :

fyi, passing both pci=realloc pci=nocrs works as a workaround for me for q35 guests.

Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :

Attaching a boot log using an OVMF built w/ DEBUG enabled, which adds some more runes, such as:

PciBus: HostBridge->NotifyPhase(AllocateResources) - Out of Resources
PciBus: [01|00|00] was rejected due to resource confliction.

Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :
Download full text (4.2 KiB)

Hi Dann, thanks a lot for your logs and pretty great bug report here. The thread with edk2 folks is really informative!

I've been doing some research on this topic, and will share here in order to document it.
So, first thing is about "pci=nocrs" "pci=realloc". When setting "pci=nocrs", we are telling kernel to disregard ACPI resource information (CRS == Current Resource Settings object). In this mode, all memory is available to PCI Host Bridge allocations except the RAM and other detected reservations, so it bypasses the limitations of the OVMF firmware. This mode was the default 10 years ago, but it was changed due to some incompatibilities, for example systems with more than 1 PCI host bridge - the PCI subsystem maintainer decided then to "trust" more in the FW resource mapping, and allowed a kernel fallback through the option "nocrs". This is well-explained in [0] and [1].

The option "pci=realloc" is somewhat orthogonal to it; basically it allows kernel to perform PCI endpoints (aka, devices) memory (re-)assignments under their PCI host bridge memory space. So, an analogy would be: the PCI host bridge resource is a pile of memory in which devices will take some and consume for their BARs. With "pci=realloc", we allow kernel to retry this memory mapping for PCI devices some times, until it works (or eventually fail). It's natural to use "pci=realloc" and this option is somewhat automatic, due to kernel build-time configuration PCI_REALLOC_ENABLE_AUTO, which is default in Ubuntu kernels. In summary, "pci=realloc" is the way the memory of PCI host bridge is distributed to the PCI devices.

Now, regarding the firmware differences between OVMF and seabios. As per the ed2k thread mentioned in the above comment, OVMF has a strict limitation of the PCI64 aperture size. In seabios, things are a bit different - the ACPI table passed to Linux containing the PCI64 aperture information is DSST, this table is built dynamically based on SSDT construction on boot time (build_ssdt() on seabios code). This is ultimately based on PCI initialization routines that construct the BARs' sizes and sum all of them, given the information in the PCI devices' configuration space. The functions involved in this process are:

pci_setup() -> pci_bios_check_devices()/pci_bios_map_devices()

There's no limit on the aperture size, which is variable and can accommodate as many devices the guest memory allows. In a way, this is similar to the way Linux would perform the PCI resource allocations with "pci=nocrs" parameter.

Now, OVMF is more complex in nature. The source tree of OVMF is composed by multiple modules. The module MdeModulePkg is responsible for the PCI enumeration for OVMF. There are 2 parts involved in that:

- the aperture is calculated on submodule PciHostBridgeDxe; it comes from the early portions of the firmware code (submodule OvmfPkg/PlatformPei), in the memory detection routine (and in that point we can hijack into it using the experimental parameter X-PciMmio64Mb). This is then passed to PciHostBridgeDxe which will create a bridge with the memory resources' limits set.

- The PCI enumeration itself (and specially the device dropping in case the apertu...

Read more...

Revision history for this message
Guilherme G. Piccoli (gpiccoli) wrote :
summary: - Unable to passthrough GPUs to guest
+ Unable to passthrough GPUs to guest, due to PCI64 aperture limitation
no longer affects: edk2 (Ubuntu Disco)
Changed in edk2 (Ubuntu Eoan):
status: New → Confirmed
Changed in edk2 (Ubuntu Focal):
status: New → Confirmed
Changed in edk2 (Ubuntu Groovy):
status: New → Confirmed
Changed in edk2 (Ubuntu Bionic):
importance: Undecided → Medium
Changed in edk2 (Ubuntu Eoan):
importance: Undecided → Medium
Changed in edk2 (Ubuntu Focal):
importance: Undecided → Medium
Changed in edk2 (Ubuntu Groovy):
importance: Undecided → Medium
Changed in edk2 (Ubuntu Bionic):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in edk2 (Ubuntu Eoan):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in edk2 (Ubuntu Groovy):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
Changed in edk2 (Ubuntu Focal):
assignee: nobody → Guilherme G. Piccoli (gpiccoli)
tags: added: sts
Changed in edk2 (Ubuntu Eoan):
status: Confirmed → Won't Fix
Revision history for this message
Brian Murray (brian-murray) wrote :

The Groovy Gorilla has reached end of life, so this bug will not be fixed for that release

Changed in edk2 (Ubuntu Groovy):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.