PCIe cards passthrough to TCG guest works on 2GB of guest memory but fails on 4GB (vfio_dma_map invalid arg)

Bug #1869006 reported by Marcin Juszkiewicz on 2020-03-25
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
QEMU
Undecided
Unassigned

Bug Description

During one meeting coworker asked "did someone tried to passthrough PCIe card to other arch guest?" and I decided to check it.

Plugged SATA and USB3 controllers into spare slots on mainboard and started playing. On 1GB VM instance it worked (both cold- and hot-plugged). On 4GB one it did not:

Błąd podczas uruchamiania domeny: internal error: process exited while connecting to monitor: 2020-03-25T13:43:39.107524Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: VFIO_MAP_DMA: -22
2020-03-25T13:43:39.107560Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: vfio 0000:29:00.0: failed to setup container for group 28: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x563169753c80, 0x40000000, 0x100000000, 0x7fb2a3e00000) = -22 (Invalid argument)

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 75, in cb_wrapper
    callback(asyncjob, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/asyncjob.py", line 111, in tmpcb
    callback(*args, **kwargs)
  File "/usr/share/virt-manager/virtManager/object/libvirtobject.py", line 66, in newfn
    ret = fn(self, *args, **kwargs)
  File "/usr/share/virt-manager/virtManager/object/domain.py", line 1279, in startup
    self._backend.create()
  File "/usr/lib64/python3.8/site-packages/libvirt.py", line 1234, in create
    if ret == -1: raise libvirtError ('virDomainCreate() failed', dom=self)
libvirt.libvirtError: internal error: process exited while connecting to monitor: 2020-03-25T13:43:39.107524Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: VFIO_MAP_DMA: -22
2020-03-25T13:43:39.107560Z qemu-system-aarch64: -device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0: vfio 0000:29:00.0: failed to setup container for group 28: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x563169753c80, 0x40000000, 0x100000000, 0x7fb2a3e00000) = -22 (Invalid argument)

I played with memory and 3054 MB is maximum value possible to boot VM with coldplugged host PCIe cards.

Marcin Juszkiewicz (hrw) wrote :

Qemu command line for booting VM was generated by libvirt:

/usr/bin/qemu-system-aarch64
-name guest=fedora-aarch64-pcie,debug-threads=on
-S
-object secret,id=masterKey0,format=raw,file=/var/lib/libvirt/qemu/domain-1-fedora-aarch64-pcie/master-key.aes
-blockdev {"driver":"file","filename":"/usr/share/edk2/aarch64/QEMU_EFI-pflash.raw","node-name":"libvirt-pflash0-storage","auto-read-only":true,"discard":"unmap"}
-blockdev {"node-name":"libvirt-pflash0-format","read-only":true,"driver":"raw","file":"libvirt-pflash0-storage"}
-blockdev {"driver":"file","filename":"/var/lib/libvirt/qemu/nvram/fedora-aarch64-pcie_VARS.fd","node-name":"libvirt-pflash1-storage","auto-read-only":true,"discard":"unmap"}
-blockdev {"node-name":"libvirt-pflash1-format","read-only":false,"driver":"raw","file":"libvirt-pflash1-storage"}
-machine virt-4.2,accel=tcg,usb=off,dump-guest-core=off,gic-version=2,pflash0=libvirt-pflash0-format,pflash1=libvirt-pflash1-format
-cpu cortex-a57
-m 2048
-overcommit mem-lock=off
-smp 3,sockets=3,cores=1,threads=1
-uuid 139dc97a-1511-480d-b215-c58a5c80e646
-no-user-config
-nodefaults
-chardev socket,id=charmonitor,fd=32,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control
-rtc base=utc
-no-shutdown
-boot strict=on
-device pcie-root-port,port=0x8,chassis=1,id=pci.1,bus=pcie.0,multifunction=on,addr=0x1
-device pcie-root-port,port=0x9,chassis=2,id=pci.2,bus=pcie.0,addr=0x1.0x1
-device pcie-root-port,port=0xa,chassis=3,id=pci.3,bus=pcie.0,addr=0x1.0x2
-device pcie-root-port,port=0xb,chassis=4,id=pci.4,bus=pcie.0,addr=0x1.0x3
-device pcie-root-port,port=0xc,chassis=5,id=pci.5,bus=pcie.0,addr=0x1.0x4
-device pcie-root-port,port=0xd,chassis=6,id=pci.6,bus=pcie.0,addr=0x1.0x5
-device pcie-root-port,port=0xe,chassis=7,id=pci.7,bus=pcie.0,addr=0x1.0x6
-device pcie-root-port,port=0xf,chassis=8,id=pci.8,bus=pcie.0,addr=0x1.0x7
-device virtio-serial-pci,id=virtio-serial0,bus=pci.2,addr=0x0
-netdev tap,fd=29,id=hostnet0
-device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:87:3e:d3,bus=pci.1,addr=0x0
-chardev pty,id=charserial0
-serial chardev:charserial0
-chardev socket,id=charchannel0,fd=33,server,nowait
-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=charchannel0,id=channel0,name=org.qemu.guest_agent.0
-spice port=5900,addr=127.0.0.1,disable-ticketing,image-compression=off,seamless-migration=on
-device virtio-gpu-pci,id=video0,max_outputs=1,bus=pci.7,addr=0x0
-device vfio-pci,host=0000:29:00.0,id=hostdev0,bus=pci.3,addr=0x0
-device vfio-pci,host=0000:28:00.0,id=hostdev1,bus=pci.4,addr=0x0
-device virtio-balloon-pci,id=balloon0,bus=pci.5,addr=0x0
-object rng-random,id=objrng0,filename=/dev/urandom
-device virtio-rng-pci,rng=objrng0,id=rng0,bus=pci.6,addr=0x0
-sandbox on,obsolete=deny,elevateprivileges=deny,spawn=deny,resourcecontrol=deny
-msg timestamp=on

Marcin Juszkiewicz (hrw) wrote :
Download full text (5.6 KiB)

lspci -vvv output of used cards (host side):

28:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03) (prog-if 30 [XHCI])
 DeviceName: RTL8111EPV
 Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
 Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
 Latency: 0, Cache Line Size: 64 bytes
 Interrupt: pin A routed to IRQ 42
 Region 0: Memory at f7700000 (64-bit, non-prefetchable) [size=8K]
 Capabilities: [50] Power Management version 3
  Flags: PMEClk- DSI- D1- D2- AuxCurrent=375mA PME(D0+,D1-,D2-,D3hot+,D3cold+)
  Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
 Capabilities: [70] MSI: Enable- Count=1/8 Maskable- 64bit+
  Address: 0000000000000000 Data: 0000
 Capabilities: [90] MSI-X: Enable+ Count=8 Masked-
  Vector table: BAR=0 offset=00001000
  PBA: BAR=0 offset=00001080
 Capabilities: [a0] Express (v2) Endpoint, MSI 00
  DevCap: MaxPayload 128 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
   ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0.000W
  DevCtl: CorrErr- NonFatalErr- FatalErr- UnsupReq-
   RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop+
   MaxPayload 128 bytes, MaxReadReq 512 bytes
  DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
  LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <4us, L1 unlimited
   ClockPM+ Surprise- LLActRep- BwNot- ASPMOptComp-
  LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+
   ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
  LnkSta: Speed 5GT/s (ok), Width x1 (ok)
   TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
  DevCap2: Completion Timeout: Not Supported, TimeoutDis+, NROPrPrP-, LTR+
    10BitTagComp-, 10BitTagReq-, OBFF Not Supported, ExtFmt-, EETLPPrefix-
    EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
    FRS-, TPHComp-, ExtTPHComp-
    AtomicOpsCap: 32bit- 64bit- 128bitCAS-
  DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR+, OBFF Disabled
    AtomicOpsCtl: ReqEn-
  LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
    Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
    Compliance De-emphasis: -6dB
  LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
    EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
 Capabilities: [100 v1] Advanced Error Reporting
  UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
  UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
  UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
  CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
  CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
  AERCap: First Error Pointer: 00, ECRCGenCap- ECRCGenEn- ECRCChkCap- ECRCChkEn-
   MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
  HeaderLog: 00000000 00000000 00000000 00000000
 Capabilities: [150 v1] Latency Tolerance Reporting
  Max snoop latency: 1048576ns
  Max no ...

Read more...

Marcin Juszkiewicz (hrw) wrote :

attaching as file as launched wraps in ugly way

summary: - PCIe cards passthrough to TCG guest works on 2GB of guest memory but
- fails on 4GB
+ PCIe cards passthrough to TCG guest works only up to 3054MB of guest
+ memory
summary: - PCIe cards passthrough to TCG guest works only up to 3054MB of guest
- memory
+ PCIe cards passthrough to TCG guest works on 2GB of guest memory but
+ fails on 4GB (vfio_dma_map invalid arg)
Marcin Juszkiewicz (hrw) wrote :

Hotplug to VM with 3055MB of memory ends same way:

internal error: błąd podczas wykonać polecenia QEMU „device_add”: vfio 0000:28:00.0: failed to setup container for group 27: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x55c1aca5c5c0, 0x40000000, 0xbef00000, 0x7f3549000000) = -22 (Invalid argument)

Traceback (most recent call last):
  File "/usr/share/virt-manager/virtManager/addhardware.py", line 1327, in _add_device
    self.vm.attach_device(dev)
  File "/usr/share/virt-manager/virtManager/object/domain.py", line 920, in attach_device
    self._backend.attachDevice(devxml)
  File "/usr/lib64/python3.8/site-packages/libvirt.py", line 606, in attachDevice
    if ret == -1: raise libvirtError ('virDomainAttachDevice() failed', dom=self)
libvirt.libvirtError: internal error: błąd podczas wykonać polecenia QEMU „device_add”: vfio 0000:28:00.0: failed to setup container for group 27: memory listener initialization failed: Region mach-virt.ram: vfio_dma_map(0x55c1aca5c5c0, 0x40000000, 0xbef00000, 0x7f3549000000) = -22 (Invalid argument)

Marcin Juszkiewicz (hrw) wrote :

14:57 < aw> hrw: under /sys/kernel/iommu_groups/ there's a reserved_regions file for every group. cat the ones associated with the groups for these devices

14:59 < hrw> 14:58 (0s) hrw@puchatek:28$ cat reserved_regions
14:59 < hrw> 0x00000000fee00000 0x00000000feefffff msi
14:59 < hrw> 0x000000fd00000000 0x000000ffffffffff reserved

14:59 < hrw> 14:59 (2s) hrw@puchatek:27$ cat reserved_regions
14:59 < hrw> 0x00000000fee00000 0x00000000feefffff msi
14:59 < hrw> 0x000000fd00000000 0x000000ffffffffff reserved

15:00 < aw> of course, you're on an x86 host, arm has no concept of not mapping memory at 0xfee00000

Summary: ARM64 TCG VM on x86 host has GPA overlapping host reserved IOVA regions, vfio cannot map these. The VM needs holes in the GPA to account for this. The 2nd and 3rd args of the vfio_dma_map error report are the starting IOVA address and length respectively.

Alex Bennée (ajbennee) on 2020-03-25
tags: added: arm passthrough tcg
Peter Maydell (pmaydell) wrote :

My (limited) understanding of PCI was that the board and the PCI controller emulation define what the windows are where PCI BARs can live, and that PCI cards go there. (This certainly works for emulated PCI cards.) It's not clear to me why the host system imposes restrictions on the memory layout of the VM...

This is not related to the BARs, the mapping of the BARs into the guest is purely virtual and controlled by the guest. The issue is that the device needs to be able to DMA into guest RAM, and to do that transparently (ie. the guest doesn't know it's being virtualized), we need to map GPAs into the host IOMMU such that the guest interacts with the device in terms of GPAs, the host IOMMU translates that to HPAs. Thus the IOMMU needs to support GPA range of the guest as IOVA. However, there are ranges of IOVA space that the host IOMMU cannot map, for example the MSI range here is handled by the interrupt remmapper, not the DMA translation portion of the IOMMU (on physical ARM systems these are one-in-the-same, on x86 they are different components, using different mapping interfaces of the IOMMU). Therefore if the guest programmed the device to perform a DMA to 0xfee00000, the host IOMMU would see that as an MSI, not a DMA. When we do an x86 VM on and x86 host, both the host and the guest have complimentary reserved regions, which avoids this issue.

Also, to expand on what I mentioned on IRC, every x86 host is going to have some reserved range below 4G for this purpose, but if the aarch64 VM has no requirements for memory below 4G, the starting GPA for the VM could be at or above 4G and avoid this issue.

Peter Maydell (pmaydell) wrote :

That's an awkward flaw in the IOMMU design :-(

Marcin Juszkiewicz (hrw) wrote :

I wanted to try the same on machine without MSI. But my desktop refuses to boot into sane environment with pci=nomsi kernel option.

Do we need something like a table of excluded IOVA regions in ACPI or somewhere; in a similar way we have a region of exluded physical ram areas?
Or is the range of excluded IOVA's constant on any one architecture so it doesn't normally need to worry about it?

Yes, to support this the vfio driver would need to be able to exclude ranges of guest GPA space. Recent kernels already expose an IOVA list via the vfio API. Clearly, excluding GPA ranges has implications for hotplug. On x86 the ranges are pretty consistent, but IIRC differ between Intel and AMD. The vfio IOVA list was originally developed for an ARM use case though, and I imagine there's little or no consistency there.

Re: pci=nomsi, the reserved range exists regardless of whether MSI is actually used.

costinel (costinel) wrote :

I am experiencing the same behaviour for x86_64 guest on x86_64 host to which I'm attempting to efi boot (not hotplug) with a pcie gpu passthrough

This discussion (https://www.spinics.net/lists/iommu/msg40613.html) suggests a change in drivers/iommu/intel-iommu.c but it appears that in the kernel I tried, the change it is already implemented (linux-image-5.4.0-39-generic)

hardware is a hp microserver gen8 with conrep physical slot excluded in bios (https://www.jimmdenton.com/proliant-intel-dpdk/) and the kernel is rebuild with rmrr patch (https://forum.proxmox.com/threads/compile-proxmox-ve-with-patched-intel-iommu-driver-to-remove-rmrr-check.36374/)

also an user complains that on the same hardware it used to work with kernel 5.3 + rmrr patch (https://forum.level1techs.com/t/looking-for-vfio-wizards-to-troubleshoot-error-vfio-dma-map-22/153539) but it stopped working on the 5.4 kernel.

is this the same issue I'm observing? my qemu complains with the similar message:

 -device vfio-pci,host=07:00.0,id=hostdev0,bus=pci.4,addr=0x0: vfio_dma_map(0x556eb57939f0, 0xc0000, 0x3ff40000, 0x7f6fc7ec0000) = -22 (Invalid argument)

/sys/kernel/iommu_groups/1/reserved_regions shows:

0x00000000000e8000 0x00000000000e8fff direct
0x00000000000f4000 0x00000000000f4fff direct
0x00000000d5f7e000 0x00000000d5f94fff direct
0x00000000fee00000 0x00000000feefffff msi

costinel (costinel) wrote :

except that in my case the vm does not boot at all no matter how less memory I allocate to it.

You'd need to create a 160KB VM in order to stay below the required direct map memory regions imposed by the system firmware (e8000 - c0000). Non-upstream bypasses of those restrictions are clearly not supported. You can find more details regarding the RMRR restriction here:
https://access.redhat.com/sites/default/files/attachments/rmrr-wp1.pdf

QEMU currently has no support for creating a VM memory map based on the restrictions of the host platform IOMMU requirements.

costinel (costinel) wrote :

Alex, thanks for the quick answer, but sadly I still do not fully understand the implications, even if I read the pdf paper on RH website you mention, as well as the vendor advisory at https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-c04781229

When you say "qemu has no support", do you actually mean "qemu people are unable to help you if you break things by bypassing the in-place restrictions", or "qemu is designed to not work when restrictions are bypassed"?

Do I understand correctly that the BIOS can modify portions of the system usable RAM, so the vendor specific software tools can read those addresses, and if yes, does this mean is there a risk for data corruption if the RMRR restrictions are bypassed?

I have eventually managed to passthrough an nvidia card in the microserver gen8 to a windows vm using patched kernel 5.3, along with the vendor instructions to exclude the pcie slot aka the conrep solution but for it to work it still needed the "rmrr patch" aka removing the "return -EPERM" line below the "Device is ineligible [...]" in drivers/iommu/intel-iommu.c

However applying the same modification to kernel 5.4 leads to the "VFIO_MAP_DMA: -22" error.

Is there other place in the kernel 5.4 source that must be modified to bring back the v5.3 kernel behaviour? (ie. I have a stable home windows vm with the gpu passthrough despite all)

Download full text (3.7 KiB)

> When you say "qemu has no support", do you actually mean "qemu people
> are unable to help you if you break things by bypassing the in-place
> restrictions", or "qemu is designed to not work when restrictions are
> bypassed"?

The former. There are two aspects to this. The first is that the device has address space restrictions which therefore impose address space restrictions on the VM. That makes things like hotplug difficult or impossible to support. That much is something that could be considered a feature which QEMU has not yet implemented.

The more significant aspect when RMRRs are involved in this restriction is that an RMRR is essentially the platform firmware dictating that the host OS must maintain an identity map between the device and a range of physical address space. We don't know the purpose of that mapping, but we can assume that it allows the device to provide ongoing data for platform firmware to consume. This data might included health or sensor information that's vital to the operation of the system. It's therefore not simply a matter that QEMU needs to avoid RMRR ranges, we need to maintain the required identity maps while also manipulating the VM address space, but the former requirement implies that a user owns a device that has DMA access to a range of host memory that's been previously defined as vital to the operation of the platform and therefore likely exploitable by the user.

The configuration you've achieved appears to have disabled the host kernel restrictions preventing RMRR encumbered devices from participating in the IOMMU API, but left in place the VM address space implications of those RMRRs. This means that once the device is opened by the user, that firmware mandated identity mapping is removed and whatever health or sensor data was being reported by the device to that range is no longer available to the host firmware, which might adversely affect the behavior of the system. Upstream put this restriction in place as the safe thing to do to honor the firmware mapping requirement and you've circumvented it, therefore you are your own support.

> Do I understand correctly that the BIOS can modify portions of the
> system usable RAM, so the vendor specific software tools can read
> those addresses, and if yes, does this mean is there a risk for
> data corruption if the RMRR restrictions are bypassed?

RMRRs used for devices other than IGD or USB are often associated with reserved memory regions to prevent the host OS from making use of those ranges. It is possible that privileged utilities might interact with these ranges, but AIUI the main use case is for the device to interact with the range, which firmware then consumes. If you were to ignore the RMRR mapping altogether, there is a risk that the device will continue to write whatever health or sensor data it's programmed to report to that IOVA mapping, which could be a guest mapping and cause data corruption.

> Is there other place in the kernel 5.4 source that must be modified
> to bring back the v5.3 kernel behaviour? (ie. I have a stable home
> windows vm with the gpu passthrough despite all)

I think the scenarios is that previously the...

Read more...

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments