passthrough devices cause >17min boot delay

Bug #1838575 reported by dann frazier
22
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Confirmed
Low
Unassigned
qemu (Ubuntu)
Invalid
Low
Unassigned

Bug Description

Adding passthrough devices to a guest introduces a boot delay of > 17 minutes. During this time libvirt reports the guest to be "paused". The delay does not seem to scale with the # of hostdevs - that is, 1 hostdev causes about a 17min delay, while 16 only bumps that up to ~18min. Removing all hostdevs avoids the delay.

Revision history for this message
dann frazier (dannf) wrote :
Revision history for this message
dann frazier (dannf) wrote :

Here's a perf report of a 'sudo perf record -p 6949 -a -F 25000 -e cycles', after libvirt spawned qemu w/ pid 6949.

Changed in qemu (Ubuntu):
importance: Undecided → Medium
assignee: nobody → Rafael David Tinoco (rafaeldtinoco)
status: New → Confirmed
Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

> > sudo perf record -a -F $((250*100)) -e cycles:u --
> > /usr/bin/qemu-system-x86_64 -name guest="guest" ....
>
> I instead attached perf to the qemu process after it was spawned by
> libvirt so I didn't have to worry about producing a working qemu
> cmdline. I let it run for several seconds while still in the paused
> state.

Sure!

> > And check where we're spending so much time for the boot to happen.
> > Note I'm using "cycles:u", getting only userland stack trace probes
> > every (HZ*100),.
>
> This showed no samples - apparently all samples were kernel side.

Either kernel or VMX/SVM mode.

> > That might give us enough information.
> >
> > Would you mind opening a launchpad bug (can be private if you prefer)
> > so Christian and I can take a look ?
>
> Sure:
> https://bugs.launchpad.net/ubuntu/+source/qemu/+bug/1838575

Err, sorry to ask you this, but I'm getting magic/endian check errors
on perf report, maybe something related to the arch and perf. Would
you mind reporting to stdio and pasting into the case ? Are you using
fully updated Eoan (userland/kernel) for everything ?

(c)inaddy@temp:~/temp$ sudo /usr/lib/linux-tools/5.2.0-8-generic/perf
report -f -v
magic/endian check failed

Tks!

Revision history for this message
Rafael David Tinoco (rafaeldtinoco) wrote :

Attached file is already the stdio, sorry for prev message.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I have seen slow boots before, but it scaled with the amount of devices and memory reaching like ~5min (for >1TB mem init) and ~2.5min for 16 device pass-through.

Those times (the ones I mentioned) Nvidia has seen and sort of accepted. I discussed with Anish about how using HugePages help to reduce a lot of these times.

But 17min for just one device certainly feels wrong.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Since I knew memory often is more painful - start with 512MB, 1CPU, 1 PCI Passthrough
Note: I installed debug symbols for glibc and qemu

On init I find initially the guests CPU thread rather busy (well, booting up)
  80.66% CPU 0/KVM [kernel]

Passthrough is successful - lspci from guest:
 00:07.0 3D controller: NVIDIA Corporation Device 1db8 (rev a1)

Usually you have a few sections of execution, here are my timings:
1. from virsh start until there is a console connected
   That is the libvirt prep and the early startup of the qemu process
   until it provides something on the socket representing the console
   => 4 seconds
2. from #1 until we see anything in the console moving
   That usually is mostly consumed by KVM (host) to prep mappings
   and potentially scrub the memory and prepare devices.
   => 12 seconds
3. from #2 until we can log in
   That is the guest kernel, followed by guest userspace initializing
   => 17 seconds

So I overall needed 33 seconds for a guest with one device passed through, that is not too bad.
Your example was the full size (1.4TB) guest and I'd expect that this is what happens here.
Trying a few more tests ...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I was bumping up to the config you had (but with one PT device).

- Host phys bits machine type for larger mappings
- more CPUS 1->32

Adding/removing a PT device in the configs above doesn't change a lot.

As assumed none of these increased the time tremendously.
Then I went to bump up the memory size.

T1: use 1.2 TB but no PT device
  #1: 6 sec
  #2: 21 sec
  #3: 16 sec

This only very slightly increases in #2 due to more memory that needs to be set up.

T2: use 1.2 TB with one PT device
  #1: 253 sec
  #2: 20 sec
  #3: 18 sec

The time consuming part is a single process, with kernel side load.
Associated userpace is qemu, but the load i in the kernel close to 100%.

Samples: 62K of event 'cycles:ppp', Event count (approx.): 34521154809
Overhead Shared Object Symbol
  73.91% [kernel] [k] clear_page_erms
   9.53% [kernel] [k] clear_huge_page
   1.65% [kernel] [k] follow_trans_huge_pmd

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

IMHO this is the same we discussed around March this year
=> https://bugs.launchpad.net/nvidia-dgx-2/+bug/1818891/comments/5.

In an associated mail thread we even discussed the pro/con of changes like
  <allocation mode="immediate|ondemand"/>

We see above this is about (transparent huge page) setup for all the memory.
We can disable hugepages entirely, but it will make this even less efficient.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

T3: use 1.2 TB with one PT device - THP=off
  #1: 476 seconds
  #2: 31 seconds
  #3: 20 seconds

ubuntu@akis:~$ echo never | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
never
ubuntu@akis:~$ cat /sys/kernel/mm/transparent_hugepage/enabled
always madvise [never]

Samples: 88K of event 'cycles:ppp', Event count (approx.): 44410625251
Overhead Shared Object Symbol
  22.12% [kernel] [k] _raw_spin_lock
  14.56% [kernel] [k] try_charge
  13.00% [kernel] [k] clear_page_erms
  11.48% [kernel] [k] get_page_from_freelist
   8.10% [kernel] [k] handle_pte_fault
   2.04% [kernel] [k] __pagevec_lru_add_fn
   1.80% [kernel] [k] follow_page_pte

We still see page scrub and mapping initialization, but no more transparent pages.
As expected this changed the perf profile but didn't provide a lot of gain
(actually a loss on init time and later on ).

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Now to gain a good result, lets use 1G Huge Pages.
Kernel cmdline:
  default_hugepagesz=1G hugepagesz=1G hugepages=1210

Gives:
HugePages_Total: 1210
HugePages_Free: 1210
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 1048576 kB

Guest config extra:
 <memoryBacking>
   <hugepages/>
 </memoryBacking>

Slightly changed profile:

Samples: 28K of event 'cycles:ppp', Event count (approx.): 19134377327
Overhead Shared Object Symbol
  27.02% [kernel] [k] follow_hugetlb_page
  11.23% [kernel] [k] __get_user_pages
   9.10% [kernel] [k] _raw_spin_lock
   7.58% [kernel] [k] down_read
   7.53% [kernel] [k] up_read
   6.14% [kernel] [k] huge_pte_offset

T4: use 1.2 TB with one PT device, 1G sized explicit HugePages
  #1: 59 seconds
  #2: 21 seconds
  #3: 18 seconds

This is what you'd expect from such a huge guest, and it works just as expected.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I mentioned in the last discussion around this, that the one thing that could be done is to make this single thread mem-init a multi thread action (in the kernel). I doubt that we can make it omit the initialization. Even though it is faster, even the 1G Huge Page setup could be more efficient.

To consider that we need to know what exactly it does.
The last call from userspace before this long 1-thread-busy period starts are:
  openat(AT_FDCWD, "/dev/vfio/vfio", O_RDWR|O_CLOEXEC) = 57
  ioctl(57, VFIO_GET_API_VERSION, 0x80002) = 0
  ioctl(57, VFIO_CHECK_EXTENSION, 0x1) = 1
  ioctl(57, VFIO_CHECK_EXTENSION, 0x3) = 1
  ioctl(56, VFIO_GROUP_SET_CONTAINER, 0x7fff4499dc24) = 0
  ioctl(57, VFIO_SET_IOMMU, 0x3) = 0
  ioctl(57, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO or VFIO_IOMMU_GET_INFO or VFIO_IOMMU_SPAPR_TCE_GET_INFO, 0x7fff4499dc30) = 0
  ioctl(12, KVM_CREATE_DEVICE, 0x7fff4499dbd4) = 0
  ioctl(58, KVM_SET_DEVICE_ATTR, 0x7fff4499dbe0) = 0
  ioctl(57, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA, 0x7fff4499da90) = 0
  ioctl(57, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA, 0x7fff4499da90) = 0
  ioctl(57, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA, 0x7fff4499da90) = 0
  ioctl(57, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA, 0x7fff4499da90) = 0
  ioctl(57, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA, 0x7fff4499da90) = 0

Note: from our last thoughts on this it is quite possible that the granularity this could be split depends on the locks being used. E.g. per Node, as arbitrary chunks might just lock content and ping-pong those locks.

It might be similar to [1] that ended up in [2] after all. Not sure yet if in the kernel it does reset or map.

[1]: https://www.spinics.net/lists/kvm/msg177206.html
[2]: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=e309df5b0c9e67cc929eedd3e32f4907fa49543e

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

A trace into the early phase (first some other init with ~16 cpus) then the long phase of 1 thread blocking.
So we will see it "enter" the slow phase as well as iterating in it.

There seem to be two phases one around alloc_pages_current and then one around slot_rmap_walk_next/rmap_get_first.
I'd appreciate if the kernel Team could take a look at the details here and how the call chains actually are.
And how this is related to the IOCTL that seem to trigger it.

After all the testcase is rather simple:
uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=amd64 label=daily release=bionic
uvt-kvm create --password ubuntu test-slowstart arch=amd64 release=bionic label=daily
Then add to guest config:
  <memory unit='KiB'>1258291200</memory>
  <currentMemory unit='KiB'>1258291200</currentMemory>
  <type arch='x86_64' machine='pc-i440fx-bionic-hpb'>hvm</type>
    <hostdev mode='subsystem' type='pci' managed='yes'>
      <source>
        <address domain='0x0000' bus='0x34' slot='0x00' function='0x0'/>
      </source>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x07' function='0x0'/>
    </hostdev>

A guest like that is already on the system for testing.

ftrace log (compressed 179M, full 4.8G) can be found on the system.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Summary:
As I mentioned before (on the other bug that I referred).
The problem is that with a PT device it needs to reset and map the VDIO devices.
So with >0 PT devices attached it needs an init that scales with memory size of the guest (see my fast results with PT but small guest memory).

As I outlined before the "fix" for that is either:
- (admin) use Huge pages (1G if possible) for the very huge guest which will make this way more efficient.
- (kernel development) get the kernel to use more cores for this initialization (currently 1)

Last time we disussed this Nvidia said they consider 1G Hugepages as recommended setup for huge guests (due to the benefits). But not for guests in general (due to the complexity managing the memory).
I don't know where it ended, btu I expect it to be in some guidelines.
That is why I wondered about th elack of hueg pages in the spec to test qemu passthrough perf if you remember.

I don't think there is much more qemu can do at this point (at least I'd not know), but we can use this re-discovery of this issue as a chance. If you want to dive (or ask one of your teammates) into the kernel side to check if it could be made multi-cpu initialization for better scaling.
I'm adding a kernel task for your consideration.

Changed in linux (Ubuntu):
status: New → Confirmed
Changed in qemu (Ubuntu):
assignee: Rafael David Tinoco (rafaeldtinoco) → nobody
status: Confirmed → Incomplete
Revision history for this message
Colin Ian King (colin-king) wrote :

Naive question: can we tweak the hugepage file settings at run time via
/proc/sys/vm/nr_hugepages and not require the kernel parameters?

Changed in linux (Ubuntu):
importance: Undecided → Medium
assignee: nobody → Colin Ian King (colin-king)
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

You can do so even per-size via e.g. /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
As discussed the later the allocation the higher the chance to fail, so re-check the sysfs file after each change if it actually got that much memory.

The default size is only a boot time parameter.
But you can specify explicit sizes in libvirt xml.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

As assumed this really seems to be cross arch and for all sizes.
Here 16 PU, 128G on ppc64el:
  #1: 54 seconds
  #2: 7 seconds
  #3: 23 seconds

Upped to 192GB this has:
  #1: 75 seconds
  #2: 5 seconds
  #3: 23 seconds

As a note, in this case I checked there are ~7 seconds before it does into this "single thread, kernel busy" load. Those 7 seconds might be other/usual init in qemu/kvm while the long load is the

Just the linear scaling is what breaks it, in case the same speed applies 10 times as much +1.28TB would be in the 10 minutes as well. The remaining difference might be dimm speed and bandwidth.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

On this platform strace still confirms the same paths:
And perf as well (slight arch differences, but still mem setup).
  46.85% [kernel] [k] lruvec_lru_size
  16.89% [kernel] [k] clear_user_page
   5.74% [kernel] [k] inactive_list_is_low.isra.0
   4.48% [kernel] [k] find_next_bit
   3.74% [kernel] [k] cpumask_next
   2.72% [kernel] [k] copypage_power7
   2.40% [kernel] [k] kvm_age_rmapp
   1.94% [kernel] [k] snapshot_refaults
   1.28% [kernel] [k] shrink_page_list

Note: This is 5.2.0-10-generic

Also cross arch, as soon as no hostdev is passed the init can be skipped. the 75 seconds above drop down to the ~7 seconds I have seen before the single thread init starts.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Hmm, with strace showing almost a hang on a single of those ioctl calls you'D think that is easy to spot :-/

But this isn't as clear as expected:
  sudo trace-cmd record -p function_graph -l vfio_pci_ioctl -O graph-time

Disable all but 1 CPUs to have less concurrency in the trace.
=> Not much better

But it is the handler:
  1276 »···.ioctl»·»···= vfio_pci_ioctl,

There seem to be a lot of calls to vfio_pci_ioctl and not just a few slow ones.
Whatever strace thought, that is not how it looks like.

Starting again with wider traces to check what the entry might be.
sys_splice is high on the radar atm.

As a bonus now the test env broke and can't passthrough anymore.
The emulex firmware breaks and fails.
Maybe abort for now and focus on other things and resetup the testbed another day.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (4.3 KiB)

I built qemu head from git

$ export CFLAGS="-O0 -g"
$ ./configure --disable-user --disable-linux-user --disable-docs --disable-guest-agent --disable-sdl --disable-gtk --disable-vnc --disable-xen --disable-brlapi --enable-fdt --disable-bluez --disable-vde --disable-rbd --disable-libiscsi --disable-libnfs --disable-libusb --disable-usb-redir --disable-seccomp --disable-glusterfs --disable-tpm --disable-numa --disable-slirp --disable-blobs --target-list=ppc64-softmmu
$ make -j

$ virsh nodedev-detach pci_0005_01_00_0 --driver vfio
$ virsh nodedev-detach pci_0005_01_00_1 --driver vfio
$ virsh nodedev-detach pci_0005_01_00_2 --driver vfio
$ virsh nodedev-detach pci_0005_01_00_3 --driver vfio
$ virsh nodedev-detach pci_0005_01_00_4 --driver vfio
$ virsh nodedev-detach pci_0005_01_00_5 --driver vfio

$ sudo ppc64-softmmu/qemu-system-ppc64 -machine pseries-4.1,accel=kvm,usb=off,dump-guest-core=off,cap-cfpc=broken,cap-sbbc=broken,cap-ibs=broken -name guest=test-vfio-slowness -m 131072 -smp 1 -no-user-config -device spapr-pci-host-bridge,index=1,id=pci.1 -drive file=/var/lib/uvtool/libvirt/images/test-huge-mem-init.qcow,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x3,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -device vfio-pci,host=0005:01:00.0,id=hostdev0,bus=pci.1.0,addr=0x1 -device vfio-pci,host=0005:01:00.1,id=hostdev1,bus=pci.1.0,addr=0x2 -device vfio-pci,host=0005:01:00.2,id=hostdev2,bus=pci.1.0,addr=0x3 -device vfio-pci,host=0005:01:00.3,id=hostdev3,bus=pci.1.0,addr=0x4 -device vfio-pci,host=0005:01:00.4,id=hostdev4,bus=pci.1.0,addr=0x5 -device vfio-pci,host=0005:01:00.5,id=hostdev5,bus=pci.1.0,addr=0x6 -msg timestamp=on -display curses

I found VFIO_IOMMU_SPAPR_REGISTER_MEMORY:
96783 0.000088 readlink("/sys/bus/pci/devices/0005:01:00.0/iommu_group", "../../../../kernel/iommu_groups/"..., 4096) = 33 <0.000022>
96783 0.000066 openat(AT_FDCWD, "/dev/vfio/8", O_RDWR|O_CLOEXEC) = 16 <0.000025>
96783 0.000059 ioctl(16, VFIO_GROUP_GET_STATUS, 0x7fffe3fd6e20) = 0 <0.000018>
96783 0.000050 openat(AT_FDCWD, "/dev/vfio/vfio", O_RDWR|O_CLOEXEC) = 17 <0.000014>
96783 0.000049 ioctl(17, VFIO_GET_API_VERSION, 0) = 0 <0.000008>
96783 0.000039 ioctl(17, VFIO_CHECK_EXTENSION, 0x3) = 0 <0.000011>
96783 0.000040 ioctl(17, VFIO_CHECK_EXTENSION, 0x1) = 0 <0.000008>
96783 0.000037 ioctl(17, VFIO_CHECK_EXTENSION, 0x7) = 1 <0.000008>
96783 0.000037 ioctl(16, VFIO_GROUP_SET_CONTAINER, 0x65690e1bb48) = 0 <0.000008>
96783 0.000037 ioctl(17, VFIO_SET_IOMMU, 0x7) = 0 <0.000039>
96783 0.000070 ioctl(17, VFIO_IOMMU_SPAPR_REGISTER_MEMORY <unfinished ...>
96785 10.019032 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <10.022751>
96785 0.053520 madvise(0x7fabd1f60000, 8257536, MADV_DONTNEED) = 0 <0.000020>
96785 0.007283 exit(0) = ?
96785 0.000072 +++ exited with 0 +++
96783 276.894553 <... ioctl resumed> , 0x7fffe3fd6b70) = 0 <286.974436>
96783 0.000107 --- SIGWINCH {si_signo=SIGWINCH, si_code=SI_KERNEL} ---

This is repeatable and explains why I haven't seen exactly the same that I saw on x86.
VFIO_IOMMU_SPAPR_REGI...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

On x86 this looks pretty similar and at the place we have seen before:

45397 0.000073 readlink("/sys/bus/pci/devices/0000:21:00.1/iommu_group", "../../../../kernel/iommu_groups/"..., 4096) = 34 <0.000020>
45397 0.000053 openat(AT_FDCWD, "/dev/vfio/45", O_RDWR|O_CLOEXEC) = 31 <0.000033>
45397 0.000060 ioctl(31, VFIO_GROUP_GET_STATUS, 0x7ffd774f80f8) = 0 <0.000015>
45397 0.000041 openat(AT_FDCWD, "/dev/vfio/vfio", O_RDWR|O_CLOEXEC) = 32 <0.000019>
45397 0.000046 ioctl(32, VFIO_GET_API_VERSION, 0x80002) = 0 <0.000011>
45397 0.000038 ioctl(32, VFIO_CHECK_EXTENSION, 0x3) = 1 <0.000012>
45397 0.000035 ioctl(31, VFIO_GROUP_SET_CONTAINER, 0x55758e6acec8) = 0 <0.000011>
45397 0.000033 ioctl(32, VFIO_SET_IOMMU, 0x3) = 0 <0.000174>
45397 0.000199 ioctl(32, VFIO_DEVICE_GET_PCI_HOT_RESET_INFO or VFIO_IOMMU_GET_INFO or VFIO_IOMMU_SPAPR_TCE_GET_INFO, 0x7ffd774f8100) = 0 <0.000012>
45397 0.000038 ioctl(14, KVM_CREATE_DEVICE, 0x7ffd774f8094) = 0 <0.000015>
45397 0.000039 ioctl(33, KVM_SET_DEVICE_ATTR, 0x7ffd774f80a0) = 0 <0.000028>
45397 0.000054 ioctl(32, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA, 0x7ffd774f7f20) = 0 <0.000287>
45397 0.000312 ioctl(32, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA, 0x7ffd774f7f20) = 0 <0.000035>
45397 0.000058 ioctl(32, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA, 0x7ffd774f7f20) = 0 <0.000037>
45397 0.000060 ioctl(32, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA, 0x7ffd774f7f20) = 0 <0.334458>
45397 0.334593 ioctl(32, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA, 0x7ffd774f7f20) = 0 <0.000055>
45397 0.000088 ioctl(32, VFIO_DEVICE_PCI_HOT_RESET or VFIO_IOMMU_MAP_DMA <unfinished ...>
45402 9.654538 <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) <10.000112>
45402 0.000144 madvise(0x7fd51236f000, 8368128, MADV_DONTNEED) = 0 <0.000036>
45402 0.000076 exit(0) = ?
45402 0.000074 +++ exited with 0 +++
45397 45.903415 <... ioctl resumed> , 0x7ffd774f7f20) = 0 <55.558237>

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (3.8 KiB)

The above was through libvirt, doing that directly in qemu now to throw it into debugging more easily:

$ virsh nodedev-detach pci_0000_21_00_1 --driver vfio

$ qemu/x86_64-softmmu/qemu-system-x86_64 -name guest=test-vfio-slowness -m 131072 -smp 1 -no-user-config -drive file=/var/lib/uvtool/libvirt/images/eoan.qcow,format=qcow2,if=none,id=drive-virtio-disk0 -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -device vfio-pci,host=21:00.1,id=hostdev0,bus=pci.0,addr=0x7 -display curses

The speed differs a lot between retries:

First run (after boot):
5204 3.793765 <... ioctl resumed> , 0x7ffffa6687b0) = 0 <13.464192>
Second run (after aborting the former qemu)
5251 179.361434 <... ioctl resumed> , 0x7ffed49fbc70) = 0 <189.030433>
Third run (after aborting the former qemu)
5728 703.886935 <... ioctl resumed> , 0x7fff9f6a7120) = 0 <713.556876>

Maybe a former unclean stop makes it take extra time?
Or there is still cleanup going on in the iommu which stalls the follow on mappings to be established.

Forth run (after proper guest shutdown + 10 min wait)
6171 11.167221 <... ioctl resumed> , 0x7ffe597f03e0) = 0 <20.838401>
Fifth run (again kill and restart)
6215 1182.244403 <... ioctl resumed> , 0x7fff20a49b10) = 0 <1191.917291>

10 sec iommu counters while in the hang:
       34330395280 cycles
                16 cmd_processed (33.33%)
                 6 cmd_processed_inv (33.37%)
                 0 ign_rd_wr_mmio_1ff8h (33.41%)
                45 int_dte_hit (33.44%)
                 0 int_dte_mis (33.44%)
               345 mem_dte_hit (33.44%)
               907 mem_dte_mis (33.44%)
               111 mem_iommu_tlb_pde_hit (33.44%)
               102 mem_iommu_tlb_pde_mis (33.40%)
              7927 mem_iommu_tlb_pte_hit (33.37%)
              1761 mem_iommu_tlb_pte_mis (33.33%)
                 0 mem_pass_excl (33.30%)
                 0 mem_pass_pretrans (33.29%)
                 0 mem_pass_untrans (33.29%)
                 0 mem_target_abort (33.29%)
              8492 mem_trans_total (33.30%)
                 0 page_tbl_read_gst (33.30%)
               168 page_tbl_read_nst (33.30%)
               129 page_tbl_read_tot ...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Just when I thought I understood the pattern.

Sixth run (again kill and restart)
6384 9.826097 <... ioctl resumed> , 0x7ffcc8ed6e20) = 0 <19.495688>

So for now lets summarize that it varies :-/
But it always seems slow.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :
Download full text (5.2 KiB)

Many ioctls (as expected) but they are all fast and match what we knew from strace.
Thread 1 "qemu-system-x86" hit Catchpoint 1 (call to syscall ioctl), 0x00007ffff72fae0b in ioctl () at ../sysdeps/unix/syscall-template.S:78
78 in ../sysdeps/unix/syscall-template.S
(gdb) bt
#0 0x00007ffff72fae0b in ioctl () at ../sysdeps/unix/syscall-template.S:78
#1 0x0000555555914217 in vfio_init_container (container=0x555557608430, group_fd=11, errp=0x7fffffffdda8) at /home/ubuntu/qemu/hw/vfio/common.c:1172
#2 0x000055555591445e in vfio_connect_container (group=0x5555576083b0, as=0x5555565ca5a0 <address_space_memory>, errp=0x7fffffffdda8) at /home/ubuntu/qemu/hw/vfio/common.c:1257
#3 0x0000555555914f35 in vfio_get_group (groupid=45, as=0x5555565ca5a0 <address_space_memory>, errp=0x7fffffffdda8) at /home/ubuntu/qemu/hw/vfio/common.c:1482
#4 0x0000555555920575 in vfio_realize (pdev=0x555557600570, errp=0x7fffffffdda8) at /home/ubuntu/qemu/hw/vfio/pci.c:2759
#5 0x0000555555b853d7 in pci_qdev_realize (qdev=0x555557600570, errp=0x7fffffffde20) at hw/pci/pci.c:2096
#6 0x0000555555aac178 in device_set_realized (obj=0x555557600570, value=true, errp=0x7fffffffdff0) at hw/core/qdev.c:835
#7 0x0000555555cb857a in property_set_bool (obj=0x555557600570, v=0x555557602f00, name=0x555555f3b5d2 "realized", opaque=0x5555574e79e0, errp=0x7fffffffdff0) at qom/object.c:2080
#8 0x0000555555cb6731 in object_property_set (obj=0x555557600570, v=0x555557602f00, name=0x555555f3b5d2 "realized", errp=0x7fffffffdff0) at qom/object.c:1272
#9 0x0000555555cb9965 in object_property_set_qobject (obj=0x555557600570, value=0x555557602dc0, name=0x555555f3b5d2 "realized", errp=0x7fffffffdff0) at qom/qom-qobject.c:26
#10 0x0000555555cb6a2a in object_property_set_bool (obj=0x555557600570, value=true, name=0x555555f3b5d2 "realized", errp=0x7fffffffdff0) at qom/object.c:1338
#11 0x0000555555a12cb1 in qdev_device_add (opts=0x5555566a0770, errp=0x55555661e790 <error_fatal>) at qdev-monitor.c:635
#12 0x0000555555a1b11f in device_init_func (opaque=0x0, opts=0x5555566a0770, errp=0x55555661e790 <error_fatal>) at vl.c:2206
#13 0x0000555555e1c7fd in qemu_opts_foreach (list=0x555556468ee0 <qemu_device_opts>, func=0x555555a1b0f4 <device_init_func>, opaque=0x0, errp=0x55555661e790 <error_fatal>)
    at util/qemu-option.c:1170
#14 0x0000555555a2009d in main (argc=14, argv=0x7fffffffe3d8, envp=0x7fffffffe450) at vl.c:4387

The slow one is again the full iommu mapping of the guest memory:
(gdb) bt
#0 0x00007ffff72fae0b in ioctl () at ../sysdeps/unix/syscall-template.S:78
#1 0x0000555555911a9e in vfio_dma_map (container=0x555557608430, iova=4294967296, size=134217728000, vaddr=0x7fe01fe00000, readonly=false) at /home/ubuntu/qemu/hw/vfio/common.c:349
#2 0x000055555591299b in vfio_listener_region_add (listener=0x555557608440, section=0x7fffffffcad0) at /home/ubuntu/qemu/hw/vfio/common.c:664
#3 0x000055555587dfb0 in listener_add_address_space (listener=0x555557608440, as=0x5555565ca5a0 <address_space_memory>) at /home/ubuntu/qemu/memory.c:2714
#4 0x000055555587e3dd in memory_listener_register (listener=0x555557608440, as=0x5555565ca5a0 <address_space_memory>) at /home/ubuntu/qemu/memory.c:2781
#...

Read more...

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Reference:
this is the call from qemu that I think we see above (on x86) is at [1].
If this time the assumption is correct the kernel place would be at vfio_iommu_type1_ioctl.

For debugging:
$ gdb qemu/x86_64-softmmu/qemu-system-x86_64
(gdb) catch syscall 16
(gdb) run -m 131072 -smp 1 -no-user-config -device vfio-pci,host=21:00.1,id=hostdev0,bus=pci.0,addr=0x7
(gdb) continue 22
# Now we are "in front" of the bad call.

https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=3e03c495d868f154b26a33c0c2d870273a0b4198;hb=HEAD#l480

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I could next build a test kernel with some debug around the vfio iommu dma map to check how time below that call is spent.
I'm sure that data already is hidden in some of my trace data, but to eventually change/experiment I need to build one anyway.

I expect anyway to summarize and go into a discussion at some point, as even if one would split the map calls iommus are full of quirks and constraints that might bit us later.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

Each qemu (version) is slightly different in the road to this, but then seems to behave.

This one is slightly better to get "in front" of the slow call to map all the memory.
$ virsh nodedev-detach pci_0000_21_00_1 --driver vfio
$ gdb /usr/bin/qemu-system-x86_64
(gdb) b vfio_dma_map
(gdb) command 1
Type commands for breakpoint(s) 1, one per line.
End with a line saying just "end".
>silent
>if size != 134217728000
 >cont
 >end
>end
(gdb) run -m 131072 -smp 1 -no-user-config -device vfio-pci,host=21:00.1,id=hostdev0,bus=pci.0,addr=0x7 -enable-kvm
(gdb) catch syscall 16
(gdb) c

Also correct the link to qemu - it is the one from vfio_listener_region_add:
https://git.qemu.org/?p=qemu.git;a=blob;f=hw/vfio/common.c;h=3e03c495d868f154b26a33c0c2d870273a0b4198;hb=HEAD#l664

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

The iommu is locked in there early and the iommu element is what is passed from userspace.
That represents the vfio container for this device (container->fd)
qemu:
  if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0
kernel:
static long vfio_iommu_type1_ioctl(void *iommu_data,
        unsigned int cmd, unsigned long arg)
    struct vfio_iommu *iommu = iommu_data;
    [... into vfio_dma_do_map ...]
    mutex_lock(&iommu->lock);
There isn't much divide and conquer splitting that seems easily possible for now :-/

Down there while this lock is held all the memory size must be pinned
-> vfio_pin_pages_remote
Which gets the biggest chunk it can to then map it
-> vfio_iommu_map
This is repeated until all of the requested size is handled.

Establishing iommu maps is known to be expensive, an assumption would be that in the semi-fast cases is either:
- memory is still non-fragmented so we only need a few calls
- the iommu is sort of async-busy from the former work (same calls, but longer)
That should be visible in the amount of vfio_pin_pages_remote if we don't miss some.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

I modified the kernel to have a few functions non-inlined to be better tracable:
vfio_dma_do_map
vfio_dma_do_unmap
mutex_lock
mutex_unlock
kzalloc
vfio_link_dma
vfio_pin_map_dma
vfio_pin_pages_remote
vfio_iommu_map

Then run tracing on this load with limited to the functions in my focus:
$ sudo trace-cmd record -p function_graph -l vfio_iommu_type1_ioctl -l vfio_pci_ioctl -l vfio_set_irqs_validate_and_prepare -l vfio_pci_set_irqs_ioctl -l pci_probe_reset_slot -l pci_probe_reset_bus -l vfio_pci_count_devs -l vfio_pci_fill_devs -l vfio_pci_validate_devs -l vfio_group_get_external_user -l vfio_group_put_external_user -l vfio_external_user_iommu_id -l vfio_dma_do_map -l vfio_dma_do_unmap -l mutex_lock -l mutex_unlock -l vfio_link_dma -l vfio_pin_map_dma -l vfio_pin_pages_remote -l vfio_iommu_map

This I traced twice to compare it:
- after a clean boot; duration ~12 seconds
- right afterwards (rerun from gdb); duration 175~ seconds

It shows that the locks are interesting for concurrency considerations, but here mostly noisy.
OTOH it is interesting that the slow case has much more of them, maybe they need to unlock for interactiveness in between more often. I'll count them in postprocessing like:

The result shows that the locking involved scaled much more.
Duration is ~*14.5 and locks are ~*34 the average lock activity between vfio calls is even more with
1.52 -> 6.32.

But the calls to vfio_pin_pages_remote / vfio_iommu_map went up as well.
64020 -> 327852; which means smaller chunks I guess.
The timing of these functions changes slightly, initially it is much different than later in the trace.
Lets take a clean shot at just those durations and sizes:

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

(systemtap)
probe module("vfio_iommu_type1").function("vfio_iommu_type1_ioctl") {
    printf("New vfio_iommu_type1_ioctl\n");
    start_stopwatch("vfioioctl");
}
probe module("vfio_iommu_type1").function("vfio_iommu_type1_ioctl").return {
    timer=read_stopwatch_ns("vfioioctl")
    printf("Completed vfio_iommu_type1_ioctl: %d\n", timer);
    stop_stopwatch("vfioioctl");
}
probe module("vfio_iommu_type1").function("vfio_pin_pages_remote") {
    timer=read_stopwatch_ns("vfioioctl")
    printf("%ld: %s\n", timer, $$parms);
}

The overhead was significant enough, the fast case this time was 132 seconds.
The following slow case had 209 seconds (still slower but not more by factors).

While the overhead might have influenced the absolute timing of this too much.
It still might help indicate the size/frequency distribution of these calls that we wanted to know.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

This is a silly but useful distribution check with log10 of the allocation sizes:
Fast:
    108 3
   1293 4
  12133 5
 113330 6
  27794 7
   1119 8
Slow:
    194 3
   1738 4
  17375 5
 143411 6
     55 7
      3 8

I got no warnings about missed calls, but always be aware that some numbers might be off - ususally they are ok for relative comparisons.

So yeah, the slow case just needs to map more smaller pieces as that is all it can find.
That explains the "getting worse with system runtime", and I don#t think there is much that can be done here.

I'll discuss if there is any gain in splitting this from one thread into many (for very huge memory sizes).

P.S. Finally I just want to re-iterate that using hugepages due to their pre-allocation, less fragmentation, less mapping behavior really is the "config" way out of this.

Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

As qemu (seems) to be unable to do much I'll set it to triaged (we understand what is going on) and low (can't do much).

Changed in qemu (Ubuntu):
status: Incomplete → Triaged
importance: Medium → Low
Revision history for this message
Christian Ehrhardt  (paelzer) wrote :

As outlined in the past conceptually there is nothing that qemu can do.
The kernel can in theory get memory zeroing to become concurrent and thereby scale with CPUs but that is an effort that was already started twice and didn't get into the kernel yet.

Workarounds are known to shrink that size massively (HP).

Also this is nothing the kernel Team would usually "do" they would follow upstream on it, so I'm leaving it open but low for them.
I realized those discussions where going on but not linked here.
=> https://<email address hidden>/T/
=> https://lwn.net/Articles/728023/
=> https://blogs.oracle.com/linux/making-kernel-tasks-faster-with-ktask,-an-update

Is one example. I'd ask the kernel Team to keep an eye open for that, be encouraging in discussions. This really has a huge benefit in those use cases.

Changed in linux (Ubuntu):
importance: Medium → Low
Changed in qemu (Ubuntu):
status: Triaged → Invalid
Changed in linux (Ubuntu):
assignee: Colin Ian King (colin-king) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.