First run (after boot):
5204 3.793765 <... ioctl resumed> , 0x7ffffa6687b0) = 0 <13.464192>
Second run (after aborting the former qemu)
5251 179.361434 <... ioctl resumed> , 0x7ffed49fbc70) = 0 <189.030433>
Third run (after aborting the former qemu)
5728 703.886935 <... ioctl resumed> , 0x7fff9f6a7120) = 0 <713.556876>
Maybe a former unclean stop makes it take extra time?
Or there is still cleanup going on in the iommu which stalls the follow on mappings to be established.
Forth run (after proper guest shutdown + 10 min wait)
6171 11.167221 <... ioctl resumed> , 0x7ffe597f03e0) = 0 <20.838401>
Fifth run (again kill and restart)
6215 1182.244403 <... ioctl resumed> , 0x7fff20a49b10) = 0 <1191.917291>
The above was through libvirt, doing that directly in qemu now to throw it into debugging more easily:
$ virsh nodedev-detach pci_0000_21_00_1 --driver vfio
$ qemu/x86_ 64-softmmu/ qemu-system- x86_64 -name guest=test- vfio-slowness -m 131072 -smp 1 -no-user-config -drive file=/var/ lib/uvtool/ libvirt/ images/ eoan.qcow, format= qcow2,if= none,id= drive-virtio- disk0 -device virtio- blk-pci, scsi=off, bus=pci. 0,addr= 0x4,drive= drive-virtio- disk0,id= virtio- disk0,bootindex =1 -device vfio-pci, host=21: 00.1,id= hostdev0, bus=pci. 0,addr= 0x7 -display curses
The speed differs a lot between retries:
First run (after boot):
5204 3.793765 <... ioctl resumed> , 0x7ffffa6687b0) = 0 <13.464192>
Second run (after aborting the former qemu)
5251 179.361434 <... ioctl resumed> , 0x7ffed49fbc70) = 0 <189.030433>
Third run (after aborting the former qemu)
5728 703.886935 <... ioctl resumed> , 0x7fff9f6a7120) = 0 <713.556876>
Maybe a former unclean stop makes it take extra time?
Or there is still cleanup going on in the iommu which stalls the follow on mappings to be established.
Forth run (after proper guest shutdown + 10 min wait)
6171 11.167221 <... ioctl resumed> , 0x7ffe597f03e0) = 0 <20.838401>
Fifth run (again kill and restart)
6215 1182.244403 <... ioctl resumed> , 0x7fff20a49b10) = 0 <1191.917291>
10 sec iommu counters while in the hang: wr_mmio_ 1ff8h (33.41%) tlb_pde_ hit (33.44%) tlb_pde_ mis (33.40%) tlb_pte_ hit (33.37%) tlb_pte_ mis (33.33%)
34330395280 cycles
16 cmd_processed (33.33%)
6 cmd_processed_inv (33.37%)
0 ign_rd_
45 int_dte_hit (33.44%)
0 int_dte_mis (33.44%)
345 mem_dte_hit (33.44%)
907 mem_dte_mis (33.44%)
111 mem_iommu_
102 mem_iommu_
7927 mem_iommu_
1761 mem_iommu_
0 mem_pass_excl (33.30%)
0 mem_pass_pretrans (33.29%)
0 mem_pass_untrans (33.29%)
0 mem_target_abort (33.29%)
8492 mem_trans_total (33.30%)
0 page_tbl_read_gst (33.30%)
168 page_tbl_read_nst (33.30%)
129 page_tbl_read_tot (33.30%)
0 smi_blk (33.30%)
0 smi_recv (33.31%)
0 tlb_inv (33.31%)
0 vapic_int_guest (33.31%)
24 vapic_int_non_guest (33.31%)
10.011867065 seconds time elapsed
But then the slowness doesn't have to be visible in these counters.
Next step, check the kernel if we can find (on x86) any place to parallelize this.