I modified the kernel to have a few functions non-inlined to be better tracable:
vfio_dma_do_map
vfio_dma_do_unmap
mutex_lock
mutex_unlock
kzalloc
vfio_link_dma
vfio_pin_map_dma
vfio_pin_pages_remote
vfio_iommu_map
Then run tracing on this load with limited to the functions in my focus:
$ sudo trace-cmd record -p function_graph -l vfio_iommu_type1_ioctl -l vfio_pci_ioctl -l vfio_set_irqs_validate_and_prepare -l vfio_pci_set_irqs_ioctl -l pci_probe_reset_slot -l pci_probe_reset_bus -l vfio_pci_count_devs -l vfio_pci_fill_devs -l vfio_pci_validate_devs -l vfio_group_get_external_user -l vfio_group_put_external_user -l vfio_external_user_iommu_id -l vfio_dma_do_map -l vfio_dma_do_unmap -l mutex_lock -l mutex_unlock -l vfio_link_dma -l vfio_pin_map_dma -l vfio_pin_pages_remote -l vfio_iommu_map
This I traced twice to compare it:
- after a clean boot; duration ~12 seconds
- right afterwards (rerun from gdb); duration 175~ seconds
It shows that the locks are interesting for concurrency considerations, but here mostly noisy.
OTOH it is interesting that the slow case has much more of them, maybe they need to unlock for interactiveness in between more often. I'll count them in postprocessing like:
The result shows that the locking involved scaled much more.
Duration is ~*14.5 and locks are ~*34 the average lock activity between vfio calls is even more with
1.52 -> 6.32.
But the calls to vfio_pin_pages_remote / vfio_iommu_map went up as well.
64020 -> 327852; which means smaller chunks I guess.
The timing of these functions changes slightly, initially it is much different than later in the trace.
Lets take a clean shot at just those durations and sizes:
I modified the kernel to have a few functions non-inlined to be better tracable: pages_remote
vfio_dma_do_map
vfio_dma_do_unmap
mutex_lock
mutex_unlock
kzalloc
vfio_link_dma
vfio_pin_map_dma
vfio_pin_
vfio_iommu_map
Then run tracing on this load with limited to the functions in my focus: type1_ioctl -l vfio_pci_ioctl -l vfio_set_ irqs_validate_ and_prepare -l vfio_pci_ set_irqs_ ioctl -l pci_probe_ reset_slot -l pci_probe_reset_bus -l vfio_pci_count_devs -l vfio_pci_fill_devs -l vfio_pci_ validate_ devs -l vfio_group_ get_external_ user -l vfio_group_ put_external_ user -l vfio_external_ user_iommu_ id -l vfio_dma_do_map -l vfio_dma_do_unmap -l mutex_lock -l mutex_unlock -l vfio_link_dma -l vfio_pin_map_dma -l vfio_pin_ pages_remote -l vfio_iommu_map
$ sudo trace-cmd record -p function_graph -l vfio_iommu_
This I traced twice to compare it:
- after a clean boot; duration ~12 seconds
- right afterwards (rerun from gdb); duration 175~ seconds
It shows that the locks are interesting for concurrency considerations, but here mostly noisy.
OTOH it is interesting that the slow case has much more of them, maybe they need to unlock for interactiveness in between more often. I'll count them in postprocessing like:
The result shows that the locking involved scaled much more.
Duration is ~*14.5 and locks are ~*34 the average lock activity between vfio calls is even more with
1.52 -> 6.32.
But the calls to vfio_pin_ pages_remote / vfio_iommu_map went up as well.
64020 -> 327852; which means smaller chunks I guess.
The timing of these functions changes slightly, initially it is much different than later in the trace.
Lets take a clean shot at just those durations and sizes: