libvirt might misdetect guests as not properly shut down when using plenty of hostdev ressources
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
libvirt (Ubuntu) |
Fix Released
|
Undecided
|
Unassigned | ||
Bionic |
Fix Released
|
Undecided
|
Unassigned | ||
Cosmic |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
[Impact]
* Systems using plenty of PCI Host Devices can have lost the Qemu PID
from /proc/<pid> but the kernel still holds it quite a while while
cleaning up resources.
That leads to libvirtd giving up on the process and keeping the guest
forever in the state "in shutdown" as it can't know what is going on.
* It turns out the kernel can need quite some time, especially if it hits
a settling per pci subbus reset per device.
* We have tested and upstreamed changes to adapt the time libvirt waits
in those cases. Backporting those to Cosmic/Bionic would be great for
Ubuntu to be ready for that many PCI device passthroughs - further back
I think the backport noise and regression-risk tradeoff is no more
worth - also in terms of requests I only got it for 18.04.
[Test Case]
* If a certain PCI device needs the bus reset or not depends on the
kernel driver. One that I could prove having this issue are Nvidia 3d
Controllers of the type "3D controller: NVIDIA Corporation Device
1db8". It is a general concept, so other PCI devices might work to
reproduce the case, but I can't guarantee as drivers can implement
smarter reset code. Keep that in mind when following the repro below
* The former default delay was ~15 second, with the guest needing ~1-2
seconds. So we need at least 14, better 16+ affected PCI devices.
Pass those devices to the guest with libvirt snippet like:
<hostdev mode='subsystem' type='pci' managed='yes'> <source> <address
domain='0x0000' bus='0x34' slot='0x0' function='0x0'/> </source>
</hostdev>
* Let the guest start up as you'd normally
* Now run virsh shutdown <guestname>"
* You'll see shortly after that the qemu process is gone from
/proc/<pid>, but libvirt will still spin on its shutdown. That is due
to signal 0 still being able to reach it as it only guarantees the
existence of the PID (not the process behind it)
* After the timeout expires libvirt would give up and leavev the guest in
the half-dead state like:
libvirtd: virProcessKillP
5114 with SIGKILL: Device or resource busy
* With the fix the maximum delay is scaled if needed and the shutdown
will work e.g. with 16 devices about ~18 seconds.
If one wants to check, log entries at debuglevel will contain the extra
delay scaling:
#40 devices attached, getting the proper plus 80 seconds:
debug : virProcessKillP
extradelay=80
[Regression Potential]
* I can think of two potential regressions, but none is on the level to
stop this fix to improve the affected cases.
* R1: For forced shutdowns (at the end of life of a guest) the worst case
timeout after delivering the sigkill when the process was not going
down due to sigterm got bumped - that said really hanging cases will
need a few seconds more until libvirt gives up on them.
Those cases would never recover anyway, but the time until giving up
can be visible to e.g. higher level tools/admins in that longer
duration.
* R2: The scaling per Hostdevice was derived out of explanations by
Kernel developers that there is some base work plus a 1 second settling
period. Out of that the increased max timeout can go really high if you
will one day have hundreds of hostdevices passed through. But libvirt
in this case is on the lower layer, if/once the kernel has code that
does no more scale linearly we can also make this extra delay less
linear - but so far nothing like that is on the horizon.
* Please keep in mind that both Risks that I spelled out are only a
problem if the guest Process is not going away to sigterm/sigkill.
That means it already is in a really bad spot and code path, we just
wait longer to give it a chance to clean up. 99+% of the cases will
have guests that go away as they should, and those cases are not
waiting the full - now extended - delay.
[Other Info]
* n/a
---
I had an upstream discussion at [1]
The TL;DR is that if using more than a few hostdevices as pass through then on shutting down of the guest there might be an odd time.
In that time the qemu process is already gone from /proc/<pid> but still reachable via singal-0.
That makes libvirt believe the process would not die (think of zombie processes due to e.g. failed NFS files).
But what really happens is that depending on the hostdev the kernel might need up to 1-2 seconds extra time to unallocate all the pci ressources.
We came up with patches that scale the allowed time depending on the number of hostdevs as well a s a general bump for the bad case (sigkill) following the recommendation of kernel engineers who said that we are in a bad path anyway, so we could as well provide a bit more time to let it clean up.
We should add these changes at least back into Bionic IMHO.
[1]: https:/
Changed in libvirt (Ubuntu Bionic): | |
status: | New → Triaged |
status: | Triaged → New |
description: | updated |
Changed in libvirt (Ubuntu Bionic): | |
status: | New → Triaged |
Test builds of the Cosmic version of libvirt with the upstream accepted patches are building in [1] pre-checks we can upload to Cosmic.
Once I verified that from the PPA and through some regression-
[1]: https:/ /launchpad. net/~ci- train-ppa- service/ +archive/ ubuntu/ 3370