Activity log for bug #1788226

Date Who What changed Old value New value Message
2018-08-21 15:45:47 Christian Ehrhardt  bug added bug
2018-08-21 15:45:57 Christian Ehrhardt  nominated for series Ubuntu Cosmic
2018-08-21 15:45:57 Christian Ehrhardt  bug task added libvirt (Ubuntu Cosmic)
2018-08-21 15:45:57 Christian Ehrhardt  nominated for series Ubuntu Bionic
2018-08-21 15:45:57 Christian Ehrhardt  bug task added libvirt (Ubuntu Bionic)
2018-08-21 17:42:07 dann frazier libvirt (Ubuntu Bionic): status New Triaged
2018-08-21 17:42:10 dann frazier libvirt (Ubuntu Bionic): status Triaged New
2018-08-22 05:39:13 Christian Ehrhardt  description I had an upstream discussion at [1] The TL;DR is that if using more than a few hostdevices as pass through then on shutting down of the guest there might be an odd time. In that time the qemu process is already gone from /proc/<pid> but still reachable via singal-0. That makes libvirt believe the process would not die (think of zombie processes due to e.g. failed NFS files). But what really happens is that depending on the hostdev the kernel might need up to 1-2 seconds extra time to unallocate all the pci ressources. We came up with patches that scale the allowed time depending on the number of hostdevs as well a s a general bump for the bad case (sigkill) following the recommendation of kernel engineers who said that we are in a bad path anyway, so we could as well provide a bit more time to let it clean up. We should add these changes at least back into Bionic IMHO. [1]: https://www.redhat.com/archives/libvir-list/2018-August/msg01295.html [Impact] * Systems using plenty of PCI Host Devices can have lost the Qemu PID from /proc/<pid> but the kernel still holds it quite a while while cleaning up resources. That leads to libvirtd giving up on the process and keeping the guest forever in the state "in shutdown" as it can't know what is going on. * It turns out the kernel can need quite some time, especially if it hits a settling per pci subbus reset per device. * We have tested and upstreamed changes to adapt the time libvirt waits in those cases. Backporting those to Cosmic/Bionic would be great for Ubuntu to be ready for that many PCI device passthroughs - further back I think the backport noise and regression-risk tradeoff is no more worth - also in terms of requests I only got it for 18.04. [Test Case] * If a certain PCI device needs the bus reset or not depends on the kernel driver. One that I could prove having this issue are Nvidia 3d Controllers of the type "3D controller: NVIDIA Corporation Device 1db8". It is a general concept, so other PCI devices might work to reproduce the case, but I can't guarantee as drivers can implement smarter reset code. Keep that in mind when following the repro below * The former default delay was ~15 second, with the guest needing ~1-2 seconds. So we need at least 14, better 16+ affected PCI devices. Pass those devices to the guest with libvirt snippet like: <hostdev mode='subsystem' type='pci' managed='yes'> <source> <address domain='0x0000' bus='0x34' slot='0x0' function='0x0'/> </source> </hostdev> * Let the guest start up as you'd normally * Now run virsh shutdown <guestname>" * You'll see shortly after that the qemu process is gone from /proc/<pid>, but libvirt will still spin on its shutdown. That is due to signal 0 still being able to reach it as it only guarantees the existence of the PID (not the process behind it) * After the timeout expires libvirt would give up and leavev the guest in the half-dead state like: libvirtd: virProcessKillPainfully:401 : Failed to terminate process 5114 with SIGKILL: Device or resource busy * With the fix the maximum delay is scaled if needed and the shutdown will work e.g. with 16 devices about ~18 seconds. If one wants to check, log entries at debuglevel will contain the extra delay scaling: #40 devices attached, getting the proper plus 80 seconds: debug : virProcessKillPainfullyDelay:361 : vpid=34896 force=1 extradelay=80 [Regression Potential] * I can think of two potential regressions, but none is on the level to stop this fix to improve the affected cases. * R1: For forced shutdowns (at the end of life of a guest) the worst case timeout after delivering the sigkill when the process was not going down due to sigterm got bumped - that said really hanging cases will need a few seconds more until libvirt gives up on them. Those cases would never recover anyway, but the time until giving up can be visible to e.g. higher level tools/admins in that longer duration. * R2: The scaling per Hostdevice was derived out of explanations by Kernel developers that there is some base work plus a 1 second settling period. Out of that the increased max timeout can go really high if you will one day have hundreds of hostdevices passed through. But libvirt in this case is on the lower layer, if/once the kernel has code that does no more scale linearly we can also make this extra delay less linear - but so far nothing like that is on the horizon. * Please keep in mind that both Risks that I spelled out are only a problem if the guest Process is not going away to sigterm/sigkill. That means it already is in a really bad spot and code path, we just wait longer to give it a chance to clean up. 99+% of the cases will have guests that go away as they should, and those cases are not waiting the full - now extended - delay. [Other Info] * n/a --- I had an upstream discussion at [1] The TL;DR is that if using more than a few hostdevices as pass through then on shutting down of the guest there might be an odd time. In that time the qemu process is already gone from /proc/<pid> but still reachable via singal-0. That makes libvirt believe the process would not die (think of zombie processes due to e.g. failed NFS files). But what really happens is that depending on the hostdev the kernel might need up to 1-2 seconds extra time to unallocate all the pci ressources. We came up with patches that scale the allowed time depending on the number of hostdevs as well a s a general bump for the bad case (sigkill) following the recommendation of kernel engineers who said that we are in a bad path anyway, so we could as well provide a bit more time to let it clean up. We should add these changes at least back into Bionic IMHO. [1]: https://www.redhat.com/archives/libvir-list/2018-August/msg01295.html
2018-08-23 00:13:09 Launchpad Janitor libvirt (Ubuntu Cosmic): status New Fix Released
2018-08-23 05:06:17 Christian Ehrhardt  libvirt (Ubuntu Bionic): status New Triaged
2018-08-23 08:34:25 Łukasz Zemczak libvirt (Ubuntu Bionic): status Triaged Fix Committed
2018-08-23 08:34:27 Łukasz Zemczak bug added subscriber Ubuntu Stable Release Updates Team
2018-08-23 08:34:28 Łukasz Zemczak bug added subscriber SRU Verification
2018-08-23 08:34:30 Łukasz Zemczak tags verification-needed verification-needed-bionic
2018-08-23 13:08:03 Christian Ehrhardt  tags verification-needed verification-needed-bionic verification-done verification-done-bionic
2018-09-03 07:58:46 Launchpad Janitor libvirt (Ubuntu Bionic): status Fix Committed Fix Released
2018-09-03 07:58:55 Łukasz Zemczak removed subscriber Ubuntu Stable Release Updates Team