guest cleanup script fails to iterate
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| libvirt (Ubuntu) |
Undecided
|
Unassigned | ||
| Xenial |
Undecided
|
Unassigned | ||
| Artful |
Undecided
|
Unassigned | ||
| Bionic |
Undecided
|
Unassigned |
Bug Description
[Impact]
* Due to a bug in a recent fix guests now are immediately considered shut
down. Due to that the helper to shut down guests cleanly might not wait
long enough.
* backporting the upstream fix to the affected releases.
[Test Case]
* Spawn a few KVM guests through libvirt, e.g.
$ uvt-simplestrea
$ uvt-kvm create --memory 128 --password ubuntu b1 arch=amd64 release=bionic label=daily
$ uvt-kvm create --memory 128 --password ubuntu b2 arch=amd64 release=bionic label=daily
$ uvt-kvm create --memory 128 --password ubuntu b3 arch=amd64 release=bionic label=daily
* A) Shut down the computer and watch how it handles guest shutdown.
* B) You can better track that if instead of a shutdown you call what the
shutdown would via:
$ sudo /usr/lib/
* It will consider guests as shut down no matter what, so you might want
to modify guests to not shut down or if you have a system to do so use
huge guests with plenty of dirty memory which will slow it down.
One good trick is to start the guests right before stopping.
Guests are not yet responding to shutdown signals at that time, so the
issue is revealed (not waiting on shutdown)
To do so you could stop the guests and then do:
$ virsh start b1; virsh start b2; virsh start b3;
$ /usr/lib/
The bad case looks like:
Running guests on default URI: b1, b2, b3
Shutting down guests on default URI...
Starting shutdown on guest: b1
Starting shutdown on guest: b2
Starting shutdown on guest: b3
Waiting for 3 guests to shut down, 120 seconds left
Shutdown of guest b1 complete.
Shutdown of guest b2 complete.
Shutdown of guest b3 complete.
But you can check with virsh list they are still alive.
The improved case at least properly waits like:
...
Waiting for 3 guests to shut down, 120 seconds left
Waiting for 3 guests to shut down, 115 seconds left
Waiting for 3 guests to shut down, 110 seconds left
...
[Regression Potential]
* The fix "only" affects this shutdown handling and no other part of
libvirt which limits the ranges of potential regressions.
* The current issue is that we don't wait enough for the geusts, we can't
wait "less" so we won't regress to be less forgiving slow-shutdowns,
but if a mistake would be in the code the regression risk would
be to slow down shutdowns - tests did not see that, but this section is
about regression POTENTIAL so I thought I add it.
[Other Info]
* n/a
---
The recent fix to libvirt-
(variable scope) but failed to adapt the loop in check_guests_
Due to that it currently detects all guests as "Failed to determine state of guest".
It all works, but guests are just assumed dead and logs are spilled with false-positive warnings.
A suggested fix is at: https:/
Christian Ehrhardt (paelzer) wrote : | #1 |
Christian Ehrhardt (paelzer) wrote : | #2 |
Submitted upstream as https:/
Changed in libvirt (Ubuntu): | |
status: | New → In Progress |
Launchpad Janitor (janitor) wrote : | #3 |
Status changed to 'Confirmed' because the bug affects multiple users.
Changed in libvirt (Ubuntu Artful): | |
status: | New → Confirmed |
Changed in libvirt (Ubuntu Xenial): | |
status: | New → Confirmed |
Dariusz Gadomski (dgadomski) wrote : | #5 |
I can confirm. With the patch applied all guests are being shutdown correctly and the "Failed to determine state of guest" message is gone.
Tested for PARALLEL=2, 5, 10 for 40 guests.
Christian Ehrhardt (paelzer) wrote : | #6 |
Thanks for the testing I added another fix in case guests are fully undefined (or libvirt gone) in the meantime. Otherwise it might report an empty variable guest name.
Submitted as v2: https:/
Dariusz Gadomski (dgadomski) wrote : | #7 |
I have repeated the tests for the second part of patch as well - everything went as expected.
Although I haven't been able to hit the undefined guest scenario before. I'll make a couple of attempts more.
Christian Ehrhardt (paelzer) wrote : | #8 |
Changes just got accepted as:
08bb5eeb tools: fix check_guests_
cce57265 tools: do not report unknown guests in print_guests_
Lets prep a fix for Bionic asap to hopefully get it in the release.
But libvirt is seeded and ISO respins are - umm - unwanted at this moment.
OTOH there are always chances there is a ISO respin anyway so that this could slip in and I'd not want the release to contain this bug if possible.
Therefore I'll upload it but if the release team rejects it we convert it to a very early SRU.
Upload is available as 4.0.0-1ubuntu8 now for review by the release Team.
Janåke Rönnblom (jan-ake) wrote : | #9 |
If you have a cluster (corosync and pacemaker) running a VM as a resource then during a shutdown the VM would be removed from virsh list since its running as a transient domain.
https:/
This would cause the VM name to be unknown when we later try to lookup its name since the VM is removed from virsh when shutdown.
Another way to fix it would be to store the VM name before shutdown. Then it could be printed in print_guests_
-J
Christian Ehrhardt (paelzer) wrote : | #10 |
Thank Janake,
you are right that transients would vanish in that case.
I were able to construct the same issue by concurrently destroy/undefining them simulating some other script running on shutdown. That is rather similar to the corosync case that does things on shutdown.
But with the current set of fixes at least there isn't a try to report on "" names.
Instead it silently accepts that the guest is at least gone, which was the mission of the shutdown script in the first place.
I don't like touching this script too much if not needed to fix issues.
I think storing the name per UUID initially to refer later might be nice but counts as such a case where I'd avoid to touch it. If you think this would be very beneficial open another bug for that aspect and we can discuss about it there.
Launchpad Janitor (janitor) wrote : | #11 |
This bug was fixed in the package libvirt - 4.0.0-1ubuntu8
---------------
libvirt (4.0.0-1ubuntu8) bionic; urgency=medium
* Fix clean shut down of guests on system shutdown (LP: #1764668)
- d/p/ubuntu/
- d/p/ubuntu/
-- Christian Ehrhardt <email address hidden> Tue, 24 Apr 2018 11:09:48 +0200
Changed in libvirt (Ubuntu Bionic): | |
status: | In Progress → Fix Released |
description: | updated |
Christian Ehrhardt (paelzer) wrote : | #12 |
Note: I realized the script might want to re-issue the shutdown command but that is for a different time/bug.
After a final test against [1] with the proposed upload being good I pushed this for SRU.
Also I added a proper SRU Template to the bug.
For SRU Review:
libvirt_
libvirt_
And the same content, tags and branches for xenial/artful to the packaging repositories.
That said waiting for the SRU Team to accept this as next step.
[1]: https:/
description: | updated |
Robie Basak (racb) wrote : | #13 |
12:35 <rbasak> cpaelzer: on bug 1764668
12:35 <ubottu> bug 1764668 in libvirt (Ubuntu Artful) "guest cleanup script fails to iterate" [Undecided,
12:35 <rbasak> cpaelzer: is it possible that some users have guests that fail to shut down in production? In that case, what will happen to them after this SRU?
12:36 <rbasak> eg. will the host now not shut down, or they'll have to wait for a timeout and then it will?
12:37 <cpaelzer> rbasak: hiho
12:39 <cpaelzer> rbasak: this is a two stage thing
12:40 <cpaelzer> rbasak: let me explain ...
12:41 <cpaelzer> before 1.3.1-1ubuntu10.21 there were cases where it could hang on shutdown
12:41 <cpaelzer> this was fixed and good in that regard, but the fix introduced another issue which was missed
12:41 <cpaelzer> so the new case is that all guests get a shutdown signal now, but it immediately detects them as shut down and goes on with the systems shutdown
12:42 <cpaelzer> the latter is fixed by the current SRU
12:42 <cpaelzer> so currently people might have guests that shut down but too fast go on and shut down the full system
12:42 <cpaelzer> with the fix it will wait up to timeout (as defined in the config)
12:43 <cpaelzer> it will not exceed that timeout and hang (as before the last SRU)
12:43 <cpaelzer> but it will also not go on too fast
12:43 <cpaelzer> rbasak: did the above eplxain what you wanted to know?
12:43 <rbasak> Yes, thanks. The case I had in mind can't happen because of the timeout.
tags: | added: regression-update |
Hello ChristianEhrhardt, or anyone else affected,
Accepted libvirt into xenial-proposed. The package will build now and be available at https:/
Please help us by testing this new package. See https:/
If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-
Further information regarding the verification process can be found at https:/
Changed in libvirt (Ubuntu Xenial): | |
status: | Confirmed → Fix Committed |
tags: | added: verification-needed verification-needed-xenial |
Changed in libvirt (Ubuntu Artful): | |
status: | Confirmed → Fix Committed |
tags: | added: verification-needed-artful |
Robie Basak (racb) wrote : | #15 |
Hello ChristianEhrhardt, or anyone else affected,
Accepted libvirt into artful-proposed. The package will build now and be available at https:/
Please help us by testing this new package. See https:/
If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-
Further information regarding the verification process can be found at https:/
Christian Ehrhardt (paelzer) wrote : | #16 |
As described in testcases I set up a few guests and made sure some of them won't shutdown properly. We want to see
1. all tried to shut down
2. waiting up until timeout for the one that does not shut down
## Pre-Fix ##
Xenial:
sudo /usr/lib/
Running guests on default URI: b4, b1, b2, b3
Shutting down guests on default URI...
Starting shutdown on guest: b4
Starting shutdown on guest: b1
Waiting for 4 guests to shut down, 120 seconds left
Failed to determine state of guest: 83964d03-
Failed to determine state of guest: 9f11ae0c-
Shutdown of guest b4 complete.
Shutdown of guest b1 complete.
Starting shutdown on guest: b2
Starting shutdown on guest: b3
Shutdown of guest b2 complete.
Shutdown of guest b3 complete.
root@x:~# virsh list
Id Name State
-------
5 b4 running
6 b1 running
7 b2 running
8 b3 running
Artful:
# sudo /usr/lib/
Running guests on default URI: b1, b2, b3, b4
Shutting down guests on default URI...
Starting shutdown on guest: b1
Starting shutdown on guest: b2
Waiting for 4 guests to shut down, 120 seconds left
Failed to determine state of guest: 80ba5ac4-
Failed to determine state of guest: 688847ca-
Shutdown of guest b1 complete.
Shutdown of guest b2 complete.
Starting shutdown on guest: b3
Starting shutdown on guest: b4
Shutdown of guest b3 complete.
Shutdown of guest b4 complete.
root@a:~# virsh list
Id Name State
-------
1 b1 running
4 b4 running
#### With the fixes from proposed ####
libvirt-
libvirt-
libvirt-
libvirt-
libvirt0/
libvirt-
libvirt0/
Xenial:
root@x:~# sudo /usr/lib/
Running guests on default URI: b4, b2, b1, b3
Shutting down guests on default URI...
Starting shutdown on guest: b4
Starting shutdown on guest: b2
Waiting for 4 guests to shut down, 120 seconds left
Waiting for 4 guests to shut down, 115 seconds left
Waiting for 4 guests to shut down, 110 seconds left
Waiting for 4 guests to shut down, 105 seconds left
Waiting for 4 guests to shut down, 100 seconds left
Waiting for 4 guests to shu...
tags: |
added: verification-done verification-done-artful verification-done-xenial removed: verification-needed verification-needed-artful verification-needed-xenial |
Christian Ehrhardt (paelzer) wrote : | #17 |
@SRU-Team: Dep8 tests show unrelated issues in cockpit for Artful.
I'm in discussion with pitti if we should even mask those tests, considering what fails and what we changed they are really unrelated (and this type of error showed before the changes of this SRU).
See the artful column in [1] for its history.
So I'd ask you to ignore the three hits on cockpit for artful for this SRU.
If pitti and I conclude that this should be a Britney override we will do so independent to this SRU.
Christian Ehrhardt (paelzer) wrote : | #18 |
FYI - we opened [1], and while this needs some processing to be acked an merged I'd ask the SRU team to consider these autopkgtest fails not a blocker to move on with thie SRU here atm.
[1]: https:/
Christian Ehrhardt (paelzer) wrote : | #19 |
In fact the MP above is approved by Pitti who sort of owns cockpit.
Since neither of us can push this change I'd appreciate if on SRU handling this could be done (as it is ~ubuntu-sru scope).
Launchpad Janitor (janitor) wrote : | #20 |
This bug was fixed in the package libvirt - 1.3.1-1ubuntu10.22
---------------
libvirt (1.3.1-
* Fix clean shut down of guests on system shutdown (LP: #1764668)
- d/p/ubuntu/
- d/p/ubuntu/
-- Christian Ehrhardt <email address hidden> Wed, 25 Apr 2018 09:26:12 +0200
Changed in libvirt (Ubuntu Xenial): | |
status: | Fix Committed → Fix Released |
The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.
Launchpad Janitor (janitor) wrote : | #22 |
This bug was fixed in the package libvirt - 3.6.0-1ubuntu6.6
---------------
libvirt (3.6.0-1ubuntu6.6) artful; urgency=medium
* Fix clean shut down of guests on system shutdown (LP: #1764668)
- d/p/ubuntu/
- d/p/ubuntu/
-- Christian Ehrhardt <email address hidden> Wed, 25 Apr 2018 09:24:08 +0200
Changed in libvirt (Ubuntu Artful): | |
status: | Fix Committed → Fix Released |
Without the fix they were in my case always detected as immediately shut down which likely is wrong and can end in non-clean shutdown even it would have been clean if the guests had some more time.
This was due to the fact that it picked up the empty external var to loop over and then the returned list was obviously empty.
Tests with the fix:
Tested guests not going away (just kill them so early that they do not respond to shutdown yet)
=> ok (took configured max time)
Running guests on default URI: b1, b2, b3
Shutting down guests on default URI...
Starting shutdown on guest: b1
Starting shutdown on guest: b2
Starting shutdown on guest: b3
Waiting for 3 guests to shut down, 120 seconds left
Waiting for 3 guests to shut down, 115 seconds left
Waiting for 3 guests to shut down, 110 seconds left
Waiting for 3 guests to shut down, 105 seconds left
...
Waiting for 3 guests to shut down, 5 seconds left
Timeout expired while shutting down domains
Tested guests going away one by one (just kill them so early that they do not respond to shutdown yet) - and then virsh destroy them over time
=> ok (detected one by one as they should be)
Running guests on default URI: b1, b2, b3
Shutting down guests on default URI...
Starting shutdown on guest: b1
Starting shutdown on guest: b2
Starting shutdown on guest: b3
Waiting for 3 guests to shut down, 120 seconds left
Waiting for 3 guests to shut down, 115 seconds left
Shutdown of guest b2 complete.
Waiting for 2 guests to shut down, 110 seconds left
Waiting for 2 guests to shut down, 105 seconds left
Shutdown of guest b3 complete.
Waiting for 1 guests to shut down, 100 seconds left
Waiting for 1 guests to shut down, 95 seconds left
Waiting for 1 guests to shut down, 90 seconds left
Shutdown of guest b1 complete.
Tested all guests going away immediately (normal guests)
=> ok, detected all in one "loop"
Running guests on default URI: b1, b2, b3
Shutting down guests on default URI...
Starting shutdown on guest: b1
Starting shutdown on guest: b2
Starting shutdown on guest: b3
Waiting for 3 guests to shut down, 120 seconds left
Shutdown of guest b1 complete.
Shutdown of guest b2 complete.
Shutdown of guest b3 complete.
Tested parallel=2 with guests going away one by one (but not all eventually).
=> ok, one down kicked next shutdown
=> ok partial looping did still work to catch timeout
Running guests on default URI: b1, b2, b3
Shutting down guests on default URI...
Starting shutdown on guest: b1
Starting shutdown on guest: b2
Waiting for 3 guests to shut down, 60 seconds left
Shutdown of guest b1 complete.
Starting shutdown on guest: b3
Waiting for 2 guests to shut down, 55 seconds left
...
Waiting for 2 guests to shut down, 5 seconds left
Timeout expired while shutting down domains
Tested parallel=2 with guests going away normally
=> ok, one down kicked next shutdown
=> ok, loop continued and completed
Running guests on default URI: b1, b2, b3
Shutting down guests on default URI...
Starting shutdown on guest: b1
Starting shutdown on guest: b2
Waiting for 3 guests to shut down, 60 seconds left
Shutdown of guest b1 complete.
Shutdown of guest b2 complete.
Starting shutdown on guest: b3
Shutdown of ...