guest cleanup script fails to iterate

Bug #1764668 reported by Christian Ehrhardt  on 2018-04-17
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
libvirt (Ubuntu)
Undecided
Unassigned
Xenial
Undecided
Unassigned
Artful
Undecided
Unassigned
Bionic
Undecided
Unassigned

Bug Description

[Impact]

 * Due to a bug in a recent fix guests now are immediately considered shut
   down. Due to that the helper to shut down guests cleanly might not wait
   long enough.

 * backporting the upstream fix to the affected releases.

[Test Case]

 * Spawn a few KVM guests through libvirt, e.g.
   $ uvt-simplestreams-libvirt --verbose sync --source http://cloud-images.ubuntu.com/daily arch=amd64 release=bionic label=daily
   $ uvt-kvm create --memory 128 --password ubuntu b1 arch=amd64 release=bionic label=daily
   $ uvt-kvm create --memory 128 --password ubuntu b2 arch=amd64 release=bionic label=daily
   $ uvt-kvm create --memory 128 --password ubuntu b3 arch=amd64 release=bionic label=daily

 * A) Shut down the computer and watch how it handles guest shutdown.

 * B) You can better track that if instead of a shutdown you call what the
   shutdown would via:
   $ sudo /usr/lib/libvirt/libvirt-guests.sh stop

 * It will consider guests as shut down no matter what, so you might want
   to modify guests to not shut down or if you have a system to do so use
   huge guests with plenty of dirty memory which will slow it down.
   One good trick is to start the guests right before stopping.
   Guests are not yet responding to shutdown signals at that time, so the
   issue is revealed (not waiting on shutdown)
   To do so you could stop the guests and then do:
   $ virsh start b1; virsh start b2; virsh start b3;
   $ /usr/lib/libvirt/libvirt-guests.sh stop

   The bad case looks like:
   Running guests on default URI: b1, b2, b3
   Shutting down guests on default URI...
   Starting shutdown on guest: b1
   Starting shutdown on guest: b2
   Starting shutdown on guest: b3
   Waiting for 3 guests to shut down, 120 seconds left
   Shutdown of guest b1 complete.
   Shutdown of guest b2 complete.
   Shutdown of guest b3 complete.

   But you can check with virsh list they are still alive.
   The improved case at least properly waits like:
   ...
   Waiting for 3 guests to shut down, 120 seconds left
   Waiting for 3 guests to shut down, 115 seconds left
   Waiting for 3 guests to shut down, 110 seconds left
   ...

[Regression Potential]

 * The fix "only" affects this shutdown handling and no other part of
   libvirt which limits the ranges of potential regressions.

 * The current issue is that we don't wait enough for the geusts, we can't
   wait "less" so we won't regress to be less forgiving slow-shutdowns,
   but if a mistake would be in the code the regression risk would
   be to slow down shutdowns - tests did not see that, but this section is
   about regression POTENTIAL so I thought I add it.

[Other Info]

 * n/a

---

The recent fix to libvirt-guests.sh.in works for what it intended to fix
(variable scope) but failed to adapt the loop in check_guests_shutdown correctly.
Due to that it currently detects all guests as "Failed to determine state of guest".

It all works, but guests are just assumed dead and logs are spilled with false-positive warnings.

A suggested fix is at: https://git.launchpad.net/~paelzer/libvirt/commit/?id=2ccca91678cd1091163490fd0b466b58c8be2b79

Download full text (3.8 KiB)

Without the fix they were in my case always detected as immediately shut down which likely is wrong and can end in non-clean shutdown even it would have been clean if the guests had some more time.
This was due to the fact that it picked up the empty external var to loop over and then the returned list was obviously empty.

Tests with the fix:

Tested guests not going away (just kill them so early that they do not respond to shutdown yet)
=> ok (took configured max time)

Running guests on default URI: b1, b2, b3
Shutting down guests on default URI...
Starting shutdown on guest: b1
Starting shutdown on guest: b2
Starting shutdown on guest: b3
Waiting for 3 guests to shut down, 120 seconds left
Waiting for 3 guests to shut down, 115 seconds left
Waiting for 3 guests to shut down, 110 seconds left
Waiting for 3 guests to shut down, 105 seconds left
...
Waiting for 3 guests to shut down, 5 seconds left
Timeout expired while shutting down domains

Tested guests going away one by one (just kill them so early that they do not respond to shutdown yet) - and then virsh destroy them over time
=> ok (detected one by one as they should be)

Running guests on default URI: b1, b2, b3
Shutting down guests on default URI...
Starting shutdown on guest: b1
Starting shutdown on guest: b2
Starting shutdown on guest: b3
Waiting for 3 guests to shut down, 120 seconds left
Waiting for 3 guests to shut down, 115 seconds left
Shutdown of guest b2 complete.
Waiting for 2 guests to shut down, 110 seconds left
Waiting for 2 guests to shut down, 105 seconds left
Shutdown of guest b3 complete.
Waiting for 1 guests to shut down, 100 seconds left
Waiting for 1 guests to shut down, 95 seconds left
Waiting for 1 guests to shut down, 90 seconds left
Shutdown of guest b1 complete.

Tested all guests going away immediately (normal guests)
=> ok, detected all in one "loop"

Running guests on default URI: b1, b2, b3
Shutting down guests on default URI...
Starting shutdown on guest: b1
Starting shutdown on guest: b2
Starting shutdown on guest: b3
Waiting for 3 guests to shut down, 120 seconds left
Shutdown of guest b1 complete.
Shutdown of guest b2 complete.
Shutdown of guest b3 complete.

Tested parallel=2 with guests going away one by one (but not all eventually).
=> ok, one down kicked next shutdown
=> ok partial looping did still work to catch timeout

Running guests on default URI: b1, b2, b3
Shutting down guests on default URI...
Starting shutdown on guest: b1
Starting shutdown on guest: b2
Waiting for 3 guests to shut down, 60 seconds left
Shutdown of guest b1 complete.
Starting shutdown on guest: b3
Waiting for 2 guests to shut down, 55 seconds left
...
Waiting for 2 guests to shut down, 5 seconds left
Timeout expired while shutting down domains

Tested parallel=2 with guests going away normally
=> ok, one down kicked next shutdown
=> ok, loop continued and completed

Running guests on default URI: b1, b2, b3
Shutting down guests on default URI...
Starting shutdown on guest: b1
Starting shutdown on guest: b2
Waiting for 3 guests to shut down, 60 seconds left
Shutdown of guest b1 complete.
Shutdown of guest b2 complete.
Starting shutdown on guest: b3
Shutdown of ...

Read more...

Changed in libvirt (Ubuntu):
status: New → In Progress
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in libvirt (Ubuntu Artful):
status: New → Confirmed
Changed in libvirt (Ubuntu Xenial):
status: New → Confirmed
Dariusz Gadomski (dgadomski) wrote :

I can confirm. With the patch applied all guests are being shutdown correctly and the "Failed to determine state of guest" message is gone.

Tested for PARALLEL=2, 5, 10 for 40 guests.

Thanks for the testing I added another fix in case guests are fully undefined (or libvirt gone) in the meantime. Otherwise it might report an empty variable guest name.

Submitted as v2: https://www.redhat.com/archives/libvir-list/2018-April/msg01825.html

Dariusz Gadomski (dgadomski) wrote :

I have repeated the tests for the second part of patch as well - everything went as expected.

Although I haven't been able to hit the undefined guest scenario before. I'll make a couple of attempts more.

Changes just got accepted as:
 08bb5eeb tools: fix check_guests_shutdown loop
 cce57265 tools: do not report unknown guests in print_guests_shutdown

Lets prep a fix for Bionic asap to hopefully get it in the release.
But libvirt is seeded and ISO respins are - umm - unwanted at this moment.
OTOH there are always chances there is a ISO respin anyway so that this could slip in and I'd not want the release to contain this bug if possible.

Therefore I'll upload it but if the release team rejects it we convert it to a very early SRU.
Upload is available as 4.0.0-1ubuntu8 now for review by the release Team.

Janåke Rönnblom (jan-ake) wrote :

If you have a cluster (corosync and pacemaker) running a VM as a resource then during a shutdown the VM would be removed from virsh list since its running as a transient domain.

https://wiki.libvirt.org/page/VM_lifecycle#Transient_guest_domains_vs_Persistent_guest_domains

This would cause the VM name to be unknown when we later try to lookup its name since the VM is removed from virsh when shutdown.

Another way to fix it would be to store the VM name before shutdown. Then it could be printed in print_guests_shutdown.

-J

Thank Janake,
you are right that transients would vanish in that case.
I were able to construct the same issue by concurrently destroy/undefining them simulating some other script running on shutdown. That is rather similar to the corosync case that does things on shutdown.
But with the current set of fixes at least there isn't a try to report on "" names.
Instead it silently accepts that the guest is at least gone, which was the mission of the shutdown script in the first place.

I don't like touching this script too much if not needed to fix issues.
I think storing the name per UUID initially to refer later might be nice but counts as such a case where I'd avoid to touch it. If you think this would be very beneficial open another bug for that aspect and we can discuss about it there.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 4.0.0-1ubuntu8

---------------
libvirt (4.0.0-1ubuntu8) bionic; urgency=medium

  * Fix clean shut down of guests on system shutdown (LP: #1764668)
    - d/p/ubuntu/lp-1764668-do-not-report-unknown-guests.patch
    - d/p/ubuntu/lp-1764668-fix-check_guests_shutdown-loop.patch

 -- Christian Ehrhardt <email address hidden> Tue, 24 Apr 2018 11:09:48 +0200

Changed in libvirt (Ubuntu Bionic):
status: In Progress → Fix Released
description: updated

Note: I realized the script might want to re-issue the shutdown command but that is for a different time/bug.

After a final test against [1] with the proposed upload being good I pushed this for SRU.
Also I added a proper SRU Template to the bug.

For SRU Review:
libvirt_3.6.0-1ubuntu6.6_source.changes
libvirt_1.3.1-1ubuntu10.22_source.changes

And the same content, tags and branches for xenial/artful to the packaging repositories.

That said waiting for the SRU Team to accept this as next step.

[1]: https://launchpad.net/~ci-train-ppa-service/+archive/ubuntu/3248

description: updated
Robie Basak (racb) wrote :

12:35 <rbasak> cpaelzer: on bug 1764668
12:35 <ubottu> bug 1764668 in libvirt (Ubuntu Artful) "guest cleanup script fails to iterate" [Undecided,Confirmed] https://launchpad.net/bugs/1764668
12:35 <rbasak> cpaelzer: is it possible that some users have guests that fail to shut down in production? In that case, what will happen to them after this SRU?
12:36 <rbasak> eg. will the host now not shut down, or they'll have to wait for a timeout and then it will?
12:37 <cpaelzer> rbasak: hiho
12:39 <cpaelzer> rbasak: this is a two stage thing
12:40 <cpaelzer> rbasak: let me explain ...
12:41 <cpaelzer> before 1.3.1-1ubuntu10.21 there were cases where it could hang on shutdown
12:41 <cpaelzer> this was fixed and good in that regard, but the fix introduced another issue which was missed
12:41 <cpaelzer> so the new case is that all guests get a shutdown signal now, but it immediately detects them as shut down and goes on with the systems shutdown
12:42 <cpaelzer> the latter is fixed by the current SRU
12:42 <cpaelzer> so currently people might have guests that shut down but too fast go on and shut down the full system
12:42 <cpaelzer> with the fix it will wait up to timeout (as defined in the config)
12:43 <cpaelzer> it will not exceed that timeout and hang (as before the last SRU)
12:43 <cpaelzer> but it will also not go on too fast
12:43 <cpaelzer> rbasak: did the above eplxain what you wanted to know?
12:43 <rbasak> Yes, thanks. The case I had in mind can't happen because of the timeout.

tags: added: regression-update

Hello ChristianEhrhardt, or anyone else affected,

Accepted libvirt into xenial-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/1.3.1-1ubuntu10.22 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-xenial to verification-done-xenial. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-xenial. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in libvirt (Ubuntu Xenial):
status: Confirmed → Fix Committed
tags: added: verification-needed verification-needed-xenial
Changed in libvirt (Ubuntu Artful):
status: Confirmed → Fix Committed
tags: added: verification-needed-artful
Robie Basak (racb) wrote :

Hello ChristianEhrhardt, or anyone else affected,

Accepted libvirt into artful-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/libvirt/3.6.0-1ubuntu6.6 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-artful to verification-done-artful. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-artful. In either case, details of your testing will help us make a better decision.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Download full text (6.1 KiB)

As described in testcases I set up a few guests and made sure some of them won't shutdown properly. We want to see
1. all tried to shut down
2. waiting up until timeout for the one that does not shut down

## Pre-Fix ##

Xenial:
sudo /usr/lib/libvirt/libvirt-guests.sh stop

Running guests on default URI: b4, b1, b2, b3
Shutting down guests on default URI...
Starting shutdown on guest: b4
Starting shutdown on guest: b1
Waiting for 4 guests to shut down, 120 seconds left
Failed to determine state of guest: 83964d03-0d3b-47c3-a17d-fd104e2511f8. Not tracking it anymore.
Failed to determine state of guest: 9f11ae0c-b73b-4d4b-abdb-b9acda341e84. Not tracking it anymore.
Shutdown of guest b4 complete.
Shutdown of guest b1 complete.
Starting shutdown on guest: b2
Starting shutdown on guest: b3
Shutdown of guest b2 complete.
Shutdown of guest b3 complete.
root@x:~# virsh list
 Id Name State
----------------------------------------------------
 5 b4 running
 6 b1 running
 7 b2 running
 8 b3 running

Artful:
# sudo /usr/lib/libvirt/libvirt-guests.sh stop

Running guests on default URI: b1, b2, b3, b4
Shutting down guests on default URI...
Starting shutdown on guest: b1
Starting shutdown on guest: b2
Waiting for 4 guests to shut down, 120 seconds left
Failed to determine state of guest: 80ba5ac4-fb68-4ee7-b8d6-313313f29ab4. Not tracking it anymore.
Failed to determine state of guest: 688847ca-5510-462d-ae5e-988d322d1471. Not tracking it anymore.
Shutdown of guest b1 complete.
Shutdown of guest b2 complete.
Starting shutdown on guest: b3
Starting shutdown on guest: b4
Shutdown of guest b3 complete.
Shutdown of guest b4 complete.
root@a:~# virsh list
 Id Name State
----------------------------------------------------
 1 b1 running
 4 b4 running

#### With the fixes from proposed ####

libvirt-bin/artful-proposed 3.6.0-1ubuntu6.6 amd64 [upgradable from: 3.6.0-1ubuntu6.5]
libvirt-clients/artful-proposed 3.6.0-1ubuntu6.6 amd64 [upgradable from: 3.6.0-1ubuntu6.5]
libvirt-daemon/artful-proposed 3.6.0-1ubuntu6.6 amd64 [upgradable from: 3.6.0-1ubuntu6.5]
libvirt-daemon-system/artful-proposed 3.6.0-1ubuntu6.6 amd64 [upgradable from: 3.6.0-1ubuntu6.5]
libvirt0/artful-proposed 3.6.0-1ubuntu6.6 amd64 [upgradable from: 3.6.0-1ubuntu6.5]

libvirt-bin/xenial-proposed 1.3.1-1ubuntu10.22 amd64 [upgradable from: 1.3.1-1ubuntu10.21]
libvirt0/xenial-proposed 1.3.1-1ubuntu10.22 amd64 [upgradable from: 1.3.1-1ubuntu10.21]

Xenial:

root@x:~# sudo /usr/lib/libvirt/libvirt-guests.sh stop

Running guests on default URI: b4, b2, b1, b3
Shutting down guests on default URI...
Starting shutdown on guest: b4
Starting shutdown on guest: b2
Waiting for 4 guests to shut down, 120 seconds left
Waiting for 4 guests to shut down, 115 seconds left
Waiting for 4 guests to shut down, 110 seconds left
Waiting for 4 guests to shut down, 105 seconds left
Waiting for 4 guests to shut down, 100 seconds left
Waiting for 4 guests to shu...

Read more...

tags: added: verification-done verification-done-artful verification-done-xenial
removed: verification-needed verification-needed-artful verification-needed-xenial

@SRU-Team: Dep8 tests show unrelated issues in cockpit for Artful.
I'm in discussion with pitti if we should even mask those tests, considering what fails and what we changed they are really unrelated (and this type of error showed before the changes of this SRU).
See the artful column in [1] for its history.
So I'd ask you to ignore the three hits on cockpit for artful for this SRU.

If pitti and I conclude that this should be a Britney override we will do so independent to this SRU.

[1]: http://autopkgtest.ubuntu.com/packages/cockpit

FYI - we opened [1], and while this needs some processing to be acked an merged I'd ask the SRU team to consider these autopkgtest fails not a blocker to move on with thie SRU here atm.

[1]: https://code.launchpad.net/~paelzer/britney/hints-ubuntu-artful-cockpitfails/+merge/344874

In fact the MP above is approved by Pitti who sort of owns cockpit.
Since neither of us can push this change I'd appreciate if on SRU handling this could be done (as it is ~ubuntu-sru scope).

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 1.3.1-1ubuntu10.22

---------------
libvirt (1.3.1-1ubuntu10.22) xenial; urgency=medium

  * Fix clean shut down of guests on system shutdown (LP: #1764668)
    - d/p/ubuntu/lp-1764668-do-not-report-unknown-guests.patch
    - d/p/ubuntu/lp-1764668-fix-check_guests_shutdown-loop.patch

 -- Christian Ehrhardt <email address hidden> Wed, 25 Apr 2018 09:26:12 +0200

Changed in libvirt (Ubuntu Xenial):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for libvirt has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package libvirt - 3.6.0-1ubuntu6.6

---------------
libvirt (3.6.0-1ubuntu6.6) artful; urgency=medium

  * Fix clean shut down of guests on system shutdown (LP: #1764668)
    - d/p/ubuntu/lp-1764668-do-not-report-unknown-guests.patch
    - d/p/ubuntu/lp-1764668-fix-check_guests_shutdown-loop.patch

 -- Christian Ehrhardt <email address hidden> Wed, 25 Apr 2018 09:24:08 +0200

Changed in libvirt (Ubuntu Artful):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers