Hosts are not evacuated reliably

Bug #1907314 reported by Syed Mohammad Adnan Karim
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Masakari Charm
Invalid
Undecided
Unassigned
masakari
Expired
Undecided
Unassigned

Bug Description

I have been testing masakari's host evacuation functionality and have observed that it does not always evacuate the VMs on a host when that host is powered off.

I have a 10 node bionic-stein openstack running (1 neutron-gateway and 9 nova-computes).
I created 10 instances and also created the following failover segment:

https://pastebin.canonical.com/p/4JV7p4tf2V/
or
https://paste.ubuntu.com/p/ZRg3R9c2z5/

I proceeded to SSH into the hosts with instances on them and run sudo poweroff to shut them down.
I would do this one by one until I got into a situation where the host was powered off, but the instance was not evacuated. Once the powered off hosts come back up, I can see with virsh list --all that the instance is still there but not running. In the latest test, this happened to the 3rd node I powered off.

I am using the following bundle:
https://pastebin.canonical.com/p/M2zMWMHjf8/
or
https://paste.ubuntu.com/p/WjVjf2xq2n/

My bundle is using the latest masakari charm from cs:~openstack-charmers-next/masakari to use the
     check-expired-interval: 60
     notification-expiration: 60
config options. I had a similar experience though with the latest charms from cs:~openstack-charmers/masakari.

Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

The pastebin is private or something?

Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :

@Radosław Piliszek (yoctozepto) I added alternate public links.
Sorry about that.

description: updated
Revision history for this message
Billy Olsen (billy-olsen) wrote :
Download full text (4.0 KiB)

In the future, the logs should be publicly accessible as well (or a subset of the logs).

In this case, I can see the following log entries for Masakari and Nova (interwoven with timestamps):

>>> Masakari
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:50:57.142 235 INFO masakari.compute.nova [req-009aadf7-5b16-4432-b987-2c4db03e5793 masakari - - - -] Call get server command for instance 56ce6206-3659-495e-a6ea-f4d3d578cc96
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:51:09.618 235 INFO masakari.compute.nova [req-3b8f5237-b8f4-4ae1-9cbc-665f4c75e819 masakari - - - -] Call get server command for instance 56ce6206-3659-495e-a6ea-f4d3d578cc96
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:51:11.734 235 INFO masakari.compute.nova [req-de081ecf-4201-4542-b092-e21b2f5ce942 masakari - - - -] Call lock server command for instance 56ce6206-3659-495e-a6ea-f4d3d578cc96

>>> Nova
3/lxd/6/var/log/apache2/nova-api-os-compute_access.log:172.27.100.141 - - [08/Dec/2020:19:51:11 +0000] "GET /v2.1/servers/56ce6206-3659-495e-a6ea-f4d3d578cc96 HTTP/1.1" 200 2321 "-" "python-novaclient"

>>> Masakari
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:51:12.309 235 INFO masakari.compute.nova [req-de081ecf-4201-4542-b092-e21b2f5ce942 masakari - - - -] Call evacuate command for instance 56ce6206-3659-495e-a6ea-f4d3d578cc96 on host None

>>> Nova
3/lxd/6/var/log/apache2/nova-api-os-compute_access.log:172.27.100.141 - - [08/Dec/2020:19:51:12 +0000] "POST /v2.1/servers/56ce6206-3659-495e-a6ea-f4d3d578cc96/action HTTP/1.1" 400 540 "-" "python-novaclient"

>>> Masakari
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:48:43.947 235 INFO masakari.compute.nova [req-2275b29c-7a9f-476c-a7dc-25635aab398b masakari - - - -] Disable nova-compute on node02ob100.maas
2020-12-08 19:48:44.040 235 INFO masakari.engine.drivers.taskflow.host_failure [req-2275b29c-7a9f-476c-a7dc-25635aab398b masakari - - - -] Sleeping 60 sec before starting recovery thread until nova recognizes the node down.
...
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:51:14.214 235 INFO masakari.compute.nova [req-7bb8e483-e54e-457f-a110-1f9a95a1b4c8 masakari - - - -] Call unlock server command for instance 56ce6206-3659-495e-a6ea-f4d3d578cc96

>>> Nova
3/lxd/6/var/log/apache2/nova-api-os-compute_access.log:172.27.100.141 - - [08/Dec/2020:19:51:14 +0000] "POST /v2.1/servers/56ce6206-3659-495e-a6ea-f4d3d578cc96/action HTTP/1.1" 202 462 "-" "python-novaclient"

>>> Masakari
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:51:15.122 235 ERROR masakari.engine.drivers.taskflow.driver masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '56ce6206-3659-495e-a6ea-f4d3d578cc96' from host 'node02ob100.maas'
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:51:15.537 235 ERROR masakari.engine.manager [req-7bb8e483-e54e-457f-a110-1f9a95a1b4c8 masakari - - - -] Failed to process notification '9ace2da7-97d1-48cb-a96a-08ad4624481b'. Reason: Failed to evacuate instances '56ce6206-3659-495e-a6ea-f4d3d578cc96' from host 'node02ob100.maas': masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '56ce...

Read more...

Changed in charm-masakari:
status: New → Invalid
Revision history for this message
Billy Olsen (billy-olsen) wrote :

@Syed - can you comment if the configuration changes helped you out?

Changed in masakari:
status: New → Invalid
Revision history for this message
Billy Olsen (billy-olsen) wrote :

I'm marking this as invalid for the time being as I believe it is configuration related. If this proves to be wrong, please set back to new and comment.

Revision history for this message
Syed Mohammad Adnan Karim (karimsye) wrote :

@Billy sorry for the delayed response.

Unfortunately it looks like setting the following values on the masakari charm did not help:
options:
  check-expired-interval: 60
  notification-expiration: 120
  evacuation-delay: 120

I redeployed the cloud with this bundle - https://paste.ubuntu.com/p/vS77gCBjKW/ on 7 machines.

You can get the crashdump here:
https://drive.google.com/file/d/1XZyFcdgozs7hSqZPAGwim0O2vRJWycNf/view?usp=sharing

To test the instance evacuation, I spun up 10 instances, and started to power off hosts.
The first host I powered off (nodeo7ob100) evacuated its instances successfully but the second host failed to do so (nodeo8ob100).

The instances that failed to evacuate were in a shutdown state:
https://paste.ubuntu.com/p/CRRvNtpzHb/

Changed in masakari:
status: Invalid → New
Changed in charm-masakari:
status: Invalid → New
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

I think you are hitting https://bugs.launchpad.net/masakari/+bug/1782517
the upstream releases are pending and the Ubuntu team can pick them up or just cherrypick the patch (though I believe the upstream releases should happen early next week).

Changed in masakari:
status: New → Incomplete
Changed in charm-masakari:
status: New → Incomplete
Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Sorry, I have just noticed this was about Stein - there are no new upstream releases for Stein. Ubuntu has to cherrypick the patch. :-(

Changed in charm-masakari:
status: Incomplete → Invalid
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for masakari because there has been no activity for 60 days.]

Changed in masakari:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.