Comment 4 for bug 1907314

Revision history for this message
Billy Olsen (billy-olsen) wrote :

In the future, the logs should be publicly accessible as well (or a subset of the logs).

In this case, I can see the following log entries for Masakari and Nova (interwoven with timestamps):

>>> Masakari
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:50:57.142 235 INFO masakari.compute.nova [req-009aadf7-5b16-4432-b987-2c4db03e5793 masakari - - - -] Call get server command for instance 56ce6206-3659-495e-a6ea-f4d3d578cc96
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:51:09.618 235 INFO masakari.compute.nova [req-3b8f5237-b8f4-4ae1-9cbc-665f4c75e819 masakari - - - -] Call get server command for instance 56ce6206-3659-495e-a6ea-f4d3d578cc96
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:51:11.734 235 INFO masakari.compute.nova [req-de081ecf-4201-4542-b092-e21b2f5ce942 masakari - - - -] Call lock server command for instance 56ce6206-3659-495e-a6ea-f4d3d578cc96

>>> Nova
3/lxd/6/var/log/apache2/nova-api-os-compute_access.log:172.27.100.141 - - [08/Dec/2020:19:51:11 +0000] "GET /v2.1/servers/56ce6206-3659-495e-a6ea-f4d3d578cc96 HTTP/1.1" 200 2321 "-" "python-novaclient"

>>> Masakari
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:51:12.309 235 INFO masakari.compute.nova [req-de081ecf-4201-4542-b092-e21b2f5ce942 masakari - - - -] Call evacuate command for instance 56ce6206-3659-495e-a6ea-f4d3d578cc96 on host None

>>> Nova
3/lxd/6/var/log/apache2/nova-api-os-compute_access.log:172.27.100.141 - - [08/Dec/2020:19:51:12 +0000] "POST /v2.1/servers/56ce6206-3659-495e-a6ea-f4d3d578cc96/action HTTP/1.1" 400 540 "-" "python-novaclient"

>>> Masakari
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:48:43.947 235 INFO masakari.compute.nova [req-2275b29c-7a9f-476c-a7dc-25635aab398b masakari - - - -] Disable nova-compute on node02ob100.maas
2020-12-08 19:48:44.040 235 INFO masakari.engine.drivers.taskflow.host_failure [req-2275b29c-7a9f-476c-a7dc-25635aab398b masakari - - - -] Sleeping 60 sec before starting recovery thread until nova recognizes the node down.
...
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:51:14.214 235 INFO masakari.compute.nova [req-7bb8e483-e54e-457f-a110-1f9a95a1b4c8 masakari - - - -] Call unlock server command for instance 56ce6206-3659-495e-a6ea-f4d3d578cc96

>>> Nova
3/lxd/6/var/log/apache2/nova-api-os-compute_access.log:172.27.100.141 - - [08/Dec/2020:19:51:14 +0000] "POST /v2.1/servers/56ce6206-3659-495e-a6ea-f4d3d578cc96/action HTTP/1.1" 202 462 "-" "python-novaclient"

>>> Masakari
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:51:15.122 235 ERROR masakari.engine.drivers.taskflow.driver masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '56ce6206-3659-495e-a6ea-f4d3d578cc96' from host 'node02ob100.maas'
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:51:15.537 235 ERROR masakari.engine.manager [req-7bb8e483-e54e-457f-a110-1f9a95a1b4c8 masakari - - - -] Failed to process notification '9ace2da7-97d1-48cb-a96a-08ad4624481b'. Reason: Failed to evacuate instances '56ce6206-3659-495e-a6ea-f4d3d578cc96' from host 'node02ob100.maas': masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '56ce6206-3659-495e-a6ea-f4d3d578cc96' from host 'node02ob100.maas'

You can see that the evacuation call for the instance to nova receives a HTTP 400. This api call will fail with a HTTP 400 on 3 conditions:

1. the instance is being evacuated to the same host
2. the instance has ports with resource requests (qos)
3. the nova-compute host the instance is running on is not marked as down

The first one is definitely not the case, and I don't believe that qos is enabled on the networking bits. Therefore I suspect that this is encountering case 3.

What I suspect is happening is that the wait_period_after_service_update option is specified as 60 seconds and the nova service takes 60 seconds to mark the compute service as down. This is then a race for the evacuation.

To resolve this race, either reduce the service_down_time interval for nova OR increase the wait_period_after_service_update (called evacuation-delay in the charm config).