>>> Masakari
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:51:15.122 235 ERROR masakari.engine.drivers.taskflow.driver masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '56ce6206-3659-495e-a6ea-f4d3d578cc96' from host 'node02ob100.maas'
3/lxd/4/var/log/masakari/masakari-engine.log:2020-12-08 19:51:15.537 235 ERROR masakari.engine.manager [req-7bb8e483-e54e-457f-a110-1f9a95a1b4c8 masakari - - - -] Failed to process notification '9ace2da7-97d1-48cb-a96a-08ad4624481b'. Reason: Failed to evacuate instances '56ce6206-3659-495e-a6ea-f4d3d578cc96' from host 'node02ob100.maas': masakari.exception.HostRecoveryFailureException: Failed to evacuate instances '56ce6206-3659-495e-a6ea-f4d3d578cc96' from host 'node02ob100.maas'
You can see that the evacuation call for the instance to nova receives a HTTP 400. This api call will fail with a HTTP 400 on 3 conditions:
1. the instance is being evacuated to the same host
2. the instance has ports with resource requests (qos)
3. the nova-compute host the instance is running on is not marked as down
The first one is definitely not the case, and I don't believe that qos is enabled on the networking bits. Therefore I suspect that this is encountering case 3.
What I suspect is happening is that the wait_period_after_service_update option is specified as 60 seconds and the nova service takes 60 seconds to mark the compute service as down. This is then a race for the evacuation.
To resolve this race, either reduce the service_down_time interval for nova OR increase the wait_period_after_service_update (called evacuation-delay in the charm config).
In the future, the logs should be publicly accessible as well (or a subset of the logs).
In this case, I can see the following log entries for Masakari and Nova (interwoven with timestamps):
>>> Masakari var/log/ masakari/ masakari- engine. log:2020- 12-08 19:50:57.142 235 INFO masakari. compute. nova [req-009aadf7- 5b16-4432- b987-2c4db03e57 93 masakari - - - -] Call get server command for instance 56ce6206- 3659-495e- a6ea-f4d3d578cc 96 var/log/ masakari/ masakari- engine. log:2020- 12-08 19:51:09.618 235 INFO masakari. compute. nova [req-3b8f5237- b8f4-4ae1- 9cbc-665f4c75e8 19 masakari - - - -] Call get server command for instance 56ce6206- 3659-495e- a6ea-f4d3d578cc 96 var/log/ masakari/ masakari- engine. log:2020- 12-08 19:51:11.734 235 INFO masakari. compute. nova [req-de081ecf- 4201-4542- b092-e21b2f5ce9 42 masakari - - - -] Call lock server command for instance 56ce6206- 3659-495e- a6ea-f4d3d578cc 96
3/lxd/4/
3/lxd/4/
3/lxd/4/
>>> Nova var/log/ apache2/ nova-api- os-compute_ access. log:172. 27.100. 141 - - [08/Dec/ 2020:19: 51:11 +0000] "GET /v2.1/servers/ 56ce6206- 3659-495e- a6ea-f4d3d578cc 96 HTTP/1.1" 200 2321 "-" "python-novaclient"
3/lxd/6/
>>> Masakari var/log/ masakari/ masakari- engine. log:2020- 12-08 19:51:12.309 235 INFO masakari. compute. nova [req-de081ecf- 4201-4542- b092-e21b2f5ce9 42 masakari - - - -] Call evacuate command for instance 56ce6206- 3659-495e- a6ea-f4d3d578cc 96 on host None
3/lxd/4/
>>> Nova var/log/ apache2/ nova-api- os-compute_ access. log:172. 27.100. 141 - - [08/Dec/ 2020:19: 51:12 +0000] "POST /v2.1/servers/ 56ce6206- 3659-495e- a6ea-f4d3d578cc 96/action HTTP/1.1" 400 540 "-" "python-novaclient"
3/lxd/6/
>>> Masakari var/log/ masakari/ masakari- engine. log:2020- 12-08 19:48:43.947 235 INFO masakari. compute. nova [req-2275b29c- 7a9f-476c- a7dc-25635aab39 8b masakari - - - -] Disable nova-compute on node02ob100.maas engine. drivers. taskflow. host_failure [req-2275b29c- 7a9f-476c- a7dc-25635aab39 8b masakari - - - -] Sleeping 60 sec before starting recovery thread until nova recognizes the node down. var/log/ masakari/ masakari- engine. log:2020- 12-08 19:51:14.214 235 INFO masakari. compute. nova [req-7bb8e483- e54e-457f- a110-1f9a95a1b4 c8 masakari - - - -] Call unlock server command for instance 56ce6206- 3659-495e- a6ea-f4d3d578cc 96
3/lxd/4/
2020-12-08 19:48:44.040 235 INFO masakari.
...
3/lxd/4/
>>> Nova var/log/ apache2/ nova-api- os-compute_ access. log:172. 27.100. 141 - - [08/Dec/ 2020:19: 51:14 +0000] "POST /v2.1/servers/ 56ce6206- 3659-495e- a6ea-f4d3d578cc 96/action HTTP/1.1" 202 462 "-" "python-novaclient"
3/lxd/6/
>>> Masakari var/log/ masakari/ masakari- engine. log:2020- 12-08 19:51:15.122 235 ERROR masakari. engine. drivers. taskflow. driver masakari. exception. HostRecoveryFai lureException: Failed to evacuate instances '56ce6206- 3659-495e- a6ea-f4d3d578cc 96' from host 'node02ob100.maas' var/log/ masakari/ masakari- engine. log:2020- 12-08 19:51:15.537 235 ERROR masakari. engine. manager [req-7bb8e483- e54e-457f- a110-1f9a95a1b4 c8 masakari - - - -] Failed to process notification '9ace2da7- 97d1-48cb- a96a-08ad462448 1b'. Reason: Failed to evacuate instances '56ce6206- 3659-495e- a6ea-f4d3d578cc 96' from host 'node02ob100.maas': masakari. exception. HostRecoveryFai lureException: Failed to evacuate instances '56ce6206- 3659-495e- a6ea-f4d3d578cc 96' from host 'node02ob100.maas'
3/lxd/4/
3/lxd/4/
You can see that the evacuation call for the instance to nova receives a HTTP 400. This api call will fail with a HTTP 400 on 3 conditions:
1. the instance is being evacuated to the same host
2. the instance has ports with resource requests (qos)
3. the nova-compute host the instance is running on is not marked as down
The first one is definitely not the case, and I don't believe that qos is enabled on the networking bits. Therefore I suspect that this is encountering case 3.
What I suspect is happening is that the wait_period_ after_service_ update option is specified as 60 seconds and the nova service takes 60 seconds to mark the compute service as down. This is then a race for the evacuation.
To resolve this race, either reduce the service_down_time interval for nova OR increase the wait_period_ after_service_ update (called evacuation-delay in the charm config).