Disable of nova service fails
Then for the host failover, Masakari tries to disable the service on nova, but when nova-compute is already down it doesn't respond to the disable request.
Version:
masakari-22.2.0
Openstack Victoria
Masakari Engine:
Aug 24 10:34:01 infra3-masakari-api-container-0e3ba16d masakari-engine[60]: 2021-08-24 07:34:01.938 60 WARNING masakari.engine.drivers.taskflow.driver [req-cc70cb2b-0871-4487-9b99-3c402ba192cc masakari - - - -] Task 'DisableComputeServiceTask' (b49bdb90-ae03-400a-a415-09655ea2483d) transitioned into state 'FAILURE' from state 'RUNNING'
2 predecessors (most recent first):
Flow 'pre_tasks'
|__Flow 'instance_evacuate_engine': novaclient.exceptions.ClientException: Unknown Error (HTTP 504)
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver Traceback (most recent call last):
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver result = task.execute(**arguments)
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/masakari/engine/drivers/taskflow/host_failure.py", line 51, in execute
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver self.novaclient.enable_disable_service(self.context, host_name)
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/masakari/compute/nova.py", line 58, in wrapper
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver res = method(self, ctx, *args, **kwargs)
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/masakari/compute/nova.py", line 163, in enable_disable_service
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver nova.services.disable(service.id)
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/novaclient/api_versions.py", line 393, in substitution
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver return methods[-1].func(obj, *args, **kwargs)
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/novaclient/v2/services.py", line 101, in disable
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver return self._update("/os-services/%s" % service_uuid,
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/novaclient/base.py", line 380, in _update
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver resp, body = self.api.client.put(url, body=body)
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/keystoneauth1/adapter.py", line 404, in put
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver return self.request(url, 'PUT', **kwargs)
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/novaclient/client.py", line 78, in request
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver raise exceptions.from_response(resp, body, url, method)
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver novaclient.exceptions.ClientException: Unknown Error (HTTP 504)
2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver
Aug 24 10:34:01 infra3-masakari-api-container-0e3ba16d masakari-engine[60]: 2021-08-24 07:34:01.980 60 WARNING masakari.engine.drivers.taskflow.driver [req-cc70cb2b-0871-4487-9b99-3c402ba192cc masakari - - - -] Task 'DisableComputeServiceTask' (b49bdb90-ae03-400a-a415-09655ea2483d) transitioned into state 'REVERTED' from state 'REVERTING'
Aug 24 10:34:01 infra3-masakari-api-container-0e3ba16d masakari-engine[60]: 2021-08-24 07:34:01.989 60 WARNING masakari.engine.drivers.taskflow.driver [req-cc70cb2b-0871-4487-9b99-3c402ba192cc masakari - - - -] Flow 'instance_evacuate_engine' (8f81e90b-dc0f-4138-af31-385f8fb34164) transitioned into state 'REVERTED' from state 'RUNNING'
Aug 24 10:34:01 infra3-masakari-api-container-0e3ba16d masakari-engine[60]: 2021-08-24 07:34:01.990 60 ERROR masakari.engine.manager [req-cc70cb2b-0871-4487-9b99-3c402ba192cc masakari - - - -] Failed to process notification 'ce142084-d849-495a-b3e6-4f327e625135'. Reason: Failed to execute host failure flow for notification 'ce142084-d849-495a-b3e6-4f327e625135'.: masakari.exception.MasakariException: Failed to execute host failure flow for notification 'ce142084-d849-495a-b3e6-4f327e625135'.
Nova API:
Aug 24 10:34:11 infra1-nova-api-container-b8a90cb3 nova-api-wsgi[151]: 2021-08-24 07:34:11.987 151 ERROR nova.compute.api [req-e95057a3-8039-432f-8a32-a658b4c74711 - - - - -] An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource providers managed by host eb-rk03-compute02. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 6a00ae79f44f456dad3b339fe1d96235
2021-08-24 07:34:11.987 151 ERROR nova.compute.api Traceback (most recent call last):
2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 433, in get
2021-08-24 07:34:11.987 151 ERROR nova.compute.api return self._queues[msg_id].get(block=True, timeout=timeout)
2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/eventlet/queue.py", line 322, in get
2021-08-24 07:34:11.987 151 ERROR nova.compute.api return waiter.wait()
2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/eventlet/queue.py", line 141, in wait
2021-08-24 07:34:11.987 151 ERROR nova.compute.api return get_hub().switch()
2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 298, in switch
2021-08-24 07:34:11.987 151 ERROR nova.compute.api return self.greenlet.switch()
2021-08-24 07:34:11.987 151 ERROR nova.compute.api _queue.Empty
2021-08-24 07:34:11.987 151 ERROR nova.compute.api
2021-08-24 07:34:11.987 151 ERROR nova.compute.api During handling of the above exception, another exception occurred:
2021-08-24 07:34:11.987 151 ERROR nova.compute.api
2021-08-24 07:34:11.987 151 ERROR nova.compute.api Traceback (most recent call last):
2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/nova/compute/api.py", line 5817, in _update_compute_provider_status
2021-08-24 07:34:11.987 151 ERROR nova.compute.api self.rpcapi.set_host_enabled(context, service.host, enabled)
2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/nova/compute/rpcapi.py", line 1263, in set_host_enabled
2021-08-24 07:34:11.987 151 ERROR nova.compute.api return cctxt.call(ctxt, 'set_host_enabled', enabled=enabled)
2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/oslo_messaging/rpc/client.py", line 175, in call
2021-08-24 07:34:11.987 151 ERROR nova.compute.api self.transport._send(self.target, msg_ctxt, msg,
2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/oslo_messaging/transport.py", line 123, in _send
2021-08-24 07:34:11.987 151 ERROR nova.compute.api return self._driver.send(target, ctxt, message,
2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 680, in send
2021-08-24 07:34:11.987 151 ERROR nova.compute.api return self._send(target, ctxt, message, wait_for_reply, timeout,
2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 669, in _send
2021-08-24 07:34:11.987 151 ERROR nova.compute.api result = self._waiter.wait(msg_id, timeout,
2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in wait
2021-08-24 07:34:11.987 151 ERROR nova.compute.api message = self.waiters.get(msg_id, timeout=timeout)
2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 435, in get
2021-08-24 07:34:11.987 151 ERROR nova.compute.api raise oslo_messaging.MessagingTimeout(
2021-08-24 07:34:11.987 151 ERROR nova.compute.api oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 6a00ae79f44f456dad3b339fe1d96235
2021-08-24 07:34:11.987 151 ERROR nova.compute.api
Masakari has to be updated for the new way nova operates. It needs a force-stop first to be flagged before disabling. In this way the disable operation will not wait for the nova-compute to respond.
From the following similar issue with the change in nova API:
https://bugs.launchpad.net/nova/+bug/1920977
"
You shouldn't disable the host by calling the host API, but rather either waiting for the periodic verification (indeed, around 60 secs) or calling the force-down API.
https://docs.openstack.org/api-ref/compute/?expanded=update-forced-down-detail#update-forced-down
... feature that resolved this same issue by first forcing down the compute host before disabling it:
Invert the force-down/service disable order
https://github.com/ClusterLabs/fence-agents/pull/303 "
Code change required that I'll propose in a PR:
160,161d159
< LOG.info('Marking hypervisor as force-down %s', host_name)
< nova.services.force_down(service.id, True)
As per the nova client API
https://docs.openstack.org/python-novaclient/latest/reference/api/novaclient.v2.services.html
force_down(service_uuid, force_down)¶
Update the service’s forced_down field specified by the service UUID ID.
Parameters
service_uuid – The UUID ID of the service.
force_down – Whether or not this service was forced down manually by an administrator. This value is useful to know that some 3rd party has verified the service should be marked down.
The thing is, without fencing, we can't be sure that nova is really down. Forcing it to be down may prove disastrous in certain random situations where it's only Pacemaker that dies and Nova is taking too long to respond. We are approaching this with new improvements in Masakari: monitoring via Consul - which is both more robust and more flexible - and moving the fencing capability to Masakari Engine - so that it can be certain that the host was really fenced.