Masakari fails to update hypervisor service status on compute down notification

Bug #1940980 reported by Mihail
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
masakari
Triaged
Medium
Unassigned

Bug Description

Disable of nova service fails
Then for the host failover, Masakari tries to disable the service on nova, but when nova-compute is already down it doesn't respond to the disable request.

Version:
masakari-22.2.0
Openstack Victoria

Masakari Engine:

Aug 24 10:34:01 infra3-masakari-api-container-0e3ba16d masakari-engine[60]: 2021-08-24 07:34:01.938 60 WARNING masakari.engine.drivers.taskflow.driver [req-cc70cb2b-0871-4487-9b99-3c402ba192cc masakari - - - -] Task 'DisableComputeServiceTask' (b49bdb90-ae03-400a-a415-09655ea2483d) transitioned into state 'FAILURE' from state 'RUNNING'
                                                                            2 predecessors (most recent first):
                                                                              Flow 'pre_tasks'
                                                                              |__Flow 'instance_evacuate_engine': novaclient.exceptions.ClientException: Unknown Error (HTTP 504)
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver Traceback (most recent call last):
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/taskflow/engines/action_engine/executor.py", line 53, in _execute_task
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver result = task.execute(**arguments)
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/masakari/engine/drivers/taskflow/host_failure.py", line 51, in execute
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver self.novaclient.enable_disable_service(self.context, host_name)
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/masakari/compute/nova.py", line 58, in wrapper
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver res = method(self, ctx, *args, **kwargs)
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/masakari/compute/nova.py", line 163, in enable_disable_service
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver nova.services.disable(service.id)
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/novaclient/api_versions.py", line 393, in substitution
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver return methods[-1].func(obj, *args, **kwargs)
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/novaclient/v2/services.py", line 101, in disable
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver return self._update("/os-services/%s" % service_uuid,
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/novaclient/base.py", line 380, in _update
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver resp, body = self.api.client.put(url, body=body)
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/keystoneauth1/adapter.py", line 404, in put
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver return self.request(url, 'PUT', **kwargs)
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver File "/openstack/venvs/masakari-22.2.0/lib/python3.8/site-packages/novaclient/client.py", line 78, in request
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver raise exceptions.from_response(resp, body, url, method)
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver novaclient.exceptions.ClientException: Unknown Error (HTTP 504)
                                                                            2021-08-24 07:34:01.938 60 ERROR masakari.engine.drivers.taskflow.driver
Aug 24 10:34:01 infra3-masakari-api-container-0e3ba16d masakari-engine[60]: 2021-08-24 07:34:01.980 60 WARNING masakari.engine.drivers.taskflow.driver [req-cc70cb2b-0871-4487-9b99-3c402ba192cc masakari - - - -] Task 'DisableComputeServiceTask' (b49bdb90-ae03-400a-a415-09655ea2483d) transitioned into state 'REVERTED' from state 'REVERTING'
Aug 24 10:34:01 infra3-masakari-api-container-0e3ba16d masakari-engine[60]: 2021-08-24 07:34:01.989 60 WARNING masakari.engine.drivers.taskflow.driver [req-cc70cb2b-0871-4487-9b99-3c402ba192cc masakari - - - -] Flow 'instance_evacuate_engine' (8f81e90b-dc0f-4138-af31-385f8fb34164) transitioned into state 'REVERTED' from state 'RUNNING'
Aug 24 10:34:01 infra3-masakari-api-container-0e3ba16d masakari-engine[60]: 2021-08-24 07:34:01.990 60 ERROR masakari.engine.manager [req-cc70cb2b-0871-4487-9b99-3c402ba192cc masakari - - - -] Failed to process notification 'ce142084-d849-495a-b3e6-4f327e625135'. Reason: Failed to execute host failure flow for notification 'ce142084-d849-495a-b3e6-4f327e625135'.: masakari.exception.MasakariException: Failed to execute host failure flow for notification 'ce142084-d849-495a-b3e6-4f327e625135'.

Nova API:

Aug 24 10:34:11 infra1-nova-api-container-b8a90cb3 nova-api-wsgi[151]: 2021-08-24 07:34:11.987 151 ERROR nova.compute.api [req-e95057a3-8039-432f-8a32-a658b4c74711 - - - - -] An error occurred while updating the COMPUTE_STATUS_DISABLED trait on compute node resource providers managed by host eb-rk03-compute02. The trait will be synchronized automatically by the compute service when the update_available_resource periodic task runs.: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 6a00ae79f44f456dad3b339fe1d96235
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api Traceback (most recent call last):
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 433, in get
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api return self._queues[msg_id].get(block=True, timeout=timeout)
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/eventlet/queue.py", line 322, in get
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api return waiter.wait()
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/eventlet/queue.py", line 141, in wait
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api return get_hub().switch()
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/eventlet/hubs/hub.py", line 298, in switch
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api return self.greenlet.switch()
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api _queue.Empty
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api During handling of the above exception, another exception occurred:
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api Traceback (most recent call last):
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/nova/compute/api.py", line 5817, in _update_compute_provider_status
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api self.rpcapi.set_host_enabled(context, service.host, enabled)
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/nova/compute/rpcapi.py", line 1263, in set_host_enabled
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api return cctxt.call(ctxt, 'set_host_enabled', enabled=enabled)
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/oslo_messaging/rpc/client.py", line 175, in call
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api self.transport._send(self.target, msg_ctxt, msg,
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/oslo_messaging/transport.py", line 123, in _send
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api return self._driver.send(target, ctxt, message,
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 680, in send
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api return self._send(target, ctxt, message, wait_for_reply, timeout,
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 669, in _send
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api result = self._waiter.wait(msg_id, timeout,
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in wait
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api message = self.waiters.get(msg_id, timeout=timeout)
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api File "/openstack/venvs/nova-22.2.0/lib/python3.8/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 435, in get
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api raise oslo_messaging.MessagingTimeout(
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 6a00ae79f44f456dad3b339fe1d96235
                                                                       2021-08-24 07:34:11.987 151 ERROR nova.compute.api

Masakari has to be updated for the new way nova operates. It needs a force-stop first to be flagged before disabling. In this way the disable operation will not wait for the nova-compute to respond.

From the following similar issue with the change in nova API:

https://bugs.launchpad.net/nova/+bug/1920977
"
You shouldn't disable the host by calling the host API, but rather either waiting for the periodic verification (indeed, around 60 secs) or calling the force-down API.
https://docs.openstack.org/api-ref/compute/?expanded=update-forced-down-detail#update-forced-down
... feature that resolved this same issue by first forcing down the compute host before disabling it:
Invert the force-down/service disable order
https://github.com/ClusterLabs/fence-agents/pull/303 "

Code change required that I'll propose in a PR:
160,161d159
< LOG.info('Marking hypervisor as force-down %s', host_name)
< nova.services.force_down(service.id, True)

As per the nova client API
https://docs.openstack.org/python-novaclient/latest/reference/api/novaclient.v2.services.html
force_down(service_uuid, force_down)¶
Update the service’s forced_down field specified by the service UUID ID.
Parameters
service_uuid – The UUID ID of the service.
force_down – Whether or not this service was forced down manually by an administrator. This value is useful to know that some 3rd party has verified the service should be marked down.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

The thing is, without fencing, we can't be sure that nova is really down. Forcing it to be down may prove disastrous in certain random situations where it's only Pacemaker that dies and Nova is taking too long to respond. We are approaching this with new improvements in Masakari: monitoring via Consul - which is both more robust and more flexible - and moving the fencing capability to Masakari Engine - so that it can be certain that the host was really fenced.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :
Changed in masakari:
status: New → Triaged
importance: Undecided → Medium
Revision history for this message
Mihail (mihail.itgix) wrote :

Good point regarding the fencing. How about if we introduce a mechanism to check if nova was discovered to be down. Probe every 10 seconds if it's already in down state and loop for a configurable amount of time before trying to disable service, instead of just attempting to disable the service once and failing with an exception and never try again.

Revision history for this message
Mihail (mihail.itgix) wrote :

I think also a configurable wait time will help. Currently the available setting for waiting is after the try to disable the service. The wait_period_after_service_update is the time between it gets disabled and when an eviction is attempted. I think introducing something like: wait_period_before_marking_service_as_disabled will help, because it will give time to the nova API to recognize there are no heartbeats from nova-compute and will mark the service down. Afaik this period of the heartbeats is something like 60 seconds, so now it depends when the failure happens and how fast Masakari responds . I think combining the above and this idea will help to avoid the racing condition.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

Yeah, that one wait that we have there probably does not make much sense. At least the current core team has a hard time explaining it. ;-)

The latest idea I had on it is best described in our etherpad from the Masakari Xena Project Team Gathering (PTG) that we held in April:

https://etherpad.opendev.org/p/masakari-xena-ptg (read from: "(yoctozepto) We came up with the conclusion that:") - yoctozepto is me obviously ;-)

I think your idea is pretty close to mine. It just needs someone to sit down, do it and test it. Heh. :D

BTW, we are going to have another PTG in October: https://etherpad.opendev.org/p/masakari-yoga-ptg
You may want to consider joining. The event is 100% online, open and free to participate for all interested folks.

Revision history for this message
Radosław Piliszek (yoctozepto) wrote :

I have found the duplicate I was thinking about. Though we can continue discussing here. They should be interlinked.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.