Evacuation process message queue timeout exception

Bug #2044494 reported by Haidong Pang
30
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Low
Haidong Pang

Bug Description

Description
===========
During the process of evacuating virtual machine instances,
my message queue experienced a brief service interruption.

As a result, nova-conductor did not receive a response as
expected within the specified timeout period during
the rpc call to nova-scheduler.

However, nova-conductor failed to catch and handle the exception.
The exception was only logged but did not update the state of
the virtual machine instance to error.

Users were unable to perceive such exceptions and continued to
believe that the evacuation action was in progress.

Steps to reproduce
==================
The reasons for the interruption and timeout of the message queue service
have not yet been traced.

However, when the message queue reaches the scheduling during
the evacuation operation, this bug is certain to be reproduced.

Expected result
===============
During the evacuation process, catch MessagingTimeout and
update the status of both the virtual machine instance and
the migration record to error.

Actual result
=============
Exceptions are only output in the logs.
Users unaware of the actual issues.

Environment
===========
1. Exact version of OpenStack you are running. See the following
   Openstack Nova Wallaby.

2. Which hypervisor did you use?
   Libvirt + KVM

2. Which storage type did you use?
   Ceph

3. Which networking type did you use?
   Neutron

Logs & Configs
==============

```
2023-10-18T02:46:27.701367401+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server [req-c1780783-e8f6-455b-aacc-0933ea92a90a d21791d7ad564f1bb51a96b388e560c4 e0d2cb309d60449a857aafed38ac8816 - default default] Exception during message handling: oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 00d2886fee7c40078806d881e3e410a4
2023-10-18T02:46:27.701618707+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-10-18T02:46:27.701628365+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 433, in get
2023-10-18T02:46:27.701634988+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server return self._queues[msg_id].get(block=True, timeout=timeout)
2023-10-18T02:46:27.701639927+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/eventlet/queue.py", line 322, in get
2023-10-18T02:46:27.701671277+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server return waiter.wait()
2023-10-18T02:46:27.701676617+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/eventlet/queue.py", line 141, in wait
2023-10-18T02:46:27.701681586+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server return get_hub().switch()
2023-10-18T02:46:27.701686496+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/eventlet/hubs/hub.py", line 313, in switch
2023-10-18T02:46:27.701691515+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server return self.greenlet.switch()
2023-10-18T02:46:27.701696805+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server queue.Empty
2023-10-18T02:46:27.701701223+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server
2023-10-18T02:46:27.701705071+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server During handling of the above exception, another exception occurred:
2023-10-18T02:46:27.701709088+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server
2023-10-18T02:46:27.701712896+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-10-18T02:46:27.701727924+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2023-10-18T02:46:27.701732533+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
2023-10-18T02:46:27.701735989+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2023-10-18T02:46:27.701739746+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
2023-10-18T02:46:27.701743243+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2023-10-18T02:46:27.701757300+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
2023-10-18T02:46:27.701761067+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/nova/conductor/manager.py", line 99, in wrapper
2023-10-18T02:46:27.701764604+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server return fn(self, context, *args, **kwargs)
2023-10-18T02:46:27.701768060+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/nova/conductor/manager.py", line 1196, in rebuild_instance
2023-10-18T02:46:27.701771597+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server return_alternates=False)
2023-10-18T02:46:27.701775154+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/nova/conductor/manager.py", line 884, in _schedule_instances
2023-10-18T02:46:27.701778881+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server return_alternates=return_alternates)
2023-10-18T02:46:27.701782427+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/nova/scheduler/client/query.py", line 42, in select_destinations
2023-10-18T02:46:27.701786164+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server instance_uuids, return_objects, return_alternates)
2023-10-18T02:46:27.701789621+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/nova/scheduler/rpcapi.py", line 160, in select_destinations
2023-10-18T02:46:27.701793248+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server return cctxt.call(ctxt, 'select_destinations', **msg_args)
2023-10-18T02:46:27.701814618+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/oslo_messaging/rpc/client.py", line 179, in call
2023-10-18T02:46:27.701819337+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server transport_options=self.transport_options)
2023-10-18T02:46:27.701823685+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/oslo_messaging/transport.py", line 128, in _send
2023-10-18T02:46:27.701828495+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server transport_options=transport_options)
2023-10-18T02:46:27.701831831+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 682, in send
2023-10-18T02:46:27.701835628+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server transport_options=transport_options)
2023-10-18T02:46:27.701838984+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 670, in _send
2023-10-18T02:46:27.701842631+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server call_monitor_timeout)
2023-10-18T02:46:27.701845958+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 559, in wait
2023-10-18T02:46:27.701849374+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server message = self.waiters.get(msg_id, timeout=timeout)
2023-10-18T02:46:27.701852851+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server File "/usr/local/lib/python3.6/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 437, in get
2023-10-18T02:46:27.701856217+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server 'to message ID %s' % msg_id)
2023-10-18T02:46:27.701865885+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID 00d2886fee7c40078806d881e3e410a4
2023-10-18T02:46:27.701869562+08:00 2023-10-18 02:46:27.698 18 ERROR oslo_messaging.rpc.server 
```

Changed in nova:
assignee: nobody → Haidong Pang (haidong-pang)
status: New → In Progress
Revision history for this message
Haidong Pang (haidong-pang) wrote :
Download full text (3.9 KiB)

Nova-api cast to conductor is an asynchronous rpc, which cannot handle exceptions.
When nova-conductor makes a synchronous rpc call to nova-scheduler to select a target node for evacuation, it triggers an oslo_messaging.exceptions.MessagingTimeout.

```python
@profiler.trace_cls("rpc")
class ComputeTaskManager(base.Base):
    @targets_cell
    def rebuild_instance(self, context, instance, orig_image_ref, image_ref,
                         injected_files, new_pass, orig_sys_metadata,
                         bdms, recreate, on_shared_storage,
                         preserve_ephemeral=False, host=None,
                         request_spec=None):
        ......

                try:
                    # if this is a rebuild of instance on the same host with
                    # new image.
                    if not evacuate and orig_image_ref != image_ref:
                        self._validate_image_traits_for_rebuild(context,
                                                                instance,
                                                                image_ref)
                    self._restrict_request_spec_to_cell(
                        context, instance, request_spec)
                    request_spec.ensure_project_and_user_id(instance)
                    request_spec.ensure_network_information(instance)
                    compute_utils.heal_reqspec_is_bfv(
                        context, request_spec, instance)

                    host_lists = self._schedule_instances(context,
                            request_spec, [instance.uuid],
                            return_alternates=False)
                    host_list = host_lists[0]
                    selection = host_list[0]
                    host, node, limits = (selection.service_host,
                            selection.nodename, selection.limits)

                    if recreate:
                        scheduler_utils.fill_provider_mapping(
                            request_spec, selection)

                except (exception.NoValidHost,
                        exception.UnsupportedPolicyException,
                        exception.AllocationUpdateFailed,
                        # the next two can come from fill_provider_mapping and
                        # signals a software error.
                        NotImplementedError,
                        ValueError) as ex:
                    ......

           ......

            self.compute_rpcapi.rebuild_instance(
                context,
                instance=instance,
                new_pass=new_pass,
                injected_files=injected_files,
                image_ref=image_ref,
                orig_image_ref=orig_image_ref,
                orig_sys_metadata=orig_sys_metadata,
                bdms=bdms,
                recreate=evacuate,
                on_shared_storage=on_shared_storage,
                preserve_ephemeral=preserve_ephemeral,
                migration=migration,
                host=host,
                node=node,
                limits=limits,
                request_spec=request_spec,
                accel_uuids=accel_uuids)
```

called `_schedule_instances`

```python
def...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/901817

Changed in nova:
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/915309

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by "Haidong Pang <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/nova/+/901817

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.