nova-condutor put instance in error during live-migration due to remote error MessagingTimeout

Bug #2044235 reported by Pierre Libeau
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Pierre Libeau

Bug Description

Description
===========
Nova-conductor put instance in error if the exception is not know in _build_live_migrate_task during the live-migration. [1]
The exception come from _call_livem_checks_on_host and we can see raise exception.MigrationPreCheckError if we facing to messaging.MessagingTimeout. [2]
The function check_can_live_migrate_destination do a check also on souce host with check_can_live_migrate_source [3] and this check can also return MessagingTimeout and this one is not catch properly because it's a remote "Remote error: MessagingTimeout" due to dest host try to
contact source host and this source host not reply.

[1] https://github.com/openstack/nova/blob/master/nova/conductor/manager.py#L523
[2] https://github.com/openstack/nova/blob/master/nova/conductor/tasks/live_migrate.py#L363
[3] https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L8546

Steps to reproduce
==================
# Deploy devstack multinode

# Create an instance
openstack server create --image a3cf22ec-3e24-404c-83cd-47a95874e164 --flavor m1.small --network dd824883-17b8-4ecd-881d-6b3cbd758bb6 test-check_can_live_migrate_source-on-dest-node

# In the dest node add in check_can_live_migrate_source (nova/compute/rpcapi.py) a sleep to have time to stop nova-compute on the source node
% git diff nova/compute/rpcapi.py
diff --git a/nova/compute/rpcapi.py b/nova/compute/rpcapi.py
index b58004c6e6..00ca0bd109 100644
--- a/nova/compute/rpcapi.py
+++ b/nova/compute/rpcapi.py
@@ -608,6 +608,8 @@ class ComputeAPI(object):
         client = self.router.client(ctxt)
         source = _compute_host(None, instance)
         cctxt = client.prepare(server=source, version=version)
+ import time
+ time.sleep(600)
         return cctxt.call(ctxt, 'check_can_live_migrate_source',
                           instance=instance,
                           dest_check_data=dest_check_data)

# Stop nova-compute and waiting
# After few minutes instance go to error state

# We can found in nova super conductor this log error:
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR nova.conductor.manager Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: WARNING nova.scheduler.utils [None req-8795982e-8a37-4d87-9695-806039a3d89b admin admin] [instance: 4969fe65-11ec-495f-a036-386f83d404b0] Setting instance to ERROR state.: oslo_messaging.rpc.client.RemoteError: Remote error: MessagingTimeout Timed out waiting for a reply to message ID c685b202642c469eac1dc06ac187a49c Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server [None req-8795982e-8a37-4d87-9695-806039a3d89b admin admin] Exception during message handling: nova.exception.MigrationError: Migration error: Remote error: MessagingTimeout Timed out waiting for a reply to message ID c685b202642c469eac1dc06ac187a49c Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server Traceback (most recent call last): Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/conductor/manager.py", line 505, in _live_migrate Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server task.execute() Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/conductor/tasks/base.py", line 25, in wrap Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server with excutils.save_and_reraise_exception(): Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_utils/excutils.py", line 227, in __exit__ Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server self.force_reraise() Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server raise self.value
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/conductor/tasks/base.py", line 23, in wrap
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server return original(self)
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/conductor/tasks/base.py", line 40, in execute
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server return self._execute()
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/conductor/tasks/live_migrate.py", line 100, in _execute
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server self.destination, dest_node, self.limits = self._find_destination()
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/conductor/tasks/live_migrate.py", line 550, in _find_destination
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server self._call_livem_checks_on_host(host, provider_mapping)
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/conductor/tasks/live_migrate.py", line 360, in _call_livem_checks_on_host
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server check_can_live_migrate_destination(self.context, self.instance,
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server File "/opt/stack/nova/nova/compute/rpcapi.py", line 604, in check_can_live_migrate_destination
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server return cctxt.call(ctxt, 'check_can_live_migrate_destination', **kwargs)
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_messaging/rpc/client.py", line 190, in call
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server result = self.transport._send(
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_messaging/transport.py", line 123, in _send
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server return self._driver.send(target, ctxt, message,
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server return self._send(target, ctxt, message, wait_for_reply, timeout,
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 681, in _send
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server raise result
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server oslo_messaging.rpc.client.RemoteError: Remote error: MessagingTimeout Timed out waiting for a reply to message ID c685b202642c469eac1dc06ac187a49c
Nov 21 16:40:58 devstack2-multi-node-1-cp nova-conductor[143072]: ERROR oslo_messaging.rpc.server ['Traceback (most recent call last):\n', ' File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 441, in get\n return self.
_queues[msg_id].get(block=True, timeout=timeout)\n', ' File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/queue.py", line 322, in get\n return waiter.wait()\n', ' File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/queue.py", line 141, in w
ait\n return get_hub().switch()\n', ' File "/opt/stack/data/venv/lib/python3.10/site-packages/eventlet/hubs/hub.py", line 313, in switch\n return self.greenlet.switch()\n', '_queue.Empty\n', '\nDuring handling of the above exception, another exception occurred:\n\n
', 'Traceback (most recent call last):\n', ' File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming\n res = self.dispatcher.dispatch(message)\n', ' File "/opt/stack/data/venv/lib/python3.10/site-packages/
oslo_messaging/rpc/dispatcher.py", line 309, in dispatch\n return self._do_dispatch(endpoint, method, ctxt, args)\n', ' File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch\n result = func(ctxt, **new_
args)\n', ' File "/opt/stack/nova/nova/exception_wrapper.py", line 65, in wrapped\n with excutils.save_and_reraise_exception():\n', ' File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_utils/excutils.py", line 227, in __exit__\n self.force_reraise()\n',
' File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_utils/excutils.py", line 200, in force_reraise\n raise self.value\n', ' File "/opt/stack/nova/nova/exception_wrapper.py", line 63, in wrapped\n return f(self, context, *args, **kw)\n', ' File "/opt/st
ack/nova/nova/compute/utils.py", line 1439, in decorated_function\n return function(self, context, *args, **kwargs)\n', ' File "/opt/stack/nova/nova/compute/manager.py", line 213, in decorated_function\n with excutils.save_and_reraise_exception():\n', ' File "/opt
/stack/data/venv/lib/python3.10/site-packages/oslo_utils/excutils.py", line 227, in __exit__\n self.force_reraise()\n', ' File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_utils/excutils.py", line 200, in force_reraise\n raise self.value\n', ' File "/op
t/stack/nova/nova/compute/manager.py", line 203, in decorated_function\n return function(self, context, *args, **kwargs)\n', ' File "/opt/stack/nova/nova/compute/manager.py", line 8546, in check_can_live_migrate_destination\n migrate_data = self.compute_rpcapi.chec
k_can_live_migrate_source(\n', ' File "/opt/stack/nova/nova/compute/rpcapi.py", line 613, in check_can_live_migrate_source\n return cctxt.call(ctxt, \'check_can_live_migrate_source\',\n', ' File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_messaging/rpc/cl
ient.py", line 190, in call\n result = self.transport._send(\n', ' File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_messaging/transport.py", line 123, in _send\n return self._driver.send(target, ctxt, message,\n', ' File "/opt/stack/data/venv/lib/pytho
n3.10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 689, in send\n return self._send(target, ctxt, message, wait_for_reply, timeout,\n', ' File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 678, in _send\n
    result = self._waiter.wait(msg_id, timeout,\n', ' File "/opt/stack/data/venv/lib/python3.10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 567, in wait\n message = self.waiters.get(msg_id, timeout=timeout)\n', ' File "/opt/stack/data/venv/lib/python3.
10/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 443, in get\n raise oslo_messaging.MessagingTimeout(\n', 'oslo_messaging.exceptions.MessagingTimeout: Timed out waiting for a reply to message ID c685b202642c469eac1dc06ac187a49c\n'].

Expected result
===============
The live-migration task are in pre check so we need to have instance in Active state with migration task in error

Actual result
=============
The instance in ERROR state and the migration is in error

Environment
===========
I have reproduce the issue on devstack multinode with default config

Changed in nova:
assignee: nobody → Pierre Libeau (pierre-libeau)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/901655

Changed in nova:
status: New → In Progress
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.