Live migrate UnexpectedTaskStateError

Bug #1752986 reported by Kevin Stevens on 2018-03-02
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Unassigned

Bug Description

Description
===========
Occasionally, when performing live migration, the instance goes to ERROR state with the message "UnexpectedTaskStateError: Conflict updating instance XXXX. Expected: {'task_state': [u'migrating']}. Actual: {'task_state': None}"
The migration is always successful so far and the instance is always found residing on the target host. Updating the instances table "node" and "host" rows in the nova database with the destination host and then resetting the instance state to ACTIVE gets things back in order.

The issue appears to occur randomly across 16 uniform compute nodes and seems like a race condition.

Steps to reproduce
==================
1. Boot an instance to Compute01
2. Issue a live-migrate command for instance targeting Compute02 (this can be done via Horizon and via python-openstackclient)
    # openstack server migrate --shared-migration --live computehost02 e8928cb2-afae-4cca-93db-f218e9f22324
3. Live-migration works, the instance remains accessible and is moved to the new host. However, ~20% of the time, the instance goes to ERROR state and some cleanup must be done in the database.
    MariaDB [nova]> update instances set node='computehost02',host='computehost02' where uuid= 'e8928cb2-afae-4cca-93db-f218e9f22324';
    # openstack server set --state active e8928cb2-afae-4cca-93db-f218e9f22324

Expected result
===============
The migrated instance should move successfully and return to ACTIVE state.

Actual result
=============
Instances occassional end up in ERROR state after a "successful" live-migration.

Environment
===========
1. This is a Newton environment with Nova Libvirt/KVM backed by Ceph. Networking is provided by Neutron ML2 linux bridge agent.
    root@computehost02:~# nova-compute --version
    14.0.8
    root@computehost02:~# dpkg -l | grep libvir
    ii libvirt-bin 1.3.1-1ubuntu10.15 amd64 programs for the libvirt library
    ii libvirt0:amd64 1.3.1-1ubuntu10.15 amd64 library for interfacing with different virtualization systems
    ii python-libvirt 1.3.1-1ubuntu1.1 amd64 libvirt Python bindings
    root@computehost02:~# dpkg -l | grep qemu
    ii ipxe-qemu 1.0.0+git-20150424.a25a16d-1ubuntu1.2 all PXE boot firmware - ROM images for qemu
    ii qemu 1:2.5+dfsg-5ubuntu10.16 amd64 fast processor emulator
    ii qemu-block-extra:amd64 1:2.5+dfsg-5ubuntu10.16 amd64 extra block backend modules for qemu-system and qemu-utils
    ii qemu-slof 20151103+dfsg-1ubuntu1 all Slimline Open Firmware -- QEMU PowerPC version
    ii qemu-system 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries
    ii qemu-system-arm 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries (arm)
    ii qemu-system-common 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries (common files)
    ii qemu-system-mips 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries (mips)
    ii qemu-system-misc 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries (miscelaneous)
    ii qemu-system-ppc 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries (ppc)
    ii qemu-system-sparc 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries (sparc)
    ii qemu-system-x86 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU full system emulation binaries (x86)
    ii qemu-user 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU user mode emulation binaries
    ii qemu-utils 1:2.5+dfsg-5ubuntu10.16 amd64 QEMU utilities
    root@computehost02:~#

    root@cephhost:~# dpkg -l | grep -i ceph
    ii ceph 10.2.10-1xenial amd64 distributed storage and file system
    ii ceph-base 10.2.10-1xenial amd64 common ceph daemon libraries and management tools
    ii ceph-common 10.2.10-1xenial amd64 common utilities to mount and interact with a ceph storage cluster
    ii ceph-mon 10.2.10-1xenial amd64 monitor server for the ceph storage system
    ii ceph-osd 10.2.10-1xenial amd64 OSD server for the ceph storage system
    ii libcephfs1 10.2.10-1xenial amd64 Ceph distributed file system client library
    ii python-cephfs 10.2.10-1xenial amd64 Python libraries for the Ceph libcephfs library
    ii python-rados 10.2.10-1xenial amd64 Python libraries for the Ceph librados library
    ii python-rbd 10.2.10-1xenial amd64 Python libraries for the Ceph librbd library
    root@cephhost:~#

Logs & Configs
==============
This is the specific error message that the target host generates in nova-compute.log:
oslo_messaging.rpc.server [req-1b4d172c-89b7-4e58-9a11-739e3a479499 1926acfd1f341056c1aee48802ca49a4b5e2bbc0145b3f3b1bee46d3a5aa84cc 397b1ee0a002419fbbfa13b43fe81482 - - -] Exception during message handling
oslo_messaging.rpc.server Traceback (most recent call last):
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_messaging/rpc/server.py", line 133, in _process_incoming
oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 150, in dispatch
oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_messaging/rpc/dispatcher.py", line 121, in _do_dispatch
oslo_messaging.rpc.server result = func(ctxt, **new_args)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/nova/exception_wrapper.py", line 75, in wrapped
oslo_messaging.rpc.server function_name, call_dict, binary)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
oslo_messaging.rpc.server self.force_reraise()
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/nova/exception_wrapper.py", line 66, in wrapped
oslo_messaging.rpc.server return f(self, context, *args, **kw)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/nova/compute/utils.py", line 613, in decorated_function
oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/nova/compute/manager.py", line 216, in decorated_function
oslo_messaging.rpc.server kwargs['instance'], e, sys.exc_info())
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
oslo_messaging.rpc.server self.force_reraise()
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/nova/compute/manager.py", line 204, in decorated_function
oslo_messaging.rpc.server return function(self, context, *args, **kwargs)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/nova/compute/manager.py", line 5611, in post_live_migration_at_destination
oslo_messaging.rpc.server instance.save(expected_task_state=task_states.MIGRATING)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 210, in wrapper
oslo_messaging.rpc.server ctxt, self, fn.__name__, args, kwargs)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/nova/conductor/rpcapi.py", line 241, in object_action
oslo_messaging.rpc.server objmethod=objmethod, args=args, kwargs=kwargs)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_messaging/rpc/client.py", line 169, in call
oslo_messaging.rpc.server retry=self.retry)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_messaging/transport.py", line 97, in _send
oslo_messaging.rpc.server timeout=timeout, retry=retry)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 584, in send
oslo_messaging.rpc.server retry=retry)
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_messaging/_drivers/amqpdriver.py", line 575, in _send
oslo_messaging.rpc.server raise result
oslo_messaging.rpc.server UnexpectedTaskStateError_Remote: Conflict updating instance e8928cb2-afae-4cca-93db-f218e9f22324. Expected: {'task_state': [u'migrating']}. Actual: {'task_state': None}
oslo_messaging.rpc.server Traceback (most recent call last):
oslo_messaging.rpc.server
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/nova/conductor/manager.py", line 87, in _object_dispatch
oslo_messaging.rpc.server return getattr(target, method)(*args, **kwargs)
oslo_messaging.rpc.server
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_versionedobjects/base.py", line 226, in wrapper
oslo_messaging.rpc.server return fn(self, *args, **kwargs)
oslo_messaging.rpc.server
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/nova/objects/instance.py", line 761, in save
oslo_messaging.rpc.server columns_to_join=_expected_cols(expected_attrs))
oslo_messaging.rpc.server
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/nova/db/api.py", line 838, in instance_update_and_get_original
oslo_messaging.rpc.server expected=expected)
oslo_messaging.rpc.server
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 169, in wrapper
oslo_messaging.rpc.server return f(*args, **kwargs)
oslo_messaging.rpc.server
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_db/api.py", line 151, in wrapper
oslo_messaging.rpc.server ectxt.value = e.inner_exc
oslo_messaging.rpc.server
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
oslo_messaging.rpc.server self.force_reraise()
oslo_messaging.rpc.server
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
oslo_messaging.rpc.server
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/oslo_db/api.py", line 139, in wrapper
oslo_messaging.rpc.server return f(*args, **kwargs)
oslo_messaging.rpc.server
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 240, in wrapped
oslo_messaging.rpc.server return f(context, *args, **kwargs)
oslo_messaging.rpc.server
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 2637, in instance_update_and_get_original
oslo_messaging.rpc.server context, instance_uuid, values, expected, original=instance_ref))
oslo_messaging.rpc.server
oslo_messaging.rpc.server File "/openstack/venvs/nova-r14.5.0/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 2773, in _instance_update
oslo_messaging.rpc.server raise exc(**exc_props)
oslo_messaging.rpc.server
oslo_messaging.rpc.server UnexpectedTaskStateError: Conflict updating instance e8928cb2-afae-4cca-93db-f218e9f22324. Expected: {'task_state': [u'migrating']}. Actual: {'task_state': None}
oslo_messaging.rpc.server
oslo_messaging.rpc.server

Kevin Stevens (kstev) wrote :

Probably related to but not the same as stale bug https://bugs.launchpad.net/nova/+bug/1646841

tags: added: live-migration
Matt Riedemann (mriedem) wrote :

This is a bit of a stretch, but if you're consuming notifications on the rpc queue, you could maybe figure out at what point the task_state on that given instance is getting set to None because it's not obvious to me where the task_state gets set to None while an instance is being live migrated.

https://docs.openstack.org/nova/latest/configuration/config.html#notifications.notify_on_state_change

Kevin Stevens (kstev) wrote :

Great idea. I will try to reproduce this in my lab. I *may* be able to turn on notifications in the affected environment to see but unfortunately it is not likely

Kevin Stevens (kstev) wrote :

I got a chance to debug on a couple of hosts. Seems like the source host is messing things up for us when it tries to undefine the corresponding libvirt domain but it is already gone. I think we were actually hitting this bug : https://bugs.launchpad.net/nova/+bug/1662626

We got this fix for Newton which fixed instances getting stuck in MIGRATING status : https://review.openstack.org/#/c/470387/

...but this change to keep method _undefine_domain from breaking the whole migrate process was abandoned for Newton https://review.openstack.org/#/c/508641/

This can probably be closed since, after patching the libvirt driver, I am unable to reproduce the issue.

Thanks

Matt Riedemann (mriedem) wrote :

Cool, glad to hear you got your environment patched.

Henry (guanzy) wrote :

I got this issue in my environment:

Add debug message in code:

nova/compute/manager.py

def _resize_instance:
    LOG.debug("-------instance.task_state is %s----",instance.task_state)
    instance.task_state = task_states.RESIZE_MIGRATING
    instance.save(expected_task_state=task_states.RESIZE_PREP)

compute.log:
2019-03-21 15:02:44.797 7 DEBUG nova.compute.manager [req-bc599214-d28f-46f3-b889-3d5abac4830b a6b2a018df8844a58bc7896d2cf91a16 4b37919a7bb7450db35b54ebb20ec29b - default default] -------instance.task_state is resize_prep---- _resize_instance /usr/lib/python2.7/site-packages/nova/compute/manager.py:4498

2019-03-21 15:02:44.927 7 ERROR nova.compute.manager [req-bc599214-d28f-46f3-b889-3d5abac4830b a6b2a018df8844a58bc7896d2cf91a16 4b37919a7bb7450db35b54ebb20ec29b - default default] [instance: b0948982-324c-4991-ab93-bd56f6e4ff52] Setting instance vm_state to ERROR: UnexpectedTaskStateError_Remote: Conflict updating instance b0948982-324c-4991-ab93-bd56f6e4ff52. Expected: {'task_state': [u'resize_prep']}. Actual: {'task_state': None}
Traceback (most recent call last):

The task_state was resize_prep already,but still report: Actual: {'task_state': None}

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers