when live migration fails due to a internal error rollback is not handeled correctly.

Bug #1788014 reported by sean mooney
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Matt Riedemann
Rocky
In Progress
Medium
Lee Yarwood
Stein
Fix Committed
Medium
Matt Riedemann
Train
Fix Committed
Medium
Matt Riedemann

Bug Description

Description
===========
While testing livemigration between ovs and ovs-dpdk on centos 7.5 i encountered
a qemu internal error the resulted in the qemu monitor closeing.
when the migration failed during the virDomainMigrateToURI3 it
rolled back the migration leaving the vm running on the souce node
the vm however no longer had network connectivity because the vm port
on the source node had no vif type as the host-id in the port was no long set.
this is likely due to not reactivating the souce node binding when the migration fails.

if the vm is hard-rebooted in this state it enters teh error state as the vif bingins are broken.

note the qemu error is out of scope of this bug. this bug just focuses on the fact
that nova does not correctly roll back the migration and result in the vm still running
correctly with networking on the source node when a qemu error is thrown.

Steps to reproduce
==================
A chronological list of steps which will bring off the
issue you noticed:
deploy a two node devstack with ovs on the controler and ovs-dpdk on the second node.
configure nova compute with libvirt/kvm virt type on both nodes.
alloate hugepages on both nodes and boot a vm on either node.
attempt to live migrate the vm to the other node.

Expected result
===============
if qemu has a bug that results in migration failure
the vif bindings should be resotred to there premigration
state and if networkig cannot be resore the vm should go directly
to error state.

Actual result
=============
the vm remained running on the source node with state active.
the vm interface no longer had a host_id set in the neutron vif:binding-details
and the vif type was none.

on hard reboot the vm entered error state with the following message
Error: Failed to perform requested operation on instance "dpdk", the instance has an error status: Please try again later [Error: vif_type parameter must be present for this vif_driver implementation].

Environment
===========
1. openstack rocky RC1
   nova sha: afe4512bf66c89a061b1a7ccd3e7ac8e3b1b284d
   neutron sha: 1dda2bca862b1268c0f5ae39b7508f1b1cab6f15

2. Which hypervisor did you use?
   libvirt + kvm
   libvirtd (libvirt) 3.9.0
   QEMU emulator version 2.10.0(qemu-kvm-ev-2.10.0-21.el7_5.4.1)
   centos 7.5

2. Which storage type did you use?
  boot from volume using default lvm backed deploy by devstack.

3. Which networking type did you use?
  neutron ml2/ovs with ovs and ovs-dpdk
  this is not really relevent to the bug.

Logs & Configs
==============

n-cpu log

Aug 16 12:19:00 devstack5 nova-compute[14766]: 2018-08-16T11:19:00.057596Z qemu-kvm: load of migration failed: Invalid argument: libvirtError: internal error: qemu unexpectedly closed the monitor: 2018-08-16T11:19:00.055479Z qemu-kvm: get_pci_config_device: Bad config data: i=0x10 read: 61 device: 1 cmask: ff wmask: c0 w1cmask:0
Aug 16 12:19:00 devstack5 nova-compute[14766]: DEBUG nova.virt.libvirt.driver [-] [instance: 05a564c2-2e25-4b07-911f-2ded60fb6756] Migration operation thread notification {{(pid=14766) thread_finished /opt/stack/nova/nova/virt/libvirt/driver.py:7478}}
Aug 16 12:19:00 devstack5 nova-compute[14766]: Traceback (most recent call last):
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/hubs/hub.py", line 457, in fire_timers
Aug 16 12:19:00 devstack5 nova-compute[14766]: timer()
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/hubs/timer.py", line 58, in __call__
Aug 16 12:19:00 devstack5 nova-compute[14766]: cb(*args, **kw)
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/event.py", line 168, in _do_send
Aug 16 12:19:00 devstack5 nova-compute[14766]: waiter.switch(result)
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/greenthread.py", line 214, in main
Aug 16 12:19:00 devstack5 nova-compute[14766]: result = function(*args, **kwargs)
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/opt/stack/nova/nova/utils.py", line 810, in context_wrapper
Aug 16 12:19:00 devstack5 nova-compute[14766]: return func(*args, **kwargs)
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 7134, in _live_migration_operation
Aug 16 12:19:00 devstack5 nova-compute[14766]: LOG.error("Live Migration failure: %s", e, instance=instance)
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 220, in __exit__
Aug 16 12:19:00 devstack5 nova-compute[14766]: self.force_reraise()
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
Aug 16 12:19:00 devstack5 nova-compute[14766]: six.reraise(self.type_, self.value, self.tb)
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 7127, in _live_migration_operation
Aug 16 12:19:00 devstack5 nova-compute[14766]: bandwidth=CONF.libvirt.live_migration_bandwidth)
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/opt/stack/nova/nova/virt/libvirt/guest.py", line 673, in migrate
Aug 16 12:19:00 devstack5 nova-compute[14766]: destination, params=params, flags=flags)
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 186, in doit
Aug 16 12:19:00 devstack5 nova-compute[14766]: result = proxy_call(self._autowrap, f, *args, **kwargs)
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 144, in proxy_call
Aug 16 12:19:00 devstack5 nova-compute[14766]: rv = execute(f, *args, **kwargs)
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 125, in execute
Aug 16 12:19:00 devstack5 nova-compute[14766]: six.reraise(c, e, tb)
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
Aug 16 12:19:00 devstack5 nova-compute[14766]: rv = meth(*args, **kwargs)
Aug 16 12:19:00 devstack5 nova-compute[14766]: File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1745, in migrateToURI3
Aug 16 12:19:00 devstack5 nova-compute[14766]: if ret == -1: raise libvirtError ('virDomainMigrateToURI3() failed', dom=self)
Aug 16 12:19:00 devstack5 nova-compute[14766]: libvirtError: internal error: qemu unexpectedly closed the monitor: 2018-08-16T11:19:00.055479Z qemu-kvm: get_pci_config_device: Bad config data: i=0x10 read: 61 device: 1 cmask: ff wmask: c0 w1cmask:0
Aug 16 12:19:00 devstack5 nova-compute[14766]: 2018-08-16T11:19:00.055541Z qemu-kvm: Failed to load PCIDevice:config
Aug 16 12:19:00 devstack5 nova-compute[14766]: 2018-08-16T11:19:00.055555Z qemu-kvm: Failed to load virtio-net:virtio
Aug 16 12:19:00 devstack5 nova-compute[14766]: 2018-08-16T11:19:00.055568Z qemu-kvm: error while loading state for instance 0x0 of device '0000:00:03.0/virtio-net'
Aug 16 12:19:00 devstack5 nova-compute[14766]: 2018-08-16T11:19:00.055761Z qemu-kvm: warning: TSC frequency mismatch between VM (2712005 kHz) and host (2712000 kHz), and TSC scaling unavailable
Aug 16 12:19:00 devstack5 nova-compute[14766]: 2018-08-16T11:19:00.057082Z qemu-kvm: warning: TSC frequency mismatch between VM (2712005 kHz) and host (2712000 kHz), and TSC scaling unavailable

Revision history for this message
Matt Riedemann (mriedem) wrote :

What clears out the vif binding host_id and vif_type during the live migration? I would expect those to always be set. Is it something to do with rolling back (deleting) the destination host port binding but not activating the source host port binding?

https://github.com/openstack/nova/blob/722d5b477219f0a2435a9f4ad4d54c61b83219f1/nova/compute/manager.py#L6908

tags: added: compute live-migration
Revision history for this message
Matt Riedemann (mriedem) wrote :

Do you see "Binding ports to destination host" in the logs on the source host when this failure occurs? That would happen when we enter post-copy mode and activate the destination host port bindings.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/594139

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: New → In Progress
Matt Riedemann (mriedem)
Changed in nova:
importance: Undecided → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :

The fix I proposed initially is the wrong one.

Sean's logs point out:

http://paste.openstack.org/show/728539/

That the migration fails and we're processing the "Migration completed" lifecycle event and that's what is triggering the activation of the dest host port binding right before we rollback and delete the dest host port binding.

From the logs:

Aug 21 16:19:16 devstack2 nova-compute[25894]: INFO nova.compute.manager [None req-c8b07cbc-52f7-4d20-aacc-f3036ad90c8d None None] [instance: fead1ca6-beab-4c47-a73e-a3ab7f7c4de2] VM Migration completed (Lifecycle Event)

Aug 21 16:19:17 devstack2 nova-compute[25894]: DEBUG nova.compute.manager [None req-c8b07cbc-52f7-4d20-aacc-f3036ad90c8d None None] [instance: fead1ca6-beab-4c47-a73e-a3ab7f7c4de2] Binding ports to destination host: devstack5 {{(pid=25894) handle_lifecycle_event /opt/stack/nova/nova/compute/manager.py:1130}}

Aug 21 16:19:17 devstack2 nova-compute[25894]: ERROR nova.virt.libvirt.driver [-] [instance: fead1ca6-beab-4c47-a73e-a3ab7f7c4de2] Live Migration failure: internal error: qemu unexpectedly closed the monitor: 2018-08-21T15:19:15.187710Z qemu-kvm: -chardev socket,id=charnet0,path=/var/run/openvswitch/vhuef02ea3f-9a,server: info: QEMU waiting

So we need to fix the libvirt driver's handling of the 'migration completed' event to not send it back up to the compute manager if the job failed.

Matt Riedemann (mriedem)
tags: added: rocky-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Fix proposed to branch: master
Review: https://review.openstack.org/594527

Revision history for this message
melanie witt (melwitt) wrote :

Here's a link to the partial fix that merged on 2018-08-23:

https://review.openstack.org/594508

Matt Riedemann (mriedem)
tags: removed: rocky-rc-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/693045

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Artom Lifshitz (<email address hidden>) on branch: master
Review: https://review.opendev.org/693045
Reason: Doesn't seem to be a way to make this work...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/594139
Reason: Let's just assume that https://review.opendev.org/#/c/594527/ fixes the bug since if we fix the root cause then we don't need to hack around cleaning up the busted port bindings.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/594527
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=aa87b9c288d316b85079e681e0df24354ec1912c
Submitter: Zuul
Branch: master

commit aa87b9c288d316b85079e681e0df24354ec1912c
Author: Matt Riedemann <email address hidden>
Date: Tue Aug 21 15:32:13 2018 -0400

    libvirt: check job status for VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED event

    Change Ic5cab99944df9e501ba2032eb96911c36304494d added handling for
    the VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED event during live migration
    but failed to distinguish between the live migration actually succeeding
    or failing before queueing the EVENT_LIFECYCLE_MIGRATION_COMPLETED
    up into the ComputeManager.handle_lifecycle_event method.

    As a result, failed live migrations will inadvertantly trigger
    activation of the port bindings on the destination host, which
    deactivates the source host port bindings, and then
    _rollback_live_migration will delete those activated dest host port
    bindings and leave the source host port bindings deactivated.

    In this change, if we get the VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED
    event, we attempt to get the job status to determine the course to
    take and only queue the EVENT_LIFECYCLE_MIGRATION_COMPLETED event,
    which triggers the dest host port activation, if we can determine
    the live migration job completed successfully. Otherwise we simply
    report the guest as paused, the same as before Ic5cab9994.

    Change-Id: I6a4252b0c12c41c233299f30ce8294fef21c7b40
    Closes-Bug: #1788014

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/695900

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/695900
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=27bfd0bc6233c25114504bb363402807752a7ece
Submitter: Zuul
Branch: stable/train

commit 27bfd0bc6233c25114504bb363402807752a7ece
Author: Matt Riedemann <email address hidden>
Date: Tue Aug 21 15:32:13 2018 -0400

    libvirt: check job status for VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED event

    Change Ic5cab99944df9e501ba2032eb96911c36304494d added handling for
    the VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED event during live migration
    but failed to distinguish between the live migration actually succeeding
    or failing before queueing the EVENT_LIFECYCLE_MIGRATION_COMPLETED
    up into the ComputeManager.handle_lifecycle_event method.

    As a result, failed live migrations will inadvertantly trigger
    activation of the port bindings on the destination host, which
    deactivates the source host port bindings, and then
    _rollback_live_migration will delete those activated dest host port
    bindings and leave the source host port bindings deactivated.

    In this change, if we get the VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED
    event, we attempt to get the job status to determine the course to
    take and only queue the EVENT_LIFECYCLE_MIGRATION_COMPLETED event,
    which triggers the dest host port activation, if we can determine
    the live migration job completed successfully. Otherwise we simply
    report the guest as paused, the same as before Ic5cab9994.

    Change-Id: I6a4252b0c12c41c233299f30ce8294fef21c7b40
    Closes-Bug: #1788014
    (cherry picked from commit aa87b9c288d316b85079e681e0df24354ec1912c)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/700774

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/stein)

Reviewed: https://review.opendev.org/700774
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6310e693039cc3b6d2808915a3b4daf7bbffb9a8
Submitter: Zuul
Branch: stable/stein

commit 6310e693039cc3b6d2808915a3b4daf7bbffb9a8
Author: Matt Riedemann <email address hidden>
Date: Tue Aug 21 15:32:13 2018 -0400

    libvirt: check job status for VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED event

    Change Ic5cab99944df9e501ba2032eb96911c36304494d added handling for
    the VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED event during live migration
    but failed to distinguish between the live migration actually succeeding
    or failing before queueing the EVENT_LIFECYCLE_MIGRATION_COMPLETED
    up into the ComputeManager.handle_lifecycle_event method.

    As a result, failed live migrations will inadvertantly trigger
    activation of the port bindings on the destination host, which
    deactivates the source host port bindings, and then
    _rollback_live_migration will delete those activated dest host port
    bindings and leave the source host port bindings deactivated.

    In this change, if we get the VIR_DOMAIN_EVENT_SUSPENDED_MIGRATED
    event, we attempt to get the job status to determine the course to
    take and only queue the EVENT_LIFECYCLE_MIGRATION_COMPLETED event,
    which triggers the dest host port activation, if we can determine
    the live migration job completed successfully. Otherwise we simply
    report the guest as paused, the same as before Ic5cab9994.

    Conflicts:
          nova/virt/libvirt/host.py

    NOTE(mriedem): The conflict on the imports is due to not having
    change I4aeac9b2397bb2f5e130d1e58829a5e549fcb191 in Stein.

    Change-Id: I6a4252b0c12c41c233299f30ce8294fef21c7b40
    Closes-Bug: #1788014
    (cherry picked from commit aa87b9c288d316b85079e681e0df24354ec1912c)
    (cherry picked from commit 27bfd0bc6233c25114504bb363402807752a7ece)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 20.1.0

This issue was fixed in the openstack/nova 20.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 19.1.0

This issue was fixed in the openstack/nova 19.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/711233

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/rocky)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/rocky
Review: https://review.opendev.org/c/openstack/nova/+/711233
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.