libvirt: post_live_migration failures to disconnect volumes result in the rollback of live migrations

Bug #1843639 reported by Lee Yarwood on 2019-09-11
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Lee Yarwood

Bug Description

Description
===========
At present any exceptions encountered during post_live_migration on the source after an instance has successfully migrated result in the overall failure of the migration and the instance being listed as running on the source while actually being on the destination.

Any such errors should be logged but otherwise ignored allowing the migration to complete and for the instance to continue to be tracked correctly.

Steps to reproduce
==================
- Live migrate an instance from host A to host B, ensuring post_live_migration fails.

Expected result
===============
Any failures on the source encountered by post_live_migration are logged but the overall migration still completes successfully.

Actual result
=============
The instance and overall migration are left in error states. Additionally the instance is reported as residing on the source host while actually running on the destination.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

   ba3147420c0a6f8b17a46b1a493b89bcd67af6f1

2. Which hypervisor did you use?
   (For example: Libvirt + KVM, Libvirt + XEN, Hyper-V, PowerKVM, ...)
   What's the version of that?

   Libvirt + KVM

2. Which storage type did you use?
   (For example: Ceph, LVM, GPFS, ...)
   What's the version of that?

   N/A

3. Which networking type did you use?
   (For example: nova-network, Neutron with OpenVSwitch, ...)

   N/A

Matt Riedemann (mriedem) wrote :

Not surprised about this since the _post_live_migration method and the post_live_migration_at_destination that it calls are all huge and complicated. I've advocated for a long time now that we should be breaking down those giant methods into smaller parts so we can more correctly do error handling like this, but for a backportable fix we'd likely just need to handle the volume errors during post processing and refactor the code out later.

Matt Riedemann (mriedem) wrote :
Matt Riedemann (mriedem) wrote :

Similar change from Lee here for refactoring volume handling in _rollback_live_migration:

https://review.opendev.org/#/c/656500/

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Lee Yarwood (lyarwood) wrote :

Apologies for the confusion, I was specifically talking about post_live_migration within the Libvirt driver itself and not within the compute layer. There are definitely additional issues there as you've pointed out above but this bug is specifically about the lack of error handling with the following method:

https://github.com/openstack/nova/blob/7a18209a81539217a95ab7daad6bc67002768950/nova/virt/libvirt/driver.py#L8800-L8810

Thankfully the fix is pretty straight forward and should be easily backportable. I'll post it shortly once M3 is cut and the gate is in better shape.

Related fix proposed to branch: master
Review: https://review.opendev.org/682621

Changed in nova:
assignee: nobody → Lee Yarwood (lyarwood)
status: Confirmed → In Progress
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers