Retry after hitting libvirt error code VIR_ERR_OPERATION_INVALID in live migration.

Bug #1799152 reported by Fan Zhang on 2018-10-22
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
Fan Zhang

Bug Description

Description
===========
When migration of a persistent guest completes, the guest merely shuts
off, but libvirt unhelpfully raises an VIR_ERR_OPERATION_INVALID error
code, in nova, we pretend this case means success. But if we are in the
middle of a live migration, and sadly qemu-kvm process is killed
accidentally, such as by host OOM, which happens rarely in our environment but it does happen few times, domain state is SHUTOFF and then we will get
VIR_ERR_OPERATION_INVALID while trying to call `self._domain.jobStats()`.

Under the circumstance, migration should be considered failed, otherwise
post_live_migration() function starts to clean up instance files and we will lose customers' data forever.

IMHO, we may need to `pretend` the migration job is still running after
hitting VIR_ERR_OPERATION_INVALID and retry to get job stats for a few times, which the count of retries can be configured. Because if migration
succeeds finally, we won't get VIR_ERR_OPERATION_INVALID after some
retries, but the error code still happens if qemu-kvm process is killed
accidentally.

Steps to reproduce
==================
* Do nova live-migration <uuid> on controller node.
* Once live migration monitor on source compute node starts to get JobInfo, kill the qemu-kvm process on source host.
* Check if post_live_migration on source host starts to execute.
* Check if post_live_migration on destination host starts to execute.
* Check image files on both source host and destination host.

Expected result
===============

Migration should be consider failed.

Actual result
=============

Post live migration on source host starts to execute and clean instance files. Instance disappears on both source and destination host.

Environment
===========
1. My environment is packstack with one controller nodes, two compute nodes, and openstack nova release is Queens.

2. Libvirt + KVM

Logs & Configs
==============

Some logs after qemu-kvm process is killed.

...
2018-09-21 14:08:34.180 11099 DEBUG nova.virt.libvirt.migration [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Downtime does not need to change update_downtime /usr/lib/python2.7/site-packages/nova/virt/libvirt/migration.py:410
2018-09-21 14:08:34.305 11099 DEBUG nova.virt.libvirt.driver [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Migration running for 10 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) _live_migration_monitor /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:7394
2018-09-21 14:08:34.886 11099 DEBUG nova.virt.libvirt.guest [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] Domain has shutdown/gone away: Requested operation is not valid: domain is not running get_job_info /usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py:720
2018-09-21 14:08:34.887 11099 INFO nova.virt.libvirt.driver [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Migration operation has completed
2018-09-21 14:08:34.887 11099 INFO nova.compute.manager [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] _post_live_migration() is started..
...

Fan Zhang (fanzhang) on 2018-10-22
Changed in nova:
assignee: nobody → Fan Zhang (fanzhang)
Fan Zhang (fanzhang) on 2018-10-22
description: updated

Fix proposed to branch: master
Review: https://review.openstack.org/612272

Changed in nova:
status: New → In Progress
Fan Zhang (fanzhang) on 2018-10-24
description: updated
tags: added: live-migration
tags: added: libvirt
melanie witt (melwitt) wrote :

This bug looks valid, marking as High as it involves data loss.

Changed in nova:
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers