Description
===========
When migration of a persistent guest completes, the guest merely shuts off,
but libvirt unhelpfully raises an VIR_ERR_OPERATION_INVALID error code, in the
nova code, we pretend this case means success. But if we are in the middle of a
live migration, and sadly qemu-kvm process is killed accidentally, such as by host OOM, which happens rarely in our environment but it does happen few
times, domain state is SHUTOFF and then we will get VIR_ERR_OPERATION_INVALID
while trying to call `self._domain.jobStats()`. Under the circumstance,
migration should be considered failed, otherwise post_live_migration() function
starts to clean up instance files and we will lose customers' data forever.
IMHO, we may need to `pretend` the migration job is still running after
hitting VIR_ERR_OPERATION_INVALID and retry to get job stats for a few times,
which the count of retries can be configured. Because if migration succeeds
finally, we won't get VIR_ERR_OPERATION_INVALID after some retries, but the error code still happens if qemu-kvm process is killed accidentally.
Steps to reproduce
==================
* Do nova live-migration <uuid> on controller node.
* Once live migration monitor on source compute node starts to get JobInfo, kill the qemu-kvm process on source host.
* Check if post_live_migration on source host starts to execute.
* Check if post_live_migration on destination host starts to execute.
* Check image files on both source host and destination host.
Expected result
===============
Migration should be consider failed.
Actual result
=============
Post live migration on source host starts to execute and clean instance files. Instance disappears on both source and destination host.
Environment
===========
1. My environment is packstack, and openstack nova release is Queens.
2. Libvirt + KVM
Logs & Configs
==============
Some logs after qemu-kvm process is killed.
```
...
2018-09-21 14:08:34.180 11099 DEBUG nova.virt.libvirt.migration [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Downtime does not need to change update_downtime /usr/lib/python2.7/site-packages/nova/virt/libvirt/migration.py:410
2018-09-21 14:08:34.305 11099 DEBUG nova.virt.libvirt.driver [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Migration running for 10 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) _live_migration_monitor /usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py:7394
2018-09-21 14:08:34.886 11099 DEBUG nova.virt.libvirt.guest [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] Domain has shutdown/gone away: Requested operation is not valid: domain is not running get_job_info /usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py:720
2018-09-21 14:08:34.887 11099 INFO nova.virt.libvirt.driver [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] Migration operation has completed
2018-09-21 14:08:34.887 11099 INFO nova.compute.manager [req-d8e0cfab-ea85-4716-a2fe-1307a7004f12 bf015418722f437e9f031efabc7a98e6 ca68d7d736374dbfb38d4ef2f80b2a5c - default default] [instance: ba8feaea-eedc-4b7c-8ffa-01152fc9bde8] _post_live_migration() is started..
...
```
Description OPERATION_ INVALID error code, in the OPERATION_ INVALID domain. jobStats( )`. Under the circumstance, migration( ) function OPERATION_ INVALID and retry to get job stats for a few times, OPERATION_ INVALID after some retries, but the error code still happens if qemu-kvm process is killed accidentally.
===========
When migration of a persistent guest completes, the guest merely shuts off,
but libvirt unhelpfully raises an VIR_ERR_
nova code, we pretend this case means success. But if we are in the middle of a
live migration, and sadly qemu-kvm process is killed accidentally, such as by host OOM, which happens rarely in our environment but it does happen few
times, domain state is SHUTOFF and then we will get VIR_ERR_
while trying to call `self._
migration should be considered failed, otherwise post_live_
starts to clean up instance files and we will lose customers' data forever.
IMHO, we may need to `pretend` the migration job is still running after
hitting VIR_ERR_
which the count of retries can be configured. Because if migration succeeds
finally, we won't get VIR_ERR_
Steps to reproduce
==================
* Do nova live-migration <uuid> on controller node.
* Once live migration monitor on source compute node starts to get JobInfo, kill the qemu-kvm process on source host.
* Check if post_live_migration on source host starts to execute.
* Check if post_live_migration on destination host starts to execute.
* Check image files on both source host and destination host.
Expected result
===============
Migration should be consider failed.
Actual result
=============
Post live migration on source host starts to execute and clean instance files. Instance disappears on both source and destination host.
Environment
===========
1. My environment is packstack, and openstack nova release is Queens.
2. Libvirt + KVM
Logs & Configs
==============
Some logs after qemu-kvm process is killed. libvirt. migration [req-d8e0cfab- ea85-4716- a2fe-1307a7004f 12 bf015418722f437 e9f031efabc7a98 e6 ca68d7d736374db fb38d4ef2f80b2a 5c - default default] [instance: ba8feaea- eedc-4b7c- 8ffa-01152fc9bd e8] Downtime does not need to change update_downtime /usr/lib/ python2. 7/site- packages/ nova/virt/ libvirt/ migration. py:410 libvirt. driver [req-d8e0cfab- ea85-4716- a2fe-1307a7004f 12 bf015418722f437 e9f031efabc7a98 e6 ca68d7d736374db fb38d4ef2f80b2a 5c - default default] [instance: ba8feaea- eedc-4b7c- 8ffa-01152fc9bd e8] Migration running for 10 secs, memory 100% remaining; (bytes processed=0, remaining=0, total=0) _live_migration _monitor /usr/lib/ python2. 7/site- packages/ nova/virt/ libvirt/ driver. py:7394 libvirt. guest [req-d8e0cfab- ea85-4716- a2fe-1307a7004f 12 bf015418722f437 e9f031efabc7a98 e6 ca68d7d736374db fb38d4ef2f80b2a 5c - default default] Domain has shutdown/gone away: Requested operation is not valid: domain is not running get_job_info /usr/lib/ python2. 7/site- packages/ nova/virt/ libvirt/ guest.py: 720 libvirt. driver [req-d8e0cfab- ea85-4716- a2fe-1307a7004f 12 bf015418722f437 e9f031efabc7a98 e6 ca68d7d736374db fb38d4ef2f80b2a 5c - default default] [instance: ba8feaea- eedc-4b7c- 8ffa-01152fc9bd e8] Migration operation has completed manager [req-d8e0cfab- ea85-4716- a2fe-1307a7004f 12 bf015418722f437 e9f031efabc7a98 e6 ca68d7d736374db fb38d4ef2f80b2a 5c - default default] [instance: ba8feaea- eedc-4b7c- 8ffa-01152fc9bd e8] _post_live_ migration( ) is started..
```
...
2018-09-21 14:08:34.180 11099 DEBUG nova.virt.
2018-09-21 14:08:34.305 11099 DEBUG nova.virt.
2018-09-21 14:08:34.886 11099 DEBUG nova.virt.
2018-09-21 14:08:34.887 11099 INFO nova.virt.
2018-09-21 14:08:34.887 11099 INFO nova.compute.
...
```