Live migration not working as Expected when Restarting nova-compute service while migration from source node
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Undecided
|
Alexandre arents | ||
Train |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Description
===========
Environment: Ubuntu 16.04
Openstack Version: Pike
I am trying to migrate VM ( live migration ( block migration ) ) form one compute node to another compute node...Everything looks good unless I restart nova-compute service, live migration still running underneath with help of libvirt, once the vm reaches destination, database is not updated properly.
Steps to reproduce:
===================
nova.conf ( libvirt setting on both compute nodes )
[libvirt]
live_migration_
live_migration_
live_migration_
live_migration_
live_migration_flag = VIR_MIGRATE_
virt_type = kvm
inject_password = False
disk_cachemodes = network=writeback
live_migration_uri = "qemu+tcp:
live_migration_
block_migration
( default openstack live migration configuration ( pre-copy with no tunneling )
Source vm root disk ( boot from volume with one ephemernal disk (160GB) )
Trying to migrate vm from compute1 to compute2, below is my source vm.
| OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
1) nova live-migration --block-migrate <vm-id> compute2
[req-48a3df61-
Migration started, able to see the data and memory transfer ( using iftop )
Data transfer between compute nodes using iftop
Restarted Nova-compute service on source compute node ( where the vm is migrating)
Live migration still it is going, once migration completes, below is my total data transfer ( using iftop )
TX: cum: 17.3MB peak: 2.50Mb rates: 11.1Kb 7.11Kb 463Kb
RX: 97.7GB 4.97Gb 3.82Kb 1.93Kb 1.87Gb
TOTAL: 97.7GB 4.97Gb
Once migration completes, from the destination compute node ( we can able to see the virsh domain running)
root@compute2:~# virsh list --all
Id Name State
-------
3 instance-00000153 running
From the nova-compute.log
Instance <id> has been moved to another host compute1(compute1). There are allocations remaining against the source host that might need to be removed: {u'resources': {u'VCPU': 8, u'MEMORY_MB': 23808, u'DISK_GB': 180}}. _remove_
Nova compute still showing 0 vcpus ( but 8 core vm was there )
Total usable vcpus: 56, total allocated vcpus: 0 _report_
nova show <vm-id> ( still nova db shows src hostname, db is not updated with new compute_node )
OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
Expected result
===============
DB should update accordingly or it should abort the migration
Actual result
=============
nova show <vm-id> ( still nova db shows src hostname, db is not updated with new compute_node )
OS-EXT-
| OS-EXT-
| OS-EXT-
| OS-EXT-
Virsh list on the destination compute node shows below output:
root@compute2:~# virsh list --all
Id Name State
-------
3 instance-00000153 running
Entire vm data is still present on both compute nodes.
ls /var/lib/
After restarting nova-compute service on destination machine ( got below warning from nova-compute )
2018-03-05 11:19:05.942 5791 WARNING nova.compute.
summary: |
Live migration not working as Expected when Restarting nova-compute - service + service while migration |
description: | updated |
tags: | added: live-migration |
summary: |
Live migration not working as Expected when Restarting nova-compute - service while migration + service while migration from source node |
Changed in nova: | |
status: | New → Confirmed |
Because of this issue it is possible to loss instance disk(saw it in production).
This scenario is reproductible on a multi node master devstack deployment:
HOST-A (ignite live block migration of a VM to HOST-B)
| VM MIGRATING(to HOST-B)
| VM MIGRATING(to HOST-B)
| VM MIGRATING(to HOST-B)
| VM MIGRATING(to HOST-B)
| VM MIGRATING(to HOST-B)
| VM MIGRATING(to HOST-B)
| VM MIGRATING(to HOST-B)
| nova-compute restart on HOST-A Nova reset state MIGRATING to ACTIVE no-task.. during init -> Here is evil
| VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
| VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
| VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
| VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
| VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
| VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
| Start another live-migration of the same VM (it is possible because VM is active no-task)
| NOVA find a suitable HOST-C to live-migrate
| NOVA run prelive_migration on HOST-C, creating a target base disk, ready to receive libvirt stream
| NOVA silenty failed to run a libvirt live migration probably due to existing libvirt stream HOST-A -> HOST-B
| VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
| VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
| VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
| VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
| VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
| END OF LIBVIRT migration to HOST-B
| NOVA caught end of live migration, and RUN post_migration task on HOST-C instead of HOST-B
| NOVA set VM in ERROR on HOSTC state because qemu was not running on HOST-C, it cleanup disk on SOURCE host-A
| qemu still running on HOST-B -> a zombie QEMU is created
HOST-C VM ERROR with a incomplete disk
So at the end, Nova think VM is on HOST-C(in error, with an incomplete disk) and disk on source host-A has been dropped during post_migration. HOST-B contains the only consistent disk copy but it is hard to guess when reading logs.
I confirm solution is to at least abort live-migration during instance_init.