Live migration not working as Expected when Restarting nova-compute service while migration from source node

Bug #1753676 reported by keerthivasan selvaraj
32
This bug affects 7 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Undecided
Alexandre arents
Train
Fix Released
Undecided
Unassigned

Bug Description

Description
===========

Environment: Ubuntu 16.04
Openstack Version: Pike

I am trying to migrate VM ( live migration ( block migration ) ) form one compute node to another compute node...Everything looks good unless I restart nova-compute service, live migration still running underneath with help of libvirt, once the vm reaches destination, database is not updated properly.

Steps to reproduce:
===================

nova.conf ( libvirt setting on both compute nodes )

[libvirt]
live_migration_bandwidth=1200
live_migration_downtime=100
live_migration_downtime_steps =3
live_migration_downtime_delay=10
live_migration_flag = VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER,VIR_MIGRATE_LIVE
virt_type = kvm
inject_password = False
disk_cachemodes = network=writeback
live_migration_uri = "qemu+tcp://nova@%s/system"
live_migration_tunnelled = False
block_migration_flag = VIR_MIGRATE_UNDEFINE_SOURCE,VIR_MIGRATE_PEER2PEER,VIR_MIGRATE_NON_SHARED_INC

( default openstack live migration configuration ( pre-copy with no tunneling )
Source vm root disk ( boot from volume with one ephemernal disk (160GB) )

Trying to migrate vm from compute1 to compute2, below is my source vm.

| OS-EXT-SRV-ATTR:host | compute1 |
| OS-EXT-SRV-ATTR:hostname | testcase1-all-ephemernal-boot-from-vol |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute1 |
| OS-EXT-SRV-ATTR:instance_name | instance-00000153

1) nova live-migration --block-migrate <vm-id> compute2

[req-48a3df61-3974-46ac-8019-c4c4a0f8a8c8 4a8150eb246a4450829331e993f8c3fd f11a5d3631f14c4f879a2e7dddb96c06 - default default] pre_live_migration data is LibvirtLiveMigrateData(bdms=<?>,block_migration=True,disk_available_mb=6900736,disk_over_commit=<?>,filename='tmpW5ApOS',graphics_listen_addr_spice=x.x.x.x,graphics_listen_addr_vnc=127.0.0.1,image_type='default',instance_relative_path='504028fc-1381-42ca-ad7c-def7f749a722',is_shared_block_storage=False,is_shared_instance_path=False,is_volume_backed=True,migration=<?>,serial_listen_addr=None,serial_listen_ports=<?>,supported_perf_events=<?>,target_connect_addr=<?>) pre_live_migration /openstack/venvs/nova-16.0.6/lib/python2.7/site-packages/nova/compute/manager.py:5437

Migration started, able to see the data and memory transfer ( using iftop )

Data transfer between compute nodes using iftop
                                                                                      <= 4.94Gb 4.99Gb 5.01Gb

Restarted Nova-compute service on source compute node ( where the vm is migrating)

Live migration still it is going, once migration completes, below is my total data transfer ( using iftop )

TX: cum: 17.3MB peak: 2.50Mb rates: 11.1Kb 7.11Kb 463Kb
RX: 97.7GB 4.97Gb 3.82Kb 1.93Kb 1.87Gb
TOTAL: 97.7GB 4.97Gb

Once migration completes, from the destination compute node ( we can able to see the virsh domain running)

root@compute2:~# virsh list --all
 Id Name State
----------------------------------------------------
 3 instance-00000153 running

From the nova-compute.log

Instance <id> has been moved to another host compute1(compute1). There are allocations remaining against the source host that might need to be removed: {u'resources': {u'VCPU': 8, u'MEMORY_MB': 23808, u'DISK_GB': 180}}. _remove_deleted_instances_allocations /openstack/venvs/nova-16.0.6/lib/python2.7/site-packages/nova/compute/resource_tracker.py:123

Nova compute still showing 0 vcpus ( but 8 core vm was there )

Total usable vcpus: 56, total allocated vcpus: 0 _report_final_resource_view /openstack/venvs/nova-16.0.6/lib/python2.7/site-packages/nova/compute/resource_tracker.py:792

nova show <vm-id> ( still nova db shows src hostname, db is not updated with new compute_node )

  OS-EXT-SRV-ATTR:host | compute1 |
| OS-EXT-SRV-ATTR:hostname | testcase1-all-ephemernal-boot-from-vol |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute1 |
| OS-EXT-SRV-ATTR:instance_name | instance-00000153

Expected result
===============
DB should update accordingly or it should abort the migration

Actual result
=============

nova show <vm-id> ( still nova db shows src hostname, db is not updated with new compute_node )

  OS-EXT-SRV-ATTR:host | compute1 |
| OS-EXT-SRV-ATTR:hostname | testcase1-all-ephemernal-boot-from-vol |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute1 |
| OS-EXT-SRV-ATTR:instance_name | instance-00000153

Virsh list on the destination compute node shows below output:

root@compute2:~# virsh list --all
 Id Name State
----------------------------------------------------
 3 instance-00000153 running

Entire vm data is still present on both compute nodes.

ls /var/lib/nova/instances/18d63c06-b124-4ec4-9e36-afcadccaf23e

After restarting nova-compute service on destination machine ( got below warning from nova-compute )

2018-03-05 11:19:05.942 5791 WARNING nova.compute.manager [-] [instance: 18d63c06-b124-4ec4-9e36-afcadccaf23e] Instance is unexpectedly not found. Ignore.: InstanceNotFound: Instance 18d63c06-b124-4ec4-9e36-afcadccaf23e could not be found.

summary: Live migration not working as Expected when Restarting nova-compute
- service
+ service while migration
description: updated
tags: added: live-migration
summary: Live migration not working as Expected when Restarting nova-compute
- service while migration
+ service while migration from source node
Changed in nova:
status: New → Confirmed
Revision history for this message
Alexandre arents (aarents) wrote :

Because of this issue it is possible to loss instance disk(saw it in production).
This scenario is reproductible on a multi node master devstack deployment:

       HOST-A (ignite live block migration of a VM to HOST-B)
         | VM MIGRATING(to HOST-B)
         | VM MIGRATING(to HOST-B)
         | VM MIGRATING(to HOST-B)
         | VM MIGRATING(to HOST-B)
         | VM MIGRATING(to HOST-B)
         | VM MIGRATING(to HOST-B)
         | VM MIGRATING(to HOST-B)
         | nova-compute restart on HOST-A Nova reset state MIGRATING to ACTIVE no-task.. during init -> Here is evil
         | VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
         | VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
         | VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
         | VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
         | VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
         | VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
         | Start another live-migration of the same VM (it is possible because VM is active no-task)
         | NOVA find a suitable HOST-C to live-migrate
         | NOVA run prelive_migration on HOST-C, creating a target base disk, ready to receive libvirt stream
         | NOVA silenty failed to run a libvirt live migration probably due to existing libvirt stream HOST-A -> HOST-B
         | VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
         | VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
         | VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
         | VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
         | VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
         | END OF LIBVIRT migration to HOST-B
         | NOVA caught end of live migration, and RUN post_migration task on HOST-C instead of HOST-B
         | NOVA set VM in ERROR on HOSTC state because qemu was not running on HOST-C, it cleanup disk on SOURCE host-A
         | qemu still running on HOST-B -> a zombie QEMU is created
       HOST-C VM ERROR with a incomplete disk

So at the end, Nova think VM is on HOST-C(in error, with an incomplete disk) and disk on source host-A has been dropped during post_migration. HOST-B contains the only consistent disk copy but it is hard to guess when reading logs.

I confirm solution is to at least abort live-migration during instance_init.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.opendev.org/678016

Changed in nova:
assignee: nobody → Alexandre arents (aarents)
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/678016
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=ebcf6e4ce576285949c5a202f2d7d21dc03156ef
Submitter: Zuul
Branch: master

commit ebcf6e4ce576285949c5a202f2d7d21dc03156ef
Author: Alexandre Arents <email address hidden>
Date: Tue Aug 20 13:37:33 2019 +0000

    Abort live-migration during instance_init

    When compute service restart during a live-migration,
    we lose live-migration monitoring thread. In that case
    it is better to early abort live-migration job before resetting
    state of instance, this will avoid API to accept further
    action while unmanaged migration process still run in background.
    It also avoid unexpected/dangerous behavior as describe in related bug.

    Change-Id: Idec2d31cbba497dc4b20912f3388ad2341951d23
    Closes-Bug: #1753676

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/720414

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/train)

Reviewed: https://review.opendev.org/720414
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=6fa8540f2ce40aacca4fcf588a050ed26f66d24c
Submitter: Zuul
Branch: stable/train

commit 6fa8540f2ce40aacca4fcf588a050ed26f66d24c
Author: Alexandre Arents <email address hidden>
Date: Tue Aug 20 13:37:33 2019 +0000

    Abort live-migration during instance_init

    When compute service restart during a live-migration,
    we lose live-migration monitoring thread. In that case
    it is better to early abort live-migration job before resetting
    state of instance, this will avoid API to accept further
    action while unmanaged migration process still run in background.
    It also avoid unexpected/dangerous behavior as describe in related bug.

    Change-Id: Idec2d31cbba497dc4b20912f3388ad2341951d23
    Closes-Bug: #1753676
    (cherry picked from commit ebcf6e4ce576285949c5a202f2d7d21dc03156ef)

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/c/openstack/nova/+/806881

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/stein)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/stein
Review: https://review.opendev.org/c/openstack/nova/+/806881
Reason: This branch transitioned to End of Life for this project, open patches needs to be closed to be able to delete the branch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.