Restarting destination compute manager during live-migration can cause instance data loss

Bug #1319797 reported by Loganathan Parthipan
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
David McNally
Icehouse
Fix Released
Undecided
Unassigned

Bug Description

During compute manager startup init_host is called. One of the functions there is to delete instance data that doesn't belong to this host ie. _destroy_evacuated_instances. But this function only checks if the local instance belongs to the host or not. It doesn't check the task_state.

Suppose a live-migration is in progress and the destination compute manager is restarted, it will find the migrating instance as not belonging to the host and destroy it. This can result in two outomes:

1. If live-migration is in progress, then the source hypervisor would hang, so a rollback is possible to trigger by killing the job.
2. However, if live-migration is completed and the post-live-migration-destination is messaged then by the time the compute manager gets to processing the message, the instance data would have been deleted. Subsequent periodic tasks would only get as far as defining the VM but there wouldn't be any disks left.

014-05-08 20:42:33.058 16724 WARNING nova.virt.libvirt.driver [-] Periodic task is updating the host stat, it is trying to get disk instance-00000002, but disk file was removed by concurrent operations such as resize.
2014-05-08 20:43:33.370 16724 WARNING nova.virt.libvirt.driver [-] Periodic task is updating the host stat, it is trying to get disk instance-00000002, but disk file was removed by concurrent operations such as resize.

Steps to reproduce:

1. Start live-migration
2. Wait for pre-live-migration to define the destination VM
3. Restart destination compute manager

To see what happens for case 2 (Live-migration having completed), put a breakpoint in init_host and delay till instance is running on the destination and then continue the nova-compute. In this case you'll end up with the instance directory like this:

ls -l 06ddbe13-577b-4f9f-ac52-0c038aec04d8
total 8
-rw-r--r-- 1 root root 89 May 8 19:59 disk.info
-rw-r--r-- 1 root root 1548 May 8 19:59 libvirt.xml

I verified this in a tripleo devtest environment.

tags: added: compute
Changed in nova:
assignee: nobody → David McNally (dave-mcnally)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/93903

Changed in nova:
status: New → In Progress
Mark McLoughlin (markmc)
Changed in nova:
importance: Undecided → High
Mark McLoughlin (markmc)
tags: added: icehouse-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/93903
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=340cae5466eaf5568c4f0eecb2a2fa7cdbcc0ba4
Submitter: Jenkins
Branch: master

commit 340cae5466eaf5568c4f0eecb2a2fa7cdbcc0ba4
Author: David McNally <email address hidden>
Date: Fri May 16 13:21:26 2014 +0100

    Prevent clean-up of migrating instances on compute init

    During compute manager startup init_host is called. One
    of the functions this carries out is to delete instance
    data that doesn't belong to this host this function only
    checks if the local instance belongs to the host or not.
    It doesn't check the task_state. This could result in the
    loss of all instance data if it occured at the wrong
    point during live migration.

    This change checks if the task_state of the instance to
    be deleted is MIGRATING and if so it does not delete the
    instance. Similarily for the task state RESIZE_MIGRATING.

    Change-Id: Ia8c67acf93d71af868907f0711dcc1dfe103560c
    Closes-Bug: 1319797

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/icehouse)

Fix proposed to branch: stable/icehouse
Review: https://review.openstack.org/101832

Changed in nova:
milestone: none → juno-2
status: Fix Committed → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/icehouse)

Reviewed: https://review.openstack.org/101832
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1469c8e14267e27ecc6ced29c91dc1506ce26633
Submitter: Jenkins
Branch: stable/icehouse

commit 1469c8e14267e27ecc6ced29c91dc1506ce26633
Author: David McNally <email address hidden>
Date: Fri May 16 13:21:26 2014 +0100

    Prevent clean-up of migrating instances on compute init

    During compute manager startup init_host is called. One
    of the functions this carries out is to delete instance
    data that doesn't belong to this host this function only
    checks if the local instance belongs to the host or not.
    It doesn't check the task_state. This could result in the
    loss of all instance data if it occured at the wrong
    point during live migration.

    This change checks if the task_state of the instance to
    be deleted is MIGRATING and if so it does not delete the
    instance. Similarily for the task state RESIZE_MIGRATING.

    This change adjust little about the unit test according to
    actual code path in icehouse.
    Conflicts:
     nova/tests/compute/test_compute_mgr.py

    Change-Id: Ia8c67acf93d71af868907f0711dcc1dfe103560c
    Closes-Bug: 1319797
    (cherry picked from commit 340cae5466eaf5568c4f0eecb2a2fa7cdbcc0ba4)

tags: added: in-stable-icehouse
Thierry Carrez (ttx)
Changed in nova:
milestone: juno-2 → 2014.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.