live-migration job not aborted when live_monitor thread fails

Bug #1905944 reported by Alexandre arents
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Alexandre arents
Train
New
Undecided
Unassigned
Ussuri
New
Undecided
Unassigned
Victoria
In Progress
Undecided
Unassigned
Wallaby
Fix Released
Undecided
Unassigned

Bug Description

Description
===========

During live migration, a monitoring thread poll each 0.5s libvirt job
progress and update db with with jobs stats. If there control pane issue
like DB/RPC or libvirt unexpected Exception (timeout)
exception handling do not properly interrupt libvirt job.

Steps to reproduce
==================
On a multinode devstack master.

#spawn instance on source_host
1) openstack server create --flavor m1.small --image cirros-0.5.1-x86_64-disk \
--nic net-id=private inst

#ignite live block migration on dest_host, wait a bit( to be in monitoring thread),
# and trigger an issue on DB for ex.
2) nova live-migration inst ; sleep 6 ; sudo service mysql restart

3) On source host you can survey libvirt job progess until it complete and disappear
because libvirt resume guest on target host(starting writting data on target disk)
source_host$ watch -n 1 virsh domjobinfo instance-0000000d

4) on dest host you will find instance active
dest_host$ virsh list
 Id Name State
-----------------------------------
 20 instance-0000000d running

5) nova show inst show instance still on source host.
$nova show inst | grep host
| OS-EXT-SRV-ATTR:host | source_host

if admin try to recover the instance on source on as it in on nova DB,
we can fall in split-brain where 2 qemu running on two different disks on two host
(true story..)

Expected result
===============
If issue happen we must at least ensure that libvirt job is interrupted, avoiding
the guest resume on target host.

Actual result
=============
If issue happen libvirt job continue and bring up guest on target host,
nova still consider it on source.

Changed in nova:
assignee: nobody → Alexandre arents (aarents)
Revision history for this message
Balazs Gibizer (balazs-gibizer) wrote :
Changed in nova:
status: New → In Progress
tags: added: libvirt live-migration
Changed in nova:
importance: Undecided → Medium
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.opendev.org/c/openstack/nova/+/764435
Committed: https://opendev.org/openstack/nova/commit/39f0af5d18d6bea34fa15b8f7778115b25432749
Submitter: "Zuul (22348)"
Branch: master

commit 39f0af5d18d6bea34fa15b8f7778115b25432749
Author: Alexandre Arents <email address hidden>
Date: Thu Nov 26 15:24:19 2020 +0000

    libvirt: Abort live-migration job when monitoring fails

    During live migration process, a _live_migration_monitor thread
    checks progress of migration on source host, if for any reason
    we hit infrastructure issue involving a DB/RPC/libvirt-timeout
    failure, an Exception is raised to the nova-compute service and
    instance/migration is set to ERROR state.

    The issue is that we may let live-migration job running out of nova
    control. At the end of job, guest is resumed on target host while
    nova still reports it on source host, this may lead to a split-brain
    situation if instance is restarted.

    This change proposes to abort live-migration job if issue occurs
    during _live_migration_monitor.

    Change-Id: Ia593b500425c81e54eb401e38264db5cc5fc1f93
    Closes-Bug: #1905944

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 24.0.0.0rc1

This issue was fixed in the openstack/nova 24.0.0.0rc1 release candidate.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/wallaby)

Fix proposed to branch: stable/wallaby
Review: https://review.opendev.org/c/openstack/nova/+/837320

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/victoria)

Fix proposed to branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/837321

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/837322

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/837323

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/wallaby)

Reviewed: https://review.opendev.org/c/openstack/nova/+/837320
Committed: https://opendev.org/openstack/nova/commit/76ea8ee37707e0e2160d30cdb5c74d1fcad60de3
Submitter: "Zuul (22348)"
Branch: stable/wallaby

commit 76ea8ee37707e0e2160d30cdb5c74d1fcad60de3
Author: Alexandre Arents <email address hidden>
Date: Thu Nov 26 15:24:19 2020 +0000

    libvirt: Abort live-migration job when monitoring fails

    During live migration process, a _live_migration_monitor thread
    checks progress of migration on source host, if for any reason
    we hit infrastructure issue involving a DB/RPC/libvirt-timeout
    failure, an Exception is raised to the nova-compute service and
    instance/migration is set to ERROR state.

    The issue is that we may let live-migration job running out of nova
    control. At the end of job, guest is resumed on target host while
    nova still reports it on source host, this may lead to a split-brain
    situation if instance is restarted.

    This change proposes to abort live-migration job if issue occurs
    during _live_migration_monitor.

    Change-Id: Ia593b500425c81e54eb401e38264db5cc5fc1f93
    Closes-Bug: #1905944
    (cherry picked from commit 39f0af5d18d6bea34fa15b8f7778115b25432749)

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 23.2.1

This issue was fixed in the openstack/nova 23.2.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/train)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/train
Review: https://review.opendev.org/c/openstack/nova/+/837323
Reason: stable/train branch of nova projects' have been tagged as End of Life. All open patches have to be abandoned in order to be able to delete the branch.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/ussuri)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/ussuri
Review: https://review.opendev.org/c/openstack/nova/+/837322
Reason: stable/ussuri branch of openstack/nova transitioned to End of Life and is about to be deleted. To be able to do that, all open patches need to be abandoned.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (stable/victoria)

Change abandoned by "Elod Illes <email address hidden>" on branch: stable/victoria
Review: https://review.opendev.org/c/openstack/nova/+/837321
Reason: stable/victoria branch of openstack/nova is about to be deleted. To be able to do that, all open patches need to be abandoned. Please cherry pick the patch to unmaintained/victoria if you want to further work on this patch.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.