Resources on destination node doesn't cleanup if live-migration fails

Bug #1488435 reported by Timofey Durakov
10
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
High
Matt Riedemann

Bug Description

I've deployed multinode devstack environment and tried to live-migrate volume-backed instance,
live-migration fails during copying of glance image to destination. When it happens nova leaves destination host in inconsistent state, and attempt to run live-migration again fails.
Here is small investigation of a problem:
nova creates instance directory on target compute node:
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L6152
in second run it fail here, because dest node wasn't properly cleanedup:
https://github.com/openstack/nova/blob/master/nova/virt/libvirt/driver.py#L6148

Also it's really strange, that when pre_live_migration on dest node fails, flow returns to source which decide is that really required to clean_up destination or not. It will be more clearer to wrap pre_live_migration with try/catch and rollback it before response to source node:
https://github.com/openstack/nova/blob/master/nova/compute/manager.py#L4962-L4984

Devstack nova version:
stack@node1:/opt/stack/nova$ git log -1
commit a71912ef463761187ef785118fbb0e51a5e91ba3
Merge: ea4e95e 6b2d616
Author: Jenkins <email address hidden>
Date: Mon Aug 24 16:07:32 2015 +0000

    Merge "Fixed indentation"

Controller with compute node nova.conf:
http://paste.openstack.org/show/426934/

Compute only nova.conf:
http://paste.openstack.org/show/426936/

*UPDATE*
root cause of the problem:
rollback_live_migration_at_destination
do several things:
one of them is deallocation nova-network resources over call of setup_networks_on_host with teardown=True. If this call fails no
clean-up on driver side happens.

Changed in nova:
assignee: nobody → Timofey Durakov (tdurakov)
Changed in nova:
status: New → In Progress
importance: Undecided → Medium
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

A few more details would be good:
* details about the setup
   * `git log -1` for devstack and nova
   * the "local.conf" files (you can use http://paste.openstack.org/ for that)
* steps to reproduce (or record a session with http://showterm.io/ )
* anything catched in the logs?

And right now it reads like this describes 3 issues:
1) "live-migration fails during copying of glance image to destination"
2) "nova leaves destination host in inconsistent state"
3) "attempt to run live-migration again fails"

The upcoming patch sets for these issues could be easier reviewed if they can be easier distinguished. So it could make sense to split this bug report up to three more dedicated bug reports?

Revision history for this message
Timofey Durakov (tdurakov) wrote :

@markus_z thank you for comment, I've updated bug description with details you've asked. About issues you decompose:
it looks like we could merge 2 and 3 as they are tightly coupled. I would like to focus on then in this bug. The problem with glance image requires another bug. I'll file it and add as comment to current bug.

description: updated
Revision history for this message
Timofey Durakov (tdurakov) wrote :

@markus_z found reason of glance error, there was incorrect [glance]api_servers provided on destination node. so only second and third are issues are valid.

tags: added: live-migrate
removed: live-migration
description: updated
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/227897

Revision history for this message
Pawel Koniszewski (pawel-koniszewski) wrote :
Revision history for this message
Timofey Durakov (tdurakov) wrote :

@Pawel, no, root cause of the problem is different

Revision history for this message
Pawel Koniszewski (pawel-koniszewski) wrote :

Raising importance to high because it leaves instance files on destination host so every other LM of the same VM to particular compute node will fail. Also this will be a blocker for https://review.openstack.org/#/c/228828/ (query and cancel LM) implementation if we agree on this spec which is very important from operators perspective.

Changed in nova:
importance: Medium → High
Changed in nova:
assignee: Timofey Durakov (tdurakov) → John Garbutt (johngarbutt)
Changed in nova:
assignee: John Garbutt (johngarbutt) → Timofey Durakov (tdurakov)
Changed in nova:
assignee: Timofey Durakov (tdurakov) → Matt Riedemann (mriedem)
Changed in nova:
assignee: Matt Riedemann (mriedem) → Timofey Durakov (tdurakov)
Changed in nova:
assignee: Timofey Durakov (tdurakov) → Matt Riedemann (mriedem)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/227897
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=450f9a063973cca65af76ef2deb4d833ce68572b
Submitter: Jenkins
Branch: master

commit 450f9a063973cca65af76ef2deb4d833ce68572b
Author: Timofey Durakov <email address hidden>
Date: Fri Sep 25 18:06:50 2015 +0300

    Attempt rollback live migrate at dest even if network dealloc fails

    When setup_networks_on_host fails during
    rollback_live_migration_at_destination some resources are
    left because driver.rollback_live_migration_at_destination
    is not called.

    To give driver a chance to cleanup resources,the network
    method is wrapped with a try/except block so we can
    attempt the rollback via the virt driver.

    Closes-bug: #1488435

    Change-Id: I463fa28fe05cf44dc335fe02f6c54d9b0a20d1e4

Changed in nova:
status: In Progress → Fix Committed
Paul Murray (pmurray)
tags: added: live-migration
removed: live-migrate
Changed in nova:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.