[Nova]Live migration isn't really made, instance stays on same compute

Bug #1544564 reported by Rodion Promyshlennikov
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Mirantis OpenStack
Status tracked in 10.0.x
10.0.x
Confirmed
Medium
Timofey Durakov
8.0.x
Won't Fix
Medium
Timofey Durakov
9.x
Won't Fix
Medium
Timofey Durakov

Bug Description

Live migration is not working (works in 1 of 5 or less live-migrations start)

Environment:
mos 8.0 549 iso

Steps to reproduce:
1. Make standard deployment with 3 controllers and 2 computes.
2. Launch instance from image with small flavor (i use Cirros img).
3. Make live-migration to second compute node with block_migration=True (you can make it from cli or Horizon, result should be same)

Expected Result:
VM will successfully migrate to another host

Observed Result:
VM didn't migrate.

Diagnostic snapshot link:
https://drive.google.com/file/d/0B-QiiEr4w70UR2VCenRlWTI5N0U/view?usp=sharing

description: updated
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Rodion, could you please elaborate on what specific error you see?

description: updated
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

We are taking a closer look at the environment right now.

tags: added: area-nova
removed: nova
tags: added: release-notes
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Ok. So this is a specific case of live migration (block_migrate=True), when VMs ephemeral disks *are not* shared, but stored on local disks instead, which means they have to be transferred over the network, when a VM is live migrated. This *can* be done, but generally not recommended, as disks may be big, so it's simply inefficient and puts network under high load.

Depending on the root disk size and network bandwidth, it may take a long time for a VM disk to be transferred. In Liberty a special option was introduced to the libvirt driver to abort stuck live migrations:

[libvirt] live_migration_progress_timeout = 150 (IntOpt) Time to wait, in seconds, for migration to make forward progress in transferring data before aborting the operation. Set to 0 to disable timeouts.

And this is exactly what we see on the Rodion's environment:

http://paste.openstack.org/show/486721/

nova-compute aborted the live migration because there was no progress during 150s.

It's unclear at this point, why qemu/libvirt failed to report progress of a block device migration, as we can see from tcpdump logs, that the disk was actually in the middle of migration when we stopped it. We'll take a closer look at this.

User impact is moderate: if block migration fails, an instance will continue to run on the source host. The workaround is to increase live_migration_progress_timeout value in nova.conf or set it to 0 to disable timeouts completely.

Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I suggest we downgrade this to High and increase the timeout value in 8.0-mu1. So this should go to release notes in 8.0.

For 9.0, we'll need to take a look, if block migration progress report can be improved from qemu/libvirt side.

tags: added: move-to-mu
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

I actually think this should be Medium for 9.0 as block-migration is the use case we'd like to avoid and this is mostly mitigated by increasing the timeout value anyway.

Still, if we can improve qemu/libvirt that would be even better.

tags: added: 8.0 release-notes-done
removed: release-notes
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Oops, forgot to update the importance for 8.0-updates.

Changed in mos:
status: Confirmed → Won't Fix
Revision history for this message
Dina Belova (dbelova) wrote :

Added move-to-10.0 tag due to the fact bug was transferred from 9.0 to 10.0

tags: added: move-to-10.0
Revision history for this message
Vitaly Sedelnik (vsedelnik) wrote :

Won't Fix for 8.0-updates because of Medium importance

Revision history for this message
Yuri Shovkoplias (yuri-shovkoplias) wrote :

Guys, it is not a medium importance, we experience this issue in the customer deployments

tags: added: customer-found
Revision history for this message
Roman Podoliaka (rpodolyaka) wrote :

Yuriy,

1) as we discussed, you were actually seeing a different issue, not this one

2) the fact that it affects any customer deployments does not directly affect the bug importance, which essentially denotes the users impact and whether the problem can easily be avoided by using workarounds

tags: added: 10.0-reviewed
tags: removed: move-to-10.0
tags: removed: move-to-mu
tags: added: ct1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.