Mirantis OpenStack

Comment 3 for bug 1544564

Revision history for this message

Roman Podoliaka (rpodolyaka) wrote on 2016-02-11:

Ok. So this is a specific case of live migration (block_migrate=True), when VMs ephemeral disks *are not* shared, but stored on local disks instead, which means they have to be transferred over the network, when a VM is live migrated. This *can* be done, but generally not recommended, as disks may be big, so it's simply inefficient and puts network under high load.

Depending on the root disk size and network bandwidth, it may take a long time for a VM disk to be transferred. In Liberty a special option was introduced to the libvirt driver to abort stuck live migrations:

[libvirt] live_migration_progress_timeout = 150 (IntOpt) Time to wait, in seconds, for migration to make forward progress in transferring data before aborting the operation. Set to 0 to disable timeouts.

And this is exactly what we see on the Rodion's environment:

http://paste.openstack.org/show/486721/

nova-compute aborted the live migration because there was no progress during 150s.

It's unclear at this point, why qemu/libvirt failed to report progress of a block device migration, as we can see from tcpdump logs, that the disk was actually in the middle of migration when we stopped it. We'll take a closer look at this.

User impact is moderate: if block migration fails, an instance will continue to run on the source host. The workaround is to increase live_migration_progress_timeout value in nova.conf or set it to 0 to disable timeouts completely.