Currently the libvirt driver's approach to live migration is bested characterized as "launch & pray". It starts the live migration operation and then just unconditionally waits for it to finish. It never makes any attempt to tune its behaviour (for example changing max downtime), nor does it look at the data transfer statistics to check if it is making any progress, nor does it have any overall timeout.
It is not uncommon for guests to have workloads that will preclude live migration from completing. Basically they can be dirtying guest RAM (or block devices) faster than the network is able to transfer it to the destination host. In such a case Nova will just leave the migration running, burning up host CPU cycles and trashing network bandwidth until the end of the universe.
There are many features exposed by libvirt, that Nova could be using to do a better job, but the question is obviously ...which features and how should they be used. Fortunately Nova is not the first project to come across this problem. The oVirt data center mgmt project has the exact same problem. So rather than trying to invent some new logic for Nova, we should, as an immediate bug fix task, just copy the oVirt logic from VDSM
https://github.com/oVirt/vdsm/blob/master/vdsm/virt/migration.py#L430
If we get this out to users and then get real world feedback on how it operates, we will have an idea of how/where to focus future ongoing efforts.
Related fix proposed to branch: master /review. openstack. org/162253
Review: https:/