Comment 1 for bug 1753676

Revision history for this message
Alexandre arents (aarents) wrote :

Because of this issue it is possible to loss instance disk(saw it in production).
This scenario is reproductible on a multi node master devstack deployment:

       HOST-A (ignite live block migration of a VM to HOST-B)
         | VM MIGRATING(to HOST-B)
         | VM MIGRATING(to HOST-B)
         | VM MIGRATING(to HOST-B)
         | VM MIGRATING(to HOST-B)
         | VM MIGRATING(to HOST-B)
         | VM MIGRATING(to HOST-B)
         | VM MIGRATING(to HOST-B)
         | nova-compute restart on HOST-A Nova reset state MIGRATING to ACTIVE no-task.. during init -> Here is evil
         | VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
         | VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
         | VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
         | VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
         | VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
         | VM ACTIVE no-task(HOST-A) but libvirt continue block migration in bg, out of nova control
         | Start another live-migration of the same VM (it is possible because VM is active no-task)
         | NOVA find a suitable HOST-C to live-migrate
         | NOVA run prelive_migration on HOST-C, creating a target base disk, ready to receive libvirt stream
         | NOVA silenty failed to run a libvirt live migration probably due to existing libvirt stream HOST-A -> HOST-B
         | VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
         | VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
         | VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
         | VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
         | VM MIGRATING(to HOST-C), but in reality libvirt continue to stream to HOST-B
         | END OF LIBVIRT migration to HOST-B
         | NOVA caught end of live migration, and RUN post_migration task on HOST-C instead of HOST-B
         | NOVA set VM in ERROR on HOSTC state because qemu was not running on HOST-C, it cleanup disk on SOURCE host-A
         | qemu still running on HOST-B -> a zombie QEMU is created
       HOST-C VM ERROR with a incomplete disk

So at the end, Nova think VM is on HOST-C(in error, with an incomplete disk) and disk on source host-A has been dropped during post_migration. HOST-B contains the only consistent disk copy but it is hard to guess when reading logs.

I confirm solution is to at least abort live-migration during instance_init.