Live migration get unclean when the libvirtd is restarted

Bug #1999607 reported by Hiroki Narukawa
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Unassigned

Bug Description

Description
===========

When I restart libvirtd on the source host while a live migration is going on, the live migration is cancelled, but nova does not roll back and some unclean disks remain on the destination hypervisor.

The reason can be organized into 2 items:

  1. In live migration monitor, nova does not retry after libvirtd connection failed.
  2. The variable @guest at nova.virt.libvirt.driver.LibvirtDriver._live_migration_monitor is tied to libvirtd client connection, so even after successfully reconnected, this variable is not available anymore.

The patch below mostly fixes this bug.

https://review.opendev.org/c/openstack/nova/+/867077

For the case that the live migration is completed but nova missed the result, which means the libvirtd is restarted within 0.5 second after the live migration has completed, that patch does not work, but fixing this behavior will be more complicated and thus this I want this ticket to scope for the case that live migration is cancelled.

Steps to reproduce
==================

1. Start a VM
2. In the VM, run "stress -m 8" to make live migration take a long time
3. Start live migration
4. Restart libvirt

The 2 commands below on the source hypervisor corresponds to the step 3 and 4.

nova live-migration --block-migrate <VM_UUID> <DEST_HOST_NAME>
sudo systemctl restart libvirtd

Expected result
===============

Rollback runs and get clean.

Actual result
=============

Rollback does not run and an unused disk remains on the destination hypervisor.

After applying the patch below, gets the expected result above.

https://review.opendev.org/c/openstack/nova/+/867077

Environment
===========

1. Exact version of OpenStack you are running. See the following
   list for all releases: http://docs.openstack.org/releases/

OpenStack Yoga based version, but I'm sorry that our environment needed some additional internal patches and I cannot determine the exact version.

2. Which storage type did you use?

Local storage on the hypervisor

3. Which networking type did you use?

Open vSwitch

Logs & Configs
==============

In logs below, VM name, VM UUID and hostnames are masked.

Logs on source hypervisor

2022-12-14 15:13:03.945 2414856 INFO nova.compute.manager [-] [instance: <INSTANCE_UUID> <VM_NAME>] Took 1.50 seconds for pre_live_migration on destination host <DEST_HOST_NAME>.
2022-12-14 15:13:04.537 2414856 INFO nova.virt.libvirt.migration [-] [instance: <INSTANCE_UUID> <VM_NAME>] Increasing downtime to 166 ms after 0 sec elapsed time
2022-12-14 15:13:04.605 2414856 INFO nova.virt.libvirt.driver [-] [instance: <INSTANCE_UUID> <VM_NAME>] Migration running for 0 secs, memory 100% remaining (bytes processed=0, remaining=0, total=0); disk 100% remaining (bytes processed=0, remaining=0, total=0).
2022-12-14 15:13:11.555 2414856 INFO nova.virt.libvirt.driver [None req-f100723b-9689-4d65-9965-abd8b078f3e1 - - - - - -] Connection event '0' reason 'Connection to libvirt lost: 1'
2022-12-14 15:13:11.556 2414856 ERROR nova.virt.libvirt.driver [-] [instance: <INSTANCE_UUID> <VM_NAME>] Live Migration failure: End of file while reading data: Input/output error: libvirt.libvirtError: End of file while reading data: Input/output error
2022-12-14 15:13:11.972 2414856 WARNING nova.virt.libvirt.driver [-] [instance: <INSTANCE_UUID> <VM_NAME>] Error monitoring migration: internal error: client socket is closed: libvirt.libvirtError: internal error: client socket is closed
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] Traceback (most recent call last):
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/nova/virt/libvirt/driver.py", line 10374, in _live_migration
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] self._live_migration_monitor(context, instance, guest, dest,
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/nova/virt/libvirt/driver.py", line 10167, in _live_migration_monitor
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] info = guest.get_job_info()
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/nova/virt/libvirt/guest.py", line 655, in get_job_info
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] stats = self._domain.jobStats()
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/eventlet/tpool.py", line 193, in doit
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] result = proxy_call(self._autowrap, f, *args, **kwargs)
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/eventlet/tpool.py", line 151, in proxy_call
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] rv = execute(f, *args, **kwargs)
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/eventlet/tpool.py", line 132, in execute
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] six.reraise(c, e, tb)
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/six.py", line 719, in reraise
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] raise value
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/eventlet/tpool.py", line 86, in tworker
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] rv = meth(*args, **kwargs)
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] File "/usr/lib64/python3.8/site-packages/libvirt.py", line 1742, in jobStats
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] raise libvirtError('virDomainGetJobStats() failed')
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>] libvirt.libvirtError: internal error: client socket is closed
2022-12-14 15:13:11.972 2414856 ERROR nova.virt.libvirt.driver [instance: <INSTANCE_UUID> <VM_NAME>]
2022-12-14 15:13:17.581 2414856 WARNING nova.compute.manager [req-4d26fd2a-fc63-4f20-a2fe-db32b8b8bf89 req-a2ce9c4b-0337-45ff-937b-4e923ed44f8b 98868f79e5c24d6eaeee24236f010aaa 0a6387d10ac54fadb3794fc4a71ddd01 - - default default] [instance: <INSTANCE_UUID> <VM_NAME>] Received unexpected event network-vif-plugged-9494fe72-aee4-4341-8d28-8f05ad68042b for instance with vm_state active and task_state migrating.
2022-12-14 15:13:41.826 2414856 INFO nova.virt.libvirt.driver [None req-f100723b-9689-4d65-9965-abd8b078f3e1 - - - - - -] Connection event '1' reason 'None'
2022-12-14 15:13:41.870 2414856 ERROR nova.virt.libvirt.driver [-] [instance: <INSTANCE_UUID> <VM_NAME>] Failed to cancel migration Requested operation is not valid: no job is active on the domain: libvirt.libvirtError: Requested operation is not valid: no job is active on the domain
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [-] [instance: <INSTANCE_UUID> <VM_NAME>] Live migration failed.: libvirt.libvirtError: internal error: client socket is closed
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] Traceback (most recent call last):
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/nova/compute/manager.py", line 8625, in _do_live_migration
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] self.driver.live_migration(context, instance, dest,
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/nova/virt/libvirt/driver.py", line 9846, in live_migration
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] self._live_migration(context, instance, dest,
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/nova/virt/libvirt/driver.py", line 10374, in _live_migration
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] self._live_migration_monitor(context, instance, guest, dest,
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/nova/virt/libvirt/driver.py", line 10167, in _live_migration_monitor
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] info = guest.get_job_info()
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/nova/virt/libvirt/guest.py", line 655, in get_job_info
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] stats = self._domain.jobStats()
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/eventlet/tpool.py", line 193, in doit
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] result = proxy_call(self._autowrap, f, *args, **kwargs)
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/eventlet/tpool.py", line 151, in proxy_call
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] rv = execute(f, *args, **kwargs)
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/eventlet/tpool.py", line 132, in execute
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] six.reraise(c, e, tb)
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/six.py", line 719, in reraise
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] raise value
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] File "/opt/nova/lib64/python3.8/site-packages/eventlet/tpool.py", line 86, in tworker
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] rv = meth(*args, **kwargs)
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] File "/usr/lib64/python3.8/site-packages/libvirt.py", line 1742, in jobStats
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] raise libvirtError('virDomainGetJobStats() failed')
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>] libvirt.libvirtError: internal error: client socket is closed
2022-12-14 15:13:41.874 2414856 ERROR nova.compute.manager [instance: <INSTANCE_UUID> <VM_NAME>]

Changed in nova:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/nova/+/877582

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.