OpenStack Compute (nova)

Live Migration - if libvirt timeout the instance goes to error state but the live migration continues

Bug #1924585 reported by Belmiro Moreira on 2021-04-15

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Triaged	High	Unassigned

Bug Description

Recently we live migrated an entire cell to new hardware and we hit the following problem several times...

During a live migration Nova monitors the state of the migration quering libvirt every 0.5s

https://github.com/openstack/nova/blob/5eab13030bc2708c8900f7ac1bdbc8a111f5f823/nova/virt/libvirt/driver.py#L9452

If libvirt timeout, the instance is left in a very bad state...
The instance goes to error state. For Nova the instance continues in the source compute node. However, libvirt continues with the live migration, that will eventually end up the the destination compute node.

I'm using Stein release, but looking into the current release the code path seems the same.

Here's the Stein trace:

```
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6796, in _do_live_migration
    block_migration, migrate_data)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7581, in live_migration
    migrate_data)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8068, in _live_migration
    finish_event, disk_paths)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7873, in _live_migration_monitor
    info = guest.get_job_info()
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 705, in get_job_info
    stats = self._domain.jobStats()
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 190, in doit
    result = proxy_call(self._autowrap, f, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 148, in proxy_call
    rv = execute(f, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 129, in execute
    six.reraise(c, e, tb)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
    rv = meth(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1433, in jobStats
    if ret is None: raise libvirtError ('virDomainGetJobStats() failed', dom=self)
libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMemoryStats)
```

Tags:

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2021-05-13:

i think going into the error state is still correct unless we can somehow recover later.

do you konw if we ever get to post_live_migration? if so then

https://review.opendev.org/c/openstack/nova/+/791135 shoudl fix it
and this is likely just another example of https://bugs.launchpad.net/nova/+bug/1628606

we likely can make this more robost as a time out on a single iteration fo polling the jobs states shoudl not be sufficient to break the migration.

tags:

added: libvirt live-migration

Revision history for this message

sean mooney (sean-k-mooney) wrote on 2021-05-13:

i would triage this as medium since while this is anoying, espically when you are upgrading and migrating a large amount of vms, live migration is an admin only operation so normal users cannot get into this state and the vm end up in errror.
that said we dont want operator to have to modify the db to fix this and there is a potential for data currptoion if you hard reboot the vm and its on shared storage when the host is inccorect so setting to high