Live Migration - if libvirt timeout the instance goes to error state but the live migration continues

Bug #1924585 reported by Belmiro Moreira
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
High
Unassigned

Bug Description

Recently we live migrated an entire cell to new hardware and we hit the following problem several times...

During a live migration Nova monitors the state of the migration quering libvirt every 0.5s

https://github.com/openstack/nova/blob/5eab13030bc2708c8900f7ac1bdbc8a111f5f823/nova/virt/libvirt/driver.py#L9452

If libvirt timeout, the instance is left in a very bad state...
The instance goes to error state. For Nova the instance continues in the source compute node. However, libvirt continues with the live migration, that will eventually end up the the destination compute node.

I'm using Stein release, but looking into the current release the code path seems the same.

Here's the Stein trace:

```
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 6796, in _do_live_migration
    block_migration, migrate_data)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7581, in live_migration
    migrate_data)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 8068, in _live_migration
    finish_event, disk_paths)
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/driver.py", line 7873, in _live_migration_monitor
    info = guest.get_job_info()
  File "/usr/lib/python2.7/site-packages/nova/virt/libvirt/guest.py", line 705, in get_job_info
    stats = self._domain.jobStats()
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 190, in doit
    result = proxy_call(self._autowrap, f, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 148, in proxy_call
    rv = execute(f, *args, **kwargs)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 129, in execute
    six.reraise(c, e, tb)
  File "/usr/lib/python2.7/site-packages/eventlet/tpool.py", line 83, in tworker
    rv = meth(*args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1433, in jobStats
    if ret is None: raise libvirtError ('virDomainGetJobStats() failed', dom=self)
libvirtError: Timed out during operation: cannot acquire state change lock (held by monitor=remoteDispatchDomainMemoryStats)
```

Revision history for this message
sean mooney (sean-k-mooney) wrote :

i think going into the error state is still correct unless we can somehow recover later.

do you konw if we ever get to post_live_migration? if so then

https://review.opendev.org/c/openstack/nova/+/791135 shoudl fix it
and this is likely just another example of https://bugs.launchpad.net/nova/+bug/1628606

we likely can make this more robost as a time out on a single iteration fo polling the jobs states shoudl not be sufficient to break the migration.

tags: added: libvirt live-migration
Revision history for this message
sean mooney (sean-k-mooney) wrote :

i would triage this as medium since while this is anoying, espically when you are upgrading and migrating a large amount of vms, live migration is an admin only operation so normal users cannot get into this state and the vm end up in errror.
that said we dont want operator to have to modify the db to fix this and there is a potential for data currptoion if you hard reboot the vm and its on shared storage when the host is inccorect so setting to high

Changed in nova:
importance: Undecided → High
status: New → Triaged
Revision history for this message
sean mooney (sean-k-mooney) wrote :

we have a similar downstream issue by the way https://bugzilla.redhat.com/show_bug.cgi?id=1959759

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.