Live migration of realtime instances is broken
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Medium
|
Stephen Finucane | ||
Train |
Fix Released
|
Undecided
|
Lee Yarwood | ||
Ussuri |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
Attempting to live migrate an instance with realtime enabled fails on master (commit d4c857dfcb1). This appears to be a bug with the live migration of pinned instances feature introduced in Train.
# Steps to reproduce
Create a server using realtime attributes and then attempt to live migrate it. For example:
$ openstack flavor create --ram 1024 --disk 0 --vcpu 4 \
--property 'hw:cpu_
--property 'hw:cpu_
--property 'hw:cpu_
realtime
$ openstack server create --os-compute-
--flavor realtime --image cirros-
--boot-
test.realtime
$ openstack server migrate --live-migration test.realtime
# Expected result
Instance should be live migrated.
# Actual result
The live migration never happens. Looking at the logs we see the following error:
Traceback (most recent call last):
File "/usr/local/
timer()
File "/usr/local/
cb(*args, **kw)
File "/usr/local/
waiter.
File "/usr/local/
result = function(*args, **kwargs)
File "/opt/stack/
return func(*args, **kwargs)
File "/opt/stack/
# is still ongoing, or failed
File "/usr/local/
self.
File "/usr/local/
six.
File "/usr/local/
raise value
File "/opt/stack/
# 2. src==running, dst==paused
File "/opt/stack/
destination, params=params, flags=flags)
File "/usr/local/
result = proxy_call(
File "/usr/local/
rv = execute(f, *args, **kwargs)
File "/usr/local/
six.
File "/usr/local/
raise value
File "/usr/local/
rv = meth(*args, **kwargs)
File "/usr/local/
if ret == -1: raise libvirtError ('virDomainMigr
libvirt.
Looking further, we see there are issues with the XML we are generating for the destination. Compare what we have on the source before updating the XML for the destination:
DEBUG nova.virt.
...
<cputune>
<
<vcpupin vcpu="0" cpuset="0"/>
<vcpupin vcpu="1" cpuset="1"/>
<vcpupin vcpu="2" cpuset="4"/>
<vcpupin vcpu="3" cpuset="5"/>
<emulatorpin cpuset="0-1"/>
<vcpusched vcpus="2" scheduler="fifo" priority="1"/>
<vcpusched vcpus="3" scheduler="fifo" priority="1"/>
</cputune
...
</domain>
{{(pid=12600) _update_numa_xml /opt/stack/
To what we have after the update:
DEBUG nova.virt.
...
<cputune>
<
<vcpupin vcpu="0" cpuset="0"/>
<vcpupin vcpu="1" cpuset="1"/>
<vcpupin vcpu="2" cpuset="4"/>
<vcpupin vcpu="3" cpuset="5"/>
<emulatorpin cpuset="0-1"/>
<vcpusched vcpus="2-3" scheduler="fifo" priority="1"/>
<vcpusched vcpus="3" scheduler="fifo" priority="1"/>
</cputune>
...
</domain>
{{(pid=12600) _update_numa_xml /opt/stack/
The issue is the 'vcpusched' elements. We're assuming there are only one of these elements when updating the XML for the destination [1]. Have to figure out why there are multiple elements and how best to handle this (likely by deleting and recreating everything).
I suspect the reason we didn't spot this is because libvirt is rewriting the XML on us. This is what nova is providing libvirt upon boot:
DEBUG nova.virt.
...
<cputune>
<
<emulatorpin cpuset="0-1"/>
<vcpupin vcpu="0" cpuset="0"/>
<vcpupin vcpu="1" cpuset="1"/>
<vcpupin vcpu="2" cpuset="4"/>
<vcpupin vcpu="3" cpuset="5"/>
<vcpusched vcpus="2-3" scheduler="fifo" priority="1"/>
</cputune>
...
</domain>
{{(pid=12600) _get_guest_xml /opt/stack/
but that's changed by time we get to recalculating things.
The solution is probably to remove all 'vcpusched' elements and recreate them, rather than trying to update stuff inline.
[1] https:/
Changed in nova: | |
status: | New → Confirmed |
importance: | Undecided → Medium |
assignee: | nobody → Stephen Finucane (stephenfinucane) |
tags: | added: numa |
tags: | added: libvirt live-migration |
tags: | added: realtime |
Fix proposed to branch: master /review. opendev. org/743568
Review: https:/