live-migration failed due to invalid value of cpu set

Bug #1440981 reported by Eli Qiao
32
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Eli Qiao

Bug Description

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 457, in fire_timers
    timer()
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/timer.py", line 58, in __call__
    cb(*args, **kw)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 168, in _do_send
    waiter.switch(result)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 5433, in _live_migration_operation
    instance=instance)
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 85, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/virt/libvirt/driver.py", line 5402, in _live_migration_operation
    CONF.libvirt.live_migration_bandwidth)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 183, in doit
    result = proxy_call(self._autowrap, f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 141, in proxy_call
    rv = execute(f, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 122, in execute
    six.reraise(c, e, tb)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/tpool.py", line 80, in tworker
    rv = meth(*args, **kwargs)
  File "/usr/lib/python2.7/dist-packages/libvirt.py", line 1582, in migrateToURI2
    if ret == -1: raise libvirtError ('virDomainMigrateToURI2() failed', dom=self)
libvirtError: Invalid value '0-15' for 'cpuset.cpus': Invalid argument

 Reproduce steps:

there are 2 compute hosts:
hostA: 16 cpu
hostB: 4cpu

1. create an instance test1 (which run on hostA)
2. do live migration on test1 from hostA to hostB

Expected result:
test1 live migrate to hostB

Actual result:
failed due to cpu set invalid.

findings:
workaround is set vcpu_pin_set = "0-3" on hostA's nova.conf default section.

Eli Qiao (taget-9)
Changed in nova:
assignee: nobody → Eli Qiao (taget-9)
Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

I experience the same behavior on a system z (arch = s390x) environment.

host A: 2 CPUs (all online)
host B: 4 CPUs (all online)

Launched from image with an "instances_path" which is backed by a shared storage.

The mentioned workaround (in my case vcpu_pin_set = "0-1") works for me too.

The status of the nova code:
$ git log --oneline -n10
    bf70df2 Merge "neutronv2: only create client once when adding/removing fixed IPs"
    ad329b0 Merge "libvirt: remove volume_drivers config param"
    563af55 Merge "Cancel all waiting events during compute node shutdown"
    cd24e14 Merge "Merge baremetal_nodes func tests between V2 and V2.1"
    6c1b8e0 Merge "Share V2 and V2.1 tenant-networks functional tests"
    a175a07 Merge "Merge sec grp default rules tests between V2 and V2.1"
    983543c Merge "Merge instance_usage_audit_log tests between V2 and V2.1"
    7e77ee3 Merge "Share V2 and V2.1 hosts functional tests"
    300f1fc Merge "Share migrations tests between V2 and V2.1"
    9b674fc Merge "Merging instance_actions tests between V2 and V2.1"

A side node: For this platform I had to use this patch https://review.openstack.org/#/c/166130/

Revision history for this message
Eli Qiao (taget-9) wrote :

some update, we can fix this issue by:

1 disable this live-migration scenario, but this should be check in can_live_migration in conductor/task/live_migrate.py to early this exception(or adding new option to enable/disable livemigration)

2 live change cpu set in the dom xml, to allow this migration.
   we can do this by pre_live-migration in dest host.

Revision history for this message
Markus Zoeller (markus_z) (mzoeller) wrote :

In reply to comment #2:
Disabling this scenario looks a bit hard to me. Option 2 seems to be more desirable from my point of view.

I have to double-check if this failure also happens when the instance is launched from a volume.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/173729

Changed in nova:
status: New → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Joe Gordon (<email address hidden>) on branch: master
Review: https://review.openstack.org/173729
Reason: This review is > 4 weeks without comment, and failed Jenkins the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message
Yongfeng Du (dolpherdu) wrote :

One workaround is to not pin the cpu to specific cpuset. But bind cpu inside one NUMA node is the design of NUMA support for performance, so this is just a workaround.
https://ask.openstack.org/en/question/61485/libvirt-broken-cpuset-on-migration-workarounds/

I add the comments here because It doesn't seems like exactly duplicate of another issue.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.