CPUPinningInvalid exception occurred when evacuate one instance repeatedly.

Bug #1723005 reported by Charlotte Han
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
High
Unassigned

Bug Description

Description
===========
Evacuate instance which has NUMA topology failed, instance's vm_state become error when instance do pin_cpus operation. the failed exception is CPUPinningInvalid.

Steps to reproduce
==================
I wrote a monitor process to evacuate instances automatically. This process is to detect compute nodes whose service is down and evacuate instances running on these compute nodes. When running this process to auto test, some instances become error after evacuate.

$ nova list --all-tenants | grep wdl_chongsheng_vm-2
| c90a1a71-4c5b-418a-b513-907ee1c956a0 | wdl_chongsheng_vm-2 | e3ddf976a1654dd89cf03820cb55b946 | ERROR | - | Running | robot_test_network=192.168.1.147 |

Error logs:
2017-10-10 17:10:31.294 20488 INFO nova.compute.manager [req-36368859-16ae-4e4f-a1f4-03a559942ec5 - - - - -] [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] Rebuilding instance
2017-10-10 17:10:31.360 20488 INFO nova.compute.claims [req-36368859-16ae-4e4f-a1f4-03a559942ec5 - - - - -] [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] Attempting claim: memory 2048 MB, disk 0 GB, vcpus 2 CPU
2017-10-10 17:10:31.361 20488 INFO nova.compute.claims [req-36368859-16ae-4e4f-a1f4-03a559942ec5 - - - - -] [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] Total memory: 63599 MB, used: 25600.00 MB
2017-10-10 17:10:31.361 20488 INFO nova.compute.claims [req-36368859-16ae-4e4f-a1f4-03a559942ec5 - - - - -] [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] memory limit not specified, defaulting to unlimited
2017-10-10 17:10:31.361 20488 INFO nova.compute.claims [req-36368859-16ae-4e4f-a1f4-03a559942ec5 - - - - -] [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] Total disk: 170 GB, used: 20.00 GB
2017-10-10 17:10:31.362 20488 INFO nova.compute.claims [req-36368859-16ae-4e4f-a1f4-03a559942ec5 - - - - -] [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] disk limit not specified, defaulting to unlimited
2017-10-10 17:10:31.362 20488 INFO nova.compute.claims [req-36368859-16ae-4e4f-a1f4-03a559942ec5 - - - - -] [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] Total vcpu: 32 VCPU, used: 8.00 VCPU
2017-10-10 17:10:31.362 20488 INFO nova.compute.claims [req-36368859-16ae-4e4f-a1f4-03a559942ec5 - - - - -] [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] vcpu limit not specified, defaulting to unlimited
2017-10-10 17:10:31.399 20488 INFO nova.compute.claims [req-36368859-16ae-4e4f-a1f4-03a559942ec5 - - - - -] [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] Claim successful
2017-10-10 17:10:31.461 20488 INFO nova.compute.resource_tracker [req-36368859-16ae-4e4f-a1f4-03a559942ec5 - - - - -] Updating from migration c90a1a71-4c5b-418a-b513-907ee1c956a0
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [req-36368859-16ae-4e4f-a1f4-03a559942ec5 - - - - -] [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] Setting instance vm_state to ERROR
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] Traceback (most recent call last):
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 7419, in _error_out_instance_on_exception
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] yield
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] File "/usr/lib/python2.7/site-packages/nova/compute/manager.py", line 2990, in rebuild_instance
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] migration=migration)
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 271, in inner
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] return f(*args, **kwargs)
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 235, in rebuild_claim
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] image_meta=image_meta, migration=migration)
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 299, in _move_claim
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] migration)
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 913, in _update_usage_from_migration
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] self._update_usage(usage)
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] File "/usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py", line 821, in _update_usage
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] self.compute_node, usage, free)
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1784, in get_host_numa_usage_from_instance
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] host_numa_topology, instance_numa_topology, free=free))
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] File "/usr/lib/python2.7/site-packages/nova/virt/hardware.py", line 1651, in numa_usage_from_instances
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] newcell.pin_cpus(pinned_cpus)
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] File "/usr/lib/python2.7/site-packages/nova/objects/numa.py", line 90, in pin_cpus
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] pinned=list(self.pinned_cpus))
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] CPUPinningInvalid: Cannot pin/unpin cpus [8, 28] from the following pinned set [1, 5, 8, 9, 21, 25, 28, 29]
2017-10-10 17:10:31.492 20488 ERROR nova.compute.manager [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0]
2017-10-10 17:10:31.617 20488 INFO nova_patch.compute.utils [req-36368859-16ae-4e4f-a1f4-03a559942ec5 - - - - -] Report alarm instance recover error. Details: alarm_instance_recover_error, instance_name: wdl_chongsheng_vm-2, instance_id: c90a1a71-4c5b-418a-b513-907ee1c956a0, action: rebuild_instance, result: True
2017-10-10 17:10:31.778 20488 INFO nova.compute.manager [req-36368859-16ae-4e4f-a1f4-03a559942ec5 - - - - -] [instance: c90a1a71-4c5b-418a-b513-907ee1c956a0] Successfully reverted task state from rebuilding on failure for instance.

Expected result
===============
Evacuate successfully when evacuate one instance whose task_state is None and host is down repeatedly.

Actual result
=============
Evacuate failed.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

  git log -1
  commit 219c2660cdc936c9d1469d7629645e05a511fbf0
  Merge: 6b8bbb1 3a19f89
  Author: Jenkins <email address hidden>
  Date: Wed Oct 11 02:32:39 2017 +0000

    Merge "Fix minor input items from previous patches"

2. Which hypervisor did you use?
    Libvirt + KVM

Charlotte Han (hanrong)
Changed in nova:
importance: Undecided → High
description: updated
tags: added: libvirt numa
Changed in nova:
status: New → Confirmed
tags: added: openstack-version.pike
Revision history for this message
Eli Qiao (taget-9) wrote :

hi

I don't think the environment you provide is correct.

Environment
===========
1. Exact version of OpenStack you are running. See the following
  list for all releases: http://docs.openstack.org/releases/

  git log -1
  commit 219c2660cdc936c9d1469d7629645e05a511fbf0
  Merge: 6b8bbb1 3a19f89
  Author: Jenkins <email address hidden>
  Date: Wed Oct 11 02:32:39 2017 +0000

    Merge "Fix minor input items from previous patches"

from your log error
"
CPUPinningInvalid: Cannot pin/unpin cpus [8, 28] from the following pinned set [1, 5, 8, 9, 21, 25, 28, 29]
"

that should be older version than:

commit b10948913dc27b46114ad80e734aa015327fc1cf
Author: Sergey Nikitin <email address hidden>
Date: Thu Apr 14 11:17:13 2016 +0300

    Added better error messages during (un)pinning CPUs

    Error messages during pinning and unpinning are same.
    It's a confusing especially during migration for example.
    You got message like "Cannot pin/unpin cpus [17] from the
    following pinned set [3]" and you can't understand is it
    a problem with removing a VM from old compute node
    or with booting a VM on new compute node.

    We should split these two cases.

    Change-Id: Ic70ecdfa414fd8558d453fbf1c21640c8acc3ea1

Revision history for this message
Eli Qiao (taget-9) wrote :

It seems when evacuate the instance to a new host, nova-compute calculates a wrong cpu pin set.

Can you still reproduce it on latest master branch?

Eli Qiao (taget-9)
Changed in nova:
status: Confirmed → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.