VM become Error after confirming resize with Error info CPUUnpinningInvalid on source node
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Fix Released
|
Medium
|
Stephen Finucane | ||
Train |
Fix Released
|
Undecided
|
Stephen Finucane | ||
Ussuri |
Fix Released
|
Undecided
|
Stephen Finucane |
Bug Description
Description
===========
In my environmet, it will take some time to clean VM on source node in confirming resize.
during confirming resize process, periodic_task update_
It may cause ERROR like:
CPUUnpinningInv
during confirming resize process.
Steps to reproduce
==================
* Set /etc/nova/nova.conf "update_
* create a "dedicated" VM, the flavor can be
+------
| Property | Value |
+------
| OS-FLV-
| OS-FLV-
| disk | 80 |
| extra_specs | {"hw:cpu_policy": "dedicated"} |
| id | 2be0f830-
| name | 4vcpu.4mem.
| os-flavor-
| ram | 4096 |
| rxtx_factor | 1.0 |
| swap | |
| vcpus | 4 |
+------
* Resize the VM with a new flavor to another node.
* Confirm resize.
Make sure it will take some time to undefine the vm on source node, 30 seconds will lead to inevitable results.
* Then you will see the ERROR notice on dashboard, And the VM become ERROR
Expected result
===============
VM resized successfuly, vm state is active
Actual result
=============
* VM become ERROR
* On dashboard you can see this notice:
Please try again later [Error: CPU set to unpin [1, 2, 18, 17] must be a subset of pinned CPU set []].
Environment
===========
1. Exact version of OpenStack you are running.
Newton version with patch https:/
I am sure it will happen to other new vision with https:/
such as Train and Ussuri
2. Which hypervisor did you use?
Libvirt + KVM
3. Which storage type did you use?
local disk
4. Which networking type did you use?
Neutron with OpenVSwitch
Logs & Configs
==============
ERROR log on source node
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
2020-05-15 10:11:12.324 425843 ERROR nova.compute.
Changed in nova: | |
assignee: | nobody → kevinzhao (kego) |
tags: | added: resize |
tags: | added: numa |
This looks like the resize just hit a race condition.
when using cpu pinning cpus are not claimed on the compute host numa toplogy blob
so its perfectly feasible for a compute node to claim to fail even if schduler and placement claims
passed.
marking this as incompleted as we dont know what release of nova this is being reported against.
if its pre train this is invalid as we expect this to race. if its after train then we should see why we are not doing the cliam
as part of the resize verify step. we should be in which case this maybe vaild.
in general however we cannot fully resolve this untill we have numa in placement as there will always be a race between the schduler and compute node resouce tracker but there should not be one between resize_verify and the perodic task even today
unless the vm raced with another vm for the resouces on the host.