Deleting a CPU-pinned instance after changing vcpu_pin_set causes it to go to ERROR

Bug #1836945 reported by Artom Lifshitz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Triaged
Low
Unassigned

Bug Description

Description
===========

If you boot an instance with pinned CPUs (for example by using the 'dedicated' CPU policy), change the vcpu_pin_set option on its compute host, then attempt to delete the instance, it will ERROR out instead of deleting successfully. Subsequent delete attempts work.

Steps to reproduce
==================

1. Configure vcpu_pin_set in nova-cpu.conf:
   [DEFAULT]
   vcpu_pin_set = 0,1

2. Create a flavor with a 'dedicated' CPU policy:
   openstack flavor create --ram 256 --disk 1 --vcpus 2 dedicated
   openstack flavor set --property hw:cpu_policy=dedicated dedicated

3. Boot a VM with that flavor:
   nova boot --nic none \
      --flavor <dedicated UUID> \
      --image 8288bd81-eb26-419a-8d4e-4481da137fd6 test

4. Change vcpu_pin_set:
   [DEFAULT]
   vcpu_pin_set = 3,4

5. Delete the instance
   nova delete test

Expected result
===============

The insance deletes successfully.

Actual result
=============

The instance goes into ERROR.

Environment
===========

master

Logs & Configs
==============

Traceback from nova-compute:

Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager Traceback (most recent call last):
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/manager.py", line 8304, in _update_available_resource_for_node
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager startup=startup)
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 747, in update_available_resource
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager self._update_available_resource(context, resources, startup=startup)
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager File "/usr/lib/python2.7/site-packages/oslo_concurrency/lockutils.py", line 328, in inner
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager return f(*args, **kwargs)
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 788, in _update_available_resource
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager context, instances, nodename)
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1327, in _update_usage_from_instances
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager self._update_usage_from_instance(context, instance, nodename)
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1291, in _update_usage_from_instance
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager nodename, sign=sign)
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager File "/opt/stack/nova/nova/compute/resource_tracker.py", line 1107, in _update_usage
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager cn, usage, free)
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager File "/opt/stack/nova/nova/virt/hardware.py", line 2073, in get_host_numa_usage_from_instance
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager host_numa_topology, instance_numa_topology, free=free))
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager File "/opt/stack/nova/nova/virt/hardware.py", line 1929, in numa_usage_from_instances
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager newcell.pin_cpus(pinned_cpus)
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager File "/opt/stack/nova/nova/objects/numa.py", line 98, in pin_cpus
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager cpuset=list(self.cpuset))
Jul 17 14:28:49 devstack-numa-allinone nova-compute[30309]: ERROR nova.compute.manager CPUPinningUnknown: CPU set to pin [0, 1] must be a subset of known CPU set []

Revision history for this message
Artom Lifshitz (notartom) wrote :

I should add, obviously nova-compute is restarted after step 4.

And I know this is low impact, since a second delete request will succeed, and it's an unlikely situation to begin with. But deleting an instance should always be a valid operation.

Revision history for this message
sean mooney (sean-k-mooney) wrote :

so personally i think its partly user error.
this would only happen if you change the vcpu_pin_set in such a way as to invalidate
existing running vms. if the operator does change the vcpu_pin_set to that without first
draining the host of running instances i think that it is there respociblity to ensure
that there change does not violate that.
On the other hand for delete we proably suold just for delete it and not require double deleteing
but modifying the vcpu_pin_set would potentially invalidate the resouce tracker view of the host
makeing schduling fail/succeed when it should not and in can casuse issues with the placment inventories.

even with the current code it could cause the total number of allocation to exceed the capsity*allocation raitio preventing any addtion vm form being scheduled to the host untill the
usage is reduced to allowed levels.

i think this is one of those case where we should say "nerver do this" and document how to safely adjust this config option.
not that since this is going way we will likely want to document how to modify cpu_shared_set and cpu_dedicated_set too.

triaging this as low as you said a second delete fixes the issue and i think a minimal fix to allwo the first delete to work and perhaps a start-up warning if we detect that existing instaces are nolonger vlaid with the vcpu pin set would be a nice enhancement.

Changed in nova:
importance: Undecided → Low
status: New → Triaged
tags: added: compute config
Revision history for this message
sean mooney (sean-k-mooney) wrote :

Note that technically the vcpu_pin_set is read by the compute manager and should control
the set of cpus on a host that are valid across all drivers so i have not added the libvirt tag although this bug is related to the libvirt driver implementation of pinning so we could add the libvirt and numa tags too. it also indirectly related to placment in that changing the config value will change the set of cores reported to the resouce tracker which will change the set of vcpu reported to placement.

tags: added: resource-tracker
tags: added: libvirt nu
tags: added: numa
removed: nu
Revision history for this message
sean mooney (sean-k-mooney) wrote :

on second thought i will add them but this could break other drivers too in subtle ways.

Akhil Gudise (akhil-g)
Changed in nova:
assignee: nobody → Akhil Gudise (akhil-g)
Akhil Gudise (akhil-g)
Changed in nova:
assignee: Akhil Gudise (akhil-g) → nobody
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.