cpu_pinning errors after evacuation of instance with cpu_policy

Bug #1688599 reported by Chris Friesen
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Unassigned

Bug Description

We recently hit an issue where an evacuating instance with dedicated cpu_policy being pinned to same host CPUs as other instances with dedicated cpu_policy. During subsequent resource audits we would see cpu pinning errors.

The root cause appears to be the fact that the resource audit skips the evacuating instance during migration phase of audit while instance was rebuilding on new host. It appears that _instance_in_resize_state() returned "false" because the vm_state was vm_states.ERROR. We allow rebuilding from the ERROR state though, so we should consider it.

Revision history for this message
Chris Friesen (cbf123) wrote :

Even after updating _instance_in_resize_state() to account for rebuilds from vm_states.ERROR, I think there is a further race condition. Down towards the end of _do_rebuild_instance() we call:

        self._update_instance_after_spawn(context, instance)
        instance.save(expected_task_state=[task_states.REBUILD_SPAWNING])

This sets the task_state to "None", but the new instance host doesn't get updated until a bit later down at the bottom of rebuild_instance(). During that window, the newly-rebuilt instance will not get accounted for in either _update_usage_from_instances() or _update_usage_from_migrations().

Sean Dague (sdague)
summary: - resource audit races against evacuating instance
+ cpu_pinning errors after evacuation of instance with cpu_policy
Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
tags: added: evacuate
Revision history for this message
Ma Wen Cheng (mars914) wrote :

After compute node is rebooted, failed to update numa usage:

2017-12-19 21:58:39.166 7494 INFO nova.compute.resource_tracker [req-ac68f931-c034-495d-948f-e142e5bedd5c - - - - -] Auditing locally available compute resources for node valor5-dal09-ce47
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager [req-ac68f931-c034-495d-948f-e142e5bedd5c - - - - -] Error updating resources for node valor5-dal09-ce47.
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager Traceback (most recent call last):
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6500, in update_available_resource
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager rt.update_available_resource(context)
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 528, in update_available_resource
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager self._update_available_resource(context, resources)
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/oslo_concurrency/lockutils.py", line 274, in inner
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager return f(*args, **kwargs)
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 573, in _update_available_resource
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager self._update_usage_from_instances(context, instances)
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 936, in _update_usage_from_instances
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager self._update_usage_from_instance(context, instance)
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 902, in _update_usage_from_instance
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager self._update_usage(instance, sign=sign)
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/compute/resource_tracker.py", line 741, in _update_usage
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager self.compute_node, usage, free)
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/hardware.py", line 1444, in get_host_numa_usage_from_instance
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager host_numa_topology, instance_numa_topology, free=free))
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager File "/usr/lib/python2.7/dist-packages/nova/virt/hardware.py", line 1304, in numa_usage_from_instances
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager pinned_cpus = set(instancecell.cpu_pinning.values())
2017-12-19 21:58:42.890 7494 ERROR nova.compute.manager AttributeError: 'NoneType' object has no attribute 'values'

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.