OpenStack Compute (nova)

Compute manager fails to cleanup compute_nodes not reported by driver

Bug #1161193 reported by David Peraza on 2013-03-28

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Low	David Peraza	OpenStack Compute (nova) 2013.2 "havana"
	Grizzly	Fix Released	Low	Unassigned	OpenStack Compute (nova) 2013.1.1

Bug Description

When virt driver supports multiple nodes and one node is removed from driver support the compute_nodes in DB are not synched with the driver list. This will cause scheduler to pick bad host resulting in this error:

| fault | {u'message': u'NovaException', u'code': 500, u'details': u'helium51 is not a valid node managed by this compute host. |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 223, in decorated_function |
| | return function(self, context, *args, **kwargs) |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 1149, in run_instance |
| | do_run_instance() |
| | File "/usr/lib/python2.6/site-packages/nova/openstack/common/lockutils.py", line 242, in inner |
| | retval = f(*args, **kwargs) |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 1148, in do_run_instance |
| | admin_password, is_first_time, node, instance) |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 802, in _run_instance |
| | self._set_instance_error_state(context, instance[\'uuid\']) |
| | File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__ |
| | self.gen.next() |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 756, in _run_instance |
| | rt = self._get_resource_tracker(node) |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 353, in _get_resource_tracker |
| | raise exception.NovaException(msg) |
| | ', u'created': u'2013-03-06T16:47:52Z'}

Two things I see in the code:

first the list of known hosts is not reflecting the DB list but a list from driver.get_available_nodes:

known_nodes = set(self._resource_tracker_dict.keys())

Which then will never yield orphan compute_nodes in this statement:

for nodename in known_nodes - nodenames

Secondly, even if we fix to get known_nodes from the DB through conductor

This code will always raise and exception:

for nodename in known_nodes - nodenames:
rt = self._get_resource_tracker(nodename)
rt.update_available_resource(context, delete=True)

because _get_resource_tracker will always check the nodename is in driver.get_available_nodes

To replicate this you could just change your hypervisor_hostname which will create a new record in nova.compute_nodes table leaving the old record around. This will simulate a compute node that is not supported anymore in a multi-node scenario.

Suggestion:

Remove logic to delete orphan compute_nodes from compute.manager and move to compute.resource_tracker under the _sync_compute_node method which already loops through all compute_nodes records for compute service

Tags:

David Peraza (dperaza) on 2013-03-28

Changed in nova:
assignee:	nobody → David Peraza (dperaza)

Michael Still (mikal) on 2013-03-28

Changed in nova:
status:	New → Triaged
importance:	Undecided → Low

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-03-28: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/25592

Changed in nova:
status:	Triaged → In Progress

aeva black (tenbrae) on 2013-04-12

tags:

added: baremetal

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-04-12: Fix merged to nova (master)

Reviewed: https://review.openstack.org/25592
Committed: http://github.com/openstack/nova/commit/45ce810ab42b202565088dce55db15374204f638
Submitter: Jenkins
Branch: master

commit 45ce810ab42b202565088dce55db15374204f638
Author: David Peraza <email address hidden>
Date: Wed Mar 27 12:44:22 2013 +0000

Cleans up orphan compute_nodes not cleaned up by compute manager

Fixes bug 1161193

    Orphan compute_node records can cause the scheduler to pick
    compute nodes that are not handled by driver anymore. This
    can happen if you rename your hypervisor_hostname or in a
    multi-node support where driver does not support a node
    anymore for whatever reason (hardare failure for example).

    Also, removing resource tracker logic to delete nodes since
    resource trackers built in compute manager will never accept
    nodes not reported by driver

Change-Id: I742d2e81ec0592d952ee5736aa8dce1e5598ef80

Changed in nova:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-04-13: Fix proposed to nova (stable/grizzly)

Fix proposed to branch: stable/grizzly
Review: https://review.openstack.org/26906

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2013-04-27: Fix merged to nova (stable/grizzly)

Reviewed: https://review.openstack.org/26906
Committed: http://github.com/openstack/nova/commit/7e527ca1a3d6da39e89867619216252a83eec1ba
Submitter: Jenkins
Branch: stable/grizzly

commit 7e527ca1a3d6da39e89867619216252a83eec1ba
Author: David Peraza <email address hidden>
Date: Wed Mar 27 12:44:22 2013 +0000

Cleans up orphan compute_nodes not cleaned up by compute manager

Fixes bug 1161193

    Also, removing resource tracker logic to delete nodes since
    resource trackers built in compute manager will never accept
    nodes not reported by driver

Change-Id: I742d2e81ec0592d952ee5736aa8dce1e5598ef80

Thierry Carrez (ttx) on 2013-05-29

Changed in nova:
milestone:	none → havana-1
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2013-10-17

Changed in nova:
milestone:	havana-1 → 2013.2

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.