Compute manager fails to cleanup compute_nodes not reported by driver

Bug #1161193 reported by David Peraza
14
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Low
David Peraza
Grizzly
Fix Released
Low
Unassigned

Bug Description

When virt driver supports multiple nodes and one node is removed from driver support the compute_nodes in DB are not synched with the driver list. This will cause scheduler to pick bad host resulting in this error:

| fault | {u'message': u'NovaException', u'code': 500, u'details': u'helium51 is not a valid node managed by this compute host. |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 223, in decorated_function |
| | return function(self, context, *args, **kwargs) |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 1149, in run_instance |
| | do_run_instance() |
| | File "/usr/lib/python2.6/site-packages/nova/openstack/common/lockutils.py", line 242, in inner |
| | retval = f(*args, **kwargs) |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 1148, in do_run_instance |
| | admin_password, is_first_time, node, instance) |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 802, in _run_instance |
| | self._set_instance_error_state(context, instance[\'uuid\']) |
| | File "/usr/lib64/python2.6/contextlib.py", line 23, in __exit__ |
| | self.gen.next() |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 756, in _run_instance |
| | rt = self._get_resource_tracker(node) |
| | File "/usr/lib/python2.6/site-packages/nova/compute/manager.py", line 353, in _get_resource_tracker |
| | raise exception.NovaException(msg) |
| | ', u'created': u'2013-03-06T16:47:52Z'}

Two things I see in the code:

first the list of known hosts is not reflecting the DB list but a list from driver.get_available_nodes:

known_nodes = set(self._resource_tracker_dict.keys())

Which then will never yield orphan compute_nodes in this statement:

for nodename in known_nodes - nodenames

Secondly, even if we fix to get known_nodes from the DB through conductor

This code will always raise and exception:

for nodename in known_nodes - nodenames:
    rt = self._get_resource_tracker(nodename)
    rt.update_available_resource(context, delete=True)

because _get_resource_tracker will always check the nodename is in driver.get_available_nodes

To replicate this you could just change your hypervisor_hostname which will create a new record in nova.compute_nodes table leaving the old record around. This will simulate a compute node that is not supported anymore in a multi-node scenario.

Suggestion:

Remove logic to delete orphan compute_nodes from compute.manager and move to compute.resource_tracker under the _sync_compute_node method which already loops through all compute_nodes records for compute service

Tags: baremetal
David Peraza (dperaza)
Changed in nova:
assignee: nobody → David Peraza (dperaza)
Michael Still (mikal)
Changed in nova:
status: New → Triaged
importance: Undecided → Low
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/25592

Changed in nova:
status: Triaged → In Progress
aeva black (tenbrae)
tags: added: baremetal
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/25592
Committed: http://github.com/openstack/nova/commit/45ce810ab42b202565088dce55db15374204f638
Submitter: Jenkins
Branch: master

commit 45ce810ab42b202565088dce55db15374204f638
Author: David Peraza <email address hidden>
Date: Wed Mar 27 12:44:22 2013 +0000

    Cleans up orphan compute_nodes not cleaned up by compute manager

    Fixes bug 1161193

    Orphan compute_node records can cause the scheduler to pick
    compute nodes that are not handled by driver anymore. This
    can happen if you rename your hypervisor_hostname or in a
    multi-node support where driver does not support a node
    anymore for whatever reason (hardare failure for example).

    Also, removing resource tracker logic to delete nodes since
    resource trackers built in compute manager will never accept
    nodes not reported by driver

    Change-Id: I742d2e81ec0592d952ee5736aa8dce1e5598ef80

Changed in nova:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (stable/grizzly)

Fix proposed to branch: stable/grizzly
Review: https://review.openstack.org/26906

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (stable/grizzly)

Reviewed: https://review.openstack.org/26906
Committed: http://github.com/openstack/nova/commit/7e527ca1a3d6da39e89867619216252a83eec1ba
Submitter: Jenkins
Branch: stable/grizzly

commit 7e527ca1a3d6da39e89867619216252a83eec1ba
Author: David Peraza <email address hidden>
Date: Wed Mar 27 12:44:22 2013 +0000

    Cleans up orphan compute_nodes not cleaned up by compute manager

    Fixes bug 1161193

    Orphan compute_node records can cause the scheduler to pick
    compute nodes that are not handled by driver anymore. This
    can happen if you rename your hypervisor_hostname or in a
    multi-node support where driver does not support a node
    anymore for whatever reason (hardare failure for example).

    Also, removing resource tracker logic to delete nodes since
    resource trackers built in compute manager will never accept
    nodes not reported by driver

    Change-Id: I742d2e81ec0592d952ee5736aa8dce1e5598ef80

Thierry Carrez (ttx)
Changed in nova:
milestone: none → havana-1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: havana-1 → 2013.2
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.