Bug #1771806 “Ironic nova-compute failover creates new resource ...” : Bugs : OpenStack Compute (nova)

Belmiro Moreira (moreira-belmiro-email-lists) on 2018-05-17

tags:

added: ironic placement
removed: placem

Surya Seetharaman (tssurya) on 2018-05-17

Changed in nova:
assignee:	nobody → Surya Seetharaman (tssurya)

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-05-23:

#1

This sounds related: https://review.openstack.org/#/c/508555/ - but that was in Queens which CERN should have already so why doesn't that resolve the problem?

Changed in nova:
status:	New → Confirmed
importance:	Undecided → Medium

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-05-31:

#2

Could be related to bug 1750450 as well.

Revision history for this message

Matt Riedemann (mriedem) wrote on 2018-05-31:

#3

Here is a detailed step through the RT code which eventually creates the compute node record, and highlights why I'm not sure why the change in comment 1 doesn't already handle this issue.

We start here:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L539

If we've restarted the nova-compute service, the self.compute_nodes dict will be empty so we have to query for existing compute_nodes records via the host and nodename fields:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L566

For ironic computes, the nodename is the ironic node uuid.

If we don't find the compute node there, we check to see if there has been an ironic node rebalance to another phsyical nova-compute host:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L574

That looks up the compute nodes by just the nodename (again, ironic node uuid):

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L518

If we find the compute node there, we update it's host field since we've rebalanced to another nova-compute service on another host.

One thing that could be a problem is if we found more than one compute node record:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L531

In that case we'll log an error and then create a new compute node record:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L584

So it would be good to know if you're seeing that error when this happens/happened, otherwise the only other way I can think we'd get past all these checks is if (1) you didn't have the fix linked in comment 1 or (2) the ironic node uuid changed (which shouldn't happen).

Regardless of this, we should probably add some code to always create the compute node with a predictable uuid if the virt driver can supply one, which in the case of the ironic driver it can using the ironic node uuid. Then we'd at least have predictable mappings of compute nodes to the ironic nodes they represent, including the resource providers in placement, since they'd all share the same uuid.

Here is a detailed step through the RT code which eventually creates the compute node record, and highlights why I'm not sure why the change in comment 1 doesn't already handle this issue.

We start here:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L539

If we've restarted the nova-compute service, the self.compute_nodes dict will be empty so we have to query for existing compute_nodes records via the host and nodename fields:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L566

For ironic computes, the nodename is the ironic node uuid.

If we don't find the compute node there, we check to see if there has been an ironic node rebalance to another phsyical nova-compute host:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L574

That looks up the compute nodes by just the nodename (again, ironic node uuid):

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L518

If we find the compute node there, we update it's host field since we've rebalanced to another nova-compute service on another host.

One thing that could be a problem is if we found more than one compute node record:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L531

In that case we'll log an error and then create a new compute node record:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L584

So it would be good to know if you're seeing that error when this happens/happened, otherwise the only other way I can think we'd get past all these checks is if (1) you didn't have the fix linked in comment 1 or (2) the ironic node uuid changed (which shouldn't happen).

Regardless of this, we should probably add some code to always create the compute node with a predictable uuid if the virt driver can supply one, which in the case of the ironic driver it can using the ironic node uuid. Then we'd at least have predictable mappings of compute nodes to the ironic nodes they represent, including the resource providers in placement, since they'd all share the same uuid.

Revision history for this message

Surya Seetharaman (tssurya) wrote on 2018-05-31:

#4

We don't hit this condition: https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L531 and we have https://review.openstack.org/#/c/508555/ already. Not sure why the record is recreated instead of being updated.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-05-31: Related fix proposed to nova (master)

#5

Related fix proposed to branch: master
Review: https://review.openstack.org/571535

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2018-06-29: Related fix merged to nova (master)

#6

Reviewed: https://review.openstack.org/571535
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9f28727eb75e05e07bad51b6eecce667d09dfb65
Submitter: Zuul
Branch: master

commit 9f28727eb75e05e07bad51b6eecce667d09dfb65
Author: Matt Riedemann <email address hidden>
Date: Thu May 31 13:33:14 2018 -0400

Match ComputeNode.uuid to ironic node uuid in RT

    When the ResourceTracker creates a new ComputeNode
    record, when using the Ironic driver, we can use
    the ironic node uuid to set the compute node uuid
    which will then also get reflected as the compute
    node resource provider uuid in Placement, which
    will be a nice link between the three resources
    which should be 1:1:1 with each other. This isn't
    mandatory nor does it fix any bugs, but it will
    be nice for debugging.

Change-Id: Ia69fabce8e7fd7de101e291fe133c6f5f5f7056a
Related-Bug: #1771806

OpenStack Compute (nova)

Ironic nova-compute failover creates new resource provider removing the resource_provider_aggregates link

Bug Description

Other bug subscribers

Remote bug watches