Ironic nova-compute failover creates new resource provider removing the resource_provider_aggregates link

Bug #1771806 reported by Belmiro Moreira
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Surya Seetharaman

Bug Description

When using the request_filter functionality, aggregates are mapped to placement_aggregates.
placement_provider_aggregates contains the resource providers mapped in aggregate_hosts.

The problem happens when a nova-compute for ironic fails and hosts are automatically moved to a different nova-compute. In this case a new compute_node entry is created originating a new resource provider.

As consequence the placement_provider_aggregates doesn't have the new resource providers.

tags: added: ironic placement
removed: placem
Changed in nova:
assignee: nobody → Surya Seetharaman (tssurya)
Revision history for this message
Matt Riedemann (mriedem) wrote :

This sounds related: https://review.openstack.org/#/c/508555/ - but that was in Queens which CERN should have already so why doesn't that resolve the problem?

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Revision history for this message
Matt Riedemann (mriedem) wrote :

Could be related to bug 1750450 as well.

Revision history for this message
Matt Riedemann (mriedem) wrote :

Here is a detailed step through the RT code which eventually creates the compute node record, and highlights why I'm not sure why the change in comment 1 doesn't already handle this issue.

We start here:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L539

If we've restarted the nova-compute service, the self.compute_nodes dict will be empty so we have to query for existing compute_nodes records via the host and nodename fields:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L566

For ironic computes, the nodename is the ironic node uuid.

If we don't find the compute node there, we check to see if there has been an ironic node rebalance to another phsyical nova-compute host:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L574

That looks up the compute nodes by just the nodename (again, ironic node uuid):

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L518

If we find the compute node there, we update it's host field since we've rebalanced to another nova-compute service on another host.

One thing that could be a problem is if we found more than one compute node record:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L531

In that case we'll log an error and then create a new compute node record:

https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L584

So it would be good to know if you're seeing that error when this happens/happened, otherwise the only other way I can think we'd get past all these checks is if (1) you didn't have the fix linked in comment 1 or (2) the ironic node uuid changed (which shouldn't happen).

Regardless of this, we should probably add some code to always create the compute node with a predictable uuid if the virt driver can supply one, which in the case of the ironic driver it can using the ironic node uuid. Then we'd at least have predictable mappings of compute nodes to the ironic nodes they represent, including the resource providers in placement, since they'd all share the same uuid.

Revision history for this message
Surya Seetharaman (tssurya) wrote :

We don't hit this condition: https://github.com/openstack/nova/blob/8ab386ed9b6e48343910e08a15ba18325c09f3b6/nova/compute/resource_tracker.py#L531 and we have https://review.openstack.org/#/c/508555/ already. Not sure why the record is recreated instead of being updated.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix proposed to nova (master)

Related fix proposed to branch: master
Review: https://review.openstack.org/571535

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Related fix merged to nova (master)

Reviewed: https://review.openstack.org/571535
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=9f28727eb75e05e07bad51b6eecce667d09dfb65
Submitter: Zuul
Branch: master

commit 9f28727eb75e05e07bad51b6eecce667d09dfb65
Author: Matt Riedemann <email address hidden>
Date: Thu May 31 13:33:14 2018 -0400

    Match ComputeNode.uuid to ironic node uuid in RT

    When the ResourceTracker creates a new ComputeNode
    record, when using the Ironic driver, we can use
    the ironic node uuid to set the compute node uuid
    which will then also get reflected as the compute
    node resource provider uuid in Placement, which
    will be a nice link between the three resources
    which should be 1:1:1 with each other. This isn't
    mandatory nor does it fix any bugs, but it will
    be nice for debugging.

    Change-Id: Ia69fabce8e7fd7de101e291fe133c6f5f5f7056a
    Related-Bug: #1771806

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.