Ironic nova-compute failover creates new resource provider removing the resource_provider_aggregates link

Bug #1771806 reported by Belmiro Moreira on 2018-05-17
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Surya Seetharaman

Bug Description

When using the request_filter functionality, aggregates are mapped to placement_aggregates.
placement_provider_aggregates contains the resource providers mapped in aggregate_hosts.

The problem happens when a nova-compute for ironic fails and hosts are automatically moved to a different nova-compute. In this case a new compute_node entry is created originating a new resource provider.

As consequence the placement_provider_aggregates doesn't have the new resource providers.

tags: added: ironic placement
removed: placem
Changed in nova:
assignee: nobody → Surya Seetharaman (tssurya)
Matt Riedemann (mriedem) wrote :

This sounds related: - but that was in Queens which CERN should have already so why doesn't that resolve the problem?

Changed in nova:
status: New → Confirmed
importance: Undecided → Medium
Matt Riedemann (mriedem) wrote :

Could be related to bug 1750450 as well.

Matt Riedemann (mriedem) wrote :

Here is a detailed step through the RT code which eventually creates the compute node record, and highlights why I'm not sure why the change in comment 1 doesn't already handle this issue.

We start here:

If we've restarted the nova-compute service, the self.compute_nodes dict will be empty so we have to query for existing compute_nodes records via the host and nodename fields:

For ironic computes, the nodename is the ironic node uuid.

If we don't find the compute node there, we check to see if there has been an ironic node rebalance to another phsyical nova-compute host:

That looks up the compute nodes by just the nodename (again, ironic node uuid):

If we find the compute node there, we update it's host field since we've rebalanced to another nova-compute service on another host.

One thing that could be a problem is if we found more than one compute node record:

In that case we'll log an error and then create a new compute node record:

So it would be good to know if you're seeing that error when this happens/happened, otherwise the only other way I can think we'd get past all these checks is if (1) you didn't have the fix linked in comment 1 or (2) the ironic node uuid changed (which shouldn't happen).

Regardless of this, we should probably add some code to always create the compute node with a predictable uuid if the virt driver can supply one, which in the case of the ironic driver it can using the ironic node uuid. Then we'd at least have predictable mappings of compute nodes to the ironic nodes they represent, including the resource providers in placement, since they'd all share the same uuid.

Surya Seetharaman (tssurya) wrote :

We don't hit this condition: and we have already. Not sure why the record is recreated instead of being updated.

Submitter: Zuul
Branch: master

commit 9f28727eb75e05e07bad51b6eecce667d09dfb65
Author: Matt Riedemann <email address hidden>
Date: Thu May 31 13:33:14 2018 -0400

    Match ComputeNode.uuid to ironic node uuid in RT

    When the ResourceTracker creates a new ComputeNode
    record, when using the Ironic driver, we can use
    the ironic node uuid to set the compute node uuid
    which will then also get reflected as the compute
    node resource provider uuid in Placement, which
    will be a nice link between the three resources
    which should be 1:1:1 with each other. This isn't
    mandatory nor does it fix any bugs, but it will
    be nice for debugging.

    Change-Id: Ia69fabce8e7fd7de101e291fe133c6f5f5f7056a
    Related-Bug: #1771806

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers