Here is a detailed step through the RT code which eventually creates the compute node record, and highlights why I'm not sure why the change in comment 1 doesn't already handle this issue.
If we've restarted the nova-compute service, the self.compute_nodes dict will be empty so we have to query for existing compute_nodes records via the host and nodename fields:
So it would be good to know if you're seeing that error when this happens/happened, otherwise the only other way I can think we'd get past all these checks is if (1) you didn't have the fix linked in comment 1 or (2) the ironic node uuid changed (which shouldn't happen).
Regardless of this, we should probably add some code to always create the compute node with a predictable uuid if the virt driver can supply one, which in the case of the ironic driver it can using the ironic node uuid. Then we'd at least have predictable mappings of compute nodes to the ironic nodes they represent, including the resource providers in placement, since they'd all share the same uuid.
Here is a detailed step through the RT code which eventually creates the compute node record, and highlights why I'm not sure why the change in comment 1 doesn't already handle this issue.
We start here:
https:/ /github. com/openstack/ nova/blob/ 8ab386ed9b6e483 43910e08a15ba18 325c09f3b6/ nova/compute/ resource_ tracker. py#L539
If we've restarted the nova-compute service, the self.compute_nodes dict will be empty so we have to query for existing compute_nodes records via the host and nodename fields:
https:/ /github. com/openstack/ nova/blob/ 8ab386ed9b6e483 43910e08a15ba18 325c09f3b6/ nova/compute/ resource_ tracker. py#L566
For ironic computes, the nodename is the ironic node uuid.
If we don't find the compute node there, we check to see if there has been an ironic node rebalance to another phsyical nova-compute host:
https:/ /github. com/openstack/ nova/blob/ 8ab386ed9b6e483 43910e08a15ba18 325c09f3b6/ nova/compute/ resource_ tracker. py#L574
That looks up the compute nodes by just the nodename (again, ironic node uuid):
https:/ /github. com/openstack/ nova/blob/ 8ab386ed9b6e483 43910e08a15ba18 325c09f3b6/ nova/compute/ resource_ tracker. py#L518
If we find the compute node there, we update it's host field since we've rebalanced to another nova-compute service on another host.
One thing that could be a problem is if we found more than one compute node record:
https:/ /github. com/openstack/ nova/blob/ 8ab386ed9b6e483 43910e08a15ba18 325c09f3b6/ nova/compute/ resource_ tracker. py#L531
In that case we'll log an error and then create a new compute node record:
https:/ /github. com/openstack/ nova/blob/ 8ab386ed9b6e483 43910e08a15ba18 325c09f3b6/ nova/compute/ resource_ tracker. py#L584
So it would be good to know if you're seeing that error when this happens/happened, otherwise the only other way I can think we'd get past all these checks is if (1) you didn't have the fix linked in comment 1 or (2) the ironic node uuid changed (which shouldn't happen).
Regardless of this, we should probably add some code to always create the compute node with a predictable uuid if the virt driver can supply one, which in the case of the ironic driver it can using the ironic node uuid. Then we'd at least have predictable mappings of compute nodes to the ironic nodes they represent, including the resource providers in placement, since they'd all share the same uuid.