Compute node HA for ironic doesn't work due to the name duplication of Resource Provider

Bug #1714248 reported by Hironori Shiina on 2017-08-31
22
This bug affects 3 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
High
John Garbutt

Bug Description

Description
===========
In an environment where there are multiple compute nodes with ironic driver,
when a compute node goes down, another compute node cannot take over ironic nodes.

Steps to reproduce
==================
1. Start multiple compute nodes with ironic driver.
2. Register one node to ironic.
2. Stop a compute node which manages the ironic node.
3. Create an instance.

Expected result
===============
The instance is created.

Actual result
=============
The instance creation is failed.

Environment
===========
1. Exact version of OpenStack you are running.
openstack-nova-scheduler-15.0.6-2.el7.noarch
openstack-nova-console-15.0.6-2.el7.noarch
python2-novaclient-7.1.0-1.el7.noarch
openstack-nova-common-15.0.6-2.el7.noarch
openstack-nova-serialproxy-15.0.6-2.el7.noarch
openstack-nova-placement-api-15.0.6-2.el7.noarch
python-nova-15.0.6-2.el7.noarch
openstack-nova-novncproxy-15.0.6-2.el7.noarch
openstack-nova-api-15.0.6-2.el7.noarch
openstack-nova-conductor-15.0.6-2.el7.noarch

2. Which hypervisor did you use?
ironic

Details
=======
When a nova-compute goes down, another nova-compute will take over ironic nodes managed by the failed nova-compute by re-balancing a hash-ring. Then the active nova-compute tries creating a
new resource provider with a new ComputeNode object UUID and the hypervisor name (ironic node UUID)[1][2][3]. This creation fails with a conflict(409) since there is a resource provider with the same name created by the failed nova-compute.

When a new instance is requested, the scheduler gets only an old resource provider for the ironic node[4]. Then, the ironic node is not selected:

WARNING nova.scheduler.filters.compute_filter [req-a37d68b5-7ab1-4254-8698-502304607a90 7b55e61a07304f9cab1544260dcd3e41 e21242f450d948d7af2650ac9365ee36 - - -] (compute02, 8904aeeb-a35b-4ba3-848a-73269fdde4d3) ram: 4096MB disk: 849920MB io_ops: 0 instances: 0 has not been heard from in a while

[1] https://github.com/openstack/nova/blob/stable/ocata/nova/compute/resource_tracker.py#L464
[2] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/client/report.py#L630
[3] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/client/report.py#L410
[4] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/filter_scheduler.py#L183

Matt Riedemann (mriedem) on 2017-08-31
tags: added: ironic placement
description: updated
Chris Dent (cdent) wrote :

This isn't the first time we've seen something like this. I wonder if we should think about what the impact would be if we removed the uniq requirement on the name field of a resource provider. It seems like it will inevitably cause problems as people/services start doing things with placement that span arbitrary boundaries (like time in this case) that matter to the client side, but are meaningless to placement.

Sean Dague (sdague) on 2017-09-05
Changed in nova:
status: New → Confirmed
importance: Undecided → High
Mark Goddard (mgoddard) wrote :
Download full text (14.8 KiB)

I believe I am also seeing this issue. First, a little about the environment.

The control plane is containerised, and deployed using an Ocata release of kolla-ansible. The base OS and container OS are both CentOS 7.3. The RDO nova compute package is openstack-nova-compute-15.0.6-2.el7.noarch. There are 3 OpenStack controllers, each with a nova compute service for ironic. There are 4 ironic baremetal nodes.

I have seen the issue twice now, and as Hironori described, the main user visible symptom is that one of the ironic nodes becomes unschedulable. Digging into the logs, the compute service to which the ironic node has been mapped shows the following messages occurring every minute:

2017-09-13 09:49:42.618 7 INFO nova.scheduler.client.report [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Another thread already created a resource provider with the UUID 22787651-ab4a-4c8b-b72b-5e20bb3fad2c. Grabbing that record from the placement API.
2017-09-13 09:49:42.631 7 WARNING nova.scheduler.client.report [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Unable to refresh my resource provider record
2017-09-13 09:49:42.689 7 DEBUG nova.compute.resource_tracker [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Total usable vcpus: 64, total allocated vcpus: 0 _report_final_resource_view /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:688
2017-09-13 09:49:42.690 7 INFO nova.compute.resource_tracker [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Final resource view: name=5d1535b1-0984-42b3-a574-a62afddd9307 phys_ram=262144MB used_ram=0MB phys_disk=222GB used_disk=0GB total_vcpus=64 used_vcpus=0 pci_stats=[]
2017-09-13 09:49:42.691 7 DEBUG nova.compute.resource_tracker [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Compute_service record updated for kef1p-phycon0003-ironic:5d1535b1-0984-42b3-a574-a62afddd9307 _update_available_resource /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:626

The placement logs are fairly lacking in useful information, even with logging set to debug. Picking out the relevant logs:

2017-09-13 09:51:43.604 20 DEBUG nova.api.openstack.placement.requestlog [req-298e44a2-5944-4322-87b2-e1b28d9fbc6a ac342c8d47c8416580ec6f3affcd287f 4970f0b152ca41dc968b4473bb8a48d9 - default default] Starting request: 10.105.1.3 "POST /resource_providers" __call__ /usr/lib/python2.7/site-packages/nova/api/openstack/placement/requestlog.py:38
2017-09-13 09:51:43.612 20 INFO nova.api.openstack.placement.requestlog [req-298e44a2-5944-4322-87b2-e1b28d9fbc6a ac342c8d47c8416580ec6f3affcd287f 4970f0b152ca41dc968b4473bb8a48d9 - default default] 10.105.1.3 "POST /resource_providers" status: 409 len: 675 microversion: 1.0

We can see here that the scheduler client first tries to GET the resource_provider for compute node 22787651-ab4a-4c8b-b72b-5e20bb3fad2c, but fails with a 404 not found. Following this, it tries to create a resource provider for the compute node, but fails with a 409, presumably because a resource provider exists with the same name (the ironic node UUID) but a different UUID.

Looking at the DB for further info, here's the troublesome RP:

+---------------------+-------------------...

Mark Goddard (mgoddard) wrote :

Staring at the nova code a little longer, I think I've pieced together what happened.

* An instance was aborted during creation.
* Destroying the instance failed because the node was locked (possibly due to a long running neutron port update), and the retry mechanism maxed out.
* Shortly afterwards, during the compute service's update_available_resource periodic task, the compute node was determined to be an orphan, and deleted.
* Deleting the resource provider for the compute node failed because allocations still existed from the instance that wasn't cleaned up.

This raises a question, why was the compute node seen to be orphaned? This happened because the ironic virt driver did not include the node in the list returned by get_available_nodes(). I suspect this is because the ironic node still had an instance_uuid set, but that instance was not mapped to the compute host.

Another point worth mentioning is that I ended up deleting the stale resource provider in the DB, and the compute service created another, allowing things to return to normal.

Mark Goddard (mgoddard) wrote :

Actually, looking at the ironic logs, the node was being provisioned at the time of the abort and the conductor had locked the node as it was handling the IPA callback.

Eric Fried (efried) wrote :

Pursuant to comment #3, why doesn't the new compute host react to the 409 by deleting the old RP?

Eric Fried (efried) wrote :

Oh, I think I understand: at the moment, this code is out of the control of ironic itself; it's all handled by the resource tracker.

So... why does the new compute host try to create the node RP with a *different* UUID? Doesn't the node have an immutable UUID that ought to be used no matter which host is doing the registration, or when, or why?

John Garbutt (johngarbutt) wrote :

The ironic driver just passes the node uuids up to Nova:
https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L718

Its the resource tracker that decides to create resource providers from that. In production pike cloud, I see the RP name matching the ironic node uuid. Not sure where the other uuid came from :S

John Garbutt (johngarbutt) wrote :

Oh wait, its totally the compute node uuid. RP uuid=computeNode and name=ironic node uuid

Nova creates a new compute node uuid for the same ironic node uuid, because its on a different nova-compute node. That works fine see:
https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/models.py#L117

The ComputeNode unique constraint simply checks ('host', 'hypervisor_hostname', 'deleted') is unique, so we totally allow the ironic node uuid to be registered twice in the ComputeNode's table.

If you restarted the dead n-cpu node (yet, I know, that is stupid), it would delete the orphan compute nodes, but in this case it is dead, so that left in the DB.

But on placement side, we have the (new-uuid, existing-node-uuid) pair, and it fails the unique constraint on the name:
https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api_models.py#L295

... So someone needs to delete the old resource provider, I guess the logic that creates the new one could check to see if the name is used by some other node, and delete it as required, and re-assign all the allocations to the new node, etc... Yuck.

John Garbutt (johngarbutt) wrote :

So idea for brute force fix...

On above change, we detect the conflict. We see if the name is duplicated, if it is we delete the old resource provider, to do that we probably have to delete all its allocations first, possible also its inventory, etc.

Once that is all done, we create a new resource provider, then add the appropriate allocations.

... But now we have a race where another instance will come in an try to get allocations on the new thing, just after we create it, but before we add back in the allocations. Boom, dang, the sky is falling down.

Now in the virt driver we know the node has the instance_uuid is set, so probably not a total disaster, but it seems a nasty race.

... not sure where to go with this.

John Garbutt (johngarbutt) wrote :

... another possible fix is to change the ComputeNode.uuid to match the ironic_node.uuid, but we hit the same race above on the first restart after we make this change, as we delete all the auto-generated ComputeNode.uuids and update them to ironic_node.uuids. Although at least we only hit the nasty re-calculate race once... Yuck again.

John Garbutt (johngarbutt) wrote :

Better idea: after rebalance, we must create compute nodes, lets check other compute nodes don't already exist, and either create or re-purpose depending on what happens.

Changed in nova:
assignee: nobody → John Garbutt (johngarbutt)
status: Confirmed → In Progress

Change abandoned by John Garbutt (<email address hidden>) on branch: master
Review: https://review.openstack.org/508508

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers