Compute node HA for ironic doesn't work due to the name duplication of Resource Provider

Bug #1714248 reported by Hironori Shiina on 2017-08-31
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
Ironic
Invalid
Critical
Dmitry Tantsur
OpenStack Compute (nova)
High
John Garbutt
Pike
High
Matt Riedemann

Bug Description

Description
===========
In an environment where there are multiple compute nodes with ironic driver,
when a compute node goes down, another compute node cannot take over ironic nodes.

Steps to reproduce
==================
1. Start multiple compute nodes with ironic driver.
2. Register one node to ironic.
2. Stop a compute node which manages the ironic node.
3. Create an instance.

Expected result
===============
The instance is created.

Actual result
=============
The instance creation is failed.

Environment
===========
1. Exact version of OpenStack you are running.
openstack-nova-scheduler-15.0.6-2.el7.noarch
openstack-nova-console-15.0.6-2.el7.noarch
python2-novaclient-7.1.0-1.el7.noarch
openstack-nova-common-15.0.6-2.el7.noarch
openstack-nova-serialproxy-15.0.6-2.el7.noarch
openstack-nova-placement-api-15.0.6-2.el7.noarch
python-nova-15.0.6-2.el7.noarch
openstack-nova-novncproxy-15.0.6-2.el7.noarch
openstack-nova-api-15.0.6-2.el7.noarch
openstack-nova-conductor-15.0.6-2.el7.noarch

2. Which hypervisor did you use?
ironic

Details
=======
When a nova-compute goes down, another nova-compute will take over ironic nodes managed by the failed nova-compute by re-balancing a hash-ring. Then the active nova-compute tries creating a
new resource provider with a new ComputeNode object UUID and the hypervisor name (ironic node UUID)[1][2][3]. This creation fails with a conflict(409) since there is a resource provider with the same name created by the failed nova-compute.

When a new instance is requested, the scheduler gets only an old resource provider for the ironic node[4]. Then, the ironic node is not selected:

WARNING nova.scheduler.filters.compute_filter [req-a37d68b5-7ab1-4254-8698-502304607a90 7b55e61a07304f9cab1544260dcd3e41 e21242f450d948d7af2650ac9365ee36 - - -] (compute02, 8904aeeb-a35b-4ba3-848a-73269fdde4d3) ram: 4096MB disk: 849920MB io_ops: 0 instances: 0 has not been heard from in a while

[1] https://github.com/openstack/nova/blob/stable/ocata/nova/compute/resource_tracker.py#L464
[2] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/client/report.py#L630
[3] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/client/report.py#L410
[4] https://github.com/openstack/nova/blob/stable/ocata/nova/scheduler/filter_scheduler.py#L183

Matt Riedemann (mriedem) on 2017-08-31
tags: added: ironic placement
description: updated
Chris Dent (cdent) wrote :

This isn't the first time we've seen something like this. I wonder if we should think about what the impact would be if we removed the uniq requirement on the name field of a resource provider. It seems like it will inevitably cause problems as people/services start doing things with placement that span arbitrary boundaries (like time in this case) that matter to the client side, but are meaningless to placement.

Sean Dague (sdague) on 2017-09-05
Changed in nova:
status: New → Confirmed
importance: Undecided → High
Mark Goddard (mgoddard) wrote :
Download full text (14.8 KiB)

I believe I am also seeing this issue. First, a little about the environment.

The control plane is containerised, and deployed using an Ocata release of kolla-ansible. The base OS and container OS are both CentOS 7.3. The RDO nova compute package is openstack-nova-compute-15.0.6-2.el7.noarch. There are 3 OpenStack controllers, each with a nova compute service for ironic. There are 4 ironic baremetal nodes.

I have seen the issue twice now, and as Hironori described, the main user visible symptom is that one of the ironic nodes becomes unschedulable. Digging into the logs, the compute service to which the ironic node has been mapped shows the following messages occurring every minute:

2017-09-13 09:49:42.618 7 INFO nova.scheduler.client.report [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Another thread already created a resource provider with the UUID 22787651-ab4a-4c8b-b72b-5e20bb3fad2c. Grabbing that record from the placement API.
2017-09-13 09:49:42.631 7 WARNING nova.scheduler.client.report [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Unable to refresh my resource provider record
2017-09-13 09:49:42.689 7 DEBUG nova.compute.resource_tracker [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Total usable vcpus: 64, total allocated vcpus: 0 _report_final_resource_view /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:688
2017-09-13 09:49:42.690 7 INFO nova.compute.resource_tracker [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Final resource view: name=5d1535b1-0984-42b3-a574-a62afddd9307 phys_ram=262144MB used_ram=0MB phys_disk=222GB used_disk=0GB total_vcpus=64 used_vcpus=0 pci_stats=[]
2017-09-13 09:49:42.691 7 DEBUG nova.compute.resource_tracker [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Compute_service record updated for kef1p-phycon0003-ironic:5d1535b1-0984-42b3-a574-a62afddd9307 _update_available_resource /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:626

The placement logs are fairly lacking in useful information, even with logging set to debug. Picking out the relevant logs:

2017-09-13 09:51:43.604 20 DEBUG nova.api.openstack.placement.requestlog [req-298e44a2-5944-4322-87b2-e1b28d9fbc6a ac342c8d47c8416580ec6f3affcd287f 4970f0b152ca41dc968b4473bb8a48d9 - default default] Starting request: 10.105.1.3 "POST /resource_providers" __call__ /usr/lib/python2.7/site-packages/nova/api/openstack/placement/requestlog.py:38
2017-09-13 09:51:43.612 20 INFO nova.api.openstack.placement.requestlog [req-298e44a2-5944-4322-87b2-e1b28d9fbc6a ac342c8d47c8416580ec6f3affcd287f 4970f0b152ca41dc968b4473bb8a48d9 - default default] 10.105.1.3 "POST /resource_providers" status: 409 len: 675 microversion: 1.0

We can see here that the scheduler client first tries to GET the resource_provider for compute node 22787651-ab4a-4c8b-b72b-5e20bb3fad2c, but fails with a 404 not found. Following this, it tries to create a resource provider for the compute node, but fails with a 409, presumably because a resource provider exists with the same name (the ironic node UUID) but a different UUID.

Looking at the DB for further info, here's the troublesome RP:

+---------------------+-------------------...

Mark Goddard (mgoddard) wrote :

Staring at the nova code a little longer, I think I've pieced together what happened.

* An instance was aborted during creation.
* Destroying the instance failed because the node was locked (possibly due to a long running neutron port update), and the retry mechanism maxed out.
* Shortly afterwards, during the compute service's update_available_resource periodic task, the compute node was determined to be an orphan, and deleted.
* Deleting the resource provider for the compute node failed because allocations still existed from the instance that wasn't cleaned up.

This raises a question, why was the compute node seen to be orphaned? This happened because the ironic virt driver did not include the node in the list returned by get_available_nodes(). I suspect this is because the ironic node still had an instance_uuid set, but that instance was not mapped to the compute host.

Another point worth mentioning is that I ended up deleting the stale resource provider in the DB, and the compute service created another, allowing things to return to normal.

Mark Goddard (mgoddard) wrote :

Actually, looking at the ironic logs, the node was being provisioned at the time of the abort and the conductor had locked the node as it was handling the IPA callback.

Eric Fried (efried) wrote :

Pursuant to comment #3, why doesn't the new compute host react to the 409 by deleting the old RP?

Eric Fried (efried) wrote :

Oh, I think I understand: at the moment, this code is out of the control of ironic itself; it's all handled by the resource tracker.

So... why does the new compute host try to create the node RP with a *different* UUID? Doesn't the node have an immutable UUID that ought to be used no matter which host is doing the registration, or when, or why?

John Garbutt (johngarbutt) wrote :

The ironic driver just passes the node uuids up to Nova:
https://github.com/openstack/nova/blob/master/nova/virt/ironic/driver.py#L718

Its the resource tracker that decides to create resource providers from that. In production pike cloud, I see the RP name matching the ironic node uuid. Not sure where the other uuid came from :S

John Garbutt (johngarbutt) wrote :

Oh wait, its totally the compute node uuid. RP uuid=computeNode and name=ironic node uuid

Nova creates a new compute node uuid for the same ironic node uuid, because its on a different nova-compute node. That works fine see:
https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/models.py#L117

The ComputeNode unique constraint simply checks ('host', 'hypervisor_hostname', 'deleted') is unique, so we totally allow the ironic node uuid to be registered twice in the ComputeNode's table.

If you restarted the dead n-cpu node (yet, I know, that is stupid), it would delete the orphan compute nodes, but in this case it is dead, so that left in the DB.

But on placement side, we have the (new-uuid, existing-node-uuid) pair, and it fails the unique constraint on the name:
https://github.com/openstack/nova/blob/master/nova/db/sqlalchemy/api_models.py#L295

... So someone needs to delete the old resource provider, I guess the logic that creates the new one could check to see if the name is used by some other node, and delete it as required, and re-assign all the allocations to the new node, etc... Yuck.

John Garbutt (johngarbutt) wrote :

So idea for brute force fix...

On above change, we detect the conflict. We see if the name is duplicated, if it is we delete the old resource provider, to do that we probably have to delete all its allocations first, possible also its inventory, etc.

Once that is all done, we create a new resource provider, then add the appropriate allocations.

... But now we have a race where another instance will come in an try to get allocations on the new thing, just after we create it, but before we add back in the allocations. Boom, dang, the sky is falling down.

Now in the virt driver we know the node has the instance_uuid is set, so probably not a total disaster, but it seems a nasty race.

... not sure where to go with this.

John Garbutt (johngarbutt) wrote :

... another possible fix is to change the ComputeNode.uuid to match the ironic_node.uuid, but we hit the same race above on the first restart after we make this change, as we delete all the auto-generated ComputeNode.uuids and update them to ironic_node.uuids. Although at least we only hit the nasty re-calculate race once... Yuck again.

John Garbutt (johngarbutt) wrote :

Better idea: after rebalance, we must create compute nodes, lets check other compute nodes don't already exist, and either create or re-purpose depending on what happens.

Changed in nova:
assignee: nobody → John Garbutt (johngarbutt)
status: Confirmed → In Progress

Change abandoned by John Garbutt (<email address hidden>) on branch: master
Review: https://review.openstack.org/508508

Dmitry Tantsur (divius) wrote :

This started breaking ironic multinode CI, see investigation in https://bugs.launchpad.net/ironic/+bug/1737395

Changed in ironic:
status: New → Triaged
importance: Undecided → Critical
Changed in nova:
assignee: John Garbutt (johngarbutt) → Dmitry Tantsur (divius)
Dmitry Tantsur (divius) on 2017-12-11
Changed in ironic:
status: Triaged → In Progress
assignee: nobody → Dmitry Tantsur (divius)
Changed in nova:
assignee: Dmitry Tantsur (divius) → Matt Riedemann (mriedem)
Matt Riedemann (mriedem) on 2017-12-12
Changed in nova:
assignee: Matt Riedemann (mriedem) → John Garbutt (johngarbutt)
Changed in ironic:
status: In Progress → Invalid

Fix proposed to branch: stable/pike
Review: https://review.openstack.org/527423

Change abandoned by Matt Riedemann (<email address hidden>) on branch: stable/pike
Review: https://review.openstack.org/517925
Reason: This is using the wrong change ID. I've re-proposed the cherry pick here:

https://review.openstack.org/#/c/527423/

Reviewed: https://review.openstack.org/508555
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e3c5e22d1fde7ca916a8cc364f335fba8a3a798f
Submitter: Zuul
Branch: master

commit e3c5e22d1fde7ca916a8cc364f335fba8a3a798f
Author: John Garbutt <email address hidden>
Date: Fri Sep 29 15:48:54 2017 +0100

    Re-use existing ComputeNode on ironic rebalance

    When a nova-compute service dies that is one of several ironic based
    nova-compute services running, a node rebalance occurs to ensure there
    is still an active nova-compute service dealing with requests for the
    given instance that is running.

    Today, when this occurs, we create a new ComputeNode entry. This change
    alters that logic to detect the case of the ironic node rebalance and in
    that case we re-use the existing ComputeNode entry, simply updating the
    host field to match the new host it has been rebalanced onto.

    Previously we hit problems with placement when we get a new
    ComputeNode.uuid for the same ironic_node.uuid. This reusing of the
    existing entry keeps the ComputeNode.uuid the same when the rebalance of
    the ComputeNode occurs.

    Without keeping the same ComputeNode.uuid placement errors out with a 409
    because we attempt to create a ResourceProvider that has the same name
    as an existing ResourceProvdier. Had that worked, we would have noticed
    the race that occurs after we create the ResourceProvider but before we
    add back the existing allocations for existing instances. Keeping the
    ComputeNode.uuid the same means we simply look up the existing
    ResourceProvider in placement, avoiding all this pain and tears.

    Closes-Bug: #1714248
    Co-Authored-By: Dmitry Tantsur <email address hidden>
    Change-Id: I4253cffca3dbf558c875eed7e77711a31e9e3406

Changed in nova:
status: In Progress → Fix Released

This issue was fixed in the openstack/nova 17.0.0.0b3 development milestone.

Reviewed: https://review.openstack.org/527423
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=e95277fa3eba86a7290212313da4dc3c81c286f3
Submitter: Zuul
Branch: stable/pike

commit e95277fa3eba86a7290212313da4dc3c81c286f3
Author: John Garbutt <email address hidden>
Date: Fri Sep 29 15:48:54 2017 +0100

    Re-use existing ComputeNode on ironic rebalance

    When a nova-compute service dies that is one of several ironic based
    nova-compute services running, a node rebalance occurs to ensure there
    is still an active nova-compute service dealing with requests for the
    given instance that is running.

    Today, when this occurs, we create a new ComputeNode entry. This change
    alters that logic to detect the case of the ironic node rebalance and in
    that case we re-use the existing ComputeNode entry, simply updating the
    host field to match the new host it has been rebalanced onto.

    Previously we hit problems with placement when we get a new
    ComputeNode.uuid for the same ironic_node.uuid. This reusing of the
    existing entry keeps the ComputeNode.uuid the same when the rebalance of
    the ComputeNode occurs.

    Without keeping the same ComputeNode.uuid placement errors out with a 409
    because we attempt to create a ResourceProvider that has the same name
    as an existing ResourceProvdier. Had that worked, we would have noticed
    the race that occurs after we create the ResourceProvider but before we
    add back the existing allocations for existing instances. Keeping the
    ComputeNode.uuid the same means we simply look up the existing
    ResourceProvider in placement, avoiding all this pain and tears.

    Closes-Bug: #1714248
    Co-Authored-By: Dmitry Tantsur <email address hidden>
    Change-Id: I4253cffca3dbf558c875eed7e77711a31e9e3406
    (cherry picked from commit e3c5e22d1fde7ca916a8cc364f335fba8a3a798f)

This issue was fixed in the openstack/nova 16.1.1 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers