Bug #1714248 “Compute node HA for ironic doesn't work due to the...” : Series pike : Bugs : OpenStack Compute (nova)

Matt Riedemann (mriedem) on 2017-08-31

tags:

added: ironic placement

Hironori Shiina (shiina-hironori) on 2017-08-31

description:

updated

Revision history for this message

Chris Dent (cdent) wrote on 2017-09-01:

#1

This isn't the first time we've seen something like this. I wonder if we should think about what the impact would be if we removed the uniq requirement on the name field of a resource provider. It seems like it will inevitably cause problems as people/services start doing things with placement that span arbitrary boundaries (like time in this case) that matter to the client side, but are meaningless to placement.

Sean Dague (sdague) on 2017-09-05

Changed in nova:
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

Mark Goddard (mgoddard) wrote on 2017-09-13:

#2

Download full text (14.8 KiB)

I believe I am also seeing this issue. First, a little about the environment.

The control plane is containerised, and deployed using an Ocata release of kolla-ansible. The base OS and container OS are both CentOS 7.3. The RDO nova compute package is openstack-nova-compute-15.0.6-2.el7.noarch. There are 3 OpenStack controllers, each with a nova compute service for ironic. There are 4 ironic baremetal nodes.

I have seen the issue twice now, and as Hironori described, the main user visible symptom is that one of the ironic nodes becomes unschedulable. Digging into the logs, the compute service to which the ironic node has been mapped shows the following messages occurring every minute:

2017-09-13 09:49:42.618 7 INFO nova.scheduler.client.report [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Another thread already created a resource provider with the UUID 22787651-ab4a-4c8b-b72b-5e20bb3fad2c. Grabbing that record from the placement API.
2017-09-13 09:49:42.631 7 WARNING nova.scheduler.client.report [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Unable to refresh my resource provider record
2017-09-13 09:49:42.689 7 DEBUG nova.compute.resource_tracker [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Total usable vcpus: 64, total allocated vcpus: 0 _report_final_resource_view /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:688
2017-09-13 09:49:42.690 7 INFO nova.compute.resource_tracker [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Final resource view: name=5d1535b1-0984-42b3-a574-a62afddd9307 phys_ram=262144MB used_ram=0MB phys_disk=222GB used_disk=0GB total_vcpus=64 used_vcpus=0 pci_stats=[]
2017-09-13 09:49:42.691 7 DEBUG nova.compute.resource_tracker [req-569e86cc-a2c6-4043-8efa-ea31e14d86dc - - - - -] Compute_service record updated for kef1p-phycon0003-ironic:5d1535b1-0984-42b3-a574-a62afddd9307 _update_available_resource /usr/lib/python2.7/site-packages/nova/compute/resource_tracker.py:626

The placement logs are fairly lacking in useful information, even with logging set to debug. Picking out the relevant logs:

2017-09-13 09:51:43.604 20 DEBUG nova.api.openstack.placement.requestlog [req-298e44a2-5944-4322-87b2-e1b28d9fbc6a ac342c8d47c8416580ec6f3affcd287f 4970f0b152ca41dc968b4473bb8a48d9 - default default] Starting request: 10.105.1.3 "POST /resource_providers" __call__ /usr/lib/python2.7/site-packages/nova/api/openstack/placement/requestlog.py:38
2017-09-13 09:51:43.612 20 INFO nova.api.openstack.placement.requestlog [req-298e44a2-5944-4322-87b2-e1b28d9fbc6a ac342c8d47c8416580ec6f3affcd287f 4970f0b152ca41dc968b4473bb8a48d9 - default default] 10.105.1.3 "POST /resource_providers" status: 409 len: 675 microversion: 1.0

We can see here that the scheduler client first tries to GET the resource_provider for compute node 22787651-ab4a-4c8b-b72b-5e20bb3fad2c, but fails with a 404 not found. Following this, it tries to create a resource provider for the compute node, but fails with a 409, presumably because a resource provider exists with the same name (the ironic node UUID) but a different UUID.

Looking at the DB for further info, here's the troublesome RP:

+---------------------+-------------------...