Resource_provider entry related to a deleted compute node, unable to migrate vms to the node

Bug #1849701 reported by Giuseppe Petralia
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Expired
Undecided
Unassigned

Bug Description

Description
===========
Migrating vm to a node was failing with the following error:

"There was a conflict when trying to complete your request.\n\n Conflicting resource provider name: mymachine.maas already exists."

https://paste.ubuntu.com/p/4dxS6d8X8p/

Steps to reproduce
==================

We found that the compute node was added multiple times:

Compute node was added multiple time, the valid one is created_at: 2019-08-22 18:47:31

mysql> select created_at, deleted_at from compute_nodes where host="mymachine";
+---------------------+---------------------+
| created_at | deleted_at |
+---------------------+---------------------+
| 2019-08-22 18:47:31 | NULL |
| 2019-08-21 11:50:26 | 2019-08-22 11:04:27 |
| 2019-08-22 16:25:52 | 2019-08-22 16:58:42 |
| 2019-08-22 18:42:39 | 2019-08-22 18:45:36 |
+---------------------+---------------------+
4 rows in set (0.00 sec)

and the resource provider entry was related to an already deleted compute node:

mysql> select created_at from resource_providers where name="mymachine.maas";
+---------------------+
| created_at |
+---------------------+
| 2019-08-22 18:42:40 |
+---------------------+
1 row in set (0.00 sec)

We tried to delete it:

mysql> delete from resource_providers where name="mymachine.maas";
ERROR 1451 (23000): Cannot delete or update a parent row: a foreign key constraint fails (`nova_api`.`resource_providers`, CONSTRAINT `resource_providers_ibfk_1` FOREIGN KEY (`root_provider_id`) REFERENCES `resource_providers` (`id`))

It is strange that root_provider_id seems to reference the same row of the same table making deletion of any row of this table impossible:

mysql> select id,root_provider_id from resource_providers;
+----+------------------+
| id | root_provider_id |
+----+------------------+
| 1 | 1 |
| 4 | 4 |
| 7 | 7 |
| 10 | 10 |
| 13 | 13 |
| 16 | 16 |
| 19 | 19 |
| 22 | 22 |
| 28 | 28 |
| 31 | 31 |
| 34 | 34 |
| 37 | 37 |
| 40 | 40 |
| 43 | 43 |
| 45 | 45 |
| 52 | 52 |
| 55 | 55 |
| 58 | 58 |
| 61 | 61 |
| 64 | 64 |
| 67 | 67 |
| 70 | 70 |
| 73 | 73 |
| 76 | 76 |
| 79 | 79 |
| 82 | 82 |
| 91 | 91 |
+----+------------------+

Expected result
===============
Resource provider entry should be deleted when a compute node is deleted allowing to migrate vm to the node.

Workaround
===============
we updated name to invalid:

mysql> update resource_providers set name="invalid" where name="mymachine.maas";
Query OK, 1 row affected (0.01 sec)

Restarted nova-compute on the node with

systemctl restart nova-compute

Resource provider entry got recreated:

mysql> select * from resource_providers where name="mymachine.maas";
+---------------------+---------------------+-----+--------------------------------------+------------------+------------+----------+------------------+--------------------+
| created_at | updated_at | id | uuid | name | generation | can_host | root_provider_id | parent_provider_id |
+---------------------+---------------------+-----+--------------------------------------+------------------+------------+----------+------------------+--------------------+
| 2019-10-24 15:16:51 | 2019-10-24 15:18:12 | 384 | e6dabd5d-d1ed-4fd5-a1e0-0be3b360fb28 | mymachine.maas | 2 | NULL | 384 | NULL |
+---------------------+---------------------+-----+--------------------------------------+------------------+------------+----------+------------------+--------------------+

And migration worked.

Environment
===============
xenial-queens cloud

Nova compute node:

dpkg -l | grep nova
ii nova-api-metadata 2:17.0.10-0ubuntu2.1~cloud0 all OpenStack Compute - metadata API frontend
ii nova-common 2:17.0.10-0ubuntu2.1~cloud0 all OpenStack Compute - common files
ii nova-compute 2:17.0.10-0ubuntu2.1~cloud0 all OpenStack Compute - compute node base
ii nova-compute-kvm 2:17.0.10-0ubuntu2.1~cloud0 all OpenStack Compute - compute node (KVM)
ii nova-compute-libvirt 2:17.0.10-0ubuntu2.1~cloud0 all OpenStack Compute - compute node libvirt support
ii python-nova 2:17.0.10-0ubuntu2.1~cloud0 all OpenStack Compute Python libraries
ii python-novaclient 2:9.1.1-0ubuntu1~cloud0 all client library for OpenStack Compute API - Python 2.7

Nova Cloud Controller

dpkg -l | grep nova
ii nova-api-os-compute 2:17.0.9-0ubuntu1~cloud0 all OpenStack Compute - OpenStack Compute API frontend
ii nova-common 2:17.0.9-0ubuntu1~cloud0 all OpenStack Compute - common files
ii nova-conductor 2:17.0.9-0ubuntu1~cloud0 all OpenStack Compute - conductor service
ii nova-consoleauth 2:17.0.9-0ubuntu1~cloud0 all OpenStack Compute - Console Authenticator
ii nova-novncproxy 2:17.0.9-0ubuntu1~cloud0 all OpenStack Compute - NoVNC proxy
ii nova-placement-api 2:17.0.9-0ubuntu1~cloud0 all OpenStack Compute - placement API frontend
ii nova-scheduler 2:17.0.9-0ubuntu1~cloud0 all OpenStack Compute - virtual machine scheduler
ii nova-spiceproxy 2:17.0.9-0ubuntu1~cloud0 all OpenStack Compute - spice html5 proxy
ii python-nova 2:17.0.9-0ubuntu1~cloud0 all OpenStack Compute Python libraries
ii python-novaclient 2:9.1.1-0ubuntu1~cloud0 all client library for OpenStack Compute API - Python 2.7

Tags: placement
Revision history for this message
Ryan Farrell (whereisrysmind) wrote :

The oldest nova log file has the resource provider created event & the start of the error messages:

https://pastebin.canonical.com/p/xKzF5qZNZv/

Revision history for this message
Matt Riedemann (mriedem) wrote :

What specific release of nova on queens? When you deleted the compute node, did you do it by deleting the compute service through the DELETE /os-services/{service_id} API? Because if so, that should clean up the resource provider assuming it doesn't have allocations against it in placement which prevents us from deleting it.

Do you have this fix which was released in queens 17.0.5:

https://review.opendev.org/#/c/563698/

If so, as I said if the resource provider has allocations against it that might be why it wasn't deleted when the compute service/node was deleted, there are some related known bugs there:

https://bugs.launchpad.net/nova/+bug/1829479

https://bugs.launchpad.net/nova/+bug/1817833

tags: added: placement
Revision history for this message
Matt Riedemann (mriedem) wrote :

Also, are you stopping the nova-compute service when deleting it? Because if not, there is a periodic task on the compute service that will recreate the compute_nodes table entry which is why you'd keep seeing it show up. Note the message in the API reference:

https://docs.openstack.org/api-ref/compute/?expanded=delete-compute-service-detail#delete-compute-service

Matt Riedemann (mriedem)
Changed in nova:
status: New → Incomplete
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for OpenStack Compute (nova) because there has been no activity for 60 days.]

Changed in nova:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.