Cleanup pending instances in "building" state

Bug #2007922 reported by eblock@nde.ag
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Invalid
Undecided
Unassigned

Bug Description

Following up on the ML thread [1], it was recommended to create a bug report.
After a network issue in a Victoria cluster (3 control nodes in HA mode, 26 compute nodes) some instance builds were interrupted. Some of them could be cleaned up with 'openstack server delete' but two of them can not. They already have a mapping but can not be removed (or "reset-state") by nova. Those are both amphora instances from octavia:

control01:~ # openstack server list --project service -c ID -c Name -c Status -f value | grep BUILD
0453a7e5-e4f9-419b-ad71-d837a20ef6bb amphora-0ee32901-0c59-4752-8253-35b66da176ea BUILD
dc8cdc3a-f6b2-469b-af6f-ba2aa130ea9b amphora-4990a47b-fe8a-431a-90ec-5ac2368a5251 BUILD

control01:~ # openstack server delete amphora-0ee32901-0c59-4752-8253-35b66da176ea
No server with a name or ID of
'amphora-0ee32901-0c59-4752-8253-35b66da176ea' exists.

control01:~ # openstack server show 0453a7e5-e4f9-419b-ad71-d837a20ef6bb
ERROR (CommandError): No server with a name or ID of
'0453a7e5-e4f9-419b-ad71-d837a20ef6bb' exists.

The database tables referring to the UUID
0453a7e5-e4f9-419b-ad71-d837a20ef6bb are these:

nova_cell0/instance_id_mappings.ibd
nova_cell0/instance_info_caches.ibd
nova_cell0/instance_extra.ibd
nova_cell0/instances.ibd
nova_cell0/instance_system_metadata.ibd
octavia/amphora.ibd
nova_api/instance_mappings.ibd
nova_api/request_specs.ibd

I can provide both debug logs and database queries, just let me know what exactly is required.

The storage back end is ceph (Pacific), we use neutron with OpenVSwitch, the exact nova versions are:

control01:~ # rpm -qa | grep nova
openstack-nova-conductor-22.2.2~dev15-lp152.1.25.noarch
openstack-nova-api-22.2.2~dev15-lp152.1.25.noarch
openstack-nova-novncproxy-22.2.2~dev15-lp152.1.25.noarch
python3-novaclient-17.2.0-lp152.3.2.noarch
openstack-nova-scheduler-22.2.2~dev15-lp152.1.25.noarch
openstack-nova-22.2.2~dev15-lp152.1.25.noarch
python3-nova-22.2.2~dev15-lp152.1.25.noarch

[1] https://lists.openstack.org/pipermail/openstack-discuss/2023-February/032308.html

eblock@nde.ag (eblock)
description: updated
Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

At least, can you verify whether you have a DB record for the instance_mappings table with the server UUID ?

If so, please tell us which cell this instance is using, so then you can also verify the cells DB for the instances table.

Changed in nova:
status: New → Incomplete
Revision history for this message
eblock@nde.ag (eblock) wrote :

Yes, there's a mapping:

MariaDB [nova_cell0]> select * from nova_api.instance_mappings where instance_uuid='0453a7e5-e4f9-419b-ad71-d837a20ef6bb';
+---------------------+------------+-------+--------------------------------------+---------+----------------------------------+-------------------+----------------------------------+
| created_at | updated_at | id | instance_uuid | cell_id | project_id | queued_for_delete | user_id |
+---------------------+------------+-------+--------------------------------------+---------+----------------------------------+-------------------+----------------------------------+
| 2023-02-04 09:08:02 | NULL | 47706 | 0453a7e5-e4f9-419b-ad71-d837a20ef6bb | NULL | 015b660a2ea549f8ab665d6218b23528 | 0 | 2437d34dfd704420a575c1595b34bdfe |
+---------------------+------------+-------+--------------------------------------+---------+----------------------------------+-------------------+----------------------------------+

The cell_id field has not been populated, it seems. Here's the output of nova_cell0.instances table (do you need more columns than these?):

MariaDB [nova_cell0]> select uuid,power_state,vm_state,host,reservation_id,display_name,vm_mode,task_state,cell_name,node,deleted,cleaned from instances where uuid='0453a7e5-e4f9-419b-ad71-d837a20ef6bb';
+--------------------------------------+-------------+----------+------+----------------+----------------------------------------------+---------+------------+-----------+------+---------+---------+
| uuid | power_state | vm_state | host | reservation_id | display_name | vm_mode | task_state | cell_name | node | deleted | cleaned |
+--------------------------------------+-------------+----------+------+----------------+----------------------------------------------+---------+------------+-----------+------+---------+---------+
| 0453a7e5-e4f9-419b-ad71-d837a20ef6bb | 0 | building | NULL | r-ukgidzt0 | amphora-0ee32901-0c59-4752-8253-35b66da176ea | NULL | scheduling | NULL | NULL | 0 | 0 |
+--------------------------------------+-------------+----------+------+----------------+----------------------------------------------+---------+------------+-----------+------+---------+---------+

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

OK, I think I'm able to guess the problem : the instance has a failed build hence being in cell0 (that's a zombie cell for instances having a scheduler issue) but for some reason the cell UUID wasn't set correctly in the instance_mappings table.

Could you please run nova-manage cell_v2 verify_instance --uuid <instance_uuid> ?
https://docs.openstack.org/nova/latest/cli/nova-manage.html#cell-v2-verify-instance

It will tell you whether the instance is mapped to a cell or not.

Revision history for this message
Mohammed Naser (mnaser) wrote :
Revision history for this message
eblock@nde.ag (eblock) wrote :

It is not mapped:

control01:~ # nova-manage cell_v2 verify_instance --uuid 0453a7e5-e4f9-419b-ad71-d837a20ef6bb
Instance 0453a7e5-e4f9-419b-ad71-d837a20ef6bb is not mapped to a cell

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

@Mohamed, it could be but the bug was fixed way before Victoria so the environment here should have this transaction.

@eblock, I wonder whether nova-manage cell_v2 map_instances --cell_uuid <cell_uuid> (with cell0 uuid) would work.
https://docs.openstack.org/nova/latest/cli/nova-manage.html#cell-v2-map-instances

Could you maybe try this ?

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Actually a brief discussion on IRC confirmed my thoughts : map_instances shouldn't identify this instance as the instance already has an instance mapping record.

My counter-proposal to fix your problem is to directly alter the instance mapping record by adding the cell0 uuid in the correct record (instead of having the None value)

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Yeah, confirmed : you could run nova-manage cell_v2 map_instances --cell_uuid <cell0> and it'll correctly find the instance in the instances table. But when trying to persist the new InstanceMapping, it will fail due to a DBDuplicateEntry error (as the instance_mappings table already contains a record for that instance uuid)

https://github.com/openstack/nova/blob/439c67254859485011e7fd2859051464e570d78b/nova/cmd/manage.py#L791-L792

So, yeah, the easiest is to alter the instance_mappings record to add by hand the cell uuid of cell0 ... or delete the instance_mappings record for that instance and run again nova-manage cell_v2 map_instance (with the --reset parameter)

Revision history for this message
eblock@nde.ag (eblock) wrote :

Thanks so much, that resolved the issue. I updated the instance_mappings table with the cell_id of cell0, and then I could delete (or show) the instances.

MariaDB [nova_api]> update instance_mappings set cell_id='3' where instance_uuid='0453a7e5-e4f9-419b-ad71-d837a20ef6bb';

Only one comment on the cell_id, I had to search in the cell_mappings table to find the correct id (not uuid). Maybe I missed it or it's available in newer releases or it's such a rare case, but I imagine it could help to have the cell_id in the output of 'list_cells' as well. But apart from that we could close this bug, but I'll wait for a comment about the cell_id.

Revision history for this message
Sylvain Bauza (sylvain-bauza) wrote :

Well, I don't really know about the root cause and why the map_instances() wasn't adding the cell UUID directly when it created the record here first. Now we have a transaction like Mohamed said so it shouldn't be a problem.

Closing this bug report now.

Changed in nova:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.