Server appears in "openstack server list" but "openstack server (show|delete|etc)" insists it doesn't exist

Bug #1871925 reported by Lars Kellogg-Stedman
38
This bug affects 6 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Confirmed
Medium
Unassigned

Bug Description

We found a server that shows up in the output of "openstack server list":

  $ openstack server list | grep 705e3bb6-27a1-4b15-a37b-62f914fc221c
  | 705e3bb6-27a1-4b15-a37b-62f914fc221c | 1mg89chr-bc4a8-sh8xn-worker-slpnh | ERROR | | 1mg89chr-bc4a8-sh8xn-rhcos | m1.s2.xlarge |

But we were otherwise unable to access it:

  $ openstack server show 705e3bb6-27a1-4b15-a37b-62f914fc221c
  No server with a name or ID of '705e3bb6-27a1-4b15-a37b-62f914fc221c' exists.

Examining the nova database, there is an entry in the instances table:

  MariaDB [nova_api]> select id, uuid, vm_state, task_state from nova.instances where uuid = "705e3bb6-27a1-4b15-a37b-62f914fc221c";
  +--------+--------------------------------------+----------+------------+
  | id | uuid | vm_state | task_state |
  +--------+--------------------------------------+----------+------------+
  | 198335 | 705e3bb6-27a1-4b15-a37b-62f914fc221c | error | NULL |
  +--------+--------------------------------------+----------+------------+
  1 row in set (0.00 sec)

There is also an entry in the nova_api.instance_mapping table, but the data appears to be invalid:

  MariaDB [nova_api]> select * from instance_mappings where instance_uuid="705e3bb6-27a1-4b15-a37b-62f914fc221c";
  +---------------------+------------+--------+--------------------------------------+---------+----------------------------------+
  | created_at | updated_at | id | instance_uuid | cell_id | project_id |
  +---------------------+------------+--------+--------------------------------------+---------+----------------------------------+
  | 2020-02-22 08:01:53 | NULL | 211514 | 705e3bb6-27a1-4b15-a37b-62f914fc221c | NULL | 593227d1d5d04cba8847d5b6b742e0a7 |
  +---------------------+------------+--------+--------------------------------------+---------+----------------------------------+

Updating the entry so that cell_id is non-null allowed things to work as expected:

  MariaDB [nova_api]> update instance_mappings set cell_id=5 where id=211514;

And now:

  $ openstack --os-cloud kaizen-admin server show 705e3bb6-27a1-4b15-a37b-62f914fc221c -c id -c status
  +--------+--------------------------------------+
  | Field | Value |
  +--------+--------------------------------------+
  | id | 705e3bb6-27a1-4b15-a37b-62f914fc221c |
  | status | ERROR |
  +--------+--------------------------------------+

This is a queens environment. We are currently running nova 17.0.13, but this issue probably cropped up before we updated to that version.

Tags: conductor
Revision history for this message
melanie witt (melwitt) wrote :

This issue can occur if the attempt to update the instance mapping with a cell_id fails due to a DBError.

There are three places we update the instance mapping with a cell.

  * Putting an instance in cell0 due to a failure to schedule [1]
  * Successful schedule to a cell at the first schedule [2]
  * While cleaning up build artifacts when an instance is deleted while in the middle of building [3]

To fix this bug, we need to figure out what we should do if an attempt to update the instance mapping record fails.

Some ideas:

  * delete the instance record to prevent orphaning it ... but note that this can also fail if it too hits DBError. And can we fill in instance fault information in the build request? How will the user be able to know what happened to their instance?
  * retry instance mapping cell_id update. How many times?

[1] https://github.com/openstack/nova/blob/7a71408a79dc81f344ee6c7760fa881afb935dfc/nova/conductor/manager.py#L1424
[2] https://github.com/openstack/nova/blob/7a71408a79dc81f344ee6c7760fa881afb935dfc/nova/conductor/manager.py#L1686-L1713
[3] https://github.com/openstack/nova/blob/7a71408a79dc81f344ee6c7760fa881afb935dfc/nova/conductor/manager.py#L1757

Changed in nova:
importance: Undecided → Medium
status: New → Confirmed
tags: added: conductor
Revision history for this message
melanie witt (melwitt) wrote :

Note: I find a fourth place in the code where an instance mapping could be updated with a cell_id [4], but I'm not sure if that code is ever reached because it's in the build_instances method, which is only called during reschedules, as far as I know:

            # This is populated in scheduler_utils.populate_retry
            num_attempts = local_filter_props.get('retry',
                                                  {}).get('num_attempts', 1)
            if num_attempts <= 1:
                # If this is a reschedule the instance is already mapped to
                # this cell and the BuildRequest is already deleted so ignore
                # the logic below.
                inst_mapping = self._populate_instance_mapping(context,
                                                               instance,
                                                               host)

[4] https://github.com/openstack/nova/blob/7a71408a79dc81f344ee6c7760fa881afb935dfc/nova/conductor/manager.py#L817

Revision history for this message
melanie witt (melwitt) wrote :

Based on code inspection and looking at similar "important updates" in nova/db/sqlalchemy/api.py, I'm guessing we'll want to add something like @oslo_db_api.wrap_db_retry(max_retries=5, retry_on_deadlock=True) to the instance mapping update call.

But we'd have to know what error is being raised when the update failed and so far, we don't have any logs showing what caused the failure. Is it a DBDeadlock? Is it just DBError? Is it something entirely different that should be fixed instead of instance mapping update?

Until we are able to reproduce the issue or get logs showing what error is being raised, I think we can't know how to fix this unfortunately.

Revision history for this message
melanie witt (melwitt) wrote :

To workaround the issue described in comment 0, where an instance has an instance_mapping with cell_id=NULL but also has a record in the nova_cell0 database or nova/nova_cell1 database, manual database updates are required. This will involve deleting the instance that is unusable because it failed to schedule (host=NULL).

Given an example instance stuck in BUILD:

> select * from nova_api.instance_mappings where instance_uuid="428012e6-1ada-4848-824b-d2f33d54a834"\G
*************************** 1. row ***************************
       created_at: 2022-07-05 09:37:10
       updated_at: NULL
               id: 50883
    instance_uuid: 428012e6-1ada-4848-824b-d2f33d54a834
          cell_id: NULL
       project_id: 6a1316f4ee914898b213865ebdc789c9
queued_for_delete: 0
          user_id: d74e2d5a0cf54ad3a22337f9aac03697
1 row in set (0.000 sec)

> select id,uuid,host,deleted from nova_cell0.instances where uuid="428012e6-1ada-4848-824b-d2f33d54a834"\G
*************************** 1. row ***************************
     id: 1206
   uuid: 428012e6-1ada-4848-824b-d2f33d54a834
   host: NULL
deleted: 0
1 row in set (0.005 sec)

and cell_mappings:

> select id,name from nova_api.cell_mappings\G
*************************** 1. row ***************************
                 id: 2
               name: cell0

*************************** 2. row ***************************
                 id: 5
               name: default

2 rows in set (0.013 sec)

Update the nova_api.instance_mappings cell_id to point at the cell the instance record is in:

  update nova_api.instance_mappings set cell_id=2 where instance_uuid="428012e6-1ada-4848-824b-d2f33d54a834";

Finally, delete the instance normally using the nova API:

  openstack server delete 428012e6-1ada-4848-824b-d2f33d54a834

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.