Server appears in "openstack server list" but "openstack server (show|delete|etc)" insists it doesn't exist
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack Compute (nova) |
Confirmed
|
Medium
|
Unassigned |
Bug Description
We found a server that shows up in the output of "openstack server list":
$ openstack server list | grep 705e3bb6-
| 705e3bb6-
But we were otherwise unable to access it:
$ openstack server show 705e3bb6-
No server with a name or ID of '705e3bb6-
Examining the nova database, there is an entry in the instances table:
MariaDB [nova_api]> select id, uuid, vm_state, task_state from nova.instances where uuid = "705e3bb6-
+----
| id | uuid | vm_state | task_state |
+----
| 198335 | 705e3bb6-
+----
1 row in set (0.00 sec)
There is also an entry in the nova_api.
MariaDB [nova_api]> select * from instance_mappings where instance_
+----
| created_at | updated_at | id | instance_uuid | cell_id | project_id |
+----
| 2020-02-22 08:01:53 | NULL | 211514 | 705e3bb6-
+----
Updating the entry so that cell_id is non-null allowed things to work as expected:
MariaDB [nova_api]> update instance_mappings set cell_id=5 where id=211514;
And now:
$ openstack --os-cloud kaizen-admin server show 705e3bb6-
+----
| Field | Value |
+----
| id | 705e3bb6-
| status | ERROR |
+----
This is a queens environment. We are currently running nova 17.0.13, but this issue probably cropped up before we updated to that version.
This issue can occur if the attempt to update the instance mapping with a cell_id fails due to a DBError.
There are three places we update the instance mapping with a cell.
* Putting an instance in cell0 due to a failure to schedule [1]
* Successful schedule to a cell at the first schedule [2]
* While cleaning up build artifacts when an instance is deleted while in the middle of building [3]
To fix this bug, we need to figure out what we should do if an attempt to update the instance mapping record fails.
Some ideas:
* delete the instance record to prevent orphaning it ... but note that this can also fail if it too hits DBError. And can we fill in instance fault information in the build request? How will the user be able to know what happened to their instance?
* retry instance mapping cell_id update. How many times?
[1] https:/ /github. com/openstack/ nova/blob/ 7a71408a79dc81f 344ee6c7760fa88 1afb935dfc/ nova/conductor/ manager. py#L1424 /github. com/openstack/ nova/blob/ 7a71408a79dc81f 344ee6c7760fa88 1afb935dfc/ nova/conductor/ manager. py#L1686- L1713 /github. com/openstack/ nova/blob/ 7a71408a79dc81f 344ee6c7760fa88 1afb935dfc/ nova/conductor/ manager. py#L1757
[2] https:/
[3] https:/