Instances end up with no cell assigned in instance_mappings
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
| OpenStack Compute (nova) |
Medium
|
Unassigned | ||
| Pike |
Medium
|
Unassigned | ||
| Queens |
Medium
|
Unassigned |
Bug Description
There has been situations where due to an unrelated issue such as an RPC or DB problem, the nova_api instance_mappings table can end up with instances that have cell_id set to NULL which can cause annoying and weird behaviour such as undeletable instances, etc.
This seems to be an issue only during times where these external infrastructure components had issues. I have come up with the following script which loops over all cells and checks where they are, and spits out a mysql query to run to fix.
This would be nice to have as a nova-manage cell_v2 command to help if any other users run into this, unfortunately I'm a bit short on time so I don't have time to nova-ify it, but it's here:
=======
#!/usr/bin/env python
import urlparse
import pymysql
# Connect to databases
api_conn = pymysql.
api_cur = api_conn.cursor()
def _get_conn(db):
parsed_url = urlparse.
conn = pymysql.
return conn.cursor()
# Get list of all cells
api_cur.
CELLS = [{'uuid': uuid, 'name': name, 'db': _get_conn(db)} for uuid, name, db in api_cur.fetchall()]
# Get list of all unmapped instances
api_cur.
print "Number of unmapped instances: %s" % api_cur.rowcount
# Go over all unmapped instances
for (instance_uuid,) in api_cur.fetchall():
instance_cell = None
# Check which cell contains this instance
for cell in CELLS:
cell[
if cell['db'].rowcount != 0:
instance_cell = cell
break
# Update to the correct cell
if instance_cell:
print "UPDATE instance_mappings SET cell_id = '%s' WHERE instance_uuid = '%s'" % (instance_
continue
# If we reach this point, it's not in any cell?!
print "%s: not found in any cell" % (instance_uuid)
=======
Mohammed Naser (mnaser) wrote : | #2 |
I added an updated version of the script that checks if build requests exist which should make it a tad more user friendly, and fix cell_id
=======
#!/usr/bin/env python
import urlparse
import pymysql
# Connect to databases
api_conn = pymysql.
api_cur = api_conn.cursor()
def _get_conn(db):
parsed_url = urlparse.
conn = pymysql.
return conn.cursor()
# Get list of all cells
api_cur.
CELLS = [{'id': id, 'name': name, 'db': _get_conn(db)} for id, name, db in api_cur.fetchall()]
# Get list of all unmapped instances
api_cur.
print "Number of unmapped instances: %s" % api_cur.rowcount
# Go over all unmapped instances
unmapped_instances = api_cur.fetchall()
for (instance_uuid,) in unmapped_instances:
instance_cell = None
# Check if a build request exists, if so, skip.
api_cur.
if api_cur.rowcount != 0:
print "%s: build request exists, skipping" % (instance_uuid,)
break
# Check which cell contains this instance
for cell in CELLS:
cell[
if cell['db'].rowcount != 0:
instance_cell = cell
break
# Update to the correct cell
if instance_cell:
print "UPDATE instance_mappings SET cell_id = '%s' WHERE instance_uuid = '%s';" % (instance_
continue
# If we reach this point, it's not in any cell?!
print "%s: not found in any cell" % (instance_uuid)
=======
Fix proposed to branch: master
Review: https:/
Changed in nova: | |
assignee: | nobody → Matt Riedemann (mriedem) |
status: | New → In Progress |
Matt Riedemann (mriedem) wrote : | #4 |
Thanks for the script, I think it would be useful to have in nova-manage as a kind of cleanup/reporting tool in case of "disaster recovery" situations.
Changed in nova: | |
importance: | Undecided → Medium |
Mohammed Naser (mnaser) wrote : | #5 |
I have come up with all of the following stacktraces that can result in us being in a weird state
=======
2018-06-24 07:02:42.747 10473 ERROR nova.api.
2018-06-24 07:02:42.747 10473 ERROR nova.api.
--
2018-06-27 14:41:01.211 10474 ERROR nova.api.
2018-06-27 14:41:01.211 10474 ERROR nova.api.
--
2018-07-09 11:45:37.280 10468 ERROR nova.api.
2018-07-09 11:45:37.280 10468 ERROR nova.api.
--
2018-07-13 12:42:11.250 10466 ERROR nova.api.
2018-07-13 12:42:11.250 10466 ERROR nova.api.
--
2018-07-20 09:41:34.312 10473 ERROR nova.api.
2018-07-20 09:41:34.312 10473 ERROR nova.api.
--
2018-07-20 09:45:29.905 10465 ERROR nova.api.
2018-07-20 09:45:29.905 10465 ERROR nova.api.
--
2018-07-20 11:08:13.265 10479 ERROR nova.api.
2018-07-20 11:08:13.265 10479 ERROR nova.api.
--
2018-07-24 08:42:27.400 10468 ERROR nova.api.
2018-07-24 08:42:27.400 10468 ERROR nova.api.
=======
Surya Seetharaman (tssurya) wrote : | #6 |
Thanks for the script mnaser! We also run into this situation often. Would be good to have this script in.
Matt Riedemann (mriedem) wrote : | #7 |
Bug 1784093 opened for the things in comment 5.
Related fix proposed to branch: master
Review: https:/
Matt Riedemann (mriedem) wrote : | #9 |
https:/
Related fix proposed to branch: master
Review: https:/
Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https:/
Reason: I'm not actively working on this and there are changes to be made (and tested in detail) so I'm going to abandon. Someone else can pick up and run with the ideas in here if they feel the need.
Changed in nova: | |
status: | In Progress → Confirmed |
assignee: | Matt Riedemann (mriedem) → nobody |
Reviewed: https:/
Committed: https:/
Submitter: Zuul
Branch: master
commit 1ff029c1c3792f5
Author: Matt Riedemann <email address hidden>
Date: Wed Jun 19 16:26:07 2019 -0400
Delete InstanceMapping in conductor if BuildRequest is already deleted
The BuildRequest represents a server in the API until the scheduler
picks a host in a cell and we create the instance record in that cell
and update the instance mapping to point at the cell. If the user
deletes the BuildRequest before the instance record is created in a
cell, the conductor schedule_
resource allocations created by scheduler and then continues to the
next instance (if it's a multi-create request). The point is the instance
does not get created in a cell, the BuildRequest is gone, and the
instance mapping is left pointing at no cell - effectively orphaned.
Furthermore, the archive_
I483701a555
for archived instances will not catch and cleanup the orphan instance
mapping because there never was an instance record to delete and archive
(the BuildRequest was deleted before the instance record was created, and
the BuildRequest is hard deleted so there is no archive).
This change simply deletes the InstanceMapping record in case the
BuildRequest is already gone by the time we finish scheduling and we
do not create the instance record in any cell.
Change-Id: Ia03577ae41f010
Related-Bug: #1784074
Related-Bug: #1773945
I think we might be hitting this:
https:/ /github. com/openstack/ nova/blob/ 6be7f7248fb1c2b bb890a0a48a424e 205e173c9c/ nova/conductor/ manager. py#L1243
Where the build request was deleted by the user (user deletes the instance) before we created the instance in a cell, but that means they shouldn't be able to list it later either, which is why we don't bother updating the instance mapping for that instance because the instance doesn't exist as a build request nor was it created in a cell.
I'm not sure why we don't just update the instance mapping as soon as we create the instance in a cell:
https:/ /github. com/openstack/ nova/blob/ 6be7f7248fb1c2b bb890a0a48a424e 205e173c9c/ nova/conductor/ manager. py#L1257
Because in the normal flow, we don't update the instance mapping until much later:
https:/ /github. com/openstack/ nova/blob/ 6be7f7248fb1c2b bb890a0a48a424e 205e173c9c/ nova/conductor/ manager. py#L1322
And if anything fails between those times, the instance will exist in a cell but the instance mapping won't point at it so you can't do things on the instance, but you can list it (because the list routine doesn't go through instance_mappings, it just iterates cells). Furthermore, the user could delete the instance in this case, but what they'd really be deleting is the build request, and since we don't have the instance mapping pointing to the cell, we won't know which cell to find the instance and delete it.