Instances end up with no cell assigned in instance_mappings

Bug #1784074 reported by Mohammed Naser on 2018-07-27
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Medium
Unassigned
Pike
Medium
Unassigned
Queens
Medium
Unassigned

Bug Description

There has been situations where due to an unrelated issue such as an RPC or DB problem, the nova_api instance_mappings table can end up with instances that have cell_id set to NULL which can cause annoying and weird behaviour such as undeletable instances, etc.

This seems to be an issue only during times where these external infrastructure components had issues. I have come up with the following script which loops over all cells and checks where they are, and spits out a mysql query to run to fix.

This would be nice to have as a nova-manage cell_v2 command to help if any other users run into this, unfortunately I'm a bit short on time so I don't have time to nova-ify it, but it's here:

========================================================================
#!/usr/bin/env python

import urlparse

import pymysql

# Connect to databases
api_conn = pymysql.connect(host='xxxx', port=3306, user='nova_api', passwd='xxxxxxx', db='nova_api')
api_cur = api_conn.cursor()

def _get_conn(db):
  parsed_url = urlparse.urlparse(db)
  conn = pymysql.connect(host=parsed_url.hostname, user=parsed_url.username, passwd=parsed_url.password, db=parsed_url.path[1:])
  return conn.cursor()

# Get list of all cells
api_cur.execute("SELECT uuid, name, database_connection FROM cell_mappings")
CELLS = [{'uuid': uuid, 'name': name, 'db': _get_conn(db)} for uuid, name, db in api_cur.fetchall()]

# Get list of all unmapped instances
api_cur.execute("SELECT instance_uuid FROM instance_mappings WHERE cell_id IS NULL")
print "Number of unmapped instances: %s" % api_cur.rowcount

# Go over all unmapped instances
for (instance_uuid,) in api_cur.fetchall():
  instance_cell = None

  # Check which cell contains this instance
  for cell in CELLS:
    cell['db'].execute("SELECT id FROM instances WHERE uuid = %s", (instance_uuid,))

    if cell['db'].rowcount != 0:
      instance_cell = cell
      break

  # Update to the correct cell
  if instance_cell:
    print "UPDATE instance_mappings SET cell_id = '%s' WHERE instance_uuid = '%s'" % (instance_cell['uuid'], instance_uuid)
    continue

  # If we reach this point, it's not in any cell?!
  print "%s: not found in any cell" % (instance_uuid)
========================================================================

Matt Riedemann (mriedem) wrote :

I think we might be hitting this:

https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1243

Where the build request was deleted by the user (user deletes the instance) before we created the instance in a cell, but that means they shouldn't be able to list it later either, which is why we don't bother updating the instance mapping for that instance because the instance doesn't exist as a build request nor was it created in a cell.

I'm not sure why we don't just update the instance mapping as soon as we create the instance in a cell:

https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1257

Because in the normal flow, we don't update the instance mapping until much later:

https://github.com/openstack/nova/blob/6be7f7248fb1c2bbb890a0a48a424e205e173c9c/nova/conductor/manager.py#L1322

And if anything fails between those times, the instance will exist in a cell but the instance mapping won't point at it so you can't do things on the instance, but you can list it (because the list routine doesn't go through instance_mappings, it just iterates cells). Furthermore, the user could delete the instance in this case, but what they'd really be deleting is the build request, and since we don't have the instance mapping pointing to the cell, we won't know which cell to find the instance and delete it.

tags: added: cells
Mohammed Naser (mnaser) wrote :

I added an updated version of the script that checks if build requests exist which should make it a tad more user friendly, and fix cell_id

================================================================================
#!/usr/bin/env python

import urlparse

import pymysql

# Connect to databases
api_conn = pymysql.connect(host='xxxxxx', port=3306, user='nova_api', passwd='xxxx', db='nova_api')
api_cur = api_conn.cursor()

def _get_conn(db):
  parsed_url = urlparse.urlparse(db)
  conn = pymysql.connect(host=parsed_url.hostname, user=parsed_url.username, passwd=parsed_url.password, db=parsed_url.path[1:])
  return conn.cursor()

# Get list of all cells
api_cur.execute("SELECT id, name, database_connection FROM cell_mappings")
CELLS = [{'id': id, 'name': name, 'db': _get_conn(db)} for id, name, db in api_cur.fetchall()]

# Get list of all unmapped instances
api_cur.execute("SELECT instance_uuid FROM instance_mappings WHERE cell_id IS NULL")
print "Number of unmapped instances: %s" % api_cur.rowcount

# Go over all unmapped instances
unmapped_instances = api_cur.fetchall()
for (instance_uuid,) in unmapped_instances:
  instance_cell = None

  # Check if a build request exists, if so, skip.
  api_cur.execute("SELECT id FROM build_requests WHERE instance_uuid = %s", (instance_uuid,))
  if api_cur.rowcount != 0:
    print "%s: build request exists, skipping" % (instance_uuid,)
    break

  # Check which cell contains this instance
  for cell in CELLS:
    cell['db'].execute("SELECT id FROM instances WHERE uuid = %s", (instance_uuid,))

    if cell['db'].rowcount != 0:
      instance_cell = cell
      break

  # Update to the correct cell
  if instance_cell:
    print "UPDATE instance_mappings SET cell_id = '%s' WHERE instance_uuid = '%s';" % (instance_cell['id'], instance_uuid)
    continue

  # If we reach this point, it's not in any cell?!
  print "%s: not found in any cell" % (instance_uuid)
================================================================================

Fix proposed to branch: master
Review: https://review.openstack.org/586713

Changed in nova:
assignee: nobody → Matt Riedemann (mriedem)
status: New → In Progress
Matt Riedemann (mriedem) wrote :

Thanks for the script, I think it would be useful to have in nova-manage as a kind of cleanup/reporting tool in case of "disaster recovery" situations.

Changed in nova:
importance: Undecided → Medium
Mohammed Naser (mnaser) wrote :

I have come up with all of the following stacktraces that can result in us being in a weird state

========================================================
2018-06-24 07:02:42.747 10473 ERROR nova.api.openstack.wsgi File "/openstack/venvs/nova-17.0.3/lib/python2.7/site-packages/nova/compute/api.py", line 1867, in _delete
2018-06-24 07:02:42.747 10473 ERROR nova.api.openstack.wsgi instance.save()
--
2018-06-27 14:41:01.211 10474 ERROR nova.api.openstack.wsgi File "/openstack/venvs/nova-17.0.3/lib/python2.7/site-packages/nova/compute/api.py", line 2113, in _do_delete
2018-06-27 14:41:01.211 10474 ERROR nova.api.openstack.wsgi instance.save()
--
2018-07-09 11:45:37.280 10468 ERROR nova.api.openstack.wsgi File "/openstack/venvs/nova-17.0.3/lib/python2.7/site-packages/nova/db/sqlalchemy/api.py", line 5738, in action_start
2018-07-09 11:45:37.280 10468 ERROR nova.api.openstack.wsgi action_ref.save(context.session)
--
2018-07-13 12:42:11.250 10466 ERROR nova.api.openstack.wsgi File "/openstack/venvs/nova-17.0.3/lib/python2.7/site-packages/nova/compute/api.py", line 937, in _provision_instances
2018-07-13 12:42:11.250 10466 ERROR nova.api.openstack.wsgi inst_mapping.create()
--
2018-07-20 09:41:34.312 10473 ERROR nova.api.openstack.wsgi File "/openstack/venvs/nova-17.0.3/lib/python2.7/site-packages/nova/compute/api.py", line 925, in _provision_instances
2018-07-20 09:41:34.312 10473 ERROR nova.api.openstack.wsgi build_request.create()
--
2018-07-20 09:45:29.905 10465 ERROR nova.api.openstack.wsgi File "/openstack/venvs/nova-17.0.3/lib/python2.7/site-packages/nova/objects/instance_mapping.py", line 80, in _create_in_db
2018-07-20 09:45:29.905 10465 ERROR nova.api.openstack.wsgi db_mapping.save(context.session)
--
2018-07-20 11:08:13.265 10479 ERROR nova.api.openstack.wsgi File "/openstack/venvs/nova-17.0.3/lib/python2.7/site-packages/nova/compute/api.py", line 901, in _provision_instances
2018-07-20 11:08:13.265 10479 ERROR nova.api.openstack.wsgi req_spec.create()
--
2018-07-24 08:42:27.400 10468 ERROR nova.api.openstack.wsgi File "/openstack/venvs/nova-17.0.3/lib/python2.7/site-packages/nova/objects/build_request.py", line 184, in _create_in_db
2018-07-24 08:42:27.400 10468 ERROR nova.api.openstack.wsgi db_req.save(context.session)
========================================================

Surya Seetharaman (tssurya) wrote :

Thanks for the script mnaser! We also run into this situation often. Would be good to have this script in.

Matt Riedemann (mriedem) wrote :

Bug 1784093 opened for the things in comment 5.

Matt Riedemann (mriedem) wrote :

https://review.opendev.org/655908 turns the script from mnaser into a nova-manage command.

Related fix proposed to branch: master
Review: https://review.opendev.org/666438

Change abandoned by Matt Riedemann (<email address hidden>) on branch: master
Review: https://review.opendev.org/586713
Reason: I'm not actively working on this and there are changes to be made (and tested in detail) so I'm going to abandon. Someone else can pick up and run with the ideas in here if they feel the need.

Matt Riedemann (mriedem) on 2019-06-19
Changed in nova:
status: In Progress → Confirmed
assignee: Matt Riedemann (mriedem) → nobody

Reviewed: https://review.opendev.org/666438
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=1ff029c1c3792f53865c6bdb3dce8d2c51b73ca7
Submitter: Zuul
Branch: master

commit 1ff029c1c3792f53865c6bdb3dce8d2c51b73ca7
Author: Matt Riedemann <email address hidden>
Date: Wed Jun 19 16:26:07 2019 -0400

    Delete InstanceMapping in conductor if BuildRequest is already deleted

    The BuildRequest represents a server in the API until the scheduler
    picks a host in a cell and we create the instance record in that cell
    and update the instance mapping to point at the cell. If the user
    deletes the BuildRequest before the instance record is created in a
    cell, the conductor schedule_and_build_instances method cleans up the
    resource allocations created by scheduler and then continues to the
    next instance (if it's a multi-create request). The point is the instance
    does not get created in a cell, the BuildRequest is gone, and the
    instance mapping is left pointing at no cell - effectively orphaned.

    Furthermore, the archive_deleted_rows command change
    I483701a55576c245d091ff086b32081b392f746e to cleanup instance mappings
    for archived instances will not catch and cleanup the orphan instance
    mapping because there never was an instance record to delete and archive
    (the BuildRequest was deleted before the instance record was created, and
    the BuildRequest is hard deleted so there is no archive).

    This change simply deletes the InstanceMapping record in case the
    BuildRequest is already gone by the time we finish scheduling and we
    do not create the instance record in any cell.

    Change-Id: Ia03577ae41f010b449e47ff5b69b432d74f8467b
    Related-Bug: #1784074
    Related-Bug: #1773945

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers