error instances remain in "Build" status and can't delete it

Bug #1693438 reported by Masahito Muroi
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Critical
Matt Riedemann

Bug Description

Description
===========

When create server API fails because of no valid host, the instance's status remains in BUILD status and "scheduling" task state. Additionally, users can't delete the instance by delete server API.

Steps to reproduce
==================

1. create devstack with default configs
2. boot instances until nova scheduler says "no valid host was found"
3. check the error instance's status, then its status remains in "BUILD" and its task state remains in "scheduling".
4. check nova-conductor's log, then it has a following error
5. I can't delete the failed instance by delete instance API

Expected result
===============

the instance goes ERROR status and none task state because of "no valid host was found". And I can delete the instance by delete instance API.

Environment
===========

- git log
Merge: bedcf29 3838d5e
Author: Jenkins <email address hidden>
Date: Thu May 25 01:15:41 2017 +0000

    Merge "Handle uuid in HostAPI._find_service"

- hypervisor: KVM

Logs & Configs
==============

- nova-conductor's log:

nova-conductor[28120]: NoValidHost: No valid host was found. There are not enough hosts available.
nova-conductor[28120]:
nova-conductor[28120]: WARNING nova.scheduler.utils [req-7969ec7f-795d-420d-b847-b5c3c6bc8489 admin admin] [instance: e5e9cfe9-49ec-40a4-b763-d8bda68d5e56] Setting instance to ERROR state.
nova-conductor[28120]: ERROR oslo_messaging.rpc.server [req-7969ec7f-795d-420d-b847-b5c3c6bc8489 admin admin] Exception during message handling
nova-conductor[28120]: ERROR oslo_messaging.rpc.server Traceback (most recent call last):
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/server.py", line 157, in _process_incoming
nova-conductor[28120]: ERROR oslo_messaging.rpc.server res = self.dispatcher.dispatch(message)
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 213, in dispatch
nova-conductor[28120]: ERROR oslo_messaging.rpc.server return self._do_dispatch(endpoint, method, ctxt, args)
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/dispatcher.py", line 183, in _do_dispatch
nova-conductor[28120]: ERROR oslo_messaging.rpc.server result = func(ctxt, **new_args)
nova-conductor[28120]: ERROR oslo_messaging.rpc.server build_requests=build_requests)
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/devstack/devstack_data/nova/nova/conductor/manager.py", line 893, in _bury_in_cell0
nova-conductor[28120]: ERROR oslo_messaging.rpc.server exc, legacy_spec)
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/devstack/devstack_data/nova/nova/conductor/manager.py", line 355, in _set_vm_state_and_notify
nova-conductor[28120]: ERROR oslo_messaging.rpc.server ex, request_spec)
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/devstack/devstack_data/nova/nova/scheduler/utils.py", line 104, in set_vm_state_and_notify
nova-conductor[28120]: ERROR oslo_messaging.rpc.server instance.save()
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 226, in wrapper
nova-conductor[28120]: ERROR oslo_messaging.rpc.server return fn(self, *args, **kwargs)
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/devstack/devstack_data/nova/nova/objects/instance.py", line 781, in save
nova-conductor[28120]: ERROR oslo_messaging.rpc.server columns_to_join=_expected_cols(expected_attrs))
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/devstack/devstack_data/nova/nova/db/api.py", line 860, in instance_update_and_get_original
nova-conductor[28120]: ERROR oslo_messaging.rpc.server expected=expected)
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/devstack/devstack_data/nova/nova/db/sqlalchemy/api.py", line 180, in wrapper
nova-conductor[28120]: ERROR oslo_messaging.rpc.server return f(*args, **kwargs)
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_db/api.py", line 150, in wrapper
nova-conductor[28120]: ERROR oslo_messaging.rpc.server ectxt.value = e.inner_exc
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
nova-conductor[28120]: ERROR oslo_messaging.rpc.server self.force_reraise()
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
nova-conductor[28120]: ERROR oslo_messaging.rpc.server six.reraise(self.type_, self.value, self.tb)
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/usr/local/lib/python2.7/dist-packages/oslo_db/api.py", line 138, in wrapper
nova-conductor[28120]: ERROR oslo_messaging.rpc.server return f(*args, **kwargs)
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/devstack/devstack_data/nova/nova/db/sqlalchemy/api.py", line 251, in wrapped
nova-conductor[28120]: ERROR oslo_messaging.rpc.server return f(context, *args, **kwargs)
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/devstack/devstack_data/nova/nova/db/sqlalchemy/api.py", line 2673, in instance_update_and_get_original
nova-conductor[28120]: ERROR oslo_messaging.rpc.server columns_to_join=columns_to_join)
nova-conductor[28120]: ERROR oslo_messaging.rpc.server File "/devstack/devstack_data/nova/nova/db/sqlalchemy/api.py", line 1929, in _instance_get_by_uuid
nova-conductor[28120]: ERROR oslo_messaging.rpc.server raise exception.InstanceNotFound(instance_id=uuid)
nova-conductor[28120]: ERROR oslo_messaging.rpc.server InstanceNotFound: Instance e5e9cfe9-49ec-40a4-b763-d8bda68d5e56 could not be found.

- inspections for the bug
    * nova_api db has the failed instance info in its db, but it's cell_id is NULL

mysql> select instance_uuid, cell_id from instance_mappings;
+--------------------------------------+---------+
| instance_uuid | cell_id |
+--------------------------------------+---------+
| 86be67fc-86db-4153-b142-2cdd77874e9a | 2 |
| db4f56bc-5069-4f0a-9fdc-f23e7fdb613d | 2 |
| 072e6db4-9ccd-4da3-935e-15e16f300cc7 | 2 |
| f4d577a7-4c3f-4c03-a875-6d3c1d3441e4 | NULL |
| 9720785e-210e-4607-a16c-58250d092efc | NULL |
| 8ced7975-b234-463e-86aa-bd30632812f7 | NULL |
| 5b675dff-d588-491b-8d13-9f0d95aec131 | 2 |
| 3785265d-b05e-47f6-9ff5-5f2300cb9a1f | NULL |
| 50dafdb8-dbd4-4d09-a5b4-cb604c1c28e9 | 2 |
| 901d03fb-a079-4707-a50c-8a9ea12f3347 | NULL |
| b3ddeeb4-d5a5-4fd6-b847-b5bb49d4f984 | NULL |
| 78439d4c-65cd-432b-9d43-835d583529b7 | 2 |
| e5e9cfe9-49ec-40a4-b763-d8bda68d5e56 | NULL |
+--------------------------------------+---------+

mysql> select id,uuid,name from cell_mappings;
+----+--------------------------------------+-------+
| id | uuid | name |
+----+--------------------------------------+-------+
| 1 | 00000000-0000-0000-0000-000000000000 | cell0 |
| 2 | 60d185ff-8de0-4b9a-8832-494d71c3f895 | cell1 |
+----+--------------------------------------+-------+

    * all rows in nova_cell0's instances table remain building and scheduling

mysql> select id, uuid, vm_state, task_state from instances;
+----+--------------------------------------+----------+------------+
| id | uuid | vm_state | task_state |
+----+--------------------------------------+----------+------------+
| 1 | 8ced7975-b234-463e-86aa-bd30632812f7 | building | scheduling |
| 2 | 3785265d-b05e-47f6-9ff5-5f2300cb9a1f | building | scheduling |
| 3 | 901d03fb-a079-4707-a50c-8a9ea12f3347 | building | scheduling |
| 4 | b3ddeeb4-d5a5-4fd6-b847-b5bb49d4f984 | building | scheduling |
| 5 | e5e9cfe9-49ec-40a4-b763-d8bda68d5e56 | building | scheduling |
+----+--------------------------------------+----------+------------+

Tags: cellsv2
Changed in nova:
assignee: nobody → Tetsuro Nakamura (tetsuro0907)
Changed in nova:
assignee: Tetsuro Nakamura (tetsuro0907) → Matt Riedemann (mriedem)
status: New → In Progress
Revision history for this message
Matt Riedemann (mriedem) wrote :

Trove CI is failing for the same reason:

http://logs.openstack.org/47/467447/14/check/gate-trove-scenario-dsvm-mariadb-multi-ubuntu-xenial-nv/6efec50/logs/screen-n-cond.txt.gz#_May_26_03_24_31_157974

Sorry to take this bug over but I noticed the bug report after I pushed the patch.

Changed in nova:
importance: Undecided → Critical
tags: added: cellsv2
Revision history for this message
Matt Riedemann (mriedem) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/468401
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=5fac17ae960d36eeb7a642725a37f82e8ca95ec1
Submitter: Jenkins
Branch: master

commit 5fac17ae960d36eeb7a642725a37f82e8ca95ec1
Author: Matt Riedemann <email address hidden>
Date: Fri May 26 08:27:07 2017 -0400

    Use targeted context when burying instances in cell0

    After Iccdf6b80f5fc8adcc8a89ce6ece3f37b6cbcaee2 we need to
    use the yielded context which is targeted to the cell when
    we do DB operations, which in this case is creating the
    instance in the cell0 database and then updating it's status.

    There is another place in here where this was missed, which is
    when we're trying to delete a build request which was already
    deleted.

    Closes-Bug: #1693438

    Change-Id: I142f97d691fa55e9824714c9c224f998ad72337e

Changed in nova:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/nova 16.0.0.0b2

This issue was fixed in the openstack/nova 16.0.0.0b2 development milestone.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.