vm rebuild fails but confusing state

Bug #1741817 reported by yuanyue
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
In Progress
Undecided
Stephen Finucane

Bug Description

Description
===========

If nova-compute crashes on the host and I am trying to rebuild the instance on this host, it will stuck in rebulding which is reasonable because no one consumes the message from MQ. However, if nova-compute on this host is started then, the instance gets confusing state.

I am not sure it is designed like this or a bug.

Steps to reproduce
==================

(1) boot an instance

$nova boot --image cirros --flavor mini --nic net-id=4ccb0bc2-272f-46a2-ad32-5f5012a5a49f y
$nova list
+--------------------------------------+------+----------+--------+------------+-------------+--------------------+
| ID | Name | Host | Status | Task State | Power State | Networks |
+--------------------------------------+------+----------+--------+------------+-------------+--------------------+
| 5c6c8913-a160-4784-9e0e-c3e24cf0511c | y | compute1 | ACTIVE | - | Running | vxlan-l=100.53.0.8 |
+--------------------------------------+------+----------+--------+------------+-------------+--------------------+

(2) stop nova-compute on compute1

$nova service-list|grep compute1
| 985b4f89-92b9-4c4e-abd3-3b7d8f5f80a8 | nova-compute | compute1 | nova | enabled | down | 2018-01-08T03:17:25.000000 | - | False |

(3) rebuild the instance

$nova rebuild y cir
$nova list
+--------------------------------------+------+----------+---------+------------+-------------+--------------------+
| ID | Name | Host | Status | Task State | Power State | Networks |
+--------------------------------------+------+----------+---------+------------+-------------+--------------------+
| 5c6c8913-a160-4784-9e0e-c3e24cf0511c | y | compute1 | REBUILD | rebuilding | Running | vxlan-l=100.53.0.8 |
+--------------------------------------+------+----------+---------+------------+-------------+--------------------+

$ rabbitmqctl list_queues
Listing queues ...
...
compute.compute1 1

(4) start nova-compute on compute1

first:

$ nova list
+--------------------------------------+------+----------+--------+------------+-------------+--------------------+
| ID | Name | Host | Status | Task State | Power State | Networks |
+--------------------------------------+------+----------+--------+------------+-------------+--------------------+
| 5c6c8913-a160-4784-9e0e-c3e24cf0511c | y | compute1 | ERROR | - | Running | vxlan-l=100.53.0.8 |
+--------------------------------------+------+----------+--------+------------+-------------+--------------------+

but soon later it stucks in following state:

$ nova list
+--------------------------------------+------+----------+---------+------------+-------------+--------------------+
| ID | Name | Host | Status | Task State | Power State | Networks |
+--------------------------------------+------+----------+---------+------------+-------------+--------------------+
| 5c6c8913-a160-4784-9e0e-c3e24cf0511c | y | compute1 | REBUILD | rebuilding | Running | vxlan-l=100.53.0.8 |
+--------------------------------------+------+----------+---------+------------+-------------+--------------------+

$ nova show y
+--------------------------------------+----------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | compute1 |
| OS-EXT-SRV-ATTR:hostname | y |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute1 |
| OS-EXT-SRV-ATTR:instance_name | instance-000025db |
| OS-EXT-SRV-ATTR:kernel_id | |
| OS-EXT-SRV-ATTR:launch_index | 0 |
| OS-EXT-SRV-ATTR:ramdisk_id | |
| OS-EXT-SRV-ATTR:reservation_id | r-wo128mco |
| OS-EXT-SRV-ATTR:root_device_name | /dev/vda |
| OS-EXT-SRV-ATTR:user_data | - |
| OS-EXT-STS:power_state | 1 |
| OS-EXT-STS:task_state | rebuilding |
| OS-EXT-STS:vm_state | error |
| OS-SRV-USG:launched_at | 2018-01-08T03:17:03.000000 |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| config_drive | |
| created | 2018-01-08T03:16:39Z |
| description | - |
| flavor:disk | 1 |
| flavor:ephemeral | 0 |
| flavor:extra_specs | {} |
| flavor:original_name | mini |
| flavor:ram | 64 |
| flavor:swap | 0 |
| flavor:vcpus | 1 |
| hostId | a7837285ad98259e17a7ca2a0fdd9194c47113068489c17ec3c3c4ce |
| host_status | UP |
| id | 5c6c8913-a160-4784-9e0e-c3e24cf0511c |
| image | cir (14440220-3505-4d45-b4b5-f2a184145140) |
| key_name | - |
| locked | False |
| metadata | {} |
| name | y |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| status | REBUILD |
| tags | [] |
| tenant_id | 5673331fd93740f58e9b36ebc56b7183 |
| updated | 2018-01-08T03:20:19Z |
| user_id | 59a94426c0404b5fae4f81d10bfc9b2f |
| vxlan-l network | 100.53.0.8 |
+--------------------------------------+----------------------------------------------------------+

Expected result
===============
Instance status should become error or the rebuild operarion should continue to be finished.

Actual result
=============
Instance status stucks in REBUILD and task_state stucks in rebuilding.

Environment
===========

nova-api 2:16.0.1-0ubuntu1~cloud0
nova-common 2:16.0.1-0ubuntu1~cloud0
nova-conductor 2:16.0.1-0ubuntu1~cloud0
nova-consoleauth 2:16.0.1-0ubuntu1~cloud0
nova-novncproxy 2:16.0.1-0ubuntu1~cloud0
nova-placement-api 2:16.0.1-0ubuntu1~cloud0
nova-scheduler 2:16.0.1-0ubuntu1~cloud0
python-nova 2:16.0.1-0ubuntu1~cloud0
python-novaclient 2:9.1.0-0ubuntu1~cloud0

Libvirt + KVM, CEPH, NEUTRON with linuxbridge

Logs
==============
On compute node:

(after nova-compute started)

2018-01-08 11:20:15.685 2352 DEBUG nova.compute.manager [req-ded7ce8a-9ac4-4fb3-8f4e-a35dd5505c4e - - - - -] [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] Instance failed to rebuild correctly, setting to ERROR state _init_instance /usr/lib/python2.7/dist-packages/nova/compute/manager.py:823

(nova-compute picks the message from MQ and continues to rebuild)

2018-01-08 11:20:18.900 2352 INFO nova.compute.manager [req-d635de5c-d5b4-4b20-96c6-663291e5554b 59a94426c0404b5fae4f81d10bfc9b2f 5673331fd93740f58e9b36ebc56b7183 - default default] [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] Rebuilding instance
2018-01-08 11:20:19.101 2352 DEBUG nova.notifications.objects.base [req-d635de5c-d5b4-4b20-96c6-663291e5554b 59a94426c0404b5fae4f81d10bfc9b2f 5673331fd93740f58e9b36ebc56b7183 - default default] Defaulting the value of the field 'projects' to None in FlavorPayload due to 'Cannot call _load_projects on orphaned Flavor object' populate_schema /usr/lib/python2.7/dist-packages/nova/notifications/objects/base.py:125
2018-01-08 11:20:19.108 2352 DEBUG nova.compute.manager [req-d635de5c-d5b4-4b20-96c6-663291e5554b 59a94426c0404b5fae4f81d10bfc9b2f 5673331fd93740f58e9b36ebc56b7183 - default default] [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] Checking state _get_power_state /usr/lib/python2.7/dist-packages/nova/compute/manager.py:1184
2018-01-08 11:20:19.242 2352 DEBUG nova.compute.utils [req-d635de5c-d5b4-4b20-96c6-663291e5554b 59a94426c0404b5fae4f81d10bfc9b2f 5673331fd93740f58e9b36ebc56b7183 - default default] [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] Conflict updating instance 5c6c8913-a160-4784-9e0e-c3e24cf0511c. Expected: {'task_state': [u'rebuilding']}. Actual: {'task_state': None} notify_about_instance_usage /usr/lib/python2.7/dist-packages/nova/compute/utils.py:328
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [req-d635de5c-d5b4-4b20-96c6-663291e5554b 59a94426c0404b5fae4f81d10bfc9b2f 5673331fd93740f58e9b36ebc56b7183 - default default] [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] Setting instance vm_state to ERROR: UnexpectedTaskStateError_Remote: Conflict updating instance 5c6c8913-a160-4784-9e0e-c3e24cf0511c. Expected: {'task_state': [u'rebuilding']}. Actual: {'task_state': None}
Traceback (most recent call last):

  File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 123, in _object_dispatch
    return getattr(target, method)(*args, **kwargs)

  File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 226, in wrapper
    return fn(self, *args, **kwargs)

  File "/usr/lib/python2.7/dist-packages/nova/objects/instance.py", line 785, in save
    columns_to_join=_expected_cols(expected_attrs))

  File "/usr/lib/python2.7/dist-packages/nova/db/api.py", line 859, in instance_update_and_get_original
    expected=expected)

  File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 179, in wrapper
    return f(*args, **kwargs)

  File "/usr/lib/python2.7/dist-packages/oslo_db/api.py", line 150, in wrapper
    ectxt.value = e.inner_exc

  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()

  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)

  File "/usr/lib/python2.7/dist-packages/oslo_db/api.py", line 138, in wrapper
    return f(*args, **kwargs)

  File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 250, in wrapped
    return f(context, *args, **kwargs)

  File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 2666, in instance_update_and_get_original
    context, instance_uuid, values, expected, original=instance_ref))

  File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 2802, in _instance_update
    raise exc(**exc_props)

UnexpectedTaskStateError: Conflict updating instance 5c6c8913-a160-4784-9e0e-c3e24cf0511c. Expected: {'task_state': [u'rebuilding']}. Actual: {'task_state': None}

2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] Traceback (most recent call last):
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6839, in _error_out_instance_on_exception
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] yield
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2821, in rebuild_instance
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] bdms, recreate, on_shared_storage, preserve_ephemeral)
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2869, in _do_rebuild_instance_with_claim
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] self._do_rebuild_instance(*args, **kwargs)
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2943, in _do_rebuild_instance
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] instance.save(expected_task_state=[task_states.REBUILDING])
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 210, in wrapper
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] ctxt, self, fn.__name__, args, kwargs)
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 245, in object_action
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] objmethod=objmethod, args=args, kwargs=kwargs)
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 169, in call
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] retry=self.retry)
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 123, in _send
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] timeout=timeout, retry=retry)
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] retry=retry)
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 569, in _send
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] raise result
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] UnexpectedTaskStateError_Remote: Conflict updating instance 5c6c8913-a160-4784-9e0e-c3e24cf0511c. Expected: {'task_state': [u'rebuilding']}. Actual: {'task_state': None}

yuanyue (yyuanyuee)
description: updated
description: updated
description: updated
description: updated
description: updated
yuanyue (yyuanyuee)
description: updated
description: updated
description: updated
summary: - vm rebuild fail but confusing state
+ vm rebuild fails but confusing state
Revision history for this message
xulei (605423512-j) wrote :

Confusing state is caused by nova-compute. When nova-compute restart ,it will set all instances with task_state[rebuilding] to error. And then nova-compute handle rpc to rebuild, and raise this error. I think this is proper because nobody know instance's true state with state rebuilding in db. I will optimize the process of exception handling.

Changed in nova:
assignee: nobody → xulei (605423512-j)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/535665

Changed in nova:
status: New → In Progress
Changed in nova:
assignee: xulei (605423512-j) → Stephen Finucane (stephenfinucane)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.