vm rebuild fails but confusing state

Bug #1741817 reported by yuanyue on 2018-01-08
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Undecided
Stephen Finucane

Bug Description

Description
===========

If nova-compute crashes on the host and I am trying to rebuild the instance on this host, it will stuck in rebulding which is reasonable because no one consumes the message from MQ. However, if nova-compute on this host is started then, the instance gets confusing state.

I am not sure it is designed like this or a bug.

Steps to reproduce
==================

(1) boot an instance

$nova boot --image cirros --flavor mini --nic net-id=4ccb0bc2-272f-46a2-ad32-5f5012a5a49f y
$nova list
+--------------------------------------+------+----------+--------+------------+-------------+--------------------+
| ID | Name | Host | Status | Task State | Power State | Networks |
+--------------------------------------+------+----------+--------+------------+-------------+--------------------+
| 5c6c8913-a160-4784-9e0e-c3e24cf0511c | y | compute1 | ACTIVE | - | Running | vxlan-l=100.53.0.8 |
+--------------------------------------+------+----------+--------+------------+-------------+--------------------+

(2) stop nova-compute on compute1

$nova service-list|grep compute1
| 985b4f89-92b9-4c4e-abd3-3b7d8f5f80a8 | nova-compute | compute1 | nova | enabled | down | 2018-01-08T03:17:25.000000 | - | False |

(3) rebuild the instance

$nova rebuild y cir
$nova list
+--------------------------------------+------+----------+---------+------------+-------------+--------------------+
| ID | Name | Host | Status | Task State | Power State | Networks |
+--------------------------------------+------+----------+---------+------------+-------------+--------------------+
| 5c6c8913-a160-4784-9e0e-c3e24cf0511c | y | compute1 | REBUILD | rebuilding | Running | vxlan-l=100.53.0.8 |
+--------------------------------------+------+----------+---------+------------+-------------+--------------------+

$ rabbitmqctl list_queues
Listing queues ...
...
compute.compute1 1

(4) start nova-compute on compute1

first:

$ nova list
+--------------------------------------+------+----------+--------+------------+-------------+--------------------+
| ID | Name | Host | Status | Task State | Power State | Networks |
+--------------------------------------+------+----------+--------+------------+-------------+--------------------+
| 5c6c8913-a160-4784-9e0e-c3e24cf0511c | y | compute1 | ERROR | - | Running | vxlan-l=100.53.0.8 |
+--------------------------------------+------+----------+--------+------------+-------------+--------------------+

but soon later it stucks in following state:

$ nova list
+--------------------------------------+------+----------+---------+------------+-------------+--------------------+
| ID | Name | Host | Status | Task State | Power State | Networks |
+--------------------------------------+------+----------+---------+------------+-------------+--------------------+
| 5c6c8913-a160-4784-9e0e-c3e24cf0511c | y | compute1 | REBUILD | rebuilding | Running | vxlan-l=100.53.0.8 |
+--------------------------------------+------+----------+---------+------------+-------------+--------------------+

$ nova show y
+--------------------------------------+----------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------+
| OS-DCF:diskConfig | MANUAL |
| OS-EXT-AZ:availability_zone | nova |
| OS-EXT-SRV-ATTR:host | compute1 |
| OS-EXT-SRV-ATTR:hostname | y |
| OS-EXT-SRV-ATTR:hypervisor_hostname | compute1 |
| OS-EXT-SRV-ATTR:instance_name | instance-000025db |
| OS-EXT-SRV-ATTR:kernel_id | |
| OS-EXT-SRV-ATTR:launch_index | 0 |
| OS-EXT-SRV-ATTR:ramdisk_id | |
| OS-EXT-SRV-ATTR:reservation_id | r-wo128mco |
| OS-EXT-SRV-ATTR:root_device_name | /dev/vda |
| OS-EXT-SRV-ATTR:user_data | - |
| OS-EXT-STS:power_state | 1 |
| OS-EXT-STS:task_state | rebuilding |
| OS-EXT-STS:vm_state | error |
| OS-SRV-USG:launched_at | 2018-01-08T03:17:03.000000 |
| OS-SRV-USG:terminated_at | - |
| accessIPv4 | |
| accessIPv6 | |
| config_drive | |
| created | 2018-01-08T03:16:39Z |
| description | - |
| flavor:disk | 1 |
| flavor:ephemeral | 0 |
| flavor:extra_specs | {} |
| flavor:original_name | mini |
| flavor:ram | 64 |
| flavor:swap | 0 |
| flavor:vcpus | 1 |
| hostId | a7837285ad98259e17a7ca2a0fdd9194c47113068489c17ec3c3c4ce |
| host_status | UP |
| id | 5c6c8913-a160-4784-9e0e-c3e24cf0511c |
| image | cir (14440220-3505-4d45-b4b5-f2a184145140) |
| key_name | - |
| locked | False |
| metadata | {} |
| name | y |
| os-extended-volumes:volumes_attached | [] |
| progress | 0 |
| security_groups | default |
| status | REBUILD |
| tags | [] |
| tenant_id | 5673331fd93740f58e9b36ebc56b7183 |
| updated | 2018-01-08T03:20:19Z |
| user_id | 59a94426c0404b5fae4f81d10bfc9b2f |
| vxlan-l network | 100.53.0.8 |
+--------------------------------------+----------------------------------------------------------+

Expected result
===============
Instance status should become error or the rebuild operarion should continue to be finished.

Actual result
=============
Instance status stucks in REBUILD and task_state stucks in rebuilding.

Environment
===========

nova-api 2:16.0.1-0ubuntu1~cloud0
nova-common 2:16.0.1-0ubuntu1~cloud0
nova-conductor 2:16.0.1-0ubuntu1~cloud0
nova-consoleauth 2:16.0.1-0ubuntu1~cloud0
nova-novncproxy 2:16.0.1-0ubuntu1~cloud0
nova-placement-api 2:16.0.1-0ubuntu1~cloud0
nova-scheduler 2:16.0.1-0ubuntu1~cloud0
python-nova 2:16.0.1-0ubuntu1~cloud0
python-novaclient 2:9.1.0-0ubuntu1~cloud0

Libvirt + KVM, CEPH, NEUTRON with linuxbridge

Logs
==============
On compute node:

(after nova-compute started)

2018-01-08 11:20:15.685 2352 DEBUG nova.compute.manager [req-ded7ce8a-9ac4-4fb3-8f4e-a35dd5505c4e - - - - -] [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] Instance failed to rebuild correctly, setting to ERROR state _init_instance /usr/lib/python2.7/dist-packages/nova/compute/manager.py:823

(nova-compute picks the message from MQ and continues to rebuild)

2018-01-08 11:20:18.900 2352 INFO nova.compute.manager [req-d635de5c-d5b4-4b20-96c6-663291e5554b 59a94426c0404b5fae4f81d10bfc9b2f 5673331fd93740f58e9b36ebc56b7183 - default default] [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] Rebuilding instance
2018-01-08 11:20:19.101 2352 DEBUG nova.notifications.objects.base [req-d635de5c-d5b4-4b20-96c6-663291e5554b 59a94426c0404b5fae4f81d10bfc9b2f 5673331fd93740f58e9b36ebc56b7183 - default default] Defaulting the value of the field 'projects' to None in FlavorPayload due to 'Cannot call _load_projects on orphaned Flavor object' populate_schema /usr/lib/python2.7/dist-packages/nova/notifications/objects/base.py:125
2018-01-08 11:20:19.108 2352 DEBUG nova.compute.manager [req-d635de5c-d5b4-4b20-96c6-663291e5554b 59a94426c0404b5fae4f81d10bfc9b2f 5673331fd93740f58e9b36ebc56b7183 - default default] [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] Checking state _get_power_state /usr/lib/python2.7/dist-packages/nova/compute/manager.py:1184
2018-01-08 11:20:19.242 2352 DEBUG nova.compute.utils [req-d635de5c-d5b4-4b20-96c6-663291e5554b 59a94426c0404b5fae4f81d10bfc9b2f 5673331fd93740f58e9b36ebc56b7183 - default default] [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] Conflict updating instance 5c6c8913-a160-4784-9e0e-c3e24cf0511c. Expected: {'task_state': [u'rebuilding']}. Actual: {'task_state': None} notify_about_instance_usage /usr/lib/python2.7/dist-packages/nova/compute/utils.py:328
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [req-d635de5c-d5b4-4b20-96c6-663291e5554b 59a94426c0404b5fae4f81d10bfc9b2f 5673331fd93740f58e9b36ebc56b7183 - default default] [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] Setting instance vm_state to ERROR: UnexpectedTaskStateError_Remote: Conflict updating instance 5c6c8913-a160-4784-9e0e-c3e24cf0511c. Expected: {'task_state': [u'rebuilding']}. Actual: {'task_state': None}
Traceback (most recent call last):

  File "/usr/lib/python2.7/dist-packages/nova/conductor/manager.py", line 123, in _object_dispatch
    return getattr(target, method)(*args, **kwargs)

  File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 226, in wrapper
    return fn(self, *args, **kwargs)

  File "/usr/lib/python2.7/dist-packages/nova/objects/instance.py", line 785, in save
    columns_to_join=_expected_cols(expected_attrs))

  File "/usr/lib/python2.7/dist-packages/nova/db/api.py", line 859, in instance_update_and_get_original
    expected=expected)

  File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 179, in wrapper
    return f(*args, **kwargs)

  File "/usr/lib/python2.7/dist-packages/oslo_db/api.py", line 150, in wrapper
    ectxt.value = e.inner_exc

  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()

  File "/usr/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)

  File "/usr/lib/python2.7/dist-packages/oslo_db/api.py", line 138, in wrapper
    return f(*args, **kwargs)

  File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 250, in wrapped
    return f(context, *args, **kwargs)

  File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 2666, in instance_update_and_get_original
    context, instance_uuid, values, expected, original=instance_ref))

  File "/usr/lib/python2.7/dist-packages/nova/db/sqlalchemy/api.py", line 2802, in _instance_update
    raise exc(**exc_props)

UnexpectedTaskStateError: Conflict updating instance 5c6c8913-a160-4784-9e0e-c3e24cf0511c. Expected: {'task_state': [u'rebuilding']}. Actual: {'task_state': None}

2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] Traceback (most recent call last):
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 6839, in _error_out_instance_on_exception
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] yield
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2821, in rebuild_instance
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] bdms, recreate, on_shared_storage, preserve_ephemeral)
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2869, in _do_rebuild_instance_with_claim
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] self._do_rebuild_instance(*args, **kwargs)
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/nova/compute/manager.py", line 2943, in _do_rebuild_instance
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] instance.save(expected_task_state=[task_states.REBUILDING])
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 210, in wrapper
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] ctxt, self, fn.__name__, args, kwargs)
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/nova/conductor/rpcapi.py", line 245, in object_action
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] objmethod=objmethod, args=args, kwargs=kwargs)
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 169, in call
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] retry=self.retry)
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 123, in _send
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] timeout=timeout, retry=retry)
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 578, in send
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] retry=retry)
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] File "/usr/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 569, in _send
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] raise result
2018-01-08 11:20:19.274 2352 ERROR nova.compute.manager [instance: 5c6c8913-a160-4784-9e0e-c3e24cf0511c] UnexpectedTaskStateError_Remote: Conflict updating instance 5c6c8913-a160-4784-9e0e-c3e24cf0511c. Expected: {'task_state': [u'rebuilding']}. Actual: {'task_state': None}

yuanyue (yyuanyuee) on 2018-01-08
description: updated
description: updated
description: updated
description: updated
description: updated
yuanyue (yyuanyuee) on 2018-01-08
description: updated
description: updated
description: updated
summary: - vm rebuild fail but confusing state
+ vm rebuild fails but confusing state
xulei (605423512-j) wrote :

Confusing state is caused by nova-compute. When nova-compute restart ,it will set all instances with task_state[rebuilding] to error. And then nova-compute handle rpc to rebuild, and raise this error. I think this is proper because nobody know instance's true state with state rebuilding in db. I will optimize the process of exception handling.

Changed in nova:
assignee: nobody → xulei (605423512-j)

Fix proposed to branch: master
Review: https://review.openstack.org/535665

Changed in nova:
status: New → In Progress
Changed in nova:
assignee: xulei (605423512-j) → Stephen Finucane (stephenfinucane)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers