OpenStack Compute (nova)

Instance recovery needed when Compute service goes down during Reboot

Bug #1072751 reported by Rohit Karajgi on 2012-10-29

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	OpenStack Compute (nova)	Fix Released	Medium	Michael Still	OpenStack Compute (nova) 12.0.0 "liberty"

Bug Description

Scenario:

If the Compute service goes down just after destroying the instance and before recreating the domain on the hypervisor,
then the instance state task state remains rebooting and the instance remains in an inconsistent state after Compute gets back.
Admin has to recreate the instance on the hypervisor using the instance's xml.

This is another corner scenario with low probability, but could be managed by the code.

Tags:

Michael Still (mikal) on 2012-10-30

Changed in nova:
status:	New → Triaged
importance:	Undecided → Medium

Johannes Erdfelt (johannes.erdfelt) on 2014-04-23

tags:

added: libvirt

Akhila C (chetlapalle-akhila-b) on 2014-06-13

Changed in nova:
assignee:	nobody → tcs_openstack_group (tcs-openstack-group)

Revision history for this message

Kanchan Gupta (kanchan-gupta1) wrote on 2014-06-23:

Require more information to fix the bug

So far the below mentioned scenario has been replicated:
nova-compute service is stopped after destroying the instance and before
recreating the domain on the hypervisor, the task state remains rebooting.
When the compute gets back, the instance gets in shutoff state. It has to be
restarted using instance's xml.

Could you please elaborate on what should be the behavior when the compute
service is up and running again – should the instance's state be active and
running instead of shutoff and shutdown?

Revision history for this message

Jyotsna (jyotsna-priya1) wrote on 2014-07-25:

Proceeding to un-assign since there is no response on the query posted.

Changed in nova:
assignee:	tcs_openstack_group (tcs-openstack-group) → nobody

Grzegorz Grasza (xek) on 2015-03-20

Changed in nova:
assignee:	nobody → Grzegorz Grasza (xek)

Grzegorz Grasza (xek) on 2015-03-24

Changed in nova:
status:	Triaged → In Progress

Revision history for this message

Grzegorz Grasza (xek) wrote on 2015-04-02:

I was able to reproduce the error by inserting ipdb in _soft_reboot in nova/virt/libvirt/driver.py and killing the service

There is code to fix states on startup (cc0be157d005c5588fe5db779fc30fefbf22b44d), but there is an error:

Traceback (most recent call last):

File "/opt/stack/nova/nova/conductor/manager.py", line 420, in _object_dispatc
h
return getattr(target, method)(*args, **kwargs)

File "/opt/stack/nova/nova/objects/base.py", line 163, in wrapper
result = fn(cls, context, *args, **kwargs)

File "/opt/stack/nova/nova/objects/instance_action.py", line 170, in event_sta
rt
db_event = db.action_event_start(context, values)

File "/opt/stack/nova/nova/db/api.py", line 1850, in action_event_start
return IMPL.action_event_start(context, values)

File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 5622, in action_event_s
tart
instance_uuid=values['instance_uuid'])

InstanceActionNotFound: Action for request_id req-4bc0fd19-f392-421d-86c5-c2a519a2b8cc on instance b7a52ee9-7214-4133-bf80-bf94dc2c5af1 not found

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-04-02: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/170123

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-08-12: Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/170123
Reason: This patch has been stalled for a long time, so I am abandoning it. Please feel free to restore it when the code is ready for review.

Revision history for this message

Michael Still (mikal) wrote on 2015-09-03:

Download full text (5.4 KiB)

I see a different error, but its been a while since this bug was looked at:

2015-09-03 11:01:55.783 DEBUG nova.objects.instance [req-b5c49e6b-aa07-4ba2-a830-9bab570eb6be None None] Lazy-loading `metadata' on Instance uuid 25bfa30e-2aad-49f1-8cc1-32654916343a from (pid=32116) obj_load_attr /opt/stack/nova/nova/objects/instance.py:864
2015-09-03 11:01:55.830 DEBUG nova.compute.manager [req-b5c49e6b-aa07-4ba2-a830-9bab570eb6be None None] [instance: 25bfa30e-2aad-49f1-8cc1-32654916343a] Checking state from (pid=32116) _get_power_state /opt/stack/nova/nova/compute/manager.py:1317
2015-09-03 11:01:55.954 INFO nova.compute.manager [req-b5c49e6b-aa07-4ba2-a830-9bab570eb6be None None] Task possibly preempted: Conflict updating instance 25bfa30e-2aad-49f1-8cc1-32654916343a. Expected: {'task_state': [u'rebooting_hard', u'reboot_pending_hard', u'reboot_started_hard']}. Actual: {'task_state': u'reboot_started'}
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/poll.py", line 115, in wait
    listener.cb(fileno)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/oslo_service/service.py", line 645, in run_service
    service.start()
  File "/opt/stack/nova/nova/service.py", line 164, in start
    self.manager.init_host()
  File "/opt/stack/nova/nova/compute/manager.py", line 1297, in init_host
    self._init_instance(context, instance)
  File "/opt/stack/nova/nova/compute/manager.py", line 1048, in _init_instance
    reboot_type=reboot_type)
  File "/opt/stack/nova/nova/exception.py", line 89, in wrapped
    payload)
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/exception.py", line 72, in wrapped
    return f(self, context, *args, **kw)
  File "/opt/stack/nova/nova/compute/manager.py", line 329, in decorated_function
    e.format_message())
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/compute/manager.py", line 322, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 401, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 379, in decorated_function
    kwargs['instance'], e, sys.exc_info())
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/compute/manager.py", line 367, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 2830, in reboot_instance
    instance.save(expected_task_state=expected_states)
  File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 197, in wrapper
    ctxt, self, fn.__name__, args, kwargs)
  File "/opt/stack/nova/nova/conductor/rpcapi.py", ...

I see a different error, but its been a while since this bug was looked at:

2015-09-03 11:01:55.783 DEBUG nova.objects.instance [req-b5c49e6b-aa07-4ba2-a830-9bab570eb6be None None] Lazy-loading `metadata' on Instance uuid 25bfa30e-2aad-49f1-8cc1-32654916343a from (pid=32116) obj_load_attr /opt/stack/nova/nova/objects/instance.py:864
2015-09-03 11:01:55.830 DEBUG nova.compute.manager [req-b5c49e6b-aa07-4ba2-a830-9bab570eb6be None None] [instance: 25bfa30e-2aad-49f1-8cc1-32654916343a] Checking state from (pid=32116) _get_power_state /opt/stack/nova/nova/compute/manager.py:1317
2015-09-03 11:01:55.954 INFO nova.compute.manager [req-b5c49e6b-aa07-4ba2-a830-9bab570eb6be None None] Task possibly preempted: Conflict updating instance 25bfa30e-2aad-49f1-8cc1-32654916343a. Expected: {'task_state': [u'rebooting_hard', u'reboot_pending_hard', u'reboot_started_hard']}. Actual: {'task_state': u'reboot_started'}
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/poll.py", line 115, in wait
    listener.cb(fileno)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)   
  File "/usr/local/lib/python2.7/dist-packages/oslo_service/service.py", line 645, in run_service
    service.start()
  File "/opt/stack/nova/nova/service.py", line 164, in start
    self.manager.init_host()
  File "/opt/stack/nova/nova/compute/manager.py", line 1297, in init_host
    self._init_instance(context, instance)
  File "/opt/stack/nova/nova/compute/manager.py", line 1048, in _init_instance
    reboot_type=reboot_type)
  File "/opt/stack/nova/nova/exception.py", line 89, in wrapped
    payload)
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/exception.py", line 72, in wrapped
    return f(self, context, *args, **kw)
  File "/opt/stack/nova/nova/compute/manager.py", line 329, in decorated_function
    e.format_message())
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/compute/manager.py", line 322, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 401, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 379, in decorated_function
    kwargs['instance'], e, sys.exc_info())
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/compute/manager.py", line 367, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 2830, in reboot_instance
    instance.save(expected_task_state=expected_states)
  File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 197, in wrapper
    ctxt, self, fn.__name__, args, kwargs)
  File "/opt/stack/nova/nova/conductor/rpcapi.py", line 233, in object_action
    objmethod=objmethod, args=args, kwargs=kwargs)
  File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/rpc/client.py", line 158, in call
    retry=self.retry)
  File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/transport.py", line 90, in _send
    timeout=timeout, retry=retry)
  File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 431, in send
    retry=retry)
  File "/usr/local/lib/python2.7/dist-packages/oslo_messaging/_drivers/amqpdriver.py", line 422, in _send
    raise result
UnexpectedTaskStateError_Remote: Conflict updating instance 25bfa30e-2aad-49f1-8cc1-32654916343a. Expected: {'task_state': [u'rebooting_hard', u'reboot_pending_hard', u'reboot_started_hard']}. Actual: {'task_state': u'reboot_started'}
Traceback (most recent call last):

File "/opt/stack/nova/nova/conductor/manager.py", line 444, in _object_dispatch
    return getattr(target, method)(*args, **kwargs)

File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 213, in wrapper
    return fn(self, *args, **kwargs)

File "/opt/stack/nova/nova/objects/instance.py", line 728, in save
    columns_to_join=_expected_cols(expected_attrs))

File "/opt/stack/nova/nova/db/api.py", line 764, in instance_update_and_get_original
    expected=expected)

File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 216, in wrapper
    return f(*args, **kwargs)

File "/usr/local/lib/python2.7/dist-packages/oslo_db/api.py", line 146, in wrapper
    ectxt.value = e.inner_exc

File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
    six.reraise(self.type_, self.value, self.tb)

File "/usr/local/lib/python2.7/dist-packages/oslo_db/api.py", line 136, in wrapper
    return f(*args, **kwargs)

File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 2464, in instance_update_and_get_original
    expected, original=instance_ref))

File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 2602, in _instance_update
    raise exc(**exc_props)

UnexpectedTaskStateError: Conflict updating instance 25bfa30e-2aad-49f1-8cc1-32654916343a. Expected: {'task_state': [u'rebooting_hard', u'reboot_pending_hard', u'reboot_started_hard']}. Actual: {'task_state': u'reboot_started'}

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-03:

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/170123
Reason: Actually, I want to rework this one, so re-abandon.

Revision history for this message

Michael Still (mikal) wrote on 2015-09-03:

So, if I add the soft reboot states to the list of expected states for a hard reboot the nova-compute does the right thing. We get this logged:

2015-09-03 12:59:32.168 INFO nova.compute.manager [req-c5e2e629-5627-4bd3-8168-1a050a79d184 None None] [instance: e2b5d3cc-36a0-492b-8d35-e324c4fda4f4] Instance in transitional state (reboot_started) at start-up and power state is (4), triggering HARD reboot

I think the key here is that we think that the instance power state is SHUTDOWN (4), which is allowed by a soft reboot. However, the nova.compute.utils.get_reboot_type code only expects that for a hard reboot.

So, I think in that case we just change the task state to a hard reboot pending and keep rolling.

Changed in nova:
assignee:	Grzegorz Grasza (xek) → Michael Still (mikalstill)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-03: Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/219980

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2015-09-10: Fix merged to nova (master)

#10

Reviewed: https://review.openstack.org/219980
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=22d9e3d2ae7a36ef28ee3b539210e3362f486724
Submitter: Jenkins
Branch: master

commit 22d9e3d2ae7a36ef28ee3b539210e3362f486724
Author: Michael Still <email address hidden>
Date: Thu Sep 3 13:41:10 2015 +1000

Handle nova-compute failure during a soft reboot

    A soft reboot is soft in the sense that we let the instance
    respond to ACPI events before shutdown. We still shutdown the
    libvirt domain however.

    Therefore, if nova-compute crashes having shutdown the instance
    domain, but before starting it again, when nova-compute
    restarts it will see an instance in a task_state indicating a
    soft reboot, but with the power_state being shutdown. This was
    unexpected and caused an instance.save() to crash out.

In those cases, change the task_state to one corresponding to
a hard reboot, and continue on.

Change-Id: Icdde0bc2e8c8c90ba20f48f010f230ae4d4dca54
Closes-Bug: #1072751

Changed in nova:
status:	In Progress → Fix Committed

Thierry Carrez (ttx) on 2015-09-24

Changed in nova:
milestone:	none → liberty-rc1
status:	Fix Committed → Fix Released

Thierry Carrez (ttx) on 2015-10-15

Changed in nova:
milestone:	liberty-rc1 → 12.0.0

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Related blueprints

Compute instance clean up daemon process

Remote bug watches

Bug watches keep track of this bug in other bug trackers.