Instance recovery needed when Compute service goes down during Reboot

Bug #1072751 reported by Rohit Karajgi
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Michael Still

Bug Description

Scenario:

If the Compute service goes down just after destroying the instance and before recreating the domain on the hypervisor,
then the instance state task state remains rebooting and the instance remains in an inconsistent state after Compute gets back.
Admin has to recreate the instance on the hypervisor using the instance's xml.

This is another corner scenario with low probability, but could be managed by the code.

Michael Still (mikal)
Changed in nova:
status: New → Triaged
importance: Undecided → Medium
tags: added: libvirt
Changed in nova:
assignee: nobody → tcs_openstack_group (tcs-openstack-group)
Revision history for this message
Kanchan Gupta (kanchan-gupta1) wrote :

Require more information to fix the bug

So far the below mentioned scenario has been replicated:
nova-compute service is stopped after destroying the instance and before
recreating the domain on the hypervisor, the task state remains rebooting.
When the compute gets back, the instance gets in shutoff state. It has to be
restarted using instance's xml.

Could you please elaborate on what should be the behavior when the compute
service is up and running again – should the instance's state be active and
running instead of shutoff and shutdown?

Revision history for this message
Jyotsna (jyotsna-priya1) wrote :

Proceeding to un-assign since there is no response on the query posted.

Changed in nova:
assignee: tcs_openstack_group (tcs-openstack-group) → nobody
Grzegorz Grasza (xek)
Changed in nova:
assignee: nobody → Grzegorz Grasza (xek)
Grzegorz Grasza (xek)
Changed in nova:
status: Triaged → In Progress
Revision history for this message
Grzegorz Grasza (xek) wrote :

I was able to reproduce the error by inserting ipdb in _soft_reboot in nova/virt/libvirt/driver.py and killing the service

There is code to fix states on startup (cc0be157d005c5588fe5db779fc30fefbf22b44d), but there is an error:

Traceback (most recent call last):

  File "/opt/stack/nova/nova/conductor/manager.py", line 420, in _object_dispatc
h
    return getattr(target, method)(*args, **kwargs)

  File "/opt/stack/nova/nova/objects/base.py", line 163, in wrapper
    result = fn(cls, context, *args, **kwargs)

  File "/opt/stack/nova/nova/objects/instance_action.py", line 170, in event_sta
rt
    db_event = db.action_event_start(context, values)

  File "/opt/stack/nova/nova/db/api.py", line 1850, in action_event_start
    return IMPL.action_event_start(context, values)

  File "/opt/stack/nova/nova/db/sqlalchemy/api.py", line 5622, in action_event_s
tart
    instance_uuid=values['instance_uuid'])

InstanceActionNotFound: Action for request_id req-4bc0fd19-f392-421d-86c5-c2a519a2b8cc on instance b7a52ee9-7214-4133-bf80-bf94dc2c5af1 not found

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/170123

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Change abandoned on nova (master)

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/170123
Reason: This patch has been stalled for a long time, so I am abandoning it. Please feel free to restore it when the code is ready for review.

Revision history for this message
Michael Still (mikal) wrote :
Download full text (5.4 KiB)

I see a different error, but its been a while since this bug was looked at:

2015-09-03 11:01:55.783 DEBUG nova.objects.instance [req-b5c49e6b-aa07-4ba2-a830-9bab570eb6be None None] Lazy-loading `metadata' on Instance uuid 25bfa30e-2aad-49f1-8cc1-32654916343a from (pid=32116) obj_load_attr /opt/stack/nova/nova/objects/instance.py:864
2015-09-03 11:01:55.830 DEBUG nova.compute.manager [req-b5c49e6b-aa07-4ba2-a830-9bab570eb6be None None] [instance: 25bfa30e-2aad-49f1-8cc1-32654916343a] Checking state from (pid=32116) _get_power_state /opt/stack/nova/nova/compute/manager.py:1317
2015-09-03 11:01:55.954 INFO nova.compute.manager [req-b5c49e6b-aa07-4ba2-a830-9bab570eb6be None None] Task possibly preempted: Conflict updating instance 25bfa30e-2aad-49f1-8cc1-32654916343a. Expected: {'task_state': [u'rebooting_hard', u'reboot_pending_hard', u'reboot_started_hard']}. Actual: {'task_state': u'reboot_started'}
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/poll.py", line 115, in wait
    listener.cb(fileno)
  File "/usr/local/lib/python2.7/dist-packages/eventlet/greenthread.py", line 214, in main
    result = function(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/oslo_service/service.py", line 645, in run_service
    service.start()
  File "/opt/stack/nova/nova/service.py", line 164, in start
    self.manager.init_host()
  File "/opt/stack/nova/nova/compute/manager.py", line 1297, in init_host
    self._init_instance(context, instance)
  File "/opt/stack/nova/nova/compute/manager.py", line 1048, in _init_instance
    reboot_type=reboot_type)
  File "/opt/stack/nova/nova/exception.py", line 89, in wrapped
    payload)
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/exception.py", line 72, in wrapped
    return f(self, context, *args, **kw)
  File "/opt/stack/nova/nova/compute/manager.py", line 329, in decorated_function
    e.format_message())
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/compute/manager.py", line 322, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 401, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 379, in decorated_function
    kwargs['instance'], e, sys.exc_info())
  File "/usr/local/lib/python2.7/dist-packages/oslo_utils/excutils.py", line 195, in __exit__
    six.reraise(self.type_, self.value, self.tb)
  File "/opt/stack/nova/nova/compute/manager.py", line 367, in decorated_function
    return function(self, context, *args, **kwargs)
  File "/opt/stack/nova/nova/compute/manager.py", line 2830, in reboot_instance
    instance.save(expected_task_state=expected_states)
  File "/usr/local/lib/python2.7/dist-packages/oslo_versionedobjects/base.py", line 197, in wrapper
    ctxt, self, fn.__name__, args, kwargs)
  File "/opt/stack/nova/nova/conductor/rpcapi.py", ...

Read more...

Revision history for this message
OpenStack Infra (hudson-openstack) wrote :

Change abandoned by Michael Still (<email address hidden>) on branch: master
Review: https://review.openstack.org/170123
Reason: Actually, I want to rework this one, so re-abandon.

Revision history for this message
Michael Still (mikal) wrote :

So, if I add the soft reboot states to the list of expected states for a hard reboot the nova-compute does the right thing. We get this logged:

2015-09-03 12:59:32.168 INFO nova.compute.manager [req-c5e2e629-5627-4bd3-8168-1a050a79d184 None None] [instance: e2b5d3cc-36a0-492b-8d35-e324c4fda4f4] Instance in transitional state (reboot_started) at start-up and power state is (4), triggering HARD reboot

I think the key here is that we think that the instance power state is SHUTDOWN (4), which is allowed by a soft reboot. However, the nova.compute.utils.get_reboot_type code only expects that for a hard reboot.

So, I think in that case we just change the task state to a hard reboot pending and keep rolling.

Changed in nova:
assignee: Grzegorz Grasza (xek) → Michael Still (mikalstill)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/219980

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/219980
Committed: https://git.openstack.org/cgit/openstack/nova/commit/?id=22d9e3d2ae7a36ef28ee3b539210e3362f486724
Submitter: Jenkins
Branch: master

commit 22d9e3d2ae7a36ef28ee3b539210e3362f486724
Author: Michael Still <email address hidden>
Date: Thu Sep 3 13:41:10 2015 +1000

    Handle nova-compute failure during a soft reboot

    A soft reboot is soft in the sense that we let the instance
    respond to ACPI events before shutdown. We still shutdown the
    libvirt domain however.

    Therefore, if nova-compute crashes having shutdown the instance
    domain, but before starting it again, when nova-compute
    restarts it will see an instance in a task_state indicating a
    soft reboot, but with the power_state being shutdown. This was
    unexpected and caused an instance.save() to crash out.

    In those cases, change the task_state to one corresponding to
    a hard reboot, and continue on.

    Change-Id: Icdde0bc2e8c8c90ba20f48f010f230ae4d4dca54
    Closes-Bug: #1072751

Changed in nova:
status: In Progress → Fix Committed
Thierry Carrez (ttx)
Changed in nova:
milestone: none → liberty-rc1
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: liberty-rc1 → 12.0.0
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.