attempting to reboot a shutdown/suspened/crashed/paused instance appears to have failed, but then surprisingly succeeds two minutes later

Bug #1236930 reported by Phil Frost
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
OpenStack Compute (nova)
Fix Released
Medium
Guangya Liu (Jay Lau)

Bug Description

I am running Havana from precise-proposed in the UCA (nova 1:2013.2~b3-0ubuntu1~cloud0).

To reproduce:

- start an instance
- reboot (sudo reboot) the compute node on which it is running
- after the compute node is done booting, the instance will be off:

root@xen10:~# nova list
+--------------------------------------+------+---------+------------+-------------+-------------------------+
| ID | Name | Status | Task State | Power State | Networks |
+--------------------------------------+------+---------+------------+-------------+-------------------------+
| 4824dce8-d876-4022-a446-3fc8d708ac62 | test | SHUTOFF | None | Shutdown | novanetwork=172.20.46.3 |
+--------------------------------------+------+---------+------------+-------------+-------------------------+

(note that although my hostname has "xen" in it, I'm using KVM. Haven't updated DNS yet...)

- attempt to reboot the instance (nova reboot 4824dce8-d876-4022-a446-3fc8d708ac62)

# nova show 4824dce8-d876-4022-a446-3fc8d708ac62
+--------------------------------------+----------------------------------------------------------+
| Property | Value |
+--------------------------------------+----------------------------------------------------------+
| status | SHUTOFF |
| updated | 2013-10-08T15:28:47Z |
| OS-EXT-STS:task_state | rebooting |

The reboot fails. The compute node will log:

2013-10-08 11:28:55.579 1400 WARNING nova.compute.manager [req-11fe1624-22f6-4348-81c5-185d0ce0d3a0 a70453729dd84bfd8f31019b1bb91e40 46ab32189ab64a4c92f8f64e6c9ed028] [instance: 4824dce8-d876-4022-a446-3fc8d708ac62] trying to reboot a non-running instance: (state: 4 expected: 1)

- attempt to start the instance (nova start 4824dce8-d876-4022-a446-3fc8d708ac62):

produces console output:
ERROR: Instance 4824dce8-d876-4022-a446-3fc8d708ac62 in task_state rebooting. Cannot start while the instance is in this state. (HTTP 400) (Request-ID: req-732224e1-8c34-4754-84f7-7a8476673185)

- wait about 120 seconds, and the compute node will log:
2013-10-08 11:30:56.082 1400 WARNING nova.virt.libvirt.driver [req-11fe1624-22f6-4348-81c5-185d0ce0d3a0 a70453729dd84bfd8f31019b1bb91e40 46ab32189ab64a4c92f8f64e6c9ed028] [instance: 4824dce8-d876-4022-a446-3fc8d708ac62] Failed to soft reboot instance. Trying hard reboot.

Afterwards, the instance will be running.

It's confusing that the reboot logs a failure for a very obvious reason (an instance that is not running can't be *re*booted), yet the instance's state remains as "rebooting". I had expected that the reboot had failed, and openstack was in some consistant state. I was then again suprised when in fact it *was* still rebooting -- it just took two minutes to do so. Less confusing would be to catch the original error, and report the reboot as failed. The log messages are confusing, because the first sets the expectation that a non-running instance can't be rebooted, but it can (two minutes later).

Changed in nova:
assignee: nobody → Jay Lau (jay-lau-513)
Revision history for this message
Guangya Liu (Jay Lau) (jay-lau-513) wrote :

We may need some discussion on this. Actually, for now, nova CAN reboot a STOPPED instance.

The logic is as following(as you append):
1)
2013-10-10 15:20:16.980 WARNING nova.compute.manager [req-975a1bd2-5c69-4a59-b506-7318bf599874 admin admin] [instance: 6105f3bf-7f58-4d42-bbbb-ff7186c16c36] trying to reboot a non-running instance: (state: 4 expected: 1)
2) soft reboot failed and try hard reboot.
2013-10-10 15:22:17.524 WARNING nova.virt.libvirt.driver [req-975a1bd2-5c69-4a59-b506-7318bf599874 admin admin] [instance: 6105f3bf-7f58-4d42-bbbb-ff7186c16c36] Failed to soft reboot instance. Trying hard reboot.
3) hard reboot will first destroy the instance then re-create and power on the instance.

So it seems to be a valid case. Phil, what do you think? Thanks.

Revision history for this message
Phil Frost (bitglue) wrote : Re: [Bug 1236930] Re: attempting to reboot a shutdown instance appears to have failed, but then suprisingly succeeds two minutes later

On 10/10/2013 03:47 AM, Jay Lau wrote:
> So it seems to be a valid case. Phil, what do you think? Thanks.

I agree it's a valid case, and the logic seems to make sense, after it
all happens and has been investigated. The problem is just that in the
two minutes between the failed soft reboot, and when the hard reboot is
done, it's really confusing. Here's what went through my mind:

- let's reboot the instance
- hum...that's taking a while. Why?
- the logs say it failed, but the API indicates that it's still rebooting.
- let's see if I can reproduce
- let's file a bug report, and manually reset the instance state in the
database (I've run into this before, with other operations)
- what the hell? My instance is running now!

Besides being confusing, it's also unnecessarily slow. In those two
minutes between soft and hard reboot attempts, nothing else can be done
to the instance.

I think this could be avoided two ways:

1) the reboot procedure can check if the instance is not running, and if
so, just start it, instead of attempting to reboot it, since that's
bound to fail

2) the first soft reboot attempt can do a better job of checking for
failures, and if they are encountered, bypass the two minute timeout and
proceed directly to the hard reboot attempt.

Revision history for this message
Guangya Liu (Jay Lau) (jay-lau-513) wrote : Re: attempting to reboot a shutdown instance appears to have failed, but then suprisingly succeeds two minutes later

Thanks Phil.

>>>>>>>>1) the reboot procedure can check if the instance is not running, and if
so, just start it, instead of attempting to reboot it, since that's
bound to fail
<<<<<<<< The instance has many power state, so we may want to handle those state separately.

NOSTATE = 0x00
RUNNING = 0x01
PAUSED = 0x03
SHUTDOWN = 0x04 # the VM is powered off
CRASHED = 0x06
SUSPENDED = 0x07

If RUNNING, reboot
If PAUSED, unpause
If SHUTDOWN, start
If SUSPENDED, resume
If CRASHED, hard reboot

But this might make the logic of reboot too complicated. I think that for this,
what about simply let nova compute report error directly if reboot a not
running vm?

>>>>>>>>2) the first soft reboot attempt can do a better job of checking for
failures, and if they are encountered, bypass the two minute timeout and
proceed directly to the hard reboot attempt.
<<<<<<<< How to enable "soft reboot attempt can do a better job of checking for
failures"? Can you please give more detail?

Thanks!

Revision history for this message
Phil Frost (bitglue) wrote : Re: [Bug 1236930] Re: attempting to reboot a shutdown instance appears to have failed, but then suprisingly succeeds two minutes later

On 10/10/2013 08:00 AM, Jay Lau wrote:
> what about simply let nova compute report error directly if reboot a not
> running vm?

That's fine by me, though it would mean an incompatible change in
behavior. I don't what the policy is on such things.

> How to enable "soft reboot attempt can do a better job of checking for
> failures"? Can you please give more detail?

I mean currently, I'm guessing the logic is something like this:

soft_reboot()
while 2 minutes have not elapsed {
  if instance has rebooted {
    return success
  }
}
hard_reboot()

When the soft reboot is attempted, there must be some error, because it
is logged (can't reboot a shutdown instance). But, it has absolutely no
bearing on the reboot procedure; it just waits two minutes, even though
it's clear (from the error) that the soft reboot has failed. The logic
could be this:

try {
  soft_reboot()
} catch {
  hard_reboot()
}
while 2 minutes have not elapsed {
  if instance has rebooted {
    return success
  }
}
hard_reboot()

Or, if a change in behavior is acceptable, as you suggest, it could be:

if instance is not running {
  return failure "can not reboot a non-running instance"
}
...reboot as it is currently done

Either way, the confusion is eliminated.

Revision history for this message
Guangya Liu (Jay Lau) (jay-lau-513) wrote : Re: attempting to reboot a shutdown instance appears to have failed, but then suprisingly succeeds two minutes later

Thanks Phil, the same issue occurred for reboot a suspended, paused, crashed VM. I will fix those issues together.

summary: - attempting to reboot a shutdown instance appears to have failed, but
- then suprisingly succeeds two minutes later
+ attempting to reboot a shutdown/suspened/paused/crashed instance appears
+ to have failed, but then surprisingly succeeds two minutes later
Revision history for this message
Guangya Liu (Jay Lau) (jay-lau-513) wrote : Re: attempting to reboot a shutdown/suspened/crashed instance appears to have failed, but then surprisingly succeeds two minutes later

Will handle paused VM case in https://bugs.launchpad.net/nova/+bug/1238435

summary: - attempting to reboot a shutdown/suspened/paused/crashed instance appears
- to have failed, but then surprisingly succeeds two minutes later
+ attempting to reboot a shutdown/suspened/crashed instance appears to
+ have failed, but then surprisingly succeeds two minutes later
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to nova (master)

Fix proposed to branch: master
Review: https://review.openstack.org/51130

Changed in nova:
status: New → In Progress
Changed in nova:
importance: Undecided → Medium
summary: - attempting to reboot a shutdown/suspened/crashed instance appears to
- have failed, but then surprisingly succeeds two minutes later
+ attempting to reboot a shutdown/suspened/crashed/paused instance appears
+ to have failed, but then surprisingly succeeds two minutes later
Mark McLoughlin (markmc)
tags: added: havana-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to nova (master)

Reviewed: https://review.openstack.org/51130
Committed: http://github.com/openstack/nova/commit/2392313f562ba6a90ed1ec3fbc507862043fa44f
Submitter: Jenkins
Branch: master

commit 2392313f562ba6a90ed1ec3fbc507862043fa44f
Author: Jay Lau <email address hidden>
Date: Sat Oct 12 13:52:38 2013 +0800

    compute api should throw exception if soft reboot invalid state VM

    When user perform soft reboot to a VM which in suspended/paused/
    stopped/error state, nova compute api should throw exception for
    such state.

    Change-Id: Ic365c6360f6b7407d9de0dac6ff1093484692cf4
    Closes-Bug: #1236930

Changed in nova:
status: In Progress → Fix Committed
Changed in nova:
milestone: none → icehouse-1
Thierry Carrez (ttx)
Changed in nova:
status: Fix Committed → Fix Released
Thierry Carrez (ttx)
Changed in nova:
milestone: icehouse-1 → 2014.1
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.