instance hangs at grub prompt after reboot followed by euca-reboot-instances

Bug #1035279 reported by Louis Bouchard on 2012-08-10
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu on EC2
Undecided
Unassigned
grub2 (Ubuntu)
High
Unassigned
Oneiric
High
Unassigned
Precise
High
Scott Moser

Bug Description

This issue has been reproduced on Diablo and Essex so far.

When doing "sudo reboot" in an instance shortly followed by "euca-reboot-instances {instanceID}", once the instaces comes back up, it is no longer accessible interactively. The kernel is still executing commands but it stucked in the early boot phases

If only "sudo reboot" from within the instance or "euca-reboot-instances" is used, network connectivity comes back as expected.

This can be reproduced by using the following steps :

# On the compute node
sudo -s
source creds/novarc
# start instance
instanceid=`euca-run-instances -k novaadmin -t m1.tiny ami-00000006 | grep INSTANCE | awk '{print $2}'`
# wait for instance to start
sleep 60
# reboot the instance with reboot command
ssh -i creds/novaadmin_.key ubuntu@`euca-describe-instances $instanceid | grep INSTANCE | awk '{print $4}'` 'sudo reboot'
# wait 20 seconds
sleep 20
# reboot instance with euca-reboot-instance command
euca-reboot-instances $instanceid

Now euca-describe-instances will show the instance as running, but it is is unreachable via ssh. The kvm process for the instance is still visible and running.

Related bugs:
  * bug 872244: grub2 recordfail logic prevents headless system from rebooting after power outage
  * bug 669481: Timeout should not be -1 if $recordfail

Related branches

Louis Bouchard (louis) wrote :

Log files captured during the "sudo reboot" + "euca-reboot-instances" sequence

Louis Bouchard (louis) wrote :
Louis Bouchard (louis) wrote :
Louis Bouchard (louis) wrote :
Louis Bouchard (louis) on 2012-08-16
summary: - network to instance lost after reboot followed by euca-reboot-instances
+ interactive access to instance lost after reboot followed by euca-
+ reboot-instances
description: updated
Download full text (3.7 KiB)

After further investigation, it turns out that it is more than network access that is lost, but complete interactive access. The instance is indeed restarted, but it blocks in the early boot phases, while running the initscripts

Here is how to better reproduce the problem :

1) Start an instance normally and ssh to it
2) Modifiy the /etc/default/grub file like this :

GRUB_CMDLINE_LINUX_DEFAULT="console=ttyS0 earlyprintk=ttyS0,keep debug=vc"

3) Reboot the instance
 $ sudo reboot

4) ssh to the instance and run 'sudo reboot' closely followed by 'euca-reboot-instances ${instanceid}'

Depending on how the instance fails, the last bits of boot data might be reported by euca-get-console-output ${instanceid}' or by looking in /var/lib/nova/instances/${instanceid}/console.ring. Here are a few examples :

test 1:

+ [ -f /scripts/init-top/ORDER ]^M
+ . /scripts/init-top/ORDER^M
+ /scripts/init-top/all_generic_ide^M
+ [ -e /conf/param.conf ]^M
+ /scripts/init-top/blacklist^M
+ [ -e /conf/param.conf ]^M
+ /scripts/init-top/udev^M
+ [ -e /conf/param.conf ]^M
+ maybe_break modules^M
+ egrep -q (,|^)modules(,|$)^M
+ echo ^M
[ 0.976683] udevd[80]: starting version 175^M
[ 0.976683] udevd[80]: starting version 175^M
+ [ n != y ]^M
+ log_begin_msg Loading essential drivers^M
+ _log_msg Begin: Loading essential drivers ... ^M
+ [ n = y ]^M
+ printf Begin: Loading essential drivers ... ^M
Begin: Loading essential drivers ... + load_modules^M
+ [ -e /conf/modules ]^M
+ [ n != y ]^M
+ log_end_msg^M
+ _log_msg done.\n^M
+ [ n = y ]^M
+ printf done.\n^M
done.^M
+ [ -n ]^M
+ maybe_break premount^M
+ echo ^M
+ egrep -q (,|^)premount(,|$)^M
+ [ n != y ]^M
+ log_begin_msg Running /scripts/init-premount^M
+ _log_msg Beg

Test 2:

+ egrep -q (,|^)mount(,|$)^M
+ echo ^M
+ log_begin_msg Mounting root file system^M
+ _log_msg Begin: Mounting root file system ... ^M
+ [ n = y ]^M
+ printf Begin: Mounting root file system ... ^M
Begin: Mounting root file system ... + . /scripts/local^M
+ parse_numeric /dev/disk/by-uuid/d26039a8-0240-4301-b38d-fd1aceedac9f^M
+ return^M
+ maybe_break mountroot^M
+ echo ^M
+ egrep -q (,|^)mountroot(,|$)^M
+ mountroot^M
+ pre_mountroot^M
+ [ n != y ]^M
+ log_begin_msg Running /scripts/local-top^M
+ _log_msg Begin: Running /scripts/local-top ... ^M
+ [ n = y ]^M
+ printf Begin: Running /scripts/local-top ... ^M
Begin: Running /scripts/local-top ... + run_scripts /scripts/local-top^M
+ initdir=/scripts/local-top^M
+ [ ! -d /scripts/local-top ]^M
+ return^M
+ [ n != y ]^M
+ log_end_msg^M
+ _log_msg done.\n^M
+ [ n = y ]^M
+ printf done.\n^M
done.^M
+ [ /disk/by-uuid/d26039a8-0240-4301-b38d-fd1aceedac9f = /dev/disk/by-uuid/d26039a8-0240-4301-b38d-fd1aceedac9f ]^M
+ [ -z ]^M
+ wait-for-root /dev/disk/by-uuid/d26039a8-0240-4301-b38d-fd1aceedac9f 30^M

Test 3

+ egrep -q (,|^)mountroot(,|$)^M
+ mountroot^M
+ pre_mountroot^M
+ [ n != y ]^M
+ log_begin_msg Running /scripts/local-top^M
+ _log_msg Begin: Running /scripts/local-top ... ^M
+ [ n = y ]^M
+ printf Begin: Running /scripts/local-top ... ^M
Begin: Running /scripts/local-top ... + run_scripts /scripts/local-top^M
+ initdir=/scripts/local-top^M
+ [ ! -d /scripts/local-top ]^M
+ return^...

Read more...

Scott Moser (smoser) wrote :

Hi Louis,
  I actually hit this bug on an openstack instance and it occurred to me what was happening. Its a fairly severe bug in our images, and I think it would be good to get the fix back to 12.04 at a minimum.

The issue occurs when the following sequence of events happen
 * reboot issued
 * system cleanly shuts down
 * bios comes up, loads grub
 * grub to do the 5 second countdown (see GRUB_TIMEOUT in
   /etc/default/grub. The build scripts at [1] set this to 0 in quantal)
 * grub writes that it started the kernel load to the grub environment
   file /boot/grub/grubenv
 * kernel starts running, and boot occurs
 * HARD REBOOT HERE
 * /etc/init.d/grub-common marks a "clean boot"
   (this is run at S99 in via rc.sysvinit, very late in boot)

This bug 872244 is fixed in the cloud images in quantal.
Search for GRUB_RECORDFAIL_TIMEOUT in [1].
Also note, that this variable was added to grub at [2].

[1] http://bazaar.launchpad.net/~ubuntu-on-ec2/vmbuilder/automated-ec2-builds/view/head:/vmbuilder-cloudimg-fixes
[2] https://code.launchpad.net/~utlemming/ubuntu/quantal/grub2/param-recordfail-timeout/+merge/107243

Scott Moser (smoser) wrote :

Oh yeah, and as to what was happening in the kvm instance at the time?
It was sitting waiting for someone to hit 'enter'.

Scott Moser (smoser) wrote :

Just a note, i was able to recreate this entirely outside of openstack, using the the process described at
https://gist.github.com/3382629 . Then, I just did:
  ssh ubuntu@$IPADDR "sudo reboot" && sleep 12 && virsh reset my-dom

Louis Bouchard (louis) wrote :

@smoser

Looks like this bug should be reassigned to something else than nova ? And I suppose that it'll warrant an SRU to get the fix into Precise ?

Louis Bouchard (louis) wrote :

Furthermore, this looks like the commit that needs to be backported to Precise

Louis Bouchard (louis) wrote :
Changed in nova:
status: New → In Progress
assignee: nobody → Louis Bouchard (louis-bouchard)
Scott Moser (smoser) on 2012-08-20
tags: added: cloud-images cloud-images-build
description: updated
Changed in nova:
status: In Progress → Invalid
Scott Moser (smoser) wrote :

I'm marking this "In progress" based on the intent to backport to 12.04.
The fix for this bug is already present in the cloud images in 12.10.

description: updated
Changed in ubuntu:
status: New → Triaged
status: Triaged → Fix Released
status: Fix Released → In Progress
Scott Moser (smoser) on 2012-08-20
summary: - interactive access to instance lost after reboot followed by euca-
- reboot-instances
+ instance hangs at grub prompt after reboot followed by euca-reboot-
+ instances
Scott Moser (smoser) on 2012-08-20
Changed in ubuntu:
status: In Progress → Fix Released
importance: Undecided → High
assignee: nobody → Ben Howard (utlemming)
Scott Moser (smoser) wrote :

The build of 12.04 cloud-images are now adding the 'GRUB_RECORDFAIL_TIMEOUT=0' to /etc/default/grub [1].
However, at the moment this will have no affect. For this problem to be completely fixed, we'll also:
 a.) need bug 669481 backported to 12.04, and through -proposed
 b.) need a new official cloud image released

--
[1] http://bazaar.launchpad.net/~ubuntu-on-ec2/vmbuilder/automated-ec2-builds/revision/499

Louis Bouchard (louis) on 2012-08-21
Changed in nova:
assignee: Louis Bouchard (louis-bouchard) → nobody
Louis Bouchard (louis) wrote :

FYI,

the SRU for bug 669481 has now been accepted and the package built into -proposed.

I tested the new package with the process described in comment #8 and it does indeed fix this issue.

I suppose that now we need to wait for it to appear in -updates.

Scott Moser (smoser) wrote :

I've added oneiric to the list of images that will get this tweak in build scripts at revno 526.
http://bazaar.launchpad.net/~ubuntu-on-ec2/vmbuilder/automated-ec2-builds/revision/526

Scott Moser (smoser) wrote :

This fix is present in 12.04 released images with serial 20121026.1 or later. (they contain grub at 1.99-21ubuntu3.4)
Marking fix-released.

Scott Moser (smoser) wrote :

Marking fix-released in oneiric based on my comment 15, and the fact that oneiric is now EOL.

affects: nova → ubuntu-on-ec2
Changed in ubuntu-on-ec2:
status: Invalid → Fix Released
affects: ubuntu → grub2 (Ubuntu)
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers