t1.micro EC2 instances hang on reboot

Bug #634102 reported by Scott Moser
120
This bug affects 12 people
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
Fix Released
Medium
Scott Moser
Lucid
Fix Released
High
Scott Moser
Maverick
Fix Released
Medium
Scott Moser

Bug Description

Binary package hint: cloud-init

on Amazon's new t1.micro instance type, there is no ephemeral storage at all. If you run a ubuntu ebs image on instance type t1.micro and reboot, it will not come back up. mountall will wait indefinitely for /dev/sda2, which is never going to be present.

cloud-init is basically hard coded to expect an 'ephemeral0', while other ephemeral devices are more dynamic.

Our images are registered with block-device-mapping indicating ephemeral0, so the metadata service will include ephemeral0 even though there is not one on the instance itself.

We need to do one of 2 things here:
a.) add 'nobootwait' for the ephemeral0 device (/dev/sda2 in this case)
b.) not write a entry in /etc/fstab (or comment it out) if that device is not present on the first boot.

There are 2 easy workarounds for this:
1.) copy and paste the following after first boot and ssh in:
[ "$(uname -m)" = "x86_64" ] && ephd=/dev/sdb || ephd=/dev/sda2
sudo sed -i.dist "\,${ephd},s,^,#," /etc/fstab

2.) launch instance with cloud-config metadata containing:
#cloud-config
mounts:
 - [ ephemeral0 ]

### SRU Information BEGIN ####
1. This bug affects anyone who is going to run an ec2 instance of type t1.micro . It is expected that this will be lots of people, especially those evaluating EC2 and/or Ubuntu. The bug is that the system will only boot and be reachable one time. On subsequent boots, the ssh service will not start, leaving a cloud instance completely unreachable. That is because on first boot an entry is written to /etc/fstab that will never be present.
2. The bug if fixed by
 a.) carefully updating existing entries in /etc/fstab to add 'nobootwait'. Only ephemeral devices are modified (either /dev/sda2 or /dev/sdb), and only if they contain 'comment=cloudconfig'.
 b.) on future first-boots, writing 'nobootwait' for the entry.
3. The patch is available at lp:~cloud-init-dev/cloud-init/lucid, in changes seen at http://bazaar.launchpad.net/~cloud-init-dev/cloud-init/lucid/revision/19?remember=15&compare_revid=15
4. To reproduce:
 a.) start ec2 lucid instance of t1.micro
 ec2-run-instances --region us-east-1 --key mykey ami-1437dd7d
 b.) ssh to instance and reboot
 sudo reboot
 c.) ssh will not come up, leaving the instance basically dead.
5. The opportunity for regression is almost completely contained in the pre-install script, and here it is very small. The only real negative fallout would be adding 'nobootwait' to an entry in /etc/fstab that the user *wanted* to wait on. This is very unlkely.
######### SRU Information END ##############

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: cloud-init 0.5.10-0ubuntu1.2
ProcVersionSignature: User Name 2.6.32-308.15-ec2 2.6.32.15+drm33.5
Uname: Linux 2.6.32-308-ec2 i686
Architecture: i386
Date: Thu Sep 9 14:42:21 2010
Ec2AMI: ami-1234de7b
Ec2AMIManifest: (unknown)
Ec2AvailabilityZone: us-east-1c
Ec2InstanceType: t1.micro
Ec2Kernel: aki-5037dd39
Ec2Ramdisk: unavailable
PackageArchitecture: all
ProcEnviron:
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: cloud-init

Related branches

Revision history for this message
Scott Moser (smoser) wrote :
summary: - cloud-init writes ephemeral0 entry in /etc/fstab on t1.micro type
+ t1.micro instances hang on reboot
Scott Moser (smoser)
description: updated
description: updated
Thierry Carrez (ttx)
Changed in cloud-init (Ubuntu Maverick):
assignee: nobody → Scott Moser (smoser)
importance: Undecided → Medium
status: New → Confirmed
Scott Moser (smoser)
Changed in cloud-init (Ubuntu Lucid):
importance: Undecided → High
milestone: none → lucid-updates
status: New → Triaged
description: updated
Revision history for this message
Ben Howard (behoward) wrote : Re: t1.micro instances hang on reboot

This looks like a problem with the block mapping when the AMI is registered.

After firing up an t1-miro instance, the Metadata shows:
block-device-mapping:
         ami: /dev/sda1
         ephemeral0: /dev/sda2
         root: /dev/sda1

While one that does have an ephemeral mapping shows:
block-device-mapping:
         ami: /dev/sda1
         root: /dev/sda1

Revision history for this message
Ben Howard (behoward) wrote :

Sorry...the previous comment should read:

After firing up an t1-miro instance with Lucid, the Metadata shows:
block-device-mapping:
         ami: /dev/sda1
         ephemeral0: /dev/sda2
         root: /dev/sda1

While one that does not have an ephemeral mapping shows:
block-device-mapping:
         ami: /dev/sda1
         root: /dev/sda1

Scott Moser (smoser)
description: updated
Revision history for this message
Scott Moser (smoser) wrote : Re: [Bug 634102] Re: t1.micro instances hang on reboot

On Thu, 9 Sep 2010, Ben Howard wrote:

> This looks like a problem with the block mapping when the AMI is
> registered.

Well, yes, sort of.
All Ubuntu images are registered with '--block-device-mapping' for the
smallest type of the given arch. See lines 391 at [1]. I recently bumped
into this and added a comment to that effect at [2]. I had actually
considered beginning to register images with 'ephemeral0' through
'ephemeral4', so that by default, the user would at least get all the
ephemeral storage they could potentially have.

> After firing up an t1-miro instance, the Metadata shows:
> block-device-mapping:
> ami: /dev/sda1
> ephemeral0: /dev/sda2
> root: /dev/sda1

So, yeah, this is the first instance type where this assumption is now
wrong, and 'ephemeral0' device as reported in the metadata service wont be
there. The fix that I've added is just to add 'nobootwait' to the entry
that is added in /etc/fstab. That way, reboot wont hang, and if the user
restarted in a larger instance of the given type, they'd get some
ephemeral data.

--
[1] http://bazaar.launchpad.net/~ubuntu-on-ec2/ubuntu-on-ec2/ec2-publishing-scripts/annotate/head%3A/ec2-image2ebs
[2] http://bazaar.launchpad.net/~ubuntu-on-ec2/ubuntu-on-ec2/ec2-publishing-scripts/revision/239

Scott Moser (smoser)
description: updated
Thierry Carrez (ttx)
tags: added: server-mrs
Thierry Carrez (ttx)
Changed in cloud-init (Ubuntu Maverick):
milestone: none → ubuntu-10.10
Scott Moser (smoser)
summary: - t1.micro instances hang on reboot
+ t1.micro EC2 instances hang on reboot
description: updated
Revision history for this message
Scott Moser (smoser) wrote :

This is fixed in 0.5.15-0ubuntu1 . cloud-init on install will now fix /dev/sda2 or /dev/sdb entries in /etc/fstab to mark them 'nobootwait'. Additionally, cloud-init will *write* entries for ephemeral0 with 'nobootwait'.

Changed in cloud-init (Ubuntu Maverick):
status: Confirmed → Fix Released
Revision history for this message
Martin Pitt (pitti) wrote : Please test proposed package

Accepted cloud-init into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in cloud-init (Ubuntu Lucid):
status: Triaged → Fix Committed
tags: added: verification-needed
Revision history for this message
Scott Moser (smoser) wrote :

I've verified this on both
us-east-1 ami-1234de7b ebs/ubuntu-lucid-10.04-i386-server-20100827
us-east-1 ami-1634de7f ebs/ubuntu-lucid-10.04-amd64-server-20100827

- launch instance
- ssh instance
- verify that '/etc/fstab' has an entry for "/mnt" whose device does not exist
 i386:
 | $ awk '$2 == "/mnt" { print $0 }' /etc/fstab
 | /dev/sda2 /mnt auto defaults,comment=cloudconfig 0 0
 | $ [ -b /dev/sda2 ] || echo "no"
 | no
 amd64:
 | $ awk '$2 == "/mnt" { print $0 }' /etc/fstab
 | /dev/sdb /mnt auto defaults,comment=cloudconfig 0 0
 | $ [ -b /dev/sdb ] || echo "no"
 | no

- enable proposed, update, install cloud-init
 | $ l="deb http://archive.ubuntu.com/ubuntu lucid-proposed main"
 | $ echo "$l" | sudo tee -a /etc/apt/sources.list
 | $ sudo apt-get update && sudo apt-get install cloud-init
 | $ dpkg-query --show cloud-init
 | cloud-init 0.5.10-0ubuntu1.3

- Now, verify that the installation of the package has fixed /etc/fstab
  Notice that 'nobootwait' has been added. In dpkg output, you will
  also see a message like:
    making ephemeral /dev/sda2 in /etc/fstab nobootwait (LP: #634102)
 i386
 | $ awk '$2 == "/mnt" { print $0 }' /etc/fstab
 | /dev/sda2 /mnt auto defaults,comment=cloudconfig,nobootwait 0 0
 amd64
 | $ awk '$2 == "/mnt" { print $0 }' /etc/fstab
 | /dev/sdb /mnt auto defaults,comment=cloudconfig,nobootwait 0 0

- Verify 'reboot', and that you can ssh back in

- Now, verify that cloud-init would write 'nobootwait' on first boot, by
  removing all of /var/lib/cloud so cloud-init thinks this is a first boot
 | $ sudo rm -Rf /var/lib/cloud && reboot

- ssh back in and check /etc/fstab
 i386
 | $ awk '$2 == "/mnt" { print $0 }' /etc/fstab
 | /dev/sda2 /mnt auto defaults,nobootwait,comment=cloudconfig 0 0
 amd64
 | $ awk '$2 == "/mnt" { print $0 }' /etc/fstab
 | /dev/sdb /mnt auto defaults,nobootwait,comment=cloudconfig 0 0

- reboot to test *that* written /etc/fstab, and connect again

tags: added: verification-done
removed: verification-needed
Revision history for this message
Jeff Bauer (jbauer) wrote :

I've verified this works on both:
us-east-1 ami-1234de7b ebs/ubuntu-lucid-10.04-i386-server-20100827
us-east-1 ami-1634de7f ebs/ubuntu-lucid-10.04-amd64-server-20100827

One minor edit to your verification process:
< $ sudo rm -Rf /var/lib/cloud && reboot
> $ sudo rm -Rf /var/lib/cloud && sudo reboot

Changed in cloud-init (Ubuntu Lucid):
status: Fix Committed → Fix Released
status: Fix Released → Fix Committed
Scott Moser (smoser)
Changed in cloud-init (Ubuntu Lucid):
assignee: nobody → Scott Moser (smoser)
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cloud-init - 0.5.10-0ubuntu1.3

---------------
cloud-init (0.5.10-0ubuntu1.3) lucid-proposed; urgency=low

  * fix hang on reboot of ec2's t1.micro (LP: #634102)
 -- Scott Moser <email address hidden> Thu, 09 Sep 2010 12:53:32 -0400

Changed in cloud-init (Ubuntu Lucid):
status: Fix Committed → Fix Released
Revision history for this message
Ray (ray-0711) wrote :

Apparently the fix hasn't made it into the daily releases of Natty on EC2.

This is especially troublesome for someone trying to boot a HVM instance with CUDA enabled, because a reboot is needed after installation of the NVIDIA dev drivers.

Here is a one-liner for a quick fix:
sudo perl -pi -e 's/(nobootwait),(\S+)/$2,$1/' /etc/fstab

Revision history for this message
Scott Moser (smoser) wrote :

@Ray,
  It seems to have been fixed for me. I suspect that you're either not running the AMI you think you are, or some process of your own is adding entries to /etc/fstab.

# us-east-1 ami-1cad5275 hvm/ubuntu-natty-11.04-amd64-server-20110426
$ ec2metadata --instance-type
cc1.4xlarge
$ ec2metadata --ami-id
ami-1cad5275
$ grep -v "^#" /etc/fstab
proc /proc proc nodev,noexec,nosuid 0 0
LABEL=uec-rootfs / ext4 defaults 0 0
/dev/xvdb /mnt auto defaults,nobootwait,comment=cloudconfig 0 2

# us-east-1 ami-4d448424 hvm/ubuntu-natty-daily-amd64-server-20110829
$ ec2metadata --ami-id
ami-4d448424
$ ec2metadata --instance-type
cc1.4xlarge
$ grep -v "^#" /etc/fstab
proc /proc proc nodev,noexec,nosuid 0 0
LABEL=cloudimg-rootfs / ext4 defaults 0 0
/dev/xvdb /mnt auto defaults,nobootwait,comment=cloudconfig 0 2

I also verified 'reboot' was functional on cc1.4xlarge both no '--block-device-mapping' arguments and with:
 --block-device-mapping /dev/sdb=ephemeral0 --block-device-mapping /dev/sdc=ephemeral1

Revision history for this message
Ray (ray-0711) wrote :

@Scott:

Thanks for testing. I think I was a bit too fast on this one. I think you are right. They are rebooting, but all of my HVM instances took a very long to reboot yesterday: >>10 min. So i thought they were stuck and I terminated them.

I will do some more testing on this and report back if there are any reproducible problems.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.