cloud-init timeout waiting for metadata service on EC2

Bug #745930 reported by Scott Moser on 2011-03-30
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
cloud-init (Ubuntu)
Low
Unassigned

Bug Description

Binary package hint: cloud-init

In testing for natty beta-1, we hit a timeout waiting for the metadata service.

This would occur when the 'cloud-init' ran, meaning that there was a network interface up.

This occurred on m1.large in us-west-1.

ProblemType: Bug
DistroRelease: Ubuntu 11.04
Package: cloud-init 0.6.1-0ubuntu4
ProcVersionSignature: User Name 2.6.38-7.39-virtual 2.6.38
Uname: Linux 2.6.38-7-virtual x86_64
Architecture: amd64
Date: Wed Mar 30 17:56:22 2011
Ec2AMI: ami-bb3261fe
Ec2AMIManifest: ubuntu-images-testing-us-west-1/ubuntu-natty-daily-amd64-server-20110329.manifest.xml
Ec2AvailabilityZone: us-west-1c
Ec2InstanceType: m1.large
Ec2Kernel: aki-9ba0f1de
Ec2Ramdisk: unavailable
ProcEnviron:
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: cloud-init
UpgradeStatus: No upgrade log present (probably fresh install)

Scott Moser (smoser) wrote :
tags: added: iso-testing
Dave Walker (davewalker) wrote :

Scott, Was networking up at that stage; and was it possible to connect to it externally?

Have you only seen this once?

Thanks.

Changed in cloud-init (Ubuntu):
status: New → Incomplete
Scott Moser (smoser) on 2011-03-30
description: updated
Scott Moser (smoser) wrote :

I had seen this recently on a lucid instance, that caused a publish failure from the daily build scripts.
That occurred on a m1.large of:
  us-west-1 ami-f5bfefb0 canonical ebs/ubuntu-lucid-10.04-amd64-server-20110201.1

Scott Moser (smoser) wrote :

This only occurred once in todays beta-1 testing. I have seen it on very rare occasion, as noted above.

I started digging a bit through my build logs. The daily build process launches an m1.large instance in each region to facilitate populating EBS root instances. On 2011-03-03, I updated the utility instances that we use to the latest lucid refreshed version (ebs/ubuntu-lucid-10.04-amd64-server-20110201.1), so this first info represents data only from that date.

- 66 build logs (17 failed for one reason or another)
- 298 instances launched from 2011-03-03 to 2011-03-30.
- 9 of 298 instances could not be reached via ssh
- 3 of those 9 instances were due to timeout on metadata service these were:
  2011-03-26 us-west-1 maverick-desktop-uec-20110326
  2011-03-28 us-west-1 natty-server-uec-20110328
  2011-03-29 us-west-1 lucid-server-uec-20110329

So, 298 instances launched, 3 failures on metadata service. All of those in us-west-1 and occurred in the last 5 days.

Going back further, though all the logs I have since ~ 2010-03-16
- 704 build logs
- 2620 instances launched
- 28 failed to have elastic IP associated and be reached via ssh.
- 4 TOTAL have messages in console suggesting time out on metadata service
  2011-03-01 us-east-1 lucid-server-uec-20110301
  2011-03-26 us-west-1 maverick-desktop-uec-20110326
  2011-03-28 us-west-1 natty-server-uec-20110328
  2011-03-29 us-west-1 lucid-server-uec-20110329

Changed in cloud-init (Ubuntu):
status: Incomplete → New
importance: Undecided → Low
Eric Hammond (esh) wrote :

In my experience in 2008-2009, just because networking is up on an EC2 instance does not mean that 169.254.169.254 is going to accept connections and requests for meta-data and user-data. You need to wait for this to become available. I had code to do this in Ubuntu AMIs I built back then.

    perl -MIO::Socket::INET -e 'until(new IO::Socket::INET("169.254.169.254:80")){sleep 1}'

I used to think I was waiting for networking to come up, but it became clear that the meta-data service was not listening even some times when networking was working and I had a local IP address on the instance.

Eric Hammond (esh) wrote :

On second glance, cloud-init may actually be retrying the connection and the problem is that sometimes it takes longer than the current number of retries.

Scott Moser (smoser) wrote :

Well, if you lok at the attached console logs, you can see that cloud-init is retrying. In lucid, we retried 100 times with a increasing back off, resulting in 17 minutes or so of trying. The build scripts gave up after 5 minutes.

In natty (both to reduce the pain if cloud-init was installed outside of EC2, and because i had extreme amounts of data -- 2000+ instances that waiting did not help), I reduced that timeout significantly, but made it configurable in /etc/cloud/cloud.cfg.

From the console logs attached:
-- lucid --
  cloud-init running: Sun, 27 Mar 2011 10:14:09 +0000. up 10.57 seconds
  waiting for metadata service at http://169.254.169.254/2009-04-04/meta-data/instance-id
  10:14:11 [ 1/100]: url error [timed out]
  ...
  10:16:56 [31/100]: url error [timed out]

-- natty --
  cloud-init start running: Wed, 30 Mar 2011 17:01:16 +0000. up 10.08 seconds
  2011-03-30 17:01:18,555 - DataSourceEc2.py[WARNING]: waiting for metadata service at http://169.254.169.254/2009-04-04/meta-data/instance-id
  2011-03-30 17:01:18,555 - DataSourceEc2.py[WARNING]: 17:01:18 [ 1/30]: url error [timed out]
  ...
  011-03-30 17:03:55,735 - DataSourceEc2.py[WARNING]: 17:03:55 [30/30]: url error [timed out]
  2011-03-30 17:04:01,742 - DataSourceEc2.py[CRITICAL]: giving up on md after 165 seconds

So, at the point at which the build publishing scripts gave up, cloud-init had been re-trying (after network had already come up) for 2 minutes and 35 seconds. In the natty test, it gave up after 2 minutes 45 seconds of trying.

If the metadata service isn't up 3 minutes after the instance is booted, then there is a bug in the platform, expecting an instance to wait 3 minutes is not acceptable.

Dave Walker (davewalker) on 2011-04-13
tags: added: server-nrs
Dave Walker (davewalker) on 2011-04-15
Changed in cloud-init (Ubuntu):
milestone: none → ubuntu-11.04
Scott Moser (smoser) on 2011-04-18
Changed in cloud-init (Ubuntu):
milestone: ubuntu-11.04 → none
Dave Walker (davewalker) on 2011-04-19
tags: removed: server-nrs
Scott Moser (smoser) wrote :

I'm marking this WONT FIX, as I do truly believe it is a platform issue, not anything that cloud-init could recover from.

Changed in cloud-init (Ubuntu):
status: New → Won't Fix
mumcowpig (a2907395) wrote :

Don't ask me why I installed the aptitude "task" group uec on my ubuntu 10.04.2 server edition virtualbox image. I wanted to have a minimal streamlined ubuntu virtual machine for developing for Heroku cedar stack which is based on ubuntu 10.04 and ruby1.9.2

My virtualbox host machine is running the linux-server(pae) kernel with dkms compiled virtualbox-ose kernel module loaded.

Almost every boot/service hang can be addressed by holding shift on boot and choosing the (single user) recovery kernel/initramfs image in grub/lilo. From there, I can choose command prompt with networking and uninstall services like cloud-init. Then i could reboot.

Because of this design problem,
Cloud-init by default runs before the recovery menu appears. At a minimum, I wish cloud-init would resume to the recovery console after 100 attempts so I could fix this.

Possibly there may be a key sequence or kernel parameter that I could add to the recovery grub boot menu that would allow me to disable this server on a one shot basis?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers