cloud-init job sometimes not running in cloud images without ramdisk

Bug #531494 reported by Scott Moser on 2010-03-03
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Ubuntu
High
Unassigned
Lucid
High
Unassigned

Bug Description

Binary package hint: upstart

Somewhere between uec image build 20100224 (beta1) and 2010228, the cloud-init job stopped running.

cloud-init job has:
  start on (mounted MOUNTPOINT=/ and net-device-up IFACE=eth0)

I'll attach the complete diff between the 2 manifests (it is actually fairly large), but the likely candidates are:
-cloud-init 0.5.7-0ubuntu2
+cloud-init 0.5.7-0ubuntu4
-mountall 2.5
+mountall 2.6
-upstart 0.6.5-3
+upstart 0.6.5-4

The cloud-init changes are packaging only.

Two other things to note here:
a.) booting with a ramdisk "fixes" the problem
b.) instances boot fine on ec2 without ramdisk

The instance boots, and it seems that sshd is running (attempts to ssh to it receive connection refused).
There is no useful information on the serial console, nothing after the kernels messages.

So far, I've been unable to reproduce this (and get console output) when either
 - mountall --debug is used
 - debug initctl job in place ('start on startup' ... 'exec initctl log-priority info')

ProblemType: Bug
Architecture: amd64
Date: Wed Mar 3 11:29:10 2010
DistroRelease: Ubuntu 10.04
Package: upstart 0.6.5-4
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.32-15.22-server
SourcePackage: upstart
Uname: Linux 2.6.32-15-server x86_64

Scott Moser (smoser) wrote :
Scott Moser (smoser) wrote :
Scott Moser (smoser) wrote :
Scott Moser (smoser) wrote :

I did reproduce failure with debug enabled, and attached a log of the same image successful and fail above.

Scott Moser (smoser) wrote :

Re-attaching succesful boot log, previous one had 2 boots in it. This has only the first.

Scott Moser (smoser) on 2010-03-03
Changed in upstart (Ubuntu):
importance: Undecided → Critical
Steve Langasek (vorlon) wrote :

The failed log shows:

remote finished
trigger_events: local 0/2 remote 0/0 virtual 11/11 swap 0/0
fsck_update: updating check priorities
try_mount: / waiting for device
try_mount: /tmp waiting for /
udev_catchup: catching up
mountinfo_watcher: mountinfo changed, reparsing
parse_mountinfo_file: updating mounts
update_mount: /: / /dev/sda1 ext3 defaults
update_mount: /dev: /dev devtmpfs devtmpfs mode=0755
update_mount: /proc: /proc proc proc defaults
update_mount: /sys: /sys none sysfs nodev,noexec,nosuid
update_mount: /sys/fs/fuse/connections: /sys/fs/fuse/connections none fusectl optional
update_mount: /sys/kernel/debug: /sys/kernel/debug none debugfs optional
update_mount: /sys/kernel/security: /sys/kernel/security none securityfs optional
update_mount: /dev/pts: /dev/pts none devpts noexec,nosuid,gid=tty,mode=0620

So mountall is never recognizing the rootfs as mounted.

affects: upstart (Ubuntu) → mountall (Ubuntu)
Steve Langasek (vorlon) wrote :

In the success log, we see:

try_udev_device: block /dev/sda (null) (null)
try_udev_device: block /dev/sda3 085367af-00a2-4be4-9acf-acae13c39ace (null)
try_udev_device: block /dev/sda1 29b36e00-d659-4281-a0c2-43a8ad85955c uec-rootfs
try_udev_device: / by name
run_fsck: /: no check required
[...]
spawn: mount -n -a -t ext3 -o remount,defaults /dev/sda1 /
spawn: mount / [257]

this is missing from the failed log.

Steve Langasek (vorlon) wrote :

More precisely, in the failed log we see

try_udev_device: block /dev/sda (null) (null)
try_udev_device: block /dev/sda3 c13c0556-60db-4cac-bd0c-c8b42ba86404 (null)
boredom_timeout: Waiting for /

Scott Moser (smoser) on 2010-03-08
Changed in mountall (Ubuntu):
milestone: none → ubuntu-10.04-beta-1
Scott Moser (smoser) wrote :

Just for a small update, I've been trying to back upstart to the older version and see if I can't reproduce there, but have not been able to due to bug 532682.

Scott Moser (smoser) wrote :

I took the upstart image that was failing (20100303) and replaced the upstart 0.6.5-4 with 0.6.5-3.

I've yet to make the instance fail (admittedly only 5 tests, but with 0.6.5-4, success was quite rare).

affects: mountall (Ubuntu Lucid) → upstart (Ubuntu Lucid)
Thierry Carrez (ttx) on 2010-03-09
Changed in upstart (Ubuntu Lucid):
status: New → Confirmed

I'm curious as to what is mounting /proc in the log:

    update_mount: /proc: /proc proc proc defaults

Something has run "mount /proc" - where does that come from?

A copy of /var/log/udev from failed and successful boots would be useful

Scott Moser (smoser) wrote :

Current status here:
  I have successfully booted uec image 20100310 6 times on one UEC that I have access to, and booted the 20100305 successfully as well. That is good news. Unfortunately, this is not on the UEC set up that I had originally seen the issue with 20100303 in. I will try on that later today.

  In the meantime, I looked at what else changed between 20100303 and 20100305. The big change was
-linux-image-2.6.32-14-virtual 2.6.32-14.20
+linux-image-2.6.32-15-virtual 2.6.32-15.22

  To see if that fixed this issue, I got the older kernel, and booted 20100310 with it. It failed to run cloud-init.

  So, in short conclusion, it *seems* like this was fix-released with linux-image-2.6.32-15-virtual . However, there is still the possibility of race condition. I'll test with the original UEC system to verify problem is solved with 20100310 there.

Scott Moser (smoser) wrote :

Well, I had high hopes for the above. However, it seems that 20100303.1 is failing on Kirkland's cloud. So, it really seems to me that there must be race and that all the data above is just happenstance.

I'm going to test 20100310 there also, just to make sure its apples to apples, but I no longer have high hopes that this is fixed.

Additionally, I can get /var/log/udev from the successful instances, but in the failed instance it seems that file isn't even present.

Scott Moser (smoser) wrote :

At this point
 - 20100310
   boots perfectly in data center,
   fails on kirkland's cloud.
 - 20100310 with older (alpha-3 kernel)
   shows no output past kernel messages, ie, the cloud-init job is not run.

I'm fairly convinced this is timing related. The data center hardware is generally much more powerful than dustin's laptop cloud. I think that goes both for the switches and the cpu/memory.

Thierry Carrez (ttx) wrote :

Another data point:
Freshly-installed CLC+Walrus+CC+SC / NC, eucalyptus 1.6.2-0ubuntu12

Starting one instance: 20100310 boots perfectly on my cloud-in-a-bag.
Starting two instances at the same time: one of them fails, same symptoms.
Starting two instances at the same time: the two fail.
Letting the cloud cool off, starting one instance: boots ok.

Thierry Carrez (ttx) wrote :

Probably not related to UEC/no-ramdisk: if EC2 systems were slow we could probably reproduce it there as well.

Changed in upstart (Ubuntu Lucid):
status: Confirmed → Incomplete
Colin Watson (cjwatson) wrote :

Scott R asked for some specific information in comments 12 and 13. Could somebody please supply it? (Scott M gave a partial answer in comment 15, but we still don't have a copy of the successful log, nor any hints as to where /proc might be being mounted.)

I'm deferring to beta-2, as we're unlikely to be able to get this from Incomplete to fixed in the remaining time to beta-1.

Changed in upstart (Ubuntu Lucid):
milestone: ubuntu-10.04-beta-1 → ubuntu-10.04-beta-2

On Fri, 12 Mar 2010, Colin Watson wrote:

> Scott R asked for some specific information in comments 12 and 13.
> Could somebody please supply it? (Scott M gave a partial answer in
> comment 15, but we still don't have a copy of the successful log, nor
> any hints as to where /proc might be being mounted.)

I really have no idea what might be mounting /proc. Nothing in cloud-init
is doing it, as cloud-init isn't being called at all as far as I can tell.
I've put other jobs in with the same 'start on' condition that do nothing
but echo messages, and have not seen the messages in the log. Nor is
there any indication in the logs that cloud-init is being called. In
short, I don't know, and Scott R is probably more suited to guess than I
am.

I will attach a successful log, I only didn't do so because I didn't
realize its use without a failed log.

I quickly tried to get a /var/log/udev from an instance running in the Datacenter cloud (which usually boots successfully), but the installation there right now doesn't have working metadata service.

As such, I was unable to do so. I've asked mathiaz to try, but I don't know if he will be successful. If not, I will get one Sunday or Monday.

To do so, simply get a current lucid server image, and start it. If you see messages after the kernel's messages in the console, then you can most likely reach the system (ie, successful boot). Once you've done that, ssh in and grab /var/log/udev and attach here.

For consistency, we've been using the 20100310 image, but just about anything after alpha3 shows similar failure. I've not seen issues with the alpha3 image.

Mathias Gug (mathiaz) wrote :

Using 20100313 amd64 server image on an CLC+Walrus+CC+SC / NC UEC with 1.6.2-0ubuntu13 I was able to successfully boot an image. I've attached /var/log/udev from the instance.

Scott Moser (smoser) wrote :

In order to have reliable boot for beta-1, I'm going to enable ramdisk generation and packaging for uec images (ie, ramdisk for -virtual kernel, but not for -ec2)

Scott Moser (smoser) wrote :

I think I just ran into this in ec2 , as far as I know, the only time this has happened to me. I'm attaching the (not very useful) console log. attempts to ssh to system give connection refused.

RESERVATION r-e8eec980 950047163771 default
INSTANCE i-1300a978 ami-a908e7c0 ec2-184-73-40-120.compute-1.amazonaws.com ip-10-245-195-22.ec2.internal running ec2-keypair.us-east-1 0m1.small 2010-03-18T14:16:49+0000 us-east-1c aki-f509e69c monitoring-disabled 184.73.40.120 10.245.195.22 ebs
BLOCKDEVICE /dev/sda1 vol-99fe25f0 2010-03-18T14:16:59.000Z

Steve Langasek (vorlon) wrote :

Scott, assuming you're the correct assignee for this targeted/milestoned bug.

Changed in upstart (Ubuntu Lucid):
assignee: nobody → Scott James Remnant (scott)
Thierry Carrez (ttx) wrote :

Updating title to reflect that it's not eucalyptus-specific, and the racy nature of it.

summary: - cloud-init job not running in eucalyptus without ramdisk
+ cloud-init job sometimes not running in cloud images without ramdisk

On Fri, 2010-03-19 at 04:45 +0000, Steve Langasek wrote:

> Scott, assuming you're the correct assignee for this targeted/milestoned
> bug.
>
> ** Changed in: upstart (Ubuntu Lucid)
> Assignee: (unassigned) => Scott James Remnant (scott)
>
No; I don't know anything about this bug -- I've asked the server team
for information and they haven't provided it. Until I actually have
concrete information about this, I can't accept it being assigned to me.

Scott
--
Scott James Remnant
<email address hidden>

Changed in upstart (Ubuntu Lucid):
assignee: Scott James Remnant (scott) → nobody
Scott Moser (smoser) wrote :

I just IRCd with Scott. I will try monday to get him access to a system with console access so he can poke around.

Scott Moser (smoser) wrote :

This is readily reproducible using an image launched from libvirt (ie, no eucalyptus involved). I've sent connection information to Scott for a system that reproduces the problem.

Scott Moser (smoser) wrote :

Moving this out of 'incomplete'. Yesterday I gave Scott access to a system that readily reproduces the problem.

Changed in upstart (Ubuntu Lucid):
status: Incomplete → Confirmed
Thierry Carrez (ttx) on 2010-03-30
Changed in upstart (Ubuntu Lucid):
importance: Critical → High
milestone: ubuntu-10.04-beta-2 → none

I can't see any reason that this should be assigned to Upstart; if t'other Scott is right, the system at least starts some services, in which case Upstart must be working just fine

affects: upstart (Ubuntu Lucid) → Ubuntu Lucid
Scott Moser (smoser) wrote :

I'm attaching a script that recreates this.
It should run on any node controller without any hassle. it does need partition2disk from eucalyptus, and general libvirt functionality to work.

Scott Moser (smoser) wrote :

Just putting this here for late convenience. The kvm command that libvirt runs is:

 /usr/bin/kvm -S -M pc-0.12 -enable-kvm -m 192 -smp 1 -name 20100406-no-ramdisk -uuid 3cffa9b9-19e4-291e-21c4-9a7c17607793 -nographic -chardev socket,id=monitor,path=/var/lib/libvirt/qemu/20100406-no-ramdisk.monitor,server,nowait -monitor chardev:monitor -boot c -kernel /home/ubuntu/recreate/lucid-server-uec-amd64-vmlinuz-virtual -append root=/dev/sda1 console=ttyS0 init=/bin/bash -drive file=/home/ubuntu/recreate/disk.img,if=scsi,index=0,boot=on -net nic,macaddr=52:54:00:29:0d:91,vlan=0,model=e1000,name=e1000.0 -net tap,fd=43,vlan=0,name=tap.0 -chardev pty,id=serial0 -serial chardev:serial0 -parallel none -usb

Scott Moser (smoser) wrote :

Scott suggested this would be fixed with mountall 2.10 . I tested the recreate with script attached here . Unfortunately, i'm not able to reproduce the failure . I've been unable to recreate with any of 20100406, 20100402 and 20100401 .

0402 and 0401 have the old mountall
from 20100401:
$ dpkg-query --show mountall upstart plymouth
mountall 2.9
plymouth 0.8.1-2
upstart 0.6.5-5

While I'm happy its fixed, I really wish I could see the failure with older image.

Scott Moser (smoser) wrote :

Good news. I see the failure when recreating with beta-1 (20100317).

I'm going to mark this fix-released based on 20100406 not exhibiting the failure and beta-1 showing it.

Scott Moser (smoser) wrote :

I'm attaching a diff of the broken (beta1/20100317) and apparently fixed manifests (20100401)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers