Ubuntu

natty fails ec2 boot on i386

Reported by Scott Moser on 2010-11-01
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Critical
Stefan Bader
Natty
Critical
Stefan Bader

Bug Description

I'm attaching the console output, but our current natty builds (this one was 20101101) do not successfully mount the root filesystem on ec2.

Here is the interesting portion of the log:
Begin: Running /scripts/init-premount ... done.^M
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.^M
Begin: Running /scripts/local-premount ... done.^M
[ 0.567065] EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)^M
Begin: Running /scripts/local-bottom ... done.^M
done.^M
Begin: Running /scripts/init-bottom ... done.^M
[ 8.923830] end_request: I/O error, dev sda1, sector 12853456^M
[ 8.923850] end_request: I/O error, dev sda1, sector 12853456^M
[ 8.924184] Aborting journal on device sda1-8.^M
[ 8.926353] EXT4-fs error (device sda1): ext4_journal_start_sb:251: Detected aborted journal^M
[ 8.926369] EXT4-fs (sda1): Remounting filesystem read-only^M
[ 8.928419] EXT4-fs error (device sda1): ext4_journal_start_sb:251: Detected aborted journal^M

Scott Moser (smoser) wrote :
Scott Moser (smoser) wrote :
Changed in linux (Ubuntu):
importance: Undecided → Critical
milestone: none → natty-alpha-1
status: New → Confirmed
Scott Moser (smoser) wrote :

I verified that maverick images boot successfully under kvm directly and when running under a uec maverick host. So, this would seem to be a ec2 specific issue.

Stefan Bader (smb) wrote :

Not sure this is leading somewhere, but comparing the console log of a Maverick instance with the failed boots here I saw the following:

[ 0.242692] blkfront: sda1: barriers enabled (drain)
[ 0.243443] Setting capacity to 20971520
[ 0.243461] sda1: detected capacity change from 0 to 10737418240
[ 0.244067] blkfront: sdb: barriers enabled (drain)
[ 0.264727] sdb: unknown partition table
[ 0.264928] Setting capacity to 880732160
[ 0.264940] sdb: detected capacity change from 0 to 450934865920
[ 0.265508] blkfront: sdc: barriers enabled (drain)
[ 0.266328] sdc: unknown partition table
[ 0.266507] Setting capacity to 880732160
[ 0.266519] sdc: detected capacity change from 0 to 450934865920

I have not yet been looking deeper but there were no change messages in Maverick. Which could just mean the message had not been printed there or there was really a change to the blkfront driver to initialize in a removable block device way as 0 size and then change to the real size. That may need some udev helper to recognize the change event and trigger a rescanning of partitions.

Stefan Bader (smb) wrote :

Tried the following today in an m1.large us-east instance:

1. Launched ami-00877069
2. logged in
3. wget https://launchpad.net/ubuntu/+source/linux/2.6.37-2.10/+build/2032158/+files/linux-image-2.6.37-2-virtual_2.6.37-2.10_amd64.deb
4. rebooted

ubuntu@ip-10-194-26-111:~$ uname -a
Linux ip-10-194-26-111 2.6.37-2-virtual #10-Ubuntu SMP Fri Nov 5 12:45:25 UTC 2010 x86_64 GNU/Linux

So this seems to be related to user-space or in the plumbing layer of the natty image.

Scott Moser (smoser) wrote :

Stefan,
  I tried to reproduce what you did above, and the system did not come back up. Worse, there are no console messages after '[ 145.610072] Restarting system.' (this was on t1.micro). This happened 3 out of 3 times for me.

  This is the first time I've looked at those logs in depth. In both cases, user space comes up and we get to a "mounted MOUNTPOINT=/" event. That can be seen by 'cloud-init' messages. Then, something goes awry and we kernel messages like:

[ 8.923830] end_request: I/O error, dev sda1, sector 12853456
[ 8.923850] end_request: I/O error, dev sda1, sector 12853456
[ 8.924184] Aborting journal on device sda1-8.
[ 8.926353] EXT4-fs error (device sda1): ext4_journal_start_sb:251: Detected aborted journal
[ 8.926369] EXT4-fs (sda1): Remounting filesystem read-only
[ 8.928419] EXT4-fs error (device sda1): ext4_journal_start_sb:251: Detected aborted journal

Stefan Bader (smb) wrote :

Fun, this seems to be related to t1.micro vs m1.large (and maybe larger). I am able to successfully replace the kernel on a m1.large. But the t1.micro is, as you say, not even bothering updating the console messages. And you said kvm boots fine. Was that with just a single vcpu?

On Fri, 12 Nov 2010, Stefan Bader wrote:

> Fun, this seems to be related to t1.micro vs m1.large (and maybe
> larger). I am able to successfully replace the kernel on a m1.large. But
> the t1.micro is, as you say, not even bothering updating the console
> messages. And you said kvm boots fine. Was that with just a single vcpu?

I just verified with todays image that it boots fine in kvm with 2cpu:
 kvm -boot a -fda natty-server-uec-amd64-floppy -drive
    file=natty-server-uec-amd64.img,if=virtio,boot=on -curses -smp 2

The path in kvm is slightly different as it a different init that runs
before /sbin/init is run, but in the end, it boots fine.

Its different disks though (kvm virtio versus xen block devices)

I don't want to claim this is of any proven importance, but I noted that my t1.micro instances come up as UP (iow 1CPU) and then memory restrictions may or may not be something to have an eye on.

On Fri, 12 Nov 2010, Stefan Bader wrote:

> I don't want to claim this is of any proven importance, but I noted that
> my t1.micro instances come up as UP (iow 1CPU) and then memory
> restrictions may or may not be something to have an eye on.

t1.micro are 650 or so MB and very slow cpu.
However, the images will boot (tested in kvm) in 192M of memory.

Stefan, I tried working this the other way around. Rather than trying natty kernel with maverick user space, I tried maverick kernel with natty user space.

$ ec2-un-instances --region us-east-1 --instance-type m1.small ami-2c28df45
# us-east-1 ami-2c28df45 canonical ebs/ubuntu-natty-daily-i386-server-2010111
$ NATTY_IID=i-aeee8cc3
$ NATTY_VOL=vol-a910fbc1

$ ec2-run-instances --region us-east-1 --instance-type t1.micro --availability-zone us-east-1b ami-aa00f7c3
$ MAVERICK_IID=i-e6ed8f8b

The natty instance failed to boot (actually with no kernel console messages, only grub).

$ ec2-stop-instances ${NATTY_IID}
$ ec2-detach-volume ${NATTY_VOL}
$ ec2-attach-volume ${NATTY_VOL} --instance ${MAVERICK_IID} --device /dev/sdh

then, on the maverick instance

% sudo mount /dev/sdh /mnt
% sudo mount --bind /dev /mnt/dev
% sudo mount --bind /proc /mnt/proc
% sudo mount --bind /sys /mnt/sys
% kdeb=https://launchpad.net/ubuntu/+archive/primary/+files/linux-image-2.6.35-22-virtual_2.6.35-22.35_i386.deb
% ( cd /mnt/tmp && wget ${kdeb} )
% sudo chroot /mnt
[chroot]% dpkg -i /tmp/*.deb
[chroot]% exit
% mnulst=/mnt/boot/grub/menu.lst
% sudo cp ${mnulst} ${mnulst}.dist
% sudo sed -i 's,^default\(.*\)0$,default\12,' ${mnulst}
% diff -u ${mnulst}.dist ${mnulst}
-default 0
+default 2
% for m in proc sys dev; do sudo umount /mnt/$m; done
% sudo umount /mnt

Then,
$ ec2-detach-volume ${NATTY_VOL}
$ ec2-attach-volume ${NATTY_VOL} --instance ${NATTY_IID} /dev/sda
$ ec2-start-instances ${NATTY_IID}

Now, the natty user space booted fine with the maverick kernel.
ssh in:
% uname -r
2.6.35-22-virtual
% lsb_release -c
Codename: natty

The above was done on m1.small (i386) i can try other sizes as well.

Stefan Bader (smb) wrote :

For some reason I don't seem to failing here:

ami-0834c361 us-east-1c

ubuntu@domU-12-31-39-13-00-CF:~$ ls
ubuntu@domU-12-31-39-13-00-CF:~$ uname -a
Linux domU-12-31-39-13-00-CF 2.6.37-2-virtual #10-Ubuntu SMP Fri Nov 5 12:45:25 UTC 2010 x86_64 GNU/Linux
ubuntu@domU-12-31-39-13-00-CF:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu natty (development branch)
Release: 11.04
Codename: natty

The only fails I see are those of t1.micro. And those do not produce any console output.

Stefan Bader (smb) wrote :

John seemed to have some limited success in getting log files from some of the boots. All in all it seems all of i386 and amd64-t1.micro are affected by this. The fact that we get no logs in most cases sounds like a very early crash. Maybe something racy that sometimes does not happen immediately but then causes random failures later on.

Having so little evidence makes it hard to get any grip on this. Will try to look at Maverick i386 and amd64 boots to get a better feeling on obvious differences.

Stefan Bader (smb) wrote :

Just to add info while I am looking through things:

I booted a maverick and a natty amd64 and a maverick i386 instance. Comparing maverick and natty, one things was the compiler version. To rule that out I compiled a natty kernel in a maverick chroot and tested that on a t1.micro. Same fail. So its not that.

The next things I saw but need to figure out how to check for relevance:

In Maverick (i386 and amd64) there is this:
[ 0.000000] Scanning 1 areas for low memory corruption
while in natty it says:
[ 0.000000] Scanning 0 areas for low memory corruption

There seems also to be a slightly different memory layout but I am not sure this is just because this is virtualization. There also was one line in the maverick i386 boot which was different from the amd64 boot:

[ 0.000000] Reserving virtual address space above 0xf5800000

This is the very fist line in dmesg, but again, not sure it is relevant.

Scott Moser (smoser) wrote :

I'm attaching a script that runs following size/arch/root combinations.

i386__ | ebs_ | m1.small |
i386__ | ebs_ | t1.micro |
i386__ | ins_ | m1.small |
x86_64 | ebs_ | m1.large |
x86_64 | ins_ | m1.large |
x86_64 | ebs_ | t1.micro |

Scott Moser (smoser) wrote :

I ran the above script with the 20101118 build in us-east-1, and here is
what I found:
arch__ | root | size____ | result
i386__ | ebs_ | m1.small | fail - no kernel console output
i386__ | ebs_ | t1.micro | fail - no kernel console output
i386__ | ins_ | m1.small | fail - no kernel console ouput
x86_64 | ebs_ | m1.large | success
x86_64 | ins_ | m1.large | success
x86_64 | ebs_ | t1.micro | fail - no kernel console output

Note, that the results above are different from my initially attached console logs. In those, I got console output on both t1.micro and m1.large for x86_64, but both crashed with filesystem errors later.

I'm attaching all the console output, though, in 'success', its present, in failure, we dont see anything after grub.

summary: - natty kernel fails to mount root on ec2
+ natty fails ec2 boot on i386 or t1.micro

I went through and launch a multiple of each m1.small, m1.large, and t1.micro instances in all zones. And none of them booted successfully. I was however able to get logs from some of the failed instances. I have attached one of the 6 logs I managed to collect. None of them are significantly different and all of them end with the same error.

John Johansen (jjohansen) wrote :

I went through and launch a multiple of each m1.small, m1.large, and t1.micro instances in all zones. And none of them booted successfully. I was however able to get logs from some of the failed instances. I have attached one of the 6 logs I managed to collect. None of them are significantly different and all of them end with the same error.

Changed in linux (Ubuntu):
assignee: nobody → Ubuntu Kernel Team (ubuntu-kernel-team)
Andy Whitcroft (apw) on 2010-11-29
Changed in linux (Ubuntu):
assignee: Ubuntu Kernel Team (ubuntu-kernel-team) → Stefan Bader (stefan-bader-canonical)
tags: added: iso-testing
Changed in linux (Ubuntu Natty):
milestone: natty-alpha-1 → natty-alpha-2
Scott Moser (smoser) wrote :

I'm marking this 'fix-released' given:

$ uname -a
Linux ip-10-251-111-51 2.6.37-8-virtual #21-Ubuntu SMP Sun Dec 5 21:11:17 UTC 2010 i686 GNU/Linux
$ dpkg -S /boot/vmlinuz-$(uname -r)
linux-image-2.6.37-8-virtual: /boot/vmlinuz-2.6.37-8-virtual
$ ec2metadata --ami-manifest-path
ubuntu-images-testing-us/ubuntu-natty-daily-i386-server-20101207.manifest.xml
$ ec2metadata --ami-id
ami-b6c036df

So, the i386 boot is fixed. However, we still have issues with t1.micro. I will split that portion of this bug out into its own.

Changed in linux (Ubuntu Natty):
status: Confirmed → Fix Released
Scott Moser (smoser) wrote :

Also should have added, in comment 19,
$ ec2metadata --instance-type
m1.small

Scott Moser (smoser) wrote :

I opened bug 686692 to address t1.micro for i386 and amd64.

summary: - natty fails ec2 boot on i386 or t1.micro
+ natty fails ec2 boot on i386
Scott Moser (smoser) wrote :

I just verified this is fixed in 20110131 builds with 2.6.38-1 kernel.

us-east-1 ami-aa3dcdc3 canonical ubuntu-natty-daily-i386-server-20110131

$ uname -r
2.6.38-1-virtual
$ dpkg -S /boot/vmlinuz-$(uname -r)
linux-image-2.6.38-1-virtual: /boot/vmlinuz-2.6.38-1-virtual

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers