Ubuntu
linux package

precise fails boot on ec2 hvm

Bug #901305 reported by Scott Moser on 2011-12-07

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Fix Released	High	Unassigned

Bug Description

I tried booting precise kernel today on hvm.
hvm/ubuntu-precise-daily-amd64-server-20111207

The instance was not reachable, and console output seemed to stop in the kernel.

I'll attach the kernel log from get-console-output.

Tags:

Revision history for this message

Scott Moser (smoser) wrote on 2011-12-07:

console log of booted instance Edit (10.9 KiB, text/plain)

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2011-12-08:

@Scot

Can you post the last precise kernel you were able to boot with successfully?

tags:	added: precise
tags:	added: kernel-da-key kernel-key

Revision history for this message

Brad Figg (brad-figg) wrote on 2011-12-08: Test with newer development kernel (3.2.0-3.9)

Thank you for taking the time to file a bug report on this issue.

However, given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

We have noted that there is a newer version of the development kernel than the one you last tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

If you want this bot to quit automatically requesting kernel tests, add a tag named: bot-stop-nagging.

Thank you for your help, we really do appreciate it.

Changed in linux (Ubuntu):
status:	Confirmed → Incomplete
tags:	added: kernel-request-3.2.0-3.9

Stefan Bader (smb) on 2011-12-08

Changed in linux (Ubuntu):
assignee:	nobody → Stefan Bader (stefan-bader-canonical)

Andy Whitcroft (apw) on 2011-12-08

tags:	added: bot-stop-nagging
Changed in linux (Ubuntu):
status:	Incomplete → In Progress

Revision history for this message

Stefan Bader (smb) wrote on 2011-12-08:

I can reproduce this with the Xen 3,4 based CentOS installation. The same image boots without problems on the same host running Xen 4.1.1 / Oneiric dom0. I created a memory dump of the hung system and will look into that next.

Revision history for this message

Stefan Bader (smb) wrote on 2011-12-08:

The dmesg of the first dump taken does not show any error but just stops right after freeing kernel memory messages. Now doing more attempts I get a mixed set of results. Twice I got the guest booted, one or two more times it hung and right now a crash.

Revision history for this message

Stefan Bader (smb) wrote on 2011-12-08:

Not sure what got the image to boot twice. :( The majority of boot attempts it is just hanging. A quick look a bit deeper into the dump taken shows both CPUs are in idle. This badly sound like some form of interrupt problems. At least the APIC emulation seems still to be used (not the paravirtualized event channels for interrupts/verctor callback).

Revision history for this message

Stefan Bader (smb) wrote on 2011-12-08:

A little more in-depth comparison between Xen 4.1.1 booting and Xen 3.4.3 not booting dmesg outputs shows for one thing a few maybe minor differences in reserved memory (likely to fit differently sized acpi tables), not being used vector callback and pvops timer and it seems the working case sizes cpu arrays to 15 while the other case only seems to set for the 2 vcpus defined.
The big difference seems to be that in the not working case we get the message about "Trying to unpack rootfs image as initramfs..." right before "Freeing initrd memory: 14228k freed". While in the working case there seems to be a whole lot of acpi and pci init going on. Weirdly the time stamps of the two look to be apart by roughly the same time.
In the working case there is also an error message about xs_reset_watches failed with -38. But don't know where that comes from and whether it has meaning. And the whole big difference may just be due to some trigger getting missed. Just cannot say what is missing here.

Revision history for this message

Stefan Bader (smb) wrote on 2011-12-08:

So very broad range it seems to have been broken between 3.1 and 3.2. Our last 3.1 kernel seems to boot (3.1.0-2.3) while the first 3.2 is already broken (3.2.0-1.1).

Joseph Salisbury (jsalisbury) on 2011-12-08

tags:

removed: kernel-key

Revision history for this message

Stefan Bader (smb) wrote on 2011-12-15:

For what-the-heck reasons, git bisection miserably failed to reveal what patch caused this regression. It narrowed it down however and together with the dumped state of a hanging guest, I think I tracked to problem down to

commit ddacf5ef684a655abe2bb50c4b2a5b72ae0d5e05
Author: Olaf Hering <email address hidden>
Date: Thu Sep 22 16:14:49 2011 +0200

xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel

The problem seems to be that (at least up to the version of Xen which I am using right now, and I deliberately did not update) xenstore seems to just ignore the message and not return an error. And for that reason the whole init remains stuck in xs_init(), while waiting for the reply to the reset watches message.

Now, right now this can mean two things:
1. The hypervisor (actually xenstord in dom0) is supposed to be updated to return an error, if it cannot handle a message.
2. The kernel messaging code should expect no answer and trigger a timeout error.

I need to start a discussion upstream to find out the answer.

Revision history for this message

Stefan Bader (smb) wrote on 2012-01-19:

#10

The problematic patch was reverted before 3.2 release. Marking the bug as fixed.

Changed in linux (Ubuntu):
assignee:	Stefan Bader (stefan-bader-canonical) → nobody
status:	In Progress → Fix Released

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

console log of booted instance Edit

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

precise fails boot on ec2 hvm

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package