precise fails boot on ec2 hvm

Bug #901305 reported by Scott Moser
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned

Bug Description

I tried booting precise kernel today on hvm.
hvm/ubuntu-precise-daily-amd64-server-20111207

The instance was not reachable, and console output seemed to stop in the kernel.

I'll attach the kernel log from get-console-output.

Revision history for this message
Scott Moser (smoser) wrote :
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

@Scot

Can you post the last precise kernel you were able to boot with successfully?

tags: added: precise
tags: added: kernel-da-key kernel-key
Revision history for this message
Brad Figg (brad-figg) wrote : Test with newer development kernel (3.2.0-3.9)

Thank you for taking the time to file a bug report on this issue.

However, given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

We have noted that there is a newer version of the development kernel than the one you last tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

If the bug still exists, change the bug status from Incomplete to Confirmed. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

If you want this bot to quit automatically requesting kernel tests, add a tag named: bot-stop-nagging.

 Thank you for your help, we really do appreciate it.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-request-3.2.0-3.9
Stefan Bader (smb)
Changed in linux (Ubuntu):
assignee: nobody → Stefan Bader (stefan-bader-canonical)
Andy Whitcroft (apw)
tags: added: bot-stop-nagging
Changed in linux (Ubuntu):
status: Incomplete → In Progress
Revision history for this message
Stefan Bader (smb) wrote :

I can reproduce this with the Xen 3,4 based CentOS installation. The same image boots without problems on the same host running Xen 4.1.1 / Oneiric dom0. I created a memory dump of the hung system and will look into that next.

Revision history for this message
Stefan Bader (smb) wrote :

The dmesg of the first dump taken does not show any error but just stops right after freeing kernel memory messages. Now doing more attempts I get a mixed set of results. Twice I got the guest booted, one or two more times it hung and right now a crash.

Revision history for this message
Stefan Bader (smb) wrote :

Not sure what got the image to boot twice. :( The majority of boot attempts it is just hanging. A quick look a bit deeper into the dump taken shows both CPUs are in idle. This badly sound like some form of interrupt problems. At least the APIC emulation seems still to be used (not the paravirtualized event channels for interrupts/verctor callback).

Revision history for this message
Stefan Bader (smb) wrote :

A little more in-depth comparison between Xen 4.1.1 booting and Xen 3.4.3 not booting dmesg outputs shows for one thing a few maybe minor differences in reserved memory (likely to fit differently sized acpi tables), not being used vector callback and pvops timer and it seems the working case sizes cpu arrays to 15 while the other case only seems to set for the 2 vcpus defined.
The big difference seems to be that in the not working case we get the message about "Trying to unpack rootfs image as initramfs..." right before "Freeing initrd memory: 14228k freed". While in the working case there seems to be a whole lot of acpi and pci init going on. Weirdly the time stamps of the two look to be apart by roughly the same time.
In the working case there is also an error message about xs_reset_watches failed with -38. But don't know where that comes from and whether it has meaning. And the whole big difference may just be due to some trigger getting missed. Just cannot say what is missing here.

Revision history for this message
Stefan Bader (smb) wrote :

So very broad range it seems to have been broken between 3.1 and 3.2. Our last 3.1 kernel seems to boot (3.1.0-2.3) while the first 3.2 is already broken (3.2.0-1.1).

tags: removed: kernel-key
Revision history for this message
Stefan Bader (smb) wrote :

For what-the-heck reasons, git bisection miserably failed to reveal what patch caused this regression. It narrowed it down however and together with the dumped state of a hanging guest, I think I tracked to problem down to

commit ddacf5ef684a655abe2bb50c4b2a5b72ae0d5e05
Author: Olaf Hering <email address hidden>
Date: Thu Sep 22 16:14:49 2011 +0200

    xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel

The problem seems to be that (at least up to the version of Xen which I am using right now, and I deliberately did not update) xenstore seems to just ignore the message and not return an error. And for that reason the whole init remains stuck in xs_init(), while waiting for the reply to the reset watches message.

Now, right now this can mean two things:
1. The hypervisor (actually xenstord in dom0) is supposed to be updated to return an error, if it cannot handle a message.
2. The kernel messaging code should expect no answer and trigger a timeout error.

I need to start a discussion upstream to find out the answer.

Revision history for this message
Stefan Bader (smb) wrote :

The problematic patch was reverted before 3.2 release. Marking the bug as fixed.

Changed in linux (Ubuntu):
assignee: Stefan Bader (stefan-bader-canonical) → nobody
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.