Bug #651370 “ec2 kernel crash invalid opcode 0000 [#1]” : Bugs : linux package : Ubuntu

Revision history for this message

Scott Moser (smoser) wrote on 2010-09-29:

#1

console log of failed instance Edit (35.1 KiB, text/plain)
BootDmesg.txt Edit (15.7 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (1.8 KiB, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (628 bytes, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (981 bytes, text/plain; charset="utf-8")
UdevDb.txt Edit (32.2 KiB, text/plain; charset="utf-8")
UdevLog.txt Edit (68.5 KiB, text/plain; charset="utf-8")

Ubuntu QA Website (ubuntuqa) on 2010-09-29

tags:

added: iso-testing

Scott Moser (smoser) on 2010-09-29

description:

updated

Revision history for this message

Jeremy Foshee (jeremyfoshee) wrote on 2010-10-01:

#2

Hi Scott,

If you could also please test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags:	added: kj-triage
Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Scott Moser (smoser) wrote on 2010-10-07:

#3

console log: us-east-1-x86_64-ami-1a9e6a73 (restarted) Edit (38.3 KiB, text/plain)

Revision history for this message

Scott Moser (smoser) wrote on 2010-10-07:

#4

console log: us-east-1-x86_64-ami-1a9e6a73 (first boot) Edit (31.1 KiB, text/plain)

Revision history for this message

Scott Moser (smoser) wrote on 2010-10-07:

#5

Moving this to confirmed, I attached 2 other console logs seeing this failure.
In both cases, the clock jumped forward by hundreds of thousands of seconds.

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed

Scott Moser (smoser) on 2010-10-18

Changed in linux (Ubuntu):
importance:	Undecided → Medium

Revision history for this message

Brandon Black (blblack) wrote on 2010-10-25:

#6

console.txt Edit (50.9 KiB, text/plain)

Having the same issue on c1.xlarge in us-east-1 (kernel crash on boot related to intel_idle). I've booted the Maverick release AMI several times on m1.large instances fine, but I seem to have a 50%+ failure rate getting it to initially boot without crashing on c1.xlarge. You're going to need to roll new AMIs when/if this bug is fixed, because the failure means inability boot far enough to get the kernel upgraded in the first place.

FWIW, I'm only even trying Maverick because of the unresolved kernel issues with Lucid on EC2 that have been hard to pin down (divide by zero panics in network-related areas of the kernel, apparent disk i/o lockups triggered by runaway CPU load triggered by apt somehow, etc...). What's going on with kernels on EC2? Is anyone at Ubuntu actually testing them?

Revision history for this message

Scott Moser (smoser) wrote on 2010-10-25: Re: [Bug 651370] Re: ec2 kernel crash invalid opcode 0000 [#1]

#7

On Mon, 25 Oct 2010, Brandon Black wrote:

> Having the same issue on c1.xlarge in us-east-1 (kernel crash on boot
> related to intel_idle). I've booted the Maverick release AMI several
> times on m1.large instances fine, but I seem to have a 50%+ failure rate
> getting it to initially boot without crashing on c1.xlarge. You're

My experience is much lower than 50% failure rate. I've run literally
hundreds of instances. This bug seems to hit in fits.
The kernel team is interested in fixing these bugs.

> going to need to roll new AMIs when/if this bug is fixed, because the
> failure means inability boot far enough to get the kernel upgraded in
> the first place.

Agreed.

> FWIW, I'm only even trying Maverick because of the unresolved kernel
> issues with Lucid on EC2 that have been hard to pin down (divide by zero
> panics in network-related areas of the kernel, apparent disk i/o lockups
> triggered by runaway CPU load triggered by apt somehow, etc...). What's

Could you please open a bug ? Use ubuntu-bug /boot/vmlinuz-$(uname -r).
And please attach console output of a kernel panic.
I've not personally seen the bug you're describing.

> going on with kernels on EC2? Is anyone at Ubuntu actually testing
> them?

We do test the kernels, our test suite
(https://code.launchpad.net/~ubuntu-on-ec2/ubuntu-on-ec2/ec2-test) can
admittedly be improved, but prior to any release we launch dozens of
instances, spanning all sizes in all regions. I recently began
publishing test results at
https://code.launchpad.net/~ubuntu-on-ec2/ubuntu-on-ec2/ec2-test-results .

Revision history for this message

Brandon Black (blblack) wrote on 2010-10-26:

#8

I tried to look in more detail at the crash this evening, because it's really causing me a lot of headache now. The most recent time I tried to boot a new c1.xlarge in us-east-1 this evening, I had to cycle through the crash/terminate/relaunch cycle 7 times before I got a working instance. I don't have a patch or answer yet, but I have a lot of hints:

1) c1.xlarge seems to be going through some changes of underlying CPU/hardware, which could explain the randomness. It probably depends which hardware you land on. The older ones are Xeon E5410 and the newer ones are Xeon E5506. So far the only times I've gotten non-crashed launches and thought to check, they've all been the E5410's.

2) The exact instruction throwing invalid opcode is MONITOR (0f 01 c8). The instructions MONITOR and MWAIT are used for efficient idling on newer CPUs, which I guess is the whole point of the intel_idle code we're crashing in.

3) These are not the sorts of instructions that can be executed in a VM environment like Xen without special support. Googling reveals discussions/patches to Xen for supporting these instructions in various ways (either as a hypercall encapsulating the whole monitor/wait pair, or masking the capability in CPUID so that Linux doesn't detect support and doesn't try to use it all). Various related links:

http://lists.xensource.com/archives/html/xen-devel/2010-04/msg00043.html
http://markmail.org/thread/terab63w744x3m2r
http://www.sfr-fresh.com/unix/misc/xen-4.0.1.tar.gz:a/xen-4.0.1/docs/misc/cpuid-config-for-guest.txt

4) intel_idle can be effectively disabled from the kernel commandline with intel_idle.max_cstate=0 ( http://kerneltrap.org/mailarchive/git-commits-head/2010/5/28/40718 ), which will fall back on acpi_idle behavior. If it still crashes, there's also a commandline flag "idle=nomwait" which might prevent acpi_idle from using mwait as well.

I don't know at this point where the true bug lies. It could be that the intel_idle code needs to make an exception to its detection routines under Xen. It could be that some of Amazon's Xen hosts are configured differently (wrt CPUID masking for mwait) than others. It could be any of a number of related things. However, I suspect new AMIs for Maverick on EC2 that disable mwait from the commandline in grub.conf/menu.lst per above might fix this. I'll try making my own AMIs with this change in the morning and see how it goes.

I tried to look in more detail at the crash this evening, because it's really causing me a lot of headache now.  The most recent time I tried to boot a new c1.xlarge in us-east-1 this evening, I had to cycle through the crash/terminate/relaunch cycle 7 times before I got a working instance.  I don't have a patch or answer yet, but I have a lot of hints:

1) c1.xlarge seems to be going through some changes of underlying CPU/hardware, which could explain the randomness.  It probably depends which hardware you land on.  The older ones are Xeon E5410 and the newer ones are Xeon E5506.  So far the only times I've gotten non-crashed launches and thought to check, they've all been the E5410's.

2) The exact instruction throwing invalid opcode is MONITOR (0f 01 c8).  The instructions MONITOR and MWAIT are used for efficient idling on newer CPUs, which I guess is the whole point of the intel_idle code we're crashing in.

3) These are not the sorts of instructions that can be executed in a VM environment like Xen without special support.  Googling reveals discussions/patches to Xen for supporting these instructions in various ways (either as a hypercall encapsulating the whole monitor/wait pair, or masking the capability in CPUID so that Linux doesn't detect support and doesn't try to use it all).  Various related links:

http://lists.xensource.com/archives/html/xen-devel/2010-04/msg00043.html
http://markmail.org/thread/terab63w744x3m2r
http://www.sfr-fresh.com/unix/misc/xen-4.0.1.tar.gz:a/xen-4.0.1/docs/misc/cpuid-config-for-guest.txt

4) intel_idle can be effectively disabled from the kernel commandline with intel_idle.max_cstate=0 ( http://kerneltrap.org/mailarchive/git-commits-head/2010/5/28/40718 ), which will fall back on acpi_idle behavior.  If it still crashes, there's also a commandline flag "idle=nomwait" which might prevent acpi_idle from using mwait as well.

I don't know at this point where the true bug lies.  It could be that the intel_idle code needs to make an exception to its detection routines under Xen.  It could be that some of Amazon's Xen hosts are configured differently (wrt CPUID masking for mwait) than others.  It could be any of a number of related things.  However, I suspect new AMIs for Maverick on EC2 that disable mwait from the commandline in grub.conf/menu.lst per above might fix this.  I'll try making my own AMIs with this change in the morning and see how it goes.

Revision history for this message

Brandon Black (blblack) wrote on 2010-10-26:

#9

I forgot to add above: on the E5410 c1.xlarge's that do boot successfully, the kernel output contains:

Oct 26 07:37:55 ip-10-243-51-207 kernel: [ 0.210255] intel_idle: MWAIT substates: 0x2220
Oct 26 07:37:55 ip-10-243-51-207 kernel: [ 0.210257] intel_idle: does not run on family 6 model 23

Which I believe means that intel_idle figured out that it needs to disable itself on these. The E5506's are model 26 rather than 23. The intel_idle code has a case statement that switches on this model number. Model 23 (0x17) is commented out for "FUTURE_USE" and thus falls through to the "does not run" condition with the output above. Model 26 (0x1A) has a case statement and will attempt to use intel_idle support.

Revision history for this message

Brandon Black (blblack) wrote on 2010-10-26:

#10

So far my test instances with one or both of the MWAIT-related kernel flags have given even worse results than the original: They boot showing intel_idle disabled on E5410 nodes only, but the (assumed) E5506 nodes just terminate themselves quickly with no console log output at all (even after waiting a while). I've opened a web support ticket with Amazon referencing my test AMI and this bug report to ask for their input.

Revision history for this message

Mikael Gueck (gumi) wrote on 2010-10-27:

#11

I just tried to launch 16 * m2.4xlarge instances with ami-e43e0b90 in the eu-west-1b area, and not a single one would boot up successfully, because of this bug. Any workaround yet?

Revision history for this message

Brandon Black (blblack) wrote on 2010-10-27:

#12

Well, I had a hunch this morning that perhaps my test AMI was faulty (perhaps some stupid issue related to block-device mapping, etc, which varies between the variations on c1.xlarge), since it wasn't packaged by the same methods/tools as the official one.

It seems this may be the case. Going off the hint from Mikael that m2.4xlarge may exhibit the problems more reliably, I did the following experiment this morning using EBS root persistence to make the change, rather than custom instance-store AMIs:

1) Booted ami-548c783d (Maverick 64-bit EBS official) on m1.large in us-east-1.
2) Logged into this machine and edited /boot/grub/menu.lst manually to add "intel_idle.max_cstate=0 idle=nomwait" to the kernel bootflags.
3) Rebooted, instance came up fine with messages showing intel_idle disabled.
4) Stopped the instance, used ec2-modify-instance-attributes to move it to type m2.4xlarge
5) Booted on m2.4xlarge successfully, no crash (cpuinfo shows Xeon X5550, which is also "model 26" like the failing c1.xlarges)
6) Edited menu.lst to remove the added bootflags and rebooted the instance again, (staying on same m2.4xlarge hardware)
7) Instance crashed on boot in intel_idle code as always

Given these results, I think the kernel flags will workaround this issue, I just built a bad test AMI during my first tests yesterday. Could someone rebuild a set of Maverick AMIs with these flags added from the get-go using whatever the official method of packaging Maverick AMIs is, for public testing among those of us experiencing the bug?

Revision history for this message

Scott Moser (smoser) wrote on 2010-10-27:

#13

I just created a rebundled instance.
Please try ami-d258acbb
The id is owned by my personal ID, not Canonicals.

- launch ami-548c783d
  us-east-1 ami-548c783d canonical ebs/ubuntu-maverick-10.10-amd64-server-20101007.1
- modify /boot/grub/menu.lst to have:
  - # kopt=root=LABEL=uec-rootfs ro
  + # kopt=root=LABEL=uec-rootfs ro intel_idle.max_cstate=0 idle=nomwait
- update grub
sudo update-grub-legacy-ec2
  # keep the local version
- clean up
sudo rm -Rf /var/lib/cloud/ /home/ubuntu/.ssh /root/.ssh
- sudo poweroff
- ec2-create-image
- ec2-create-image i-02bf1e6f --name "smoser-lp-651370-ubuntu-maverick-10.10-amd64-server-20101007.1" --description "smoser's rebundle of ubuntu-maverick-10.10-amd64-server-20101007.1 to address LP: #651370"
- ec2-modify-image-attribute --launch-permission --add all ami-d258acbb

Revision history for this message

Mikael Gueck (gumi) wrote on 2010-10-27:

#14

Brandon's and Scott's workaround works for me partly, but the kernel on an instance started in such a way seems to detect only 32 GB of memory even for a m2.4xlarge instance which should have 68.4 GB available, according to the EC2 instances page. Is this a side-effect of the workaround, or a completely separate bug?

Maveric results:
ubuntu@ip-10-230-9-87:~$ uname -a
Linux ip-10-230-9-87 2.6.35-22-virtual #35-Ubuntu SMP Sat Oct 16 23:19:29 UTC 2010 x86_64 GNU/Linux
ubuntu@ip-10-230-9-87:~$ ec2metadata --instance-type
m2.4xlarge
ubuntu@ip-10-230-9-87:~$ free
total used free shared buffers cached
Mem: 32810684 667628 32143056 0 6444 32152
-/+ buffers/cache: 629032 32181652
Swap: 0 0 0

Expected results (from a SUSE 11 guest):
ip-10-230-45-187:~ # uname -a
Linux ip-10-230-45-187 2.6.32.19-0.3-ec2 #1 SMP 2010-09-17 20:28:21 +0200 x86_64 x86_64 x86_64 GNU/Linux
ip-10-230-45-187:~ # curl http://169.254.169.254/latest/meta-data/instance-type
m2.4xlarge
ip-10-230-45-187:~ # free
total used free shared buffers cached
Mem: 71705116 2361584 69343532 0 10972 126424
-/+ buffers/cache: 2224188 69480928
Swap: 0 0 0

Revision history for this message

Brandon Black (blblack) wrote on 2010-10-27:

#15

I wasn't able to boot on ami-d258acbb on m2.4xlarge. It seemed to come up without the special kernel options:

[ 0.000000] Linux version 2.6.35-22-virtual (buildd@allspice) (gcc version 4.4.5 (Ubuntu/Linaro 4.4.4-14ubuntu4) ) #33-Ubuntu SMP Sun Sep 19 21:05:42 UTC 2010 (Ubuntu 2.6.35-22.33-virtual 2.6.35.4)
[ 0.000000] Command line: root=LABEL=uec-rootfs ro console=hvc0

And then hung in intel_idle as expected. Also, confirmed apparent 32GB memory limit on this kernel + machine type.

Revision history for this message

Brandon Black (blblack) wrote on 2010-10-27:

#16

What's the method for making the S3 AMIs by the way? When I tried before, I tried just doing standard ec2-bundle-vol stuff inside of a fixed Maverick, but my first attempts failed because of the root device not having LABEL=euc-rootfs in the newly-launched instances, and the second generation I manually switched the root to /dev/sda1, but had other mysterious boot failures. Is there some standard tool or script used to package the official AMIs that we can use to produce identical results (with small changes)?

Revision history for this message

Scott Moser (smoser) wrote on 2010-10-28:

#17

Brandon,
sorry about failing to get the command line changed in the ami i rebundled. I really thought I tested that it had the proper command line before posting here. The problem in my steps above was selecting "keep local version". I should have chosen "use maintainers version".
Regarding simple changes to the s3 amis, the easiest thing to do (and actually what i would recommend for *non* simple changes) is to download the .tar.gz file from http://uec-images.ubuntu.com/releases/maverick/ . extract it, mount it loop back, modify files (or chroot and modify files), uec-resize-image (the downloaded filesystem image is only 2G). then euca-bundle-image euca-publish-image...

I also registered 'ami-aa42b6c3' and verified boot on a t1.micro and checked it has the command line. John is hoping to get rebuilt kernel images that would have these options in the config. He should point to them sometime soon.

Revision history for this message

Scott Moser (smoser) wrote on 2010-10-28:

#18

Mikael,
I opened bug 667696 to address the 32G issue.
Brandon,
I opened bug 667793 to address euca-bundle-vol not copying the filesystem label.

I copied you each on the respective bugs.

Revision history for this message

John Johansen (jjohansen) wrote on 2010-10-28:

#19

There are maverick test kernels at

kernel.ubuntu.com/~jj/linux-image-2.6.35-23-virtual_2.6.35-23.36~ec2_amd64.deb
kernel.ubuntu.com/~jj/linux-image-2.6.35-23-virtual_2.6.35-23.36~ec2_i386.deb

Revision history for this message

Mikael Gueck (gumi) wrote on 2010-11-01:

#20

John Johansen's suggested -23.36 kernel booted, but still exhibited bug 667796.

Linux ip-10-230-9-131 2.6.35-23-virtual #36~ec2 SMP Thu Oct 28 15:07:00 UTC 2010 x86_64 GNU/Linux

[ 0.000000] PERCPU: Embedded 30 pages/cpu @ffff88000e8c7000 s91520 r8192 d23168 u122880
[ 0.000000] pcpu-alloc: s91520 r8192 d23168 u122880 alloc=30*4096
[ 0.000000] pcpu-alloc: [0] 0 [0] 1 [0] 2 [0] 3 [0] 4 [0] 5 [0] 6 [0] 7
[8698363.527286] trying to map vcpu_info 0 at ffff88000e8d2020, mfn 10569b2, offset 32
[8698363.527290] cpu 0 using vcpu_info at ffff88000e8d2020
[8698363.527292] trying to map vcpu_info 1 at ffff88000e8f0020, mfn 1056994, offset 32
[8698363.527294] cpu 1 using vcpu_info at ffff88000e8f0020
...

ubuntu@ip-10-230-9-131:~$ free
total used free shared buffers cached
Mem: 32810684 669128 32141556 0 7016 32268
-/+ buffers/cache: 629844 32180840
Swap: 0 0 0

Revision history for this message

Scott Moser (smoser) wrote on 2010-11-01:

#21

Mike,
Thanks for your test. Its interesting that we still see the time travel of roughly 100 days in your dmesg.
I gather the system was otherwise usable ? Other than only showing 32G of memory.

Revision history for this message

Scott Moser (smoser) wrote on 2010-11-01:

#22

m2.4xlarge console output showing time travel forward and back Edit (56.7 KiB, text/plain)

I'm attaching a console output of a lucid 10.04 from:
us-east-1 ami-4a0df923 canonical ebs/ubuntu-lucid-10.04-amd64-server-20101020

This shows very interesting time travel (both forward and backward) on an otherwise functional instance.
Thus, while the kernel time messages are not pretty looking, they don't necessarily correlate with this bug occuring.

John Johansen (jjohansen) on 2010-11-02

Changed in linux (Ubuntu Maverick):
status:	New → In Progress
Changed in linux (Ubuntu):
status:	Confirmed → In Progress

Stefan Bader (smb) on 2010-11-02

description:	updated
Changed in linux (Ubuntu):
assignee:	nobody → Andy Whitcroft (apw)
status:	In Progress → Triaged
Changed in linux (Ubuntu Maverick):
assignee:	nobody → John Johansen (jjohansen)

Stefan Bader (smb) on 2010-11-02

Changed in linux (Ubuntu Maverick):
importance:	Undecided → Medium

Revision history for this message

Brandon Black (blblack) wrote on 2010-11-02:

#23

Stefan: the ~32 vs ~64GB memory issue is very likely orthogonal and has a separate bug now (bug 667796). This issue is solely about intel_idle vs certain CPU types under Amazon's EC2 (Xen) environment. m2.4xlarge in us-east reproduces the crash on boot readily (and also happens to exhibit the memory limit issue), and c1.xlarge reproduces it some of the time (depending which hardware you are randomly assigned).

Revision history for this message

Scott Moser (smoser) wrote on 2010-11-03:

#24

@Brandon,
Stefan's comment in the SRU justification about 68G of memory (which should have been 64) is really only suggesting that selection of a larger instance size seems more likely to land you on newer hardware where failure is more likely.

Revision history for this message

Stefan Bader (smb) wrote on 2010-11-10:

#25

@Brandon, sorry for the late response. Have been traveling. And yes, Scott's reply is right. The comment about 68G was made because selecting this size seems to trigger the crash more reliably. But it has nothing to do with the memory size itself. Just that requesting that size seems to get you a recent Intel box behind the covers. Just found this to happen while looking at another bug about 68G not being detected correctly in Maverick and finding that I never get the instance up due to this.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2010-11-12:

#26

This bug was fixed in the package linux - 2.6.37-3.11

---------------
linux (2.6.37-3.11) natty; urgency=low

[ Andy Whitcroft ]

  * Revert "ubuntu: AUFS -- update to
    b37c575759dc4535ccc03241c584ad5fe69e3b25"
  * Revert "ubuntu: AUFS -- track changes to the arguements to fop fsync()"
  * Revert "ubuntu: AUFS -- update to standalone 2.6.35-rcN as at 20100601"
  * Revert "ubuntu: AUFS -- update to standalone 2.6.34 as at 20100601"
  * Revert "ubuntu: AUFS -- aufs2 base patch for linux-2.6.34"
  * [Config] Disable intel_idle for -virtual kernels
    - LP: #651370
  * [Config] enforcer -- ensure we never enable CONFIG_IMA
  * debian -- pass the correct flavour name when checking configs
  * [Config] enforcer -- ensure CONFIG_INTEL_IDLE is off for -virtual
  * [Config] ensure CONFIG_IPV6=y for powerpc
  * [Config] enforcer -- ensure CONFIG_IPV6=y
  * ubuntu: AUFS -- aufs2-base.patch aufs2.1-36-UNRELEASED-20101103
  * ubuntu: AUFS -- aufs2-standalone.patch aufs2.1-36-UNRELEASED-20101103
  * ubuntu: AUFS -- update to aufs2.1-36-UNRELEASED-20101103
  * ubuntu: AUFS -- re-enable
  * ubuntu: AUFS -- track changes to work queue initialisation
  * ubuntu: AUFS -- track changes to llseek in v2.6.37-rc1
  * SAUCE: fbcon -- fix race between open and removal of framebuffers
  * SAUCE: fbcon -- fix OOPs triggered by race prevention fixes
    - LP: #614008
  * SAUCE: drm -- stop early access to drm devices

[ Jeremy Kerr ]

* [Config] Build-in powermac ZILOG serial driver
- LP: #673346

[ Kees Cook ]

* SAUCE: nx-emu: use upstream ASLR when possible

[ Tim Gardner ]

* [Config] Use correct be2iscsi module name in d-i/modules/scsi-modules
- LP: #628776

[ Upstream Kernel Changes ]

  * i386: NX emulation
  * nx-emu: drop exec-shield sysctl, merge with disable_nx
  * nx-emu: standardize boottime message prefix
  * mmap randomization for executable mappings on 32-bit
  * exec-randomization: brk away from exec rand area
-- Andy Whitcroft <email address hidden> Thu, 11 Nov 2010 23:46:37 +0000

This bug was fixed in the package linux - 2.6.37-3.11

---------------
linux (2.6.37-3.11) natty; urgency=low

[ Andy Whitcroft ]

* Revert "ubuntu: AUFS -- update to
    b37c575759dc4535ccc03241c584ad5fe69e3b25"
  * Revert "ubuntu: AUFS -- track changes to the arguements to fop fsync()"
  * Revert "ubuntu: AUFS -- update to standalone 2.6.35-rcN as at 20100601"
  * Revert "ubuntu: AUFS -- update to standalone 2.6.34 as at 20100601"
  * Revert "ubuntu: AUFS -- aufs2 base patch for linux-2.6.34"
  * [Config] Disable intel_idle for -virtual kernels
    - LP: #651370
  * [Config] enforcer -- ensure we never enable CONFIG_IMA
  * debian -- pass the correct flavour name when checking configs
  * [Config] enforcer -- ensure CONFIG_INTEL_IDLE is off for -virtual
  * [Config] ensure CONFIG_IPV6=y for powerpc
  * [Config] enforcer -- ensure CONFIG_IPV6=y
  * ubuntu: AUFS -- aufs2-base.patch aufs2.1-36-UNRELEASED-20101103
  * ubuntu: AUFS -- aufs2-standalone.patch aufs2.1-36-UNRELEASED-20101103
  * ubuntu: AUFS -- update to aufs2.1-36-UNRELEASED-20101103
  * ubuntu: AUFS -- re-enable
  * ubuntu: AUFS -- track changes to work queue initialisation
  * ubuntu: AUFS -- track changes to llseek in v2.6.37-rc1
  * SAUCE: fbcon -- fix race between open and removal of framebuffers
  * SAUCE: fbcon -- fix OOPs triggered by race prevention fixes
    - LP: #614008
  * SAUCE: drm -- stop early access to drm devices

[ Jeremy Kerr ]

* [Config] Build-in powermac ZILOG serial driver
    - LP: #673346

[ Kees Cook ]

* SAUCE: nx-emu: use upstream ASLR when possible

[ Tim Gardner ]

* [Config] Use correct be2iscsi module name in d-i/modules/scsi-modules
    - LP: #628776

[ Upstream Kernel Changes ]

* i386: NX emulation
  * nx-emu: drop exec-shield sysctl, merge with disable_nx
  * nx-emu: standardize boottime message prefix
  * mmap randomization for executable mappings on 32-bit
  * exec-randomization: brk away from exec rand area
 -- Andy Whitcroft <apw@canonical.com>   Thu, 11 Nov 2010 23:46:37 +0000

Changed in linux (Ubuntu):
status:	Triaged → Fix Released

Revision history for this message

marstonstudio (jon-marstonstudio) wrote on 2010-11-27:

#27

will a fix for this be backported to Maverick?

Revision history for this message

Martin Pitt (pitti) wrote on 2010-12-07: Please test proposed package

#28

Accepted linux into maverick-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in linux (Ubuntu Maverick):
status:	In Progress → Fix Committed
tags:	added: verification-needed

Revision history for this message

Brad Figg (brad-figg) wrote on 2010-12-08:

#29

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed' to 'verification-done'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

Revision history for this message

Scott Moser (smoser) wrote on 2010-12-09:

#30

I've verified this:

* start instance of (t1.micro)
   # us-east-1 ami-548c783d ebs/ubuntu-maverick-10.10-amd64-server-20101007.1
* ssh instance, install kernel reboot
   % wget https://launchpad.net/ubuntu/+archive/primary/+files/linux-image-2.6.35-24-virtual_2.6.35-24.42_amd64.deb
   % sudo dpkg -i linux-image-2.6.35-24-virtual_2.6.35-24.42_amd64.deb
   % sudo reboot
* ssh instance again, verify in new kernel, then shutdown
   % $ uname -a
     Linux ip-10-202-31-117 2.6.35-24-virtual #42-Ubuntu SMP Thu Dec 2 05:15:26 UTC 2010 x86_64 GNU/Linux
   % sudo poweroff

* for each type c1.xlarge, m2.2xlarge
   $ ec2-stop-instances ${IID}
   $ ec2-modify-instance-attribute --instance-type ${ITYPE} ${IID}
   $ ec2-start-instances ${IID}
   # 5 times test reboot (note, the cpu info hopefully
   # shows E5506 where it failed before)
   $ for i in 1 2 3 4 5; do ssh $EC2_HOST "uname -a; uptime;
         grep "Xeon" /proc/cpuinfo | head -n 1; sudo reboot" &&
         echo "$i: passed" || echo "$i: failed"; sleep 2m; done
   $ ssh $EC2_HOST sudo poweroff

I got an instance with X5550 in both c1.xlarge and m2.2xlarge and successfully rebooted and connected 5 times in a row.

tags:

added: verification-done
removed: verification-needed

Revision history for this message

Stefan Bader (smb) wrote on 2010-12-09:

#31

I can confirm to be able to boot the latest kernel in a m2.4xlarge instance which was usually crashing because it landed on hardware that triggered the intel_idle driver to load.

Revision history for this message

Ed Swierk (eswierk) wrote on 2010-12-09:

#32

Also confirmed that I can boot the kernel from #30 in an m2.4xlarge instance. It still sees only 32 GB of memory, though (bug 667796).

Revision history for this message

Launchpad Janitor (janitor) wrote on 2010-12-20:

#33

Download full text (9.0 KiB)

This bug was fixed in the package linux - 2.6.35-24.42

---------------
linux (2.6.35-24.42) maverick-proposed; urgency=low

[ Brad Figg ]

- LP: #683422

[ Colin Ian King ]

  * SAUCE: Allow registration of handler to multiple WMI events with same
    GUID
    - LP: #676997
  * SAUCE: Add WMI hotkeys support for Dell All-In-One series
    - LP: #676997
  * [Config] Enable Dell All-In-One WMI Hotkeys driver
    - LP: #676997

[ David Woodhouse ]

  * [Upstream] Call acpi_video_register() in intel_opregion_init() failure
    path
    - LP: #615947

[ Manoj Iyer ]

  * SAUCE: enable rfkill for rtl8192se driver
    - LP: #640992
  * SAUCE: Enable jack sense for Thinkpad Edge 11
    - LP: #677210

[ Tim Gardner ]

  * [Config] Use correct be2iscsi module name in d-i/modules/scsi-modules
    - LP: #628776
  * [Config] Added NFS and related modules to virtual flavour
    - LP: #659084
  * [Config] Add support for cross compiling armel
  * Simplify the use of CROSS_COMPILER

[ Upstream Kernel Changes ]

  * Revert "(pre-stable) ACPI: enable repeated PCIEXP wakeup by clearing
    PCIEXP_WAKE_STS on resume"
  * Revert "(pre-stable) mm: Move vma_stack_continue into mm.h"
  * x86, cpu: After uncapping CPUID, re-run CPU feature detection
    - LP: #672664
  * ALSA: sound/pci/rme9652: prevent reading uninitialized stack memory
    - LP: #672664
  * ALSA: oxygen: fix analog capture on Claro halo cards
    - LP: #672664
  * ALSA: hda - Add Dell Latitude E6400 model quirk
    - LP: #643891, #672664
  * ALSA: prevent heap corruption in snd_ctl_new()
    - LP: #672664
  * ALSA: rawmidi: fix oops (use after free) when unloading a driver module
    - LP: #672664
  * hwmon: (lis3) Fix Oops with NULL platform data
    - LP: #672664
  * USB: fix bug in initialization of interface minor numbers
    - LP: #672664
  * usb: musb: gadget: fix kernel panic if using out ep with FIFO_TXRX
    style
    - LP: #672664
  * usb: musb: gadget: restart request on clearing endpoint halt
    - LP: #672664
  * HID: hidraw, fix a NULL pointer dereference in hidraw_ioctl
    - LP: #672664
  * HID: hidraw, fix a NULL pointer dereference in hidraw_write
    - LP: #672664
  * ahci: fix module refcount breakage introduced by libahci split
    - LP: #672664
  * lib/list_sort: do not pass bad pointers to cmp callback
    - LP: #672664
  * ACPI: invoke DSDT corruption workaround on all Toshiba Satellite
    - LP: #672664
  * oprofile: Add Support for Intel CPU Family 6 / Model 29
    - LP: #672664
  * oprofile, ARM: Release resources on failure
    - LP: #672664
  * RDMA/cxgb3: Turn off RX coalescing for iWARP connections
    - LP: #672664
  * drm/radeon/kms: fix bad cast/shift in evergreen.c
    - LP: #672664
  * drm/radeon/kms: avivo cursor workaround applies to evergreen as well
    - LP: #672664
  * ARM: 6400/1: at91: fix arch_gettimeoffset fallout
    - LP: #672664
  * ARM: 6395/1: VExpress: Set bit 22 in the PL310 (cache controller)
    AuxCtlr register
    - LP: #672664
  * V4L/DVB: gspca - main: Fix a crash of some webcams on ARM arch
    - LP: #672664
  * V4L/DVB: gspca - sn9c20x: Bad transfer size of Bayer images
    - LP: #672664
  * mmc: sdhci-s3c: fix NULL ptr acc...

This bug was fixed in the package linux - 2.6.35-24.42

---------------
linux (2.6.35-24.42) maverick-proposed; urgency=low

[ Brad Figg ]

- LP: #683422

[ Colin Ian King ]

* SAUCE: Allow registration of handler to multiple WMI events with same
    GUID
    - LP: #676997
  * SAUCE: Add WMI hotkeys support for Dell All-In-One series
    - LP: #676997
  * [Config] Enable Dell All-In-One WMI Hotkeys driver
    - LP: #676997

[ David Woodhouse ]

* [Upstream] Call acpi_video_register() in intel_opregion_init() failure
    path
    - LP: #615947

[ Manoj Iyer ]

* SAUCE: enable rfkill for rtl8192se driver
    - LP: #640992
  * SAUCE: Enable jack sense for Thinkpad Edge 11
    - LP: #677210

[ Tim Gardner ]

* [Config] Use correct be2iscsi module name in d-i/modules/scsi-modules
    - LP: #628776
  * [Config] Added NFS and related modules to virtual flavour
    - LP: #659084
  * [Config] Add support for cross compiling armel
  * Simplify the use of CROSS_COMPILER

[ Upstream Kernel Changes ]

* Revert "(pre-stable) ACPI: enable repeated PCIEXP wakeup by clearing
    PCIEXP_WAKE_STS on resume"
  * Revert "(pre-stable) mm: Move vma_stack_continue into mm.h"
  * x86, cpu: After uncapping CPUID, re-run CPU feature detection
    - LP: #672664
  * ALSA: sound/pci/rme9652: prevent reading uninitialized stack memory
    - LP: #672664
  * ALSA: oxygen: fix analog capture on Claro halo cards
    - LP: #672664
  * ALSA: hda - Add Dell Latitude E6400 model quirk
    - LP: #643891, #672664
  * ALSA: prevent heap corruption in snd_ctl_new()
    - LP: #672664
  * ALSA: rawmidi: fix oops (use after free) when unloading a driver module
    - LP: #672664
  * hwmon: (lis3) Fix Oops with NULL platform data
    - LP: #672664
  * USB: fix bug in initialization of interface minor numbers
    - LP: #672664
  * usb: musb: gadget: fix kernel panic if using out ep with FIFO_TXRX
    style
    - LP: #672664
  * usb: musb: gadget: restart request on clearing endpoint halt
    - LP: #672664
  * HID: hidraw, fix a NULL pointer dereference in hidraw_ioctl
    - LP: #672664
  * HID: hidraw, fix a NULL pointer dereference in hidraw_write
    - LP: #672664
  * ahci: fix module refcount breakage introduced by libahci split
    - LP: #672664
  * lib/list_sort: do not pass bad pointers to cmp callback
    - LP: #672664
  * ACPI: invoke DSDT corruption workaround on all Toshiba Satellite
    - LP: #672664
  * oprofile: Add Support for Intel CPU Family 6 / Model 29
    - LP: #672664
  * oprofile, ARM: Release resources on failure
    - LP: #672664
  * RDMA/cxgb3: Turn off RX coalescing for iWARP connections
    - LP: #672664
  * drm/radeon/kms: fix bad cast/shift in evergreen.c
    - LP: #672664
  * drm/radeon/kms: avivo cursor workaround applies to evergreen as well
    - LP: #672664
  * ARM: 6400/1: at91: fix arch_gettimeoffset fallout
    - LP: #672664
  * ARM: 6395/1: VExpress: Set bit 22 in the PL310 (cache controller)
    AuxCtlr register
    - LP: #672664
  * V4L/DVB: gspca - main: Fix a crash of some webcams on ARM arch
    - LP: #672664
  * V4L/DVB: gspca - sn9c20x: Bad transfer size of Bayer images
    - LP: #672664
  * mmc: sdhci-s3c: fix NULL ptr access in sdhci_s3c_remove
    - LP: #672664
  * x86/amd-iommu: Set iommu configuration flags in enable-loop
    - LP: #672664
  * x86/amd-iommu: Fix rounding-bug in __unmap_single
    - LP: #672664
  * x86/amd-iommu: Work around S3 BIOS bug
    - LP: #672664
  * tracing/x86: Don't use mcount in pvclock.c
    - LP: #672664
  * tracing/x86: Don't use mcount in kvmclock.c
    - LP: #672664
  * ksm: fix bad user data when swapping
    - LP: #672664
  * i7core_edac: fix panic in udimm sysfs attributes registration
    - LP: #672664
  * v4l1: fix 32-bit compat microcode loading translation
    - LP: #672664
  * V4L/DVB: cx231xx: Avoid an OOPS when card is unknown (card=0)
    - LP: #672664
  * V4L/DVB: IR: fix keys beeing stuck down forever
    - LP: #672664
  * V4L/DVB: Don't identify PV SBTVD Hybrid as a DibCom device
    - LP: #672664
  * Input: joydev - fix JSIOCSAXMAP ioctl
    - LP: #672664
  * Input: wacom - fix pressure in Cintiq 21UX2
    - LP: #672664
  * ioat2: fix performance regression
    - LP: #672664
  * mac80211: fix use-after-free
    - LP: #672664
  * x86, hpet: Fix bogus error check in hpet_assign_irq()
    - LP: #672664
  * x86, irq: Plug memory leak in sparse irq
    - LP: #672664
  * ubd: fix incorrect sector handling during request restart
    - LP: #672664
  * OSS: soundcard: locking bug in sound_ioctl()
    - LP: #672664
  * virtio-blk: fix request leak.
    - LP: #672664
  * ring-buffer: Fix typo of time extends per page
    - LP: #672664
  * dmaengine: fix interrupt clearing for mv_xor
    - LP: #672664
  * drivers/gpu/drm/i915/i915_gem.c: Add missing error handling code
    - LP: #672664
  * hrtimer: Preserve timer state in remove_hrtimer()
    - LP: #672664
  * i2c-pca: Fix waitforcompletion() return value
    - LP: #672664
  * reiserfs: fix dependency inversion between inode and reiserfs mutexes
    - LP: #672664
  * reiserfs: fix unwanted reiserfs lock recursion
    - LP: #672664
  * mfd: Ignore non-GPIO IRQs when setting wm831x IRQ types
    - LP: #672664
  * wext: fix potential private ioctl memory content leak
    - LP: #672664
  * atl1: fix resume
    - LP: #672664
  * x86, numa: For each node, register the memory blocks actually used
    - LP: #672664
  * x86, AMD, MCE thresholding: Fix the MCi_MISCj iteration order
    - LP: #672664
  * firewire: ohci: fix TI TSB82AA2 regression since 2.6.35
    - LP: #672664
  * De-pessimize rds_page_copy_user
    - LP: #672664
  * drm/i915: Prevent module unload to avoid random memory corruption
    - LP: #672664
  * drm/i915: fix GMCH power reporting
    - LP: #672664
  * drm: Prune GEM vma entries
    - LP: #672664
  * drm: Hold the mutex when dropping the last GEM reference (v2)
    - LP: #672664
  * drm/radeon: fix PCI ID 5657 to be an RV410
    - LP: #672664
  * drm/radeon/kms: fix possible sigbus in evergreen accel code
    - LP: #672664
  * drm/radeon/kms: fix up encoder info messages for DFP6
    - LP: #672664
  * drm/radeon/kms: fix potential segfault in r600_ioctl_wait_idle
    - LP: #672664
  * drm/radeon/kms: add quirk for MSI K9A2GM motherboard
    - LP: #672664
  * mmc: sdio: fix SDIO suspend/resume regression
    - LP: #672664
  * V4L/DVB: dib7770: enable the current mirror
    - LP: #672664
  * xfs: properly account for reclaimed inodes
    - LP: #672664
  * skge: add quirk to limit DMA
    - LP: #672664
  * r8169: allocate with GFP_KERNEL flag when able to sleep
    - LP: #672664
  * KVM: i8259: fix migration
    - LP: #672664
  * KVM: x86: Fix SVM VMCB reset
    - LP: #672664
  * KVM: x86: Move TSC reset out of vmcb_init
    - LP: #672664
  * KVM: fix irqfd assign/deassign race
    - LP: #672664
  * KVM: Fix reboot on Intel hosts
    - LP: #672664
  * bsg: fix incorrect device_status value
    - LP: #672664
  * Fix VPD inquiry page wrapper
    - LP: #672664
  * virtio: console: Don't block entire guest if host doesn't read data
    - LP: #672664
  * ACPI: Handle ACPI0007 Device in acpi_early_set_pdc
    - LP: #672664
  * powerpc: Initialise paca->kstack before early_setup_secondary
    - LP: #672664
  * powerpc: Don't use kernel stack with translation off
    - LP: #672664
  * b44: fix carrier detection on bind
    - LP: #672664
  * ACPI: enable repeated PCIEXP wakeup by clearing PCIEXP_WAKE_STS on
    resume
    - LP: #613381, #672664
  * ACPI: EC: add Vista incompatibility DMI entry for Toshiba Satellite
    L355
    - LP: #672664
  * ACPI: delete ZEPTO idle=nomwait DMI quirk
    - LP: #672664
  * ACPI: Disable Windows Vista compatibility for Toshiba P305D
    - LP: #672664
  * PM / ACPI: Blacklist systems known to require acpi_sleep=nonvs
    - LP: #672664
  * x86: detect scattered cpuid features earlier
    - LP: #672664
  * agp/intel: Fix cache control for Sandybridge
    - LP: #672664
  * x86-32: Separate 1:1 pagetables from swapper_pg_dir
    - LP: #672664
  * x86-32: Fix dummy trampoline-related inline stubs
    - LP: #672664
  * x86, mm: Fix CONFIG_VMSPLIT_1G and 2G_OPT trampoline
    - LP: #672664
  * setup_arg_pages: diagnose excessive argument size
    - LP: #672664
  * execve: improve interactivity with large arguments
    - LP: #672664
  * execve: make responsive to SIGKILL with large arguments
    - LP: #672664
  * mm: Move vma_stack_continue into mm.h
    - LP: #672664
  * Linux 2.6.35.8
    - LP: #672664
  * SRU:[Config] Disable inte_idle for -virtual kernels
    - LP: #651370
  * smsc95xx: generate random MAC address once, not every ifup
    - LP: #673504, #673509
  * ALSA: HDA: Enable SKU quirks for Realtek
    - LP: #617647
  * ALSA: HDA: Apply SKU override for Acer aspire 7736z
    - LP: #617647
  * net: clear heap allocation for ETHTOOL_GRXCLSRLALL
    - CVE-2010-3861
  * ipc: shm: fix information leak to userland
    - CVE-2010-4072
  * drm/i915: Avoid pageflipping freeze when we miss the flip prepare
    interrupt
    - LP: #680204
  * ALSA: HDA: Add an extra DAC for Realtek ALC887-VD
    - LP: #682596
  * ALSA: hda - Fixed ALC887-VD initial error
    - LP: #682596
 -- Brad Figg <brad.figg@canonical.com>   Tue, 30 Nov 2010 12:29:50 -0800

Changed in linux (Ubuntu Maverick):
status:	Fix Committed → Fix Released

Ubuntu
linux package

ec2 kernel crash invalid opcode 0000 [#1]

Bug Description

CVE References

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Fix Released	Medium	Andy Whitcroft
	Maverick	Fix Released	Medium	John Johansen

Ubuntulinux package

ec2 kernel crash invalid opcode 0000 [#1]

Bug Description

CVE References

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package