Comment 8 for bug 651370

Revision history for this message
Brandon Black (blblack) wrote :

I tried to look in more detail at the crash this evening, because it's really causing me a lot of headache now. The most recent time I tried to boot a new c1.xlarge in us-east-1 this evening, I had to cycle through the crash/terminate/relaunch cycle 7 times before I got a working instance. I don't have a patch or answer yet, but I have a lot of hints:

1) c1.xlarge seems to be going through some changes of underlying CPU/hardware, which could explain the randomness. It probably depends which hardware you land on. The older ones are Xeon E5410 and the newer ones are Xeon E5506. So far the only times I've gotten non-crashed launches and thought to check, they've all been the E5410's.

2) The exact instruction throwing invalid opcode is MONITOR (0f 01 c8). The instructions MONITOR and MWAIT are used for efficient idling on newer CPUs, which I guess is the whole point of the intel_idle code we're crashing in.

3) These are not the sorts of instructions that can be executed in a VM environment like Xen without special support. Googling reveals discussions/patches to Xen for supporting these instructions in various ways (either as a hypercall encapsulating the whole monitor/wait pair, or masking the capability in CPUID so that Linux doesn't detect support and doesn't try to use it all). Various related links:

http://lists.xensource.com/archives/html/xen-devel/2010-04/msg00043.html
http://markmail.org/thread/terab63w744x3m2r
http://www.sfr-fresh.com/unix/misc/xen-4.0.1.tar.gz:a/xen-4.0.1/docs/misc/cpuid-config-for-guest.txt

4) intel_idle can be effectively disabled from the kernel commandline with intel_idle.max_cstate=0 ( http://kerneltrap.org/mailarchive/git-commits-head/2010/5/28/40718 ), which will fall back on acpi_idle behavior. If it still crashes, there's also a commandline flag "idle=nomwait" which might prevent acpi_idle from using mwait as well.

I don't know at this point where the true bug lies. It could be that the intel_idle code needs to make an exception to its detection routines under Xen. It could be that some of Amazon's Xen hosts are configured differently (wrt CPUID masking for mwait) than others. It could be any of a number of related things. However, I suspect new AMIs for Maverick on EC2 that disable mwait from the commandline in grub.conf/menu.lst per above might fix this. I'll try making my own AMIs with this change in the morning and see how it goes.