Kernel Panics - ec2

Bug #1178707 reported by justin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
High
Unassigned

Bug Description

Kernel Versions affected: 2.6.32-346-ec2 #51-Ubuntu
                                                2.6.32-309-ec2 #18-Ubuntu SMP
                                                3.0.0-32-virtual #51~lucid1-Ubuntu SMP
                                                2.6.32-351-ec2 #64-Ubuntu SMP

Description: Ubuntu 10.04.4 LTS
Release: 10.04

We're seeing Kernel Panics across a variety of different instance types on AWS all running 10.04 on the following AMIs
ami-da0cf8b3
ami-3fe54d56

We've tested with Various Kernels and still have had the issue. I've attached the console output from a few of the servers where the panics occur as well as the
---
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access /dev/snd/: No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 10.04
Frequency: Once every few days.
Lspci:

Lsusb: Error: command ['lsusb'] failed with exit code 1:
Package: linux (not installed)
ProcCmdLine: root=LABEL=cloudimg-rootfs ro xencons=hvc0 console=hvc0
ProcEnviron:
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcModules: acpiphp 23989 0 - Live 0xffffffffa0000000
ProcVersionSignature: Ubuntu 3.0.0-32.51~lucid1-virtual 3.0.69
Regression: No
Reproducible: No
Tags: lucid kconfig needs-upstream-testing
Uname: Linux 3.0.0-32-virtual x86_64
UserGroups:
---
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access /dev/snd/: No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
DistroRelease: Ubuntu 10.04
Frequency: Once every few days.
Lspci:

Lsusb: Error: command ['lsusb'] failed with exit code 1:
Package: linux (not installed)
ProcCmdLine: root=/dev/sda1 ro 4
ProcEnviron:
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcModules: ipv6 293511 12 - Live 0xffffffffa0000000
ProcVersionSignature: Ubuntu 2.6.32-309.18-ec2 2.6.32.21+drm33.7
Regression: No
Reproducible: No
Tags: lucid kconfig needs-upstream-testing
Uname: Linux 2.6.32-309-ec2 x86_64
UserGroups:

Revision history for this message
justin (jlintz) wrote :
Revision history for this message
justin (jlintz) wrote :
Revision history for this message
justin (jlintz) wrote :
Revision history for this message
justin (jlintz) wrote :
Revision history for this message
justin (jlintz) wrote :
Revision history for this message
justin (jlintz) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1178707

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: lucid
Revision history for this message
justin (jlintz) wrote :

The instances work load range from

 - an nginx proxy server, just proxies connections to different backends running in-memory database
  avg cpu: 30%

- a server running inhouse in-memory database , taking connections from the nginx proxy servers
  avg cpu: 20%

- queue worker servers
  avg cpu: 75%

The nginx proxy servers and memory databases do a large amount of network traffic, usually around 25k pkts/sec on the proxy server, with around 15k pkts/sec on the database servers

Revision history for this message
justin (jlintz) wrote : BootDmesg.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
justin (jlintz) wrote : CurrentDmesg.txt

apport information

Revision history for this message
justin (jlintz) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
justin (jlintz) wrote : ProcInterrupts.txt

apport information

Revision history for this message
justin (jlintz) wrote : UdevDb.txt

apport information

Revision history for this message
justin (jlintz) wrote : UdevLog.txt

apport information

description: updated
Revision history for this message
justin (jlintz) wrote : BootDmesg.txt

apport information

Revision history for this message
justin (jlintz) wrote : CurrentDmesg.txt

apport information

Revision history for this message
justin (jlintz) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
justin (jlintz) wrote : ProcInterrupts.txt

apport information

Revision history for this message
justin (jlintz) wrote : UdevDb.txt

apport information

Revision history for this message
justin (jlintz) wrote : UdevLog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Does this only happen on the 10.04 images? Have you also tested other releases?

Changed in linux (Ubuntu):
importance: Undecided → High
tags: added: kernel-da-key
Revision history for this message
justin (jlintz) wrote :

I've only tested this on 10.04 images. It would be a bit difficult to try on a newer release given software dependencies we have currently

Revision history for this message
justin (jlintz) wrote :

@Joseph is there any additional information I can provide to help the debugging?

Revision history for this message
Stefan Bader (smb) wrote :

Looking at the dmesg snippets of the various kernels there seem to be multiple pages that have that bad page state. The locations seem random (maybe visualizing may yield some pattern). It happens the same with the ec2 kernel and the virtual flavour which actually are very different in the Xen code.
So I am wondering whether the external factors. Is this happening to all instances used? Is there maybe any pattern when looking at the availability zones, the Xen version (dmesg in the guest) or the visible cpu info (/proc/cpuinfo in the guest)? The instance should remain where it was before if it is only rebooted.
Although the report says multiple instance types (which are?) and that would lessen the chances of some odd hw problem on the host as I think I was told each host only serves instances of the same type...

Revision history for this message
Stefan Bader (smb) wrote :

And just for references (I had the feeling there was something similar before) bug 1007082 has a comment #36 that claims this was there related to fsc on NFS. Is that involved here as well?

Revision history for this message
justin (jlintz) wrote :

No NFS is involved. All the mounts are ephemeral storage.

Instance types seemed to be isolated to m1.large and c1.xlarge so far. We have the same configuration running on m2.xlarge that we have for some m1.larges and have not seen crashes there (but I wouldn't rule out since we didnt start digging into this issue deeply until recently)

Seen it across multiple AZs in us-east. Amazon checked a few instance ids for us and said they found no hardware issues.

Seen on CPUs
model name : Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz
model name : Intel(R) Xeon(R) CPU E5507 @ 2.27GHz
model name : Intel(R) Xeon(R) CPU E5506 @ 2.13GHz

I only seem to be able to get the Xen version on the 3.0 kernels. Asking Amazon for more info there but on the 3.0 kernels we've seen

Xen version: 3.4.3-2.6.18 (preserve-AD)

Revision history for this message
justin (jlintz) wrote :

Adding an additional backtrace from this morning on Kernel 3

Revision history for this message
Stefan Bader (smb) wrote :

Oh, right, I forgot that the version string came later. But since the symptom is distributed over such a variety of availability zones and even different instance types, it seems rather unlikely to be related to something on the host.

Unfortunately the kernel messages that are seen only tell us something failed to release/invalidate a page in the past and the stacks point to the process that stumbles over this when trying to use that page. So probably the only way to shed some light into this is to guess what may be different (configuration/usage). So first, is there anything different in resources compared to a standard instance?

The other thing that I thought of: if services running there are somewhat independent, maybe one could stop some on affected instances and see whether the problem remains or not.

Revision history for this message
justin (jlintz) wrote :

One common trait these instances share are they are heavy on network IO. Instances of larger sizes with the same network I/O seem to be stable. Some have sustain bandwidth of 4MB/sec in/out with packet rates of up to 30k/sec

Revision history for this message
justin (jlintz) wrote :

Here's another backtrace from today , this occurred on a c1.medium but the backtrace actually contained a mention of "kernel BUG"

Revision history for this message
Stefan Bader (smb) wrote :

That "kernel BUG" probably does not mean that much. Given there seems to be at least one (but likely more) page on the free list which is not really released, this will result in more and more fallout. Is it possible to elaborate more on disk and network setup (at least anything that differers to an instance one would get by default)?

The stack traces here do not help much to find out the problem. Rather this requires to understand the setup and possibly think of something that this would exercise more than other environments.

Revision history for this message
justin (jlintz) wrote :

@Stefan,

One interesting thing is we are seeing the crashes on m1.larges of a certain server type, but that same type running on m2.xlarge has not seen any crashes. Seeing same network and IO patterns in both cases but no crashes on the larger instance type.

I disabled irqbalancer on one group of servers that appear to have given that class some stability there, although I can't say definitely since I was never able to reproduce the issue but it's been a couple of weeks since we've seen a kernel panic on them since disabling irqbalance.

Is there any other debug information I can attempt to gather? I'm also testing Kernel 3.2 from mainline on a server w/ irqbalance enabled to see if that makes a difference.

Revision history for this message
Stefan Bader (smb) wrote :

Hard to imagine how dynamically pinning irqhandlers to certain cpu's would make a difference. But who knows. If the description of instance types is correct the main differences between the two instance types would be that m2 has more memory (7.5GB / 17.1 GB) but has only one 420GB virtual drive, while m1 has two of them. So m2 could run out of cache for network IO less likely/often.

And apart from additional software, is there anything special in the setup that differs from a stock m1.large or m2.xlarge guest?

Revision history for this message
justin (jlintz) wrote :

Yea I had read some bug reports about instability with irqbalance on Xen, but I'm just grasping at straws.

The software and configurations are identical on the m1.large and m2.xlarge for this class of servers.

Are there any particular values I could graph and start monitoring to see if network IO cache may be related to the issue? Thanks

Revision history for this message
Stefan Bader (smb) wrote :

Maybe you got pointers to those reports about irgbalance? I not really sure what could be monitored to find more information. I went back and looked at all the bad page error messages and one thing that all of them seem to have in common is that there is a page->mapping set which has bit 0 set. And that points to that page being previously used for a anonymous mapping.

That may be heap used by libc for malloc but somehow I would imagine if that would be broken, there should be many more issues. So maybe this can be narrowed down to something that uses mmap with MAP_ANONYMOUS and somehow causes pages to go back onto the pool before they are unmapped...

Revision history for this message
justin (jlintz) wrote :

Stefan,

http://bugzilla.xensource.com/bugzilla/show_bug.cgi?id=430 , granted that is very old and https://bugzilla.redhat.com/show_bug.cgi?id=550724#c81

I also found this related bug that seems to be having similar crashes to ours reported by an Amazon engineer

https://bugs.launchpad.net/ubuntu/+source/linux-ec2/+bug/1052275

Revision history for this message
Stefan Bader (smb) wrote :

The irqbalance problem on Xen.org sounds like the daemon crashing (which is not the case here). In the Redhat bug report it feels like people use crash when they mean hang. I remember there were some requests about backporting interrupt related patches. But due to the differences in the EC2 kernels I could not backport all of them. And then you had the same issue using the ec2 kernels and the generic kernels which have those changes.

The other bug report looks quite similar but then the mapping address is an even one, so that would be a file backed mapping. And the latest update to that report seems a problem detected while releasing pages but looks to be missing the other information. It looks to be origination from python in that case, maybe that helps a bit.

That said, when I enable function tracing on the kernel anon_vma_prepare and anon_vma_unlink functions, then irqbalance causes mappings and unmappings in a loop. Actually it happens quite a lot by other processes. But maybe if you run that as well there is something sticking out doing many of them (may increase the chance to go wrong).

as root change into /sys/kernel/debug/tracing, then
echo "anon_vma_prepare anon_vma_unlink" >set_ftrace_filter
echo function >current_tracer
cat trace_pipe | tee /tmp/ftrace.log

Revision history for this message
justin (jlintz) wrote :

It doesn't look like dynamic ftracing is available in the 2.6.32 kernels we are running, only in the 3.x kernels. I assume you meant unlink_anon_vmas function?

There's alot of output so it's really hard to discern much from it. We have phantomjs running on one of the servers experiencing the crashes and seeing that scroll by a lot.

Revision history for this message
Stefan Bader (smb) wrote :

I tested this on a local Lucid PVM and tracing was available there. Maybe debugfs is not mounted by defaul on EC2? For 2.6.32 it was anon_vma_unlink. But probably does not matter that much which kernel. More to get a feeling how much relative activity processes do.

I guess I need to do a bit more thinking. Need to find a way to somehow limit any listing to more likely suspects. Right now it would include all the normal malloc done through libc and that is unlikely a path to get bad pages.

Revision history for this message
justin (jlintz) wrote :

Also not sure if this is helpful, but here's an output of "sysctl -a"?field.comment=Also not sure if this is helpful, but here's an output of "sysctl -a"

Revision history for this message
Stefan Bader (smb) wrote :

Not really anything substantial, but recently there was a new upstream stable release for 2.6.32 which had some mm updates and also a few places claiming to fix memory leaks. As it is still unclear what causes the problems it would be good to install that updated kernel into at least one affected instance, just to make sure it hasn't been magically solved.

It would still take a bit until the updates become available, so I placed the packages into [1] for convenience.

[1] http://people.canonical.com/~smb/lp1178707/

Revision history for this message
justin (jlintz) wrote :

Unfortunately just saw a panic on those newer kernels as well

Revision history for this message
Stefan Bader (smb) wrote :

That unfortunate news. Right now I can only think of a kind of desperate approach. I added a 64bit dbg1 kernel to the same location as from comment #41. That one hopefully (not really able to test it). If it works as expected it will dump the memory contents of the page that appears bad on the freelist and immediately panics the host so it does not produce follow-ups. Maybe the data in that page allows some conclusion about what was or is using it.

Revision history for this message
justin (jlintz) wrote :

Is it supposed to dump the contents to the console? Had 2 crashes this weekend, attached are the stack traces but I don't really see anything different.

Revision history for this message
justin (jlintz) wrote :

debug kernel stack trace

Revision history for this message
Stefan Bader (smb) wrote :

Thanks and sorry, yeah the dump would be on the console if I had not messed up the conversion between the reported struct page and the memory I try to read from. So what you saw is basically the function trying to dump crashing because it accesses the wrong place. I hope I got it right this time and when I got the kernel compiled it will be a dbg2 version at the same location (probably 30 to 60 minutes after this post).

Revision history for this message
justin (jlintz) wrote :

Ok got some more information now.

Revision history for this message
Stefan Bader (smb) wrote :

Hm, so that middle part looks a bit like Python documentation. Could it be part or a part of phantomjs? Btw, for Lucid/10.04, how is phantomjs obtained? At least it is not a separate package as of Precise/12.04 and later.
I wonder whether any part of that (or something else which is added to the stock instance) makes extensive use of async IO which IIRC is one of the things in the kernel that handles MAP_ANON pages. That alone probably does not explain why some instances have problems with that and others do not, but maybe if that comes together with less memory and/or a certain throughput to storage or network.

Revision history for this message
justin (jlintz) wrote :

Yea that documentation is a part of paramiko which is imported in a shared python library that some code on this particular server uses (but does not make use of). PhantomJS is rolled on our own but it's also not installed or running on other instances where we've seen this issue. Hopefully (odd thing to wish for) we'll see some more crashes in the next few days and get some more info on what the pages look like.

Revision history for this message
justin (jlintz) wrote :

One crash from this weekend

Revision history for this message
justin (jlintz) wrote :

And another crash from this weekend.

There was a third but the memory page that it dumped out contains some non-public information so I can't post it here

Revision history for this message
justin (jlintz) wrote :

Just saw a crash on Kernel 3.2.46

Here's attached console output

Revision history for this message
Stefan Bader (smb) wrote :

So the first one did not show some immediately obvious hint. And I think the lockup of that was posted in comment #52 is a completely different issue (also wondering about the kernel version in there, is that a mainline kernel?). Anyway, that rather seems to be a bug which I thought we had a patch in upstream stable (v3.2.40: xen: Send spinlock IPI to all waiters). Certainly not a crash but a lockup and unlikely related to the 2.6.32 bug of bad pages.

The crash from comment #51 could be a little more interesting. Though it is at least a different way in which brokenness is detected. Actually it does not seem to be detected at all but freeing some pages seems to run into a page fault and the second trace looks to be from adding dynamic memory.

In all recent traces it is phantomjs that is affected (or running on the cpu that produces the error). I wonder, would it be possible to point to the source from which that comes from? Normally userspace should not be able to cause that sort of corruption but maybe the way this code works allows to see what goes wrong.

Revision history for this message
justin (jlintz) wrote :

We're using https://github.com/ariya/phantomjs/tree/1.7 , the recent traces are just from machines that are running phantomJS, we have been seeing crashes on other servers without phantomjs but I only have the kernel you compiled for us running on those servers since they crash the most frequent

Revision history for this message
Stefan Bader (smb) wrote :

Ah yeah. Well maybe it is not the only way to make it happen but one rather successful. I would really love to be able to find anything that allows me to reproduce the problem on a local host. So I grasp any straw that looks promising.

Revision history for this message
justin (jlintz) wrote :

We've had a few more panics but the page has been empty a few times it has printed it out. Is it helpful to post anymore traces or is there any other information that would be useful to gather for debugging?

Revision history for this message
Stefan Bader (smb) wrote :

I am sorry, I unfortunately got distracted by trying to finish some feature for the next release. And I must admit right now I have no good idea how to proceed. The pages that got dumped at least to me show no pattern that points to a certain process. You might be in a better position there since you know better what those instances are doing.
The only vague suspect might be something that does asynchronous I/O (just because that would to a certain degree use anonymous pages which seem to be ending up on free lists incorrectly). If there was to be certain processes you would know to use aio and if that would be configurable, it would be worth trying to turn that off and see whether the instance survives. Or if there are independent tasks that cause some of the load and could be turned off and on, maybe it would become more obvious which direction to look. Though I doubt this is possible.

Revision history for this message
justin (jlintz) wrote :

Hi,

We just completed an upgrade to precise across our instances and it looks like the issue is still persisting on kernel 3.2.0-61-virtual. Only have seen this on Amazon's m1.large instances so far. I've attached a new stack trace

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Closing this bug with Won't fix as this kernel / release is no longer supported.
Please feel free to open a new bug report if you're still experiencing this on a newer release (Bionic 18.04.3 / Disco 19.04)
Thanks!

Changed in linux (Ubuntu):
status: Confirmed → Won't Fix
To post a comment you must log in.