m1.large instances randomly freezing for 5-15 minutes

Bug #741224 reported by Jerome Oufella
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux-ec2 (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I noticed a strange behavior since we started running m1.large instances with AMI ami-fa01f193.
From time to time (1 to 3 times a day), the instance will become unresponsive for 3 to 15 minutes. Our application, running on the instance will not respond to requests anymore. However, some low-profile processes (such as collectd) will keep running and we continue getting statistics.

Here is a descriptive
* System is running normally (load level is not a factor, we've had it happen on non-busy servers)
* Suddenly, one of the CPUs becomes stuck to 100% with a large proportion of system cpu time (see attached capture from collectd).
* Applications become totally unresponsive.
* Network is *not* totally stopped (since we keep receiving collectd statistics).

How to repeat bug: No deterministic way, just wait.
I only tested the problem on the us-east-1 region but it occurs on instances from all zones.
We have different software stacks (python, java) and all have been affected.

I tried to run other instances on an AMI we trusted for long (ami-da0cf8b3 2.6.32-309-ec2), and so far, for more than 24 hours, the issue did not show while it continues showing on.
I am currently testing kernel 2.6.32-314-ec2 to see if it causes the same behavior.

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: linux-ec2 2.6.32.312.13
ProcVersionSignature: Ubuntu 2.6.32-312.24-ec2 2.6.32.27+drm33.12
Uname: Linux 2.6.32-312-ec2 x86_64
Architecture: amd64
Date: Wed Mar 23 14:06:15 2011
Ec2AMI: ami-fa01f193
Ec2AMIManifest: ubuntu-images-us/ubuntu-lucid-10.04-amd64-server-20110201.1.manifest.xml
Ec2AvailabilityZone: us-east-1b
Ec2InstanceType: m1.large
Ec2Kernel: aki-427d952b
Ec2Ramdisk: unavailable
ProcEnviron:
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-meta-ec2

Revision history for this message
Jerome Oufella (jerome-oufella) wrote :
Revision history for this message
David Creemer (dcreemer+launchpad) wrote :

I've been experiencing the same systems, though I believe it's related to Java. We're using sun-java6, 2.6.32-305-ec2 (and also 2.6.32-314-ec2), Ubuntu 10.04.1 x86_64. The problem seems top be exacerbated under load (e.g. when running Hadoop jobs).

This (http://<email address hidden>/msg08703.html) seems related.

Revision history for this message
Stefan Bader (smb) wrote :

This is a general request for some testing help for Ubuntu 10.04 (Lucid) on ec2. If there are other problem reports about strange process cpu times or hangs without any message it would be good to pass this on.
Trying to pull in a wide range of upstream fixes, I have prepared linux-ec2 packages that include all the fixes to Xen code which could be found in other places. This is quite a big change, so while I have successfully booted the test kernels on various ec2 instances, this usually does not really provide much insight about what happens when facing more or less real workloads. So anybody who is willing and able to help testing would be highly welcome.

The kernel packages are at: http://people.canonical.com/~smb/lucid-ec2-ng/
To use those, http://uec-images.ubuntu.com/lucid/current/ has current AMIs and using --kernel with a matching AKI from https://lists.ubuntu.com/archives/ubuntu-cloud/2010-December/000466.html will use pv-grub for booting. So the new kernel can just be installed via dpkg and activated by rebooting.

Please report back anything that seems to be worse or better than before. Thanks!

Revision history for this message
Peter Værlien (VT) (peter-verdandetechnology) wrote :
Download full text (3.7 KiB)

We have been experiencing something similar with a custom image built with python-vmbuilder, using kernel aki-2407f24d (2.6.32-308.15)

During the hang, the following repeats in kern.log a number of times:

Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057125] INFO: task java:4479 blocked for more than 120 seconds.
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057139] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057144] java D 0000000000000002 0 4479 1230 0x00000000
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057149] ffff8801dda81e00 0000000000000282 0000000000000000 ffff8801dda81d80
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057152] ffffffff81338023 ffff8801dda81dc8 ffff8801dcfccab8 ffff8801dda81fd8
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057155] ffff8801dcfcc700 ffff8801dcfcc700 ffff8801dcfcc700 ffff8801dda81fd8
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057158] Call Trace:
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057169] [<ffffffff81338023>] ? cpumask_next_and+0x23/0x40
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057175] [<ffffffff813a695b>] ? xen_spin_kick+0x4b/0x130
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057181] [<ffffffff810383f8>] ? check_preempt_wakeup+0x2a8/0x3b0
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057186] [<ffffffff814b0587>] ? _spin_unlock_irqrestore+0x77/0x90
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057189] [<ffffffff814aff8d>] rwsem_down_failed_common+0xbd/0x240
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057191] [<ffffffff814b0166>] rwsem_down_read_failed+0x26/0x30
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057195] [<ffffffff81341ec4>] call_rwsem_down_read_failed+0x14/0x30
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057197] [<ffffffff814af302>] ? down_read+0x12/0x20
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057200] [<ffffffff814b2dc4>] do_page_fault+0x2f4/0x390
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057203] [<ffffffff814b0a48>] page_fault+0x28/0x30
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057205] INFO: task java:4480 blocked for more than 120 seconds.
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057209] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057217] java D 0000000000000001 0 4480 1230 0x00000000
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057220] ffff8801dd603e00 0000000000000282 0000000000000035 ffff8801dd603d80
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057223] 0000000000000000 ffff8801dd603dc8 ffff8801dd70aa38 ffff8801dd603fd8
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057226] ffff8801dd70a680 ffff8801dd70a680 ffff8801dd70a680 ffff8801dd603fd8
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057228] Call Trace:
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057231] [<ffffffff814aff8d>] rwsem_down_failed_common+0xbd/0x240
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057234] [<ffffffff81037178>] ? set_next_entity+0x88/0x90
Aug 24 18:13:38 ip-10-83-47-71 kernel: [28107.057236] [<ffffffff814b0166>] rwsem_down_read_failed+0x26/0x30
Aug 24 18:13:38 ip-10-83-...

Read more...

Changed in linux-ec2 (Ubuntu):
status: New → Confirmed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.