Review EBS mount procedures

Bug #485563 reported by Josh Koenig
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
PANTHEON Mercury
Confirmed
Undecided
Greg Coit

Bug Description

Currently there are some questions about the process used in creating EBS mounts. It's also unclear whether a server survives a reboot after an EBS has been mounted. This is possibly a known issue with EC2/EBS, but we should investigate.

I just set up a server, mounted EBS, attempted a reboot, and found the instance to be unresponsive.

Tags: 0.9 ec2
Revision history for this message
Josh Koenig (joshkoenig) wrote :

Here's some of the syslog (from ec2 console) from the unresponsive instance.

SGI XFS Quota Management subsystem

    [74G[ OK ]
 * Setting kernel variables (/etc/sysctl.conf)...
    [80G
    [74G[ OK ]
 * Setting kernel variables (/etc/sysctl.d/10-console-messages.conf)...

    [80G
    [74G[ OK ]
 * Setting kernel variables (/etc/sysctl.d/10-network-security.conf)...
    [80G
    [74G[ OK ]
 * Activating swap...
    [80G
    [74G[ OK ]
 * Checking root file system...

    [80G fsck 1.41.4 (27-Jan-2009)
/dev/sda1: clean, 36224/655360 files, 370976/2621440 blocks

    [74G[ OK ]
 * Checking file systems...
    [80G fsck 1.41.4 (27-Jan-2009)

    [74G[ OK ]

 * Mounting local filesystems...
    [80G mount: none already mounted or /sys busy
mount: according to mtab, sysfs is already mounted on /sys
BUG: soft lockup detected on CPU#0!

BUG: soft lockup detected on CPU#0!

Greg Coit (gregcoit)
Changed in projectmercury:
assignee: nobody → Greg Coit (gregcoit)
status: New → Confirmed
Revision history for this message
Ranger Harke (rharke) wrote :

This sounds like the issue mentioned here (and probably in other places):
http://developer.amazonwebservices.com/connect/thread.jspa?threadID=24402

It appears to be some sort of negative interaction between EC2, EBS, and the xfs filesystem. I have been seeing it about 25% of the time bringing up Mercury 32-bit 0.81 with an xfs-based EBS volume (using scripts of my own crafting). It is also not limited to occurring on boot; it will happen sometimes after mounting the xfs volume manually.

I switched our instance to using ext3 and the problem went away. A shame, since I'd really prefer to use xfs, but what can you do..

Oddly, I haven't experienced this on some of our other production instances running a different AMI with a slightly different kernel. But that may just be luck..

Revision history for this message
Thomas Bonte (toemaz) wrote :

I have seen the same syslog output a couple of times now. When it occurs, it's (almost) impossible to bring back the instance alive: no SSH access and reboot via the aws console panel or elasticfox plugin does not help either.

Further in the syslog output, I get the following output which might be related (not sure):
------------[ cut here ]------------
kernel BUG at arch/i386/mm/hypervisor.c:185!
invalid opcode: 0000 [#1]
SMP
last sysfs file: /block/sdh/size
Modules linked in: ipv6(F)(U) xfs(F)(U) xennet(F)(U) xenblk(F)(U) ext3(F)(U) jbd(F)(U) mbcache(F)(U)
CPU: 0
EIP: 0061:[<c101bf0e>] Tainted: GF VLI
EFLAGS: 00210282 (2.6.21.7-2.fc8xen-ec2-v1.0 #2)
EIP is at xen_pgd_pin+0x54/0x5e
eax: ffffffea ebx: c2a28ef8 ecx: 00000001 edx: 00000000
esi: 00007ff0 edi: 00000000 ebp: ed2ab630 esp: c2a28ef8
ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0069
Process apache2 (pid: 16864, ti=c2a28000 task=ece17210 task.ti=c2a28000)
Stack: 00000002 00364984 20739000 0040e720 00000000 c101816d c03e0b00 c1018196
       c2b6ebcc c102250e c2b6e718 c2a28fb8 c106e159 c2a28fb8 bfe70100 01200011
       00000000 ed2ab630 c03e0b00 ecc73a80 c2b6ebd8 c2b6ebec c2b6ebe8 c03e0b00
Call Trace:
 [<c101816d>] __pgd_pin+0x2f/0x3c
 [<c1018196>] mm_pin+0x1c/0x23
 [<c102250e>] copy_process+0xac3/0x10bc
 [<c106e159>] kmem_cache_alloc+0x6b/0x98
 [<c1022b58>] do_fork+0x51/0x13a
 [<c100321e>] sys_clone+0x36/0x3b
 [<c1005688>] syscall_call+0x7/0xb
 [<c1200000>] __sched_text_start+0x240/0x83f
 =======================
Code: eb fe a1 24 c2 38 c1 8b 14 90 81 e2 ff ff ff 7f 89 54 24 04 89 e3 b9 01 00 00 00 31 d2 be f0 7f 00 00 e8 36 54 fe ff 85 c0 79 04 <0f> 0b eb fe 83 c4 0c 5b 5e c3 56 89 c2 53 83 ec 0c c1 ea 0c 80
EIP: [<c101bf0e>] xen_pgd_pin+0x54/0x5e SS:ESP 0069:c2a28ef8
------------[ cut here ]------------

Searching in google, I find this bug which relates to a kernel issue: https://bugzilla.redhat.com/show_bug.cgi?id=335961

Revision history for this message
Thomas Bonte (toemaz) wrote :

@Ranger what slightly different kernel are you having success with? With the Mercury 0.81 beta, we have had 3 froozen/killed instances so far, all with kernel Linux version 2.6.21.7-2.fc8xen-ec2-v1.0. Before 0.81, we didn't encounter this problem.

Revision history for this message
Josh Koenig (joshkoenig) wrote : Re: [Bug 485563] Re: Review EBS mount procedures

FWIW this is a lower-level Amazon issue with the Kernel version included in
some AMIs and XFS as a mounted EBS filesystem. We're going to have a better
set of recommendations soon.

On Mon, Jan 4, 2010 at 7:08 AM, Thomas Bonte <email address hidden> wrote:

> @Ranger what slightly different kernel are you having success with? With
> the Mercury 0.81 beta, we have had 3 froozen/killed instances so far,
> all with kernel Linux version 2.6.21.7-2.fc8xen-ec2-v1.0. Before 0.81,
> we didn't encounter this problem.
>
> --
> Review EBS mount procedures
> https://bugs.launchpad.net/bugs/485563
> You received this bug notification because you are a member of PANTHEON
> Developers, which is the registrant for PANTHEON Mercury.
>
> Status in Open-source system configuration for high-performance Drupal:
> Confirmed
>
> Bug description:
> Currently there are some questions about the process used in creating EBS
> mounts. It's also unclear whether a server survives a reboot after an EBS
> has been mounted. This is possibly a known issue with EC2/EBS, but we should
> investigate.
>
> I just set up a server, mounted EBS, attempted a reboot, and found the
> instance to be unresponsive.
>
>
>

--
Josh Koenig
Founding Partner / CTO
Chapter Three, Your Drupal Experts in San Francisco
300 Beale St, San Francisco, CA 94105
{o} 1.415.692.5435 {t} 1.888.496.3238 {f} 1.866.668.1912
chapterthree.com // twitter.com/chapter_three

Revision history for this message
Thomas Bonte (toemaz) wrote :

Using an EBS formatted with ext3 does solve the problem. I'm looking forward to a long term solution. Thank Josh.

Revision history for this message
Devin Poolman (devinpoolman) wrote :

I'm having this issue now too. I've been able to bring the instance back without terminating by rebooting the instance and detaching the EBS volume. Is switching to an ext3 formatted EBS really the only option?

Revision history for this message
Josh Koenig (joshkoenig) wrote :

This is a kernel-level issue. Until we can get a new version out with an
updated kernel, you should use ext3 as the filesystem.

On Tue, Jan 5, 2010 at 12:46 PM, Devin Poolman <email address hidden>wrote:

> I'm having this issue now too. I've been able to bring the instance
> back without terminating by rebooting the instance and detaching the EBS
> volume. Is switching to an ext3 formatted EBS really the only option?
>
> --
> Review EBS mount procedures
> https://bugs.launchpad.net/bugs/485563
> You received this bug notification because you are a member of PANTHEON
> Developers, which is the registrant for PANTHEON Mercury.
>
> Status in Open-source system configuration for high-performance Drupal:
> Confirmed
>
> Bug description:
> Currently there are some questions about the process used in creating EBS
> mounts. It's also unclear whether a server survives a reboot after an EBS
> has been mounted. This is possibly a known issue with EC2/EBS, but we should
> investigate.
>
> I just set up a server, mounted EBS, attempted a reboot, and found the
> instance to be unresponsive.
>
>
>

--
Josh Koenig
Founding Partner / CTO
Chapter Three, Your Drupal Experts in San Francisco
300 Beale St, San Francisco, CA 94105
{o} 1.415.692.5435 {t} 1.888.496.3238 {f} 1.866.668.1912
chapterthree.com // twitter.com/chapter_three

Revision history for this message
macrocosm (josh-elementaltek) wrote :

Is this fixed in the latest release? I just marched down this path and while my instance was still accessible I lost my drives the superblocks were overwritten and screwed .. its a good thing I have snapshots.

I just wanted to know if this was fixed in the latest release, and if I should still use the ext3 filesystem? Supposedly snapshots are not very straight forward on ext3.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.