maverick on ec2 64bit ext4 deadlock

Bug #666211 reported by Timo Derstappen
96
This bug affects 13 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
High
Unassigned

Bug Description

I created an ami from the official image (ami-505c6924 – region eu-west-1). the rebundled image works fine for a while but ends with nearly 100% iowait and a load of 6000.

The instance type is m1.xlarge without ebs.

There are errors in the amazon console before the kernel starts (see attached boot.log):
Failed to read /local/domain/0/backend/vbd/161/2049/feature-barrier.
Failed to read /local/domain/0/backend/vbd/161/2049/feature-flush-cache.

If you are lucky you see errors in the syslog like this:
INFO: task jbd2/sda1-8:235 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/sda1-8 D ffff880003bcd980 0 235 2 0x00000000
ffff8801b5f49b20 0000000000000246 0000000000000000 0000000000015980
ffff8801b5f49fd8 0000000000015980 ffff8801b5f49fd8 ffff8801b571db80
0000000000015980 0000000000015980 ffff8801b5f49fd8 0000000000015980

The machines are unusable after a few hours. I am testing those images right now. There is no heavy load expected. Karmic images work fine. Packages I've installed are build-essential, git-core, ruby, nginx and couchdb. Node.js is compiled manually.

The same error is described on Alestic by Jay Freeman, unfortunately he didn't open any bug here:
http://alestic.com/2010/10/ec2-ubuntu-maverick#comment-484

AMI: http://developer.amazonwebservices.com/connect/entry.jspa?externalID=4350

Revision history for this message
Timo Derstappen (teemow) wrote :
Revision history for this message
Timo Derstappen (teemow) wrote :
Revision history for this message
Timo Derstappen (teemow) wrote :
Revision history for this message
Jay Freeman (saurik) (saurik) wrote :

We are also getting the following error message from the kernel:

JBD: barrier-based sync failed on sda1-8 - disabling barriers

My theory was that this has to do with the root partition being used as ext4. I do not know much about bundling of AMIs: is this something that is easy for you to change/test with your rebundled AMIs?

Revision history for this message
Jay Freeman (saurik) (saurik) wrote :

Somehow I thought I said this in the last comment, but I see that I didn't: "using the instance's normal root partition, not an EBS root boot".

Revision history for this message
Scott Moser (smoser) wrote :

Jay,
  Could you please get console output of a system that reproduces this bug and attach it here?

Revision history for this message
Scott Moser (smoser) wrote :

I'm attaching dmesg log of an m1.xlarge in eu-west-1.
I've tried to reproduce with a couple benchmarking utilities to add load, and not been able to do so (stress and dbench). I know they're just benchmarking, but I can't see it in one instance here.

Is there info that you could give to help us reproduce ?

Revision history for this message
Timo Derstappen (teemow) wrote :

Scott, did you bundle a new ami from the existing uec image? Because that is what we did. Which tools do you use for that? And probably that is where the error comes from. The original image where I installed all software and bundled from is still running and didn't have any issues. Only the cloned instances failed.

Here is a rough description of what I did using ami-tools:
 * Replaced root device label uec-rootfs in menu.lst and fstab with /dev/sda1
 * ec2-bundle-vol -d /mnt --block-device-mapping "root=/dev/sda1" -r x86_64
 * ec2-upload-bundle ...

I did replace the label, because otherwise the bundled amis failed with the message root device "uec-rootfs" not found and I didn't find another solution for that.

Is there a better way to do that, what do you recommend?

Revision history for this message
Scott Moser (smoser) wrote :

Timo,
  If you're booting with the correct kernel (which it appears you are), then I can't imagine that bundle-vol and upload-bundle is what is causing this.
  I've seen similar issues to what is described here in bug 567334. There, I do not see the issue reproducibly, though.
  Do you see this in a easily reproducible manner ? I have some 368 logs of maverick ec2 boots in lp:~ubuntu-on-ec2/ubuntu-on-ec2/ec2-test-results/ (looking at files named console-term.txt). 8 of them have the 'jbd2/sda1...blocked' message but none of them show a hex dump like in the summary. Of those 8 all were x86_64 and instance-store, 2 of them m1.xlarge, 1 was c1.xlarge, 5 m1.large.

regarding rebundling, I would suggest using the images available at http://uec-images.ubuntu.com, mounting them loopback, making your changes via chroot, and then ec2-bundle-image and ec2-upload-bundle. This would especially be true if you're automating your changes. Its simply a cleaner starting point than a booted system, and will retain the filesystem label also.

Scott Moser (smoser)
affects: linux-meta (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Confirmed
tags: added: amd64 ec2-images
Revision history for this message
Jay Freeman (saurik) (saurik) wrote :

@scott: I have attached a dmesg dump from a system that had failed.

Revision history for this message
Stefan Bader (smb) wrote :

Adding a few comments here as they come to my mind:

The message about barrier based sync failed is just status and can be safely ignored.

Reading through the dmesg from comment #10 and comparing to one gathered from a daily server instance boot:

[ 0.000000] Xen version: 3.0.3-rc5-8.1.14.f (mine was 3.0.3-rc5-8.el5)
...
[ 0.000000] trying to map vcpu_info 0 at ffff880003bc3020, mfn 124f8a, offset 32
[ 0.000000] register_vcpu_info failed: err=-38

Did not see this error in my log.

[ 0.016933] CPU: Physical Processor ID: 0
[ 0.016939] CPU: Processor Core ID: 0

Not sure this is really relevant the hw I booted seemed to have only 2 CPUs and showed a warning about an unsupported number of siblings (4).

[ 0.103804] alloc irq_desc for 16 on node 0
[ 0.103806] alloc kstat_irqs on node 0

Did not see messages like this either, but I suspect my hw was AMD dual core while this might be Intel quad core.

[ 0.171411] intel_idle: MWAIT substates: 0x2220
[ 0.171413] intel_idle: does not run on family 6 model 23

This proves the previous suspicion. At least it refuses here instead of crashing. Then mostly normal things. The only strange thing is the name of the device in the barrier based sync failed message: sda1-8, unfortunately the way xen works there is no partition detection in the log, but this sounds at least like sda has 8 partitions...

The following stack traces look very much like something deadlocks on flushing. The jbd2 tasks are transactions, what I am not sure about is pgbouncer (what should it do?). However it seems to involve aio and I see there are two patches in 2.6.35-23.36 which address aio completion ordering (coming from 2.6.35.5 upstream stable).

So probably Jay, do you know what pgbouncer is doing and maybe that is something not used in the common images. If yes, maybe it makes sense to check for that newer kernel versions?

Revision history for this message
Jay Freeman (saurik) (saurik) wrote :

Stefan,

I'm not certain what you mean by "something not used in the common images". To be clear, I do not even know how to make my own images. That's not to say I'm not certain I couldn't figure it out very quickly, but I never have as I personally do not think that is a good way of using EC2: I instead boot stock Ubuntu images (ami-688c7801 in this case) and then install packages on them. Currently, I do not any "custom" software other than my Python web application: everything I install comes from the default Ubuntu repositories.

(If required, I can provide the exact set of commands that I manually run to setup one of these EC2 servers. Unfortunately, I have so far been unable to fully automate the bootup of these systems as the server tends to lock up on me while I'm doing the install, but I remember thinking it was an unrelated issue to this one. I will do some testing of the fully automated boot of these servers tonight and see if I can reproduce those lockups again to see if they look at all related.)

So if you mean "am I making my own AMI that has some kind of modified system software", the answer is "definitely not". However, while I think that it is an amazing testament to modern engineering that I can reinstall and reboot computers from a thousand miles away, that is a fairly useless endeavor: the stock images are very "stock" and don't do anything at all, as far as I know, out of the box. To be putting any kind of load on them at all I'm certainly installing some software on it, even if that software is just a shell script fork bomb.

In my particular case, I install apache2 with mod_python and "pgbouncer" (from Ubuntu universe), a program from Skype that provides a PostgreSQL pooling proxy server. My python application connects to pgbouncer (which is listening on a named Unix socket and pretends to be PostgreSQL) instead of directly to my database server, which then keeps its own pool of connections to the actual database. This makes the 3200 Apache threads that I normally have running able to rapidly get a database connection without trying to coordinate local in-process pools.

Put differently: pgbouncer should be a fairly boring user-land process. If you are looking at it thinking it is some kind of cool kernel task that increases network security (maybe "bouncing" bad clients) or something, it (maybe sadly ;P) isn't. This software is actually the least popular (but I personally feel best ;P) choice in PostgreSQL pooling proxy servers, an already fairly narrow niche; therefore, I would find it highly unlikely if Timo was also using it, but maybe he will chime in with a "yeah! pgbouncer is AWESOME!" and prove me wrong.

-J

Revision history for this message
Jay Freeman (saurik) (saurik) wrote :

Stefan,

To be clear, both Timo and I were using m1.xlarge instances, which are supposed to have four cores. (You mentioned the hardware you were testing on only had two cores, and therefore you weren't getting the same seemingly-bad and probably-should-be-fixed error messages.)

Also, that log was saved from two weeks ago when I was running into this issue: the difference in Xen version could theoretically then be that Amazon has upgraded their system since then. (I have no idea if that even makes sense, but I figure I'll throw it out there as a possibility.)

-J

Revision history for this message
Jay Freeman (saurik) (saurik) wrote :

Stefan,

These stack traces for pgbouncer are all in sys_write(), btw, which is then backed by ext4. Both from what I know about how pgbouncer operates, and from greping through its source code, the only file-backed operation it performs is writing to its log file, which it normally does only once a minute unless it is encountering some kind of connection failures.

It looked (and still looks) to me like the filesystem is simply locking up. It should be noted that my dmesg log also includes another process that got stuck: run-parts, which got blocked in a call to sys_getdents().

Also, I looked into the AIO completion ordering change you mentioned, and it seems totally unrelated. The author of this patch referred to a reproduction of the bug they were fixing, which was a "you now read a bunch of zeros when you were expecting data" race condition, not a deadlock. In specific, operations involving "unwritten extents" would claim to be "completed" via AIO when they were still pending: the reordering fixed this.

http://www.spinics.net/lists/linux-ext4/msg19590.html
http://thread.gmane.org/gmane.comp.file-systems.ext4/19659

-J

Revision history for this message
Stefan Bader (smb) wrote :

Jay,

to hopefully answer all questions: the "non-standard" question aimed to understand what additional packages were installed and what is the use when things happen. It sometimes helps to understand the problem better when knowing how the system is in use exactly. Also knowing the Xen version and the hardware may (or may not) help. When I use a m1.xlarge for example I seem to get more memory (16G) and less CPU (4, whatever that is the real cores or hyperthreads of it). Not that it should normally matter with that deadlock, but sometimes it does.

So for the deadlock, I will to have to follow the traces more closely. But at least knowing that pgbouncer is some user-space damon that does some db proxy stuff and seeing aio somewhere in the traces gives some hint what types of fs access are involved in the game.

Revision history for this message
Jay Freeman (saurik) (saurik) wrote :

In case this is saves anyone's time: the top of those stack traces is garbage. Really, all of those processes are simply blocked in the scheduler: the second from the top entry in all the call stacks is a call to schedule() (which I presume is scrambling the registers enough to confuse the stack tracer).

Revision history for this message
Stefan Bader (smb) wrote :

I think all of them are sort of waiting on IO for one or the other reason. There is also a lot of lines with ? which usually tells that they cannot be fully trusted. The interesting/hard part is to understand how things ended up where they did and whether there possibly is some relationship between two of the blocked ones or maybe even one process stepping on its own toes. Not that I got that far, yet.

Revision history for this message
Jay Freeman (saurik) (saurik) wrote :

Yes: I'm just telling you that the ? entries at the top of these stacks are all "in the scheduler". jbd2 and run-parts are blocked in io_schedule(), and pgbouncer is blocked in do_get_write_access(). Both of those functions are calling into schedule(), and that's what is actually at the top of the stack. (I disassembled the 2.6.35-22.35-virtual kernel and verified the call points from the non-? second entries down on the stack.)

Revision history for this message
Stefan Bader (smb) wrote :

Yes, you are right. There is one likely place in the path the pgbouncer take that will wait for a buffer to finish being written to disk. And the jbd2 task is waiting for a range of pages to be written out. Maybe related but I cannot see a reason why this should deadlock. And the same is true for run-parts.

This leads to the question whether we actually see an ext4 issue here. Unfortunately we have no clue what is running on the other side (dom0) for sure. From the past I have seen users of kvm having very similar issues on 2.6.32 hosts. There has been a lot of fiddling with the generic writeback interface. And even on bare metal we have seen completely poor performance when multiple people tried IO bound tasks (like doing kernel compiles, where one could massively starve other people). This is why the Ubuntu 10.04 kernel has a huge pile of patches backported from 2.6.35.

Another lead could be some patches in recent 2.6.35 that fix a problem in xen about lost interrupts. If we are waiting for pages written to disk and the completion interrupt gets lost, this would be showing up like it does here.

commit a29059dc766af0bd2783614399972950fc99a99d
    xen: handle events as edge-triggered
...
    The most noticable symptom of these lost events is occasional lockups
    of blkfront.

So if it would be the writeback issues on older dom0, then I would expect the messages to go away eventually (though it could take a really long time, potentially more than 10 minutes). This might be completely coincidental but for some reason run-parts seems to vanish in the 4th batch of messages:

2040s: jdb2 and pgbouncer
2160s: jbd2, pgbouncer and run-parts
2280s: jbd2, pgbouncer and run-parts
2400s: jbd2 and pgbouncer

If it would be that and go away, then this would need to be addressed in dom0.

For the lost interrupt case: That patch only changes the handler, so I guess changing the domU should be effective. As maverick instances use pv-grub it is simple to try that. If you boot your instance and install the other software, you also can do a

wget https://launchpad.net/ubuntu/+source/linux/2.6.35-23.37/+build/2033771/+files/linux-image-2.6.35-23-virtual_2.6.35-23.37_amd64.deb

to download a kernel that includes that (amongst other things) fix. Then you can reboot the instance into that kernel and see whether it shows the issue or seems to solve it.

Revision history for this message
Zach Bailey (znbailey) wrote :
Download full text (13.6 KiB)

I am also having this problem running the 64-bit EBS-backed Alestic EC2 image ami-548c783d on c1.xlarge.

Here is how I am setting up my machine:

1.) Boot a fresh instance
2.) Install the Sun/Oracle Java6 JDK
3.) Download the heritrix web crawler from http://builds.archive.org:8080/maven2/org/archive/heritrix/heritrix/3.1.1-SNAPSHOT/ - heritrix is a java program which runs in user land and crawls web sites and writes out the results to disks. Heritrix is installed into /mnt/heritrix due to the fact that it writes a very large BerkleyDB database the size of which exceeds the EBS device which is only 15 gig.
4.) Mount a 100 gig EBS volume formatted as ext3 to write the heritrix crawl results
5.) Start heritrix with 5gig max heap (-Xmx5g) and start a crawl job to crawl a couple hundred thousand web sites

Invariably within a couple hours, the machine "hangs" and any attempts at I/O block eternally. The following warnings/errors appear in /var/log/syslog:

Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290051] INFO: task kjournald:614 blocked for more than 120 seconds.
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290068] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290076] kjournald D ffff880003c9f980 0 614 2 0x00000000
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290082] ffff8801b63d9c30 0000000000000246 0000000000000000 0000000000015980
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290088] ffff8801b63d9fd8 0000000000015980 ffff8801b63d9fd8 ffff8801b7dfadc0
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290092] 0000000000015980 0000000000015980 ffff8801b63d9fd8 0000000000015980
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290097] Call Trace:
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290127] [<ffffffff8117d510>] ? sync_buffer+0x0/0x50
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290133] [<ffffffff815a20f3>] io_schedule+0x73/0xc0
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290136] [<ffffffff8117d555>] sync_buffer+0x45/0x50
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290139] [<ffffffff815a276f>] __wait_on_bit+0x5f/0x90
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290142] [<ffffffff8117c281>] ? submit_bh+0x111/0x140
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290145] [<ffffffff8117d510>] ? sync_buffer+0x0/0x50
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290148] [<ffffffff815a2818>] out_of_line_wait_on_bit+0x78/0x90
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290153] [<ffffffff8107f0c0>] ? wake_bit_function+0x0/0x40
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290156] [<ffffffff8117d506>] __wait_on_buffer+0x26/0x30
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290161] [<ffffffff812205b1>] journal_commit_transaction+0x2f1/0xe30
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290166] [<ffffffff81006afd>] ? __raw_callee_save_xen_irq_disable+0x11/0x1e
Nov 15 15:27:03 domU-12-31-39-0A-B6-71 kernel: [39840.290169] [<ffffffff81006adf>] ? __raw_callee_save_xen_restore_fl+0x11/0x1e
Nov 15 15:27:03 ...

Revision history for this message
MatthiasP (mpdude) wrote :
Download full text (14.2 KiB)

Not quite sure if it's the same bug, but maybe this adds some more data points to the set.

AMI ID is ami-405c6934, which is the eu-west-1 EBS-backed variant of the image in question.

Just a plain instance boot, connecting four EBS volumes, bundling them together as /dev/md0, putting XFS on top and running a benchmark as described in http://www.mysqlperformanceblog.com/2009/08/06/ec2ebs-single-and-raid-volumes-io-bencmark/.

When doing the same (on the same volumes) from a 32bit AMI on a .small instance the problem does not occur (at least not over a few days), whereas the 64bit AMI crashed within hours.

Dec 5 22:00:55 ip-10-234-243-114 kernel: [ 6.642845] JBD: barrier-based sync failed on sda1-8 - disabling barriers
Dec 5 22:01:02 ip-10-234-243-114 kernel: [ 13.890171] eth0: no IPv6 routers present
Dec 5 22:04:14 ip-10-234-243-114 kernel: [ 205.786453] SGI XFS with ACLs, security attributes, realtime, large block/inode numbers, no debug enabled
Dec 5 22:04:14 ip-10-234-243-114 kernel: [ 205.789239] SGI XFS Quota Management subsystem
Dec 5 22:04:14 ip-10-234-243-114 kernel: [ 205.790489] Filesystem "md0": Disabling barriers, trial barrier write failed
Dec 5 22:04:14 ip-10-234-243-114 kernel: [ 205.807777] XFS mounting filesystem md0
Dec 5 22:04:15 ip-10-234-243-114 kernel: [ 206.368403] Ending clean XFS mount for filesystem: md0
Dec 5 22:07:36 ip-10-234-243-114 kernel: [ 407.339573] Filesystem "md0": Disabling barriers, trial barrier write failed
Dec 5 22:07:36 ip-10-234-243-114 kernel: [ 407.340479] XFS mounting filesystem md0
Dec 5 22:07:36 ip-10-234-243-114 kernel: [ 407.593179] Ending clean XFS mount for filesystem: md0
Dec 5 22:17:01 ip-10-234-243-114 CRON[1222]: (root) CMD ( cd / && run-parts --report /etc/cron.hourly)
Dec 5 22:31:54 ip-10-234-243-114 kernel: [ 1865.812490] XFS mounting filesystem md0
Dec 5 22:31:54 ip-10-234-243-114 kernel: [ 1866.061941] Ending clean XFS mount for filesystem: md0
Dec 5 23:04:48 ip-10-234-243-114 kernel: [ 3840.190047] INFO: task flush-9:0:1272 blocked for more than 120 seconds.
Dec 5 23:04:48 ip-10-234-243-114 kernel: [ 3840.190061] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Dec 5 23:04:48 ip-10-234-243-114 kernel: [ 3840.190068] flush-9:0 D ffff880003e7d980 0 1272 2 0x00000000
Dec 5 23:04:48 ip-10-234-243-114 kernel: [ 3840.190072] ffff88014d79b640 0000000000000246 ffff880100000000 0000000000015980
Dec 5 23:04:48 ip-10-234-243-114 kernel: [ 3840.190077] ffff88014d79bfd8 0000000000015980 ffff88014d79bfd8 ffff8801d58316e0
Dec 5 23:04:48 ip-10-234-243-114 kernel: [ 3840.190081] 0000000000015980 0000000000015980 ffff88014d79bfd8 0000000000015980
Dec 5 23:04:48 ip-10-234-243-114 kernel: [ 3840.190084] Call Trace:
Dec 5 23:04:48 ip-10-234-243-114 kernel: [ 3840.190095] [<ffffffff815a20f3>] io_schedule+0x73/0xc0
Dec 5 23:04:48 ip-10-234-243-114 kernel: [ 3840.190099] [<ffffffff812a2f1c>] get_request_wait+0xcc/0x1a0
Dec 5 23:04:48 ip-10-234-243-114 kernel: [ 3840.190104] [<ffffffff8107f080>] ? autoremove_wake_function+0x0/0x40
Dec 5 23:04:48 ip-10-234-243-114 kernel: [ 3840.190107] [<ffffffff812a3083>] __make_reques...

Revision history for this message
Paul Bohm (bohmps) wrote :

previously reported as #666211 - completely different userspace setup and same problem. any workarounds?

Revision history for this message
Paul Bohm (bohmps) wrote :
Revision history for this message
Jay Freeman (saurik) (saurik) wrote :

Other people experiencing this issue may want to explicitly note the comment on this bug from Stefan (which is now somewhat buried) regarding the potential Xen IRQ misconfiguration in these kernels, and attempt that fix. Unfortunately, my test case is "run my business on this system for a day and wait for my website to go offline and millions of users to complain" ;P.

However, for what it is worth, I do not believe the current 2.6.37 kernel from Natty (2.6.37-12.26) experiences this issue. I was building a new m2.4xlarge database server, which the Maverick kernels do not support correctly (bug 667796), and therefore started experimenting with upgrading just the kernel to Natty (with an APT Pin and all that).

As the system seemed stable while building it and I knew the load characteristics would be drastically different on it than my previous web server backends, I decided to go ahead with it. Then, after that worked out for a while, I decided to risk moving all of my web backend boxes (the ones for which I was experiencing this issue) to that kernel as well.

So far I've been very happy with the results. To date, I've been forced to use Karmic, with its non-PV-grub kernel, field upgraded to Maverick. The kernels from Lucid were not acceptable (seemingly serious performance regression), and Maverick has been "right out" (given this horrendous I/O lock-up issue, and the m4.4xlarge 64GB no-go).

Revision history for this message
Sesshomurai (darren-ontrenet) wrote :

Is it useful to know whether this problem happens in other distro AMI's with the same kernel rev? Or is this specifically an Ubuntu AMI issue?

Revision history for this message
John Johansen (jjohansen) wrote :

Sesshomurai,

It would indeed be useful to know if the is seen in other distro kernels, but at the moment I don't believe enough testing of the other distro kernels have been to determine that.

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

I had been triggering a similar bug across a number of m1.large instances in different availability-zones running 2.6.35-22-virtual. Mostly, the systems were running a number of rake tasks that were migrating data from one remote database to another, and then either (1) logging data locally or (2) mv'ing small files to new locations on the root ext4 filesystem. In either case, i/o to the device would eventually block. Thinking it might be ext4 specific, I tried XFS and ext3 and both eventually ran into the same problem.

I wasn't able to reproduce this on-demand, but it happened consistently enough. I found I was able to reproduce it more easily using DRBD. Creating a simple DRBD cluster and initiating the initial synchronization between nodes, the secondary node (sync target) would eventually stall out while its backing device deadlocked.

After upgrading to 2.6.35-28-virtual across all instances, I found the issue gone. I can only assume it was resolved by the upstream fixes Stefan mentioned above:

  * xen: handle events as edge-triggered
  * xen: use percpu interrupts for IPIs and VIRQs

...which were applied to the 2.6.35-23 kernel.

Can anyone else confirm that upgrading from 2.6.35-22 to later maverick kernels resolves their issues?

Attached are some traces from 3 instances as well as a URL to a thread on the AWS support forum describing similar behavior.

https://forums.aws.amazon.com/thread.jspa?messageID=224301

Revision history for this message
Adam Gandelman (gandelman-a) wrote :
Revision history for this message
Adam Gandelman (gandelman-a) wrote :
Revision history for this message
Mikhail P (m-mihasya) wrote :

Still seeing this issue with 2.6.35-24-virtual.

Adam, You said you are using -28 - perhaps there's something further on that fixes? Are you still problem-free?

Revision history for this message
Paul Bohm (bohmps) wrote : Re: [Bug 666211] Re: maverick on ec2 64bit ext4 deadlock

i think barrier=0 in /etc/fstab fixed this for me, but there might
have been conflating factors

On Tue, Jun 7, 2011 at 5:23 PM, Mikhail P <email address hidden> wrote:
> Still seeing this issue with 2.6.35-24-virtual.
>
> Adam, You said you are using -28 - perhaps there's something further on
> that fixes? Are you still problem-free?
>
> --
> You received this bug notification because you are a direct subscriber
> of a duplicate bug (667656).
> https://bugs.launchpad.net/bugs/666211
>
> Title:
>  maverick on ec2 64bit ext4 deadlock
>
> Status in “linux” package in Ubuntu:
>  Confirmed
>
> Bug description:
>  I created an ami from the official image (ami-505c6924 – region eu-
>  west-1). the rebundled image works fine for a while but ends with
>  nearly 100% iowait and a load of 6000.
>
>  The instance type is m1.xlarge without ebs.
>
>  There are errors in the amazon console before the kernel starts (see attached boot.log):
>  Failed to read /local/domain/0/backend/vbd/161/2049/feature-barrier.
>  Failed to read /local/domain/0/backend/vbd/161/2049/feature-flush-cache.
>
>  If you are lucky you see errors in the syslog like this:
>  INFO: task jbd2/sda1-8:235 blocked for more than 120 seconds.
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  jbd2/sda1-8 D ffff880003bcd980 0 235 2 0x00000000
>  ffff8801b5f49b20 0000000000000246 0000000000000000 0000000000015980
>  ffff8801b5f49fd8 0000000000015980 ffff8801b5f49fd8 ffff8801b571db80
>  0000000000015980 0000000000015980 ffff8801b5f49fd8 0000000000015980
>
>  The machines are unusable after a few hours. I am testing those images
>  right now. There is no heavy load expected. Karmic images work fine.
>  Packages I've installed are build-essential, git-core, ruby, nginx and
>  couchdb. Node.js is compiled manually.
>
>  The same error is described on Alestic by Jay Freeman, unfortunately he didn't open any bug here:
>  http://alestic.com/2010/10/ec2-ubuntu-maverick#comment-484
>
>  AMI:
>  http://developer.amazonwebservices.com/connect/entry.jspa?externalID=4350
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/666211/+subscribe
>

Revision history for this message
Mikhail P (m-mihasya) wrote :

Paul, what else did you change?

Are you using RAID, or did you see this happening on individually mounted devices?

Thanks!

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

Mikhail-

I am no longer working at the organization where I had encountered this, but upgrading to the *-28 kernel solved the problem for the remaining month or so I was there. I would highly suggest upgrading to the latest 2.6.35 kernel, not just for this fix but important updates as well.

Revision history for this message
Paul Bohm (bohmps) wrote :

on individual devices

i changed the app running on my machine, and at some point i might
also have reinstalled, tho the problem was previously also present
after a previous reinstall

i think the barrier stuff is the conclusion i came ot after reading
source, and i think without further testing it is the reason i'm
stable now

On Wed, Jun 8, 2011 at 2:25 PM, Mikhail P <email address hidden> wrote:
> Paul, what else did you change?
>
> Are you using RAID, or did you see this happening on individually
> mounted devices?
>
> Thanks!
>
> --
> You received this bug notification because you are a direct subscriber
> of a duplicate bug (667656).
> https://bugs.launchpad.net/bugs/666211
>
> Title:
>  maverick on ec2 64bit ext4 deadlock
>
> Status in “linux” package in Ubuntu:
>  Confirmed
>
> Bug description:
>  I created an ami from the official image (ami-505c6924 – region eu-
>  west-1). the rebundled image works fine for a while but ends with
>  nearly 100% iowait and a load of 6000.
>
>  The instance type is m1.xlarge without ebs.
>
>  There are errors in the amazon console before the kernel starts (see attached boot.log):
>  Failed to read /local/domain/0/backend/vbd/161/2049/feature-barrier.
>  Failed to read /local/domain/0/backend/vbd/161/2049/feature-flush-cache.
>
>  If you are lucky you see errors in the syslog like this:
>  INFO: task jbd2/sda1-8:235 blocked for more than 120 seconds.
>  "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
>  jbd2/sda1-8 D ffff880003bcd980 0 235 2 0x00000000
>  ffff8801b5f49b20 0000000000000246 0000000000000000 0000000000015980
>  ffff8801b5f49fd8 0000000000015980 ffff8801b5f49fd8 ffff8801b571db80
>  0000000000015980 0000000000015980 ffff8801b5f49fd8 0000000000015980
>
>  The machines are unusable after a few hours. I am testing those images
>  right now. There is no heavy load expected. Karmic images work fine.
>  Packages I've installed are build-essential, git-core, ruby, nginx and
>  couchdb. Node.js is compiled manually.
>
>  The same error is described on Alestic by Jay Freeman, unfortunately he didn't open any bug here:
>  http://alestic.com/2010/10/ec2-ubuntu-maverick#comment-484
>
>  AMI:
>  http://developer.amazonwebservices.com/connect/entry.jspa?externalID=4350
>
> To unsubscribe from this bug, go to:
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/666211/+subscribe
>

Revision history for this message
David Taylor (david-taylor) wrote :

It sounds like I've bumped into the identical problem (twice so far!) as MatthiasP (mpdude).

I'm using ami-af7e2eea in us-west-1 on c1.xlarge. I have 8 x 128GB EBS volumes in a RAID10 array using mdadm. At some point /dev/md0 would "freeze" and load would shoot up from <5 to >300-400.

Any attempts to interrogate the mounted filesystem would hang and be uninterruptible.

When I ran "mdadm --examine" on each of the devices it returned "state: clean" on all of them except one. That command would never return and I'd have to Ctrl-C to interrupt it.

I had the same log messages in /var/log/syslog, so I won't re-paste them here.

Any ideas what the cause is? Better yet, the fix? I see some suggestions that kernel upgrades might help. What are the prevailing thoughts on that, has it been confirmed?

Also, how did you perform the kernel upgrade? Did you build your own AMI or are you upgrading it after launch?

Cheers,
David.

Revision history for this message
Brandon Black (blblack) wrote :

As a matter of practicality, given this and other problematic bugs w/ EC2 on the stock Maverick and Lucid kernels (there are several that range from annoying to flat-out unreliable or un-(re)-bootable), I had been running my Maverick-based instances with the Karmic kernels pretty successfully for a long time (by adding Karmic repos to sources.list and installing linux-image-2.6.31-307-ec2 from them and deleting the Maverick kernel as part of my cloud-init script on first boot).

I'd recommend that to you now (and you still can if you must, using the old-releases.ubuntu.com mirror), but Karmic has now been dropped from support some months back. This leaves basically no stable, supported option for an Ubuntu-based distribution with a reasonably-decent quality kernel for EC2. I'd recommend switching distros; these problems have been simmering far too long to expect a sudden fix to come your way.

Revision history for this message
Stefan Bader (smb) wrote :

@David, from what you describe this sounds like something is blocking the device completely. And that could potentially be to a few reasons. However the call done by mdadm --examine should be a fairly basic read of some sector(s). That rather would tell that the one virtual disk is blocking. Which could be missing an interrupt or not having received one the first place.
Debugging those kind of things is very painful, as there is never much hard evidence. The problem usually happens earlier and then the process is being blocked messages get spitted out by anything that directly or indirectly is waiting on the IO. As I mentioned earlier there have been patches to address some interrupt issues. So it would be helpful to know what exact kernel version you are running. Also to confirm whether it also is a m1.xlarge or a different kind of instance. Another question would be whether you are bound to stick with Maverick or could try and move to Oneiric completely (or at least Natty).

Revision history for this message
Ben Howard (darkmuggle-deactivatedaccount) wrote :

To concur with comment #37, I speculate that you have a slow EBS volume or you aren't able to commit things fast enough due to your heavy I/O. Performance of EBS volumes can vary widely. One thing to remember is that EBS disks _ARE NETWORK ATTACHED STORAGE_, and with that comes all the fun that network attached storage bring.

I think that you can do some tuning here and see where you get. Try setting the following sysctl settings (these will force uncommitted disks writes to be flushed sooner than later). You can play with the settings, as this is a delicate balance between performance and being safe.
vm.dirty_writeback_centisecs = 300 ( force flush after three seconds )
vm.dirty_ratio = 5 ( no more 3% of memory can be dirty pages )

My hunch is that you are using at least a m1.large or c1.meduim (at the least) and you saturating the network links used to flush the disk writes, while at the same time pulling more information onto the disk(s), preventing the flush from completing. The default settings on Maverick allow for 20% of memory and 5 seconds when flushing. Reducing the ratios will affect your performance, but I suspect that it will stabilize your system to make sure that everything gets to disk.

Another tactic would be use to ephemeral store as a "temp" directory -- push your data to the emphemeral storage and when it is ready for permance, commit it to the RAIDed EBS volumes.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote : Unsupported series, setting status to "Won't Fix".

This bug was filed against a series that is no longer supported and so is being marked as Won't Fix. If this issue still exists in a supported series, please file a new bug.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: Confirmed → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.