Bug #666211 “maverick on ec2 64bit ext4 deadlock” : Bugs : linux package : Ubuntu

Revision history for this message

Timo Derstappen (teemow) wrote on 2010-10-25:

#1

boot.log Edit (21.4 KiB, text/plain)

Revision history for this message

Timo Derstappen (teemow) wrote on 2010-10-25:

#2

uname-a.log Edit (103 bytes, text/plain)

Revision history for this message

Timo Derstappen (teemow) wrote on 2010-10-25:

#3

version.log Edit (38 bytes, text/plain)

Revision history for this message

Jay Freeman (saurik) (saurik) wrote on 2010-10-27:

#4

We are also getting the following error message from the kernel:

JBD: barrier-based sync failed on sda1-8 - disabling barriers

My theory was that this has to do with the root partition being used as ext4. I do not know much about bundling of AMIs: is this something that is easy for you to change/test with your rebundled AMIs?

Revision history for this message

Jay Freeman (saurik) (saurik) wrote on 2010-10-27:

#5

Somehow I thought I said this in the last comment, but I see that I didn't: "using the instance's normal root partition, not an EBS root boot".

Revision history for this message

Scott Moser (smoser) wrote on 2010-10-27:

#6

Jay,
Could you please get console output of a system that reproduces this bug and attach it here?

Revision history for this message

Scott Moser (smoser) wrote on 2010-10-27:

#7

m1.xlarge console output ami-505c6924 eu-west-1 [no issues] Edit (25.5 KiB, text/plain)

I'm attaching dmesg log of an m1.xlarge in eu-west-1.
I've tried to reproduce with a couple benchmarking utilities to add load, and not been able to do so (stress and dbench). I know they're just benchmarking, but I can't see it in one instance here.

Is there info that you could give to help us reproduce ?

Revision history for this message

Timo Derstappen (teemow) wrote on 2010-10-28:

#8

Scott, did you bundle a new ami from the existing uec image? Because that is what we did. Which tools do you use for that? And probably that is where the error comes from. The original image where I installed all software and bundled from is still running and didn't have any issues. Only the cloned instances failed.

Here is a rough description of what I did using ami-tools:
* Replaced root device label uec-rootfs in menu.lst and fstab with /dev/sda1
* ec2-bundle-vol -d /mnt --block-device-mapping "root=/dev/sda1" -r x86_64
* ec2-upload-bundle ...

I did replace the label, because otherwise the bundled amis failed with the message root device "uec-rootfs" not found and I didn't find another solution for that.

Is there a better way to do that, what do you recommend?

Revision history for this message

Scott Moser (smoser) wrote on 2010-11-01:

#9

Timo,
  If you're booting with the correct kernel (which it appears you are), then I can't imagine that bundle-vol and upload-bundle is what is causing this.
  I've seen similar issues to what is described here in bug 567334. There, I do not see the issue reproducibly, though.
  Do you see this in a easily reproducible manner ? I have some 368 logs of maverick ec2 boots in lp:~ubuntu-on-ec2/ubuntu-on-ec2/ec2-test-results/ (looking at files named console-term.txt). 8 of them have the 'jbd2/sda1...blocked' message but none of them show a hex dump like in the summary. Of those 8 all were x86_64 and instance-store, 2 of them m1.xlarge, 1 was c1.xlarge, 5 m1.large.

regarding rebundling, I would suggest using the images available at http://uec-images.ubuntu.com, mounting them loopback, making your changes via chroot, and then ec2-bundle-image and ec2-upload-bundle. This would especially be true if you're automating your changes. Its simply a cleaner starting point than a booted system, and will retain the filesystem label also.

Scott Moser (smoser) on 2010-11-03

affects:	linux-meta (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
importance:	Undecided → High
status:	New → Confirmed
tags:	added: amd64 ec2-images

Revision history for this message

Jay Freeman (saurik) (saurik) wrote on 2010-11-10:

#10

dmesg log from a system that had gotten hung up Edit (46.8 KiB, text/plain)

@scott: I have attached a dmesg dump from a system that had failed.

Revision history for this message

Stefan Bader (smb) wrote on 2010-11-10:

#11

Adding a few comments here as they come to my mind:

The message about barrier based sync failed is just status and can be safely ignored.

Reading through the dmesg from comment #10 and comparing to one gathered from a daily server instance boot:

[ 0.000000] Xen version: 3.0.3-rc5-8.1.14.f (mine was 3.0.3-rc5-8.el5)
...
[ 0.000000] trying to map vcpu_info 0 at ffff880003bc3020, mfn 124f8a, offset 32
[ 0.000000] register_vcpu_info failed: err=-38

Did not see this error in my log.

[ 0.016933] CPU: Physical Processor ID: 0
[ 0.016939] CPU: Processor Core ID: 0

Not sure this is really relevant the hw I booted seemed to have only 2 CPUs and showed a warning about an unsupported number of siblings (4).

[ 0.103804] alloc irq_desc for 16 on node 0
[ 0.103806] alloc kstat_irqs on node 0

Did not see messages like this either, but I suspect my hw was AMD dual core while this might be Intel quad core.

[ 0.171411] intel_idle: MWAIT substates: 0x2220
[ 0.171413] intel_idle: does not run on family 6 model 23

This proves the previous suspicion. At least it refuses here instead of crashing. Then mostly normal things. The only strange thing is the name of the device in the barrier based sync failed message: sda1-8, unfortunately the way xen works there is no partition detection in the log, but this sounds at least like sda has 8 partitions...

The following stack traces look very much like something deadlocks on flushing. The jbd2 tasks are transactions, what I am not sure about is pgbouncer (what should it do?). However it seems to involve aio and I see there are two patches in 2.6.35-23.36 which address aio completion ordering (coming from 2.6.35.5 upstream stable).

So probably Jay, do you know what pgbouncer is doing and maybe that is something not used in the common images. If yes, maybe it makes sense to check for that newer kernel versions?

Revision history for this message

Jay Freeman (saurik) (saurik) wrote on 2010-11-11:

#12

Stefan,

I'm not certain what you mean by "something not used in the common images". To be clear, I do not even know how to make my own images. That's not to say I'm not certain I couldn't figure it out very quickly, but I never have as I personally do not think that is a good way of using EC2: I instead boot stock Ubuntu images (ami-688c7801 in this case) and then install packages on them. Currently, I do not any "custom" software other than my Python web application: everything I install comes from the default Ubuntu repositories.

(If required, I can provide the exact set of commands that I manually run to setup one of these EC2 servers. Unfortunately, I have so far been unable to fully automate the bootup of these systems as the server tends to lock up on me while I'm doing the install, but I remember thinking it was an unrelated issue to this one. I will do some testing of the fully automated boot of these servers tonight and see if I can reproduce those lockups again to see if they look at all related.)

So if you mean "am I making my own AMI that has some kind of modified system software", the answer is "definitely not". However, while I think that it is an amazing testament to modern engineering that I can reinstall and reboot computers from a thousand miles away, that is a fairly useless endeavor: the stock images are very "stock" and don't do anything at all, as far as I know, out of the box. To be putting any kind of load on them at all I'm certainly installing some software on it, even if that software is just a shell script fork bomb.

In my particular case, I install apache2 with mod_python and "pgbouncer" (from Ubuntu universe), a program from Skype that provides a PostgreSQL pooling proxy server. My python application connects to pgbouncer (which is listening on a named Unix socket and pretends to be PostgreSQL) instead of directly to my database server, which then keeps its own pool of connections to the actual database. This makes the 3200 Apache threads that I normally have running able to rapidly get a database connection without trying to coordinate local in-process pools.

Put differently: pgbouncer should be a fairly boring user-land process. If you are looking at it thinking it is some kind of cool kernel task that increases network security (maybe "bouncing" bad clients) or something, it (maybe sadly ;P) isn't. This software is actually the least popular (but I personally feel best ;P) choice in PostgreSQL pooling proxy servers, an already fairly narrow niche; therefore, I would find it highly unlikely if Timo was also using it, but maybe he will chime in with a "yeah! pgbouncer is AWESOME!" and prove me wrong.

-J

Stefan,

I'm not certain what you mean by "something not used in the common images". To be clear, I do not even know how to make my own images. That's not to say I'm not certain I couldn't figure it out very quickly, but I never have as I personally do not think that is a good way of using EC2: I instead boot stock Ubuntu images (ami-688c7801 in this case) and then install packages on them. Currently, I do not any "custom" software other than my Python web application: everything I install comes from the default Ubuntu repositories.

(If required, I can provide the exact set of commands that I manually run to setup one of these EC2 servers. Unfortunately, I have so far been unable to fully automate the bootup of these systems as the server tends to lock up on me while I'm doing the install, but I remember thinking it was an unrelated issue to this one. I will do some testing of the fully automated boot of these servers tonight and see if I can reproduce those lockups again to see if they look at all related.)

So if you mean "am I making my own AMI that has some kind of modified system software", the answer is "definitely not". However, while I think that it is an amazing testament to modern engineering that I can reinstall and reboot computers from a thousand miles away, that is a fairly useless endeavor: the stock images are very "stock" and don't do anything at all, as far as I know, out of the box. To be putting any kind of load on them at all I'm certainly installing some software on it, even if that software is just a shell script fork bomb.

In my particular case, I install apache2 with mod_python and "pgbouncer" (from Ubuntu universe), a program from Skype that provides a PostgreSQL pooling proxy server. My python application connects to pgbouncer (which is listening on a named Unix socket and pretends to be PostgreSQL) instead of directly to my database server, which then keeps its own pool of connections to the actual database. This makes the 3200 Apache threads that I normally have running able to rapidly get a database connection without trying to coordinate local in-process pools.

Put differently: pgbouncer should be a fairly boring user-land process. If you are looking at it thinking it is some kind of cool kernel task that increases network security (maybe "bouncing" bad clients) or something, it (maybe sadly ;P) isn't. This software is actually the least popular (but I personally feel best ;P) choice in PostgreSQL pooling proxy servers, an already fairly narrow niche; therefore, I would find it highly unlikely if Timo was also using it, but maybe he will chime in with a "yeah! pgbouncer is AWESOME!" and prove me wrong.

-J

Revision history for this message

Jay Freeman (saurik) (saurik) wrote on 2010-11-11:

#13

Stefan,

To be clear, both Timo and I were using m1.xlarge instances, which are supposed to have four cores. (You mentioned the hardware you were testing on only had two cores, and therefore you weren't getting the same seemingly-bad and probably-should-be-fixed error messages.)

Also, that log was saved from two weeks ago when I was running into this issue: the difference in Xen version could theoretically then be that Amazon has upgraded their system since then. (I have no idea if that even makes sense, but I figure I'll throw it out there as a possibility.)

-J

Revision history for this message

Jay Freeman (saurik) (saurik) wrote on 2010-11-11:

#14

Stefan,

These stack traces for pgbouncer are all in sys_write(), btw, which is then backed by ext4. Both from what I know about how pgbouncer operates, and from greping through its source code, the only file-backed operation it performs is writing to its log file, which it normally does only once a minute unless it is encountering some kind of connection failures.

It looked (and still looks) to me like the filesystem is simply locking up. It should be noted that my dmesg log also includes another process that got stuck: run-parts, which got blocked in a call to sys_getdents().

Also, I looked into the AIO completion ordering change you mentioned, and it seems totally unrelated. The author of this patch referred to a reproduction of the bug they were fixing, which was a "you now read a bunch of zeros when you were expecting data" race condition, not a deadlock. In specific, operations involving "unwritten extents" would claim to be "completed" via AIO when they were still pending: the reordering fixed this.

http://www.spinics.net/lists/linux-ext4/msg19590.html
http://thread.gmane.org/gmane.comp.file-systems.ext4/19659

-J

Revision history for this message

Stefan Bader (smb) wrote on 2010-11-11:

#15

Jay,

to hopefully answer all questions: the "non-standard" question aimed to understand what additional packages were installed and what is the use when things happen. It sometimes helps to understand the problem better when knowing how the system is in use exactly. Also knowing the Xen version and the hardware may (or may not) help. When I use a m1.xlarge for example I seem to get more memory (16G) and less CPU (4, whatever that is the real cores or hyperthreads of it). Not that it should normally matter with that deadlock, but sometimes it does.

So for the deadlock, I will to have to follow the traces more closely. But at least knowing that pgbouncer is some user-space damon that does some db proxy stuff and seeing aio somewhere in the traces gives some hint what types of fs access are involved in the game.

Revision history for this message

Jay Freeman (saurik) (saurik) wrote on 2010-11-11:

#16

In case this is saves anyone's time: the top of those stack traces is garbage. Really, all of those processes are simply blocked in the scheduler: the second from the top entry in all the call stacks is a call to schedule() (which I presume is scrambling the registers enough to confuse the stack tracer).

Revision history for this message

Stefan Bader (smb) wrote on 2010-11-11:

#17

I think all of them are sort of waiting on IO for one or the other reason. There is also a lot of lines with ? which usually tells that they cannot be fully trusted. The interesting/hard part is to understand how things ended up where they did and whether there possibly is some relationship between two of the blocked ones or maybe even one process stepping on its own toes. Not that I got that far, yet.

Revision history for this message

Jay Freeman (saurik) (saurik) wrote on 2010-11-11:

#18

Yes: I'm just telling you that the ? entries at the top of these stacks are all "in the scheduler". jbd2 and run-parts are blocked in io_schedule(), and pgbouncer is blocked in do_get_write_access(). Both of those functions are calling into schedule(), and that's what is actually at the top of the stack. (I disassembled the 2.6.35-22.35-virtual kernel and verified the call points from the non-? second entries down on the stack.)

Revision history for this message

Stefan Bader (smb) wrote on 2010-11-12:

#19

Yes, you are right. There is one likely place in the path the pgbouncer take that will wait for a buffer to finish being written to disk. And the jbd2 task is waiting for a range of pages to be written out. Maybe related but I cannot see a reason why this should deadlock. And the same is true for run-parts.

This leads to the question whether we actually see an ext4 issue here. Unfortunately we have no clue what is running on the other side (dom0) for sure. From the past I have seen users of kvm having very similar issues on 2.6.32 hosts. There has been a lot of fiddling with the generic writeback interface. And even on bare metal we have seen completely poor performance when multiple people tried IO bound tasks (like doing kernel compiles, where one could massively starve other people). This is why the Ubuntu 10.04 kernel has a huge pile of patches backported from 2.6.35.

Another lead could be some patches in recent 2.6.35 that fix a problem in xen about lost interrupts. If we are waiting for pages written to disk and the completion interrupt gets lost, this would be showing up like it does here.

commit a29059dc766af0bd2783614399972950fc99a99d
    xen: handle events as edge-triggered
...
    The most noticable symptom of these lost events is occasional lockups
    of blkfront.

So if it would be the writeback issues on older dom0, then I would expect the messages to go away eventually (though it could take a really long time, potentially more than 10 minutes). This might be completely coincidental but for some reason run-parts seems to vanish in the 4th batch of messages:

2040s: jdb2 and pgbouncer
2160s: jbd2, pgbouncer and run-parts
2280s: jbd2, pgbouncer and run-parts
2400s: jbd2 and pgbouncer

If it would be that and go away, then this would need to be addressed in dom0.

For the lost interrupt case: That patch only changes the handler, so I guess changing the domU should be effective. As maverick instances use pv-grub it is simple to try that. If you boot your instance and install the other software, you also can do a

wget https://launchpad.net/ubuntu/+source/linux/2.6.35-23.37/+build/2033771/+files/linux-image-2.6.35-23-virtual_2.6.35-23.37_amd64.deb

to download a kernel that includes that (amongst other things) fix. Then you can reboot the instance into that kernel and see whether it shows the issue or seems to solve it.