Comment 27 for bug 666211

Revision history for this message
Adam Gandelman (gandelman-a) wrote :

I had been triggering a similar bug across a number of m1.large instances in different availability-zones running 2.6.35-22-virtual. Mostly, the systems were running a number of rake tasks that were migrating data from one remote database to another, and then either (1) logging data locally or (2) mv'ing small files to new locations on the root ext4 filesystem. In either case, i/o to the device would eventually block. Thinking it might be ext4 specific, I tried XFS and ext3 and both eventually ran into the same problem.

I wasn't able to reproduce this on-demand, but it happened consistently enough. I found I was able to reproduce it more easily using DRBD. Creating a simple DRBD cluster and initiating the initial synchronization between nodes, the secondary node (sync target) would eventually stall out while its backing device deadlocked.

After upgrading to 2.6.35-28-virtual across all instances, I found the issue gone. I can only assume it was resolved by the upstream fixes Stefan mentioned above:

  * xen: handle events as edge-triggered
  * xen: use percpu interrupts for IPIs and VIRQs

...which were applied to the 2.6.35-23 kernel.

Can anyone else confirm that upgrading from 2.6.35-22 to later maverick kernels resolves their issues?

Attached are some traces from 3 instances as well as a URL to a thread on the AWS support forum describing similar behavior.

https://forums.aws.amazon.com/thread.jspa?messageID=224301