Comment 19 for bug 666211

Revision history for this message
Stefan Bader (smb) wrote :

Yes, you are right. There is one likely place in the path the pgbouncer take that will wait for a buffer to finish being written to disk. And the jbd2 task is waiting for a range of pages to be written out. Maybe related but I cannot see a reason why this should deadlock. And the same is true for run-parts.

This leads to the question whether we actually see an ext4 issue here. Unfortunately we have no clue what is running on the other side (dom0) for sure. From the past I have seen users of kvm having very similar issues on 2.6.32 hosts. There has been a lot of fiddling with the generic writeback interface. And even on bare metal we have seen completely poor performance when multiple people tried IO bound tasks (like doing kernel compiles, where one could massively starve other people). This is why the Ubuntu 10.04 kernel has a huge pile of patches backported from 2.6.35.

Another lead could be some patches in recent 2.6.35 that fix a problem in xen about lost interrupts. If we are waiting for pages written to disk and the completion interrupt gets lost, this would be showing up like it does here.

commit a29059dc766af0bd2783614399972950fc99a99d
    xen: handle events as edge-triggered
...
    The most noticable symptom of these lost events is occasional lockups
    of blkfront.

So if it would be the writeback issues on older dom0, then I would expect the messages to go away eventually (though it could take a really long time, potentially more than 10 minutes). This might be completely coincidental but for some reason run-parts seems to vanish in the 4th batch of messages:

2040s: jdb2 and pgbouncer
2160s: jbd2, pgbouncer and run-parts
2280s: jbd2, pgbouncer and run-parts
2400s: jbd2 and pgbouncer

If it would be that and go away, then this would need to be addressed in dom0.

For the lost interrupt case: That patch only changes the handler, so I guess changing the domU should be effective. As maverick instances use pv-grub it is simple to try that. If you boot your instance and install the other software, you also can do a

wget https://launchpad.net/ubuntu/+source/linux/2.6.35-23.37/+build/2033771/+files/linux-image-2.6.35-23-virtual_2.6.35-23.37_amd64.deb

to download a kernel that includes that (amongst other things) fix. Then you can reboot the instance into that kernel and see whether it shows the issue or seems to solve it.