Bug #561210 “Writing big files to NFS target causes system lock ...” : Bugs : linux package : Ubuntu

Revision history for this message

Ancoron Luziferis (ancoron) wrote on 2010-04-12:

#1

uname -a Edit (98 bytes, text/plain)

Revision history for this message

Ancoron Luziferis (ancoron) wrote on 2010-04-12:

#2

dmesg Edit (61.4 KiB, text/plain)

Revision history for this message

Ancoron Luziferis (ancoron) wrote on 2010-04-12:

#3

cat /proc/version_signature Edit (46 bytes, text/plain)

Revision history for this message

Ancoron Luziferis (ancoron) wrote on 2010-04-12:

#4

sudo lspci -vnvn Edit (35.0 KiB, text/plain)

Revision history for this message

Ancoron Luziferis (ancoron) wrote on 2010-04-12:

#5

/var/log/syslog Edit (221.9 KiB, text/plain)

Revision history for this message

barbz (p-barbz) wrote on 2010-04-17:

#6

I have the same problem in lucid.

Mounting nfs in fstab via
192.168.1.128:/mnt/Array1 /media/Array1 nfs rsize=8192,wsize=8192,timeo=14,intr,noatime,nodiratime

Transfer of files larger than 1gb cause system to lock up when it transfers the first 1gb.

Base install of 10.04 beta 2 with nfs-common and portmap installed.

Paul

Revision history for this message

Andre Roth (lynx-deactivatedaccount) wrote on 2010-04-22:

#7

Our NFS Boot environment is affected by the same problem, as we are trying to get the lucid lynx ready.

Apparently this has been fixed on kernel versions 2.6.33.2 and 2.6.32.11 according to:
http://bbs.archlinux.org/viewtopic.php?pid=739477

I really hope this will be patched in the ubuntu 2.6.32 kernels soon.

andré

Revision history for this message

Ancoron Luziferis (ancoron) wrote on 2010-04-22:

#8

Yepp, the upstream commit seams to be this one:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=bb6fbc4548b9ae7ebbd06ef72f00229df259d217

But in addition this one should also be considered for a backport:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d812e575822a2b7ab1a7cadae2571505ec6ec2bd

So, let's backport them and see if they really fix the issue...

Revision history for this message

Ancoron Luziferis (ancoron) wrote on 2010-04-22:

#9

Well, just to see that 2.6.32.11 is already the current version for Lucid. So it should be fine.

If it is not then it is a problem elsewhere.

Revision history for this message

Ancoron Luziferis (ancoron) wrote on 2010-04-23:

#10

Just made some tests with some ISO's (690 MiB - 3.7 GiB) and the issue seems to be fixed.

Now I got consistent read/write speed back again.

Revision history for this message

Andre Roth (lynx-deactivatedaccount) wrote on 2010-04-23: Re: [Bug 561210] Re: Writing big files to NFS target causes system lock up

#11

On 22.04.2010 22:57, Ancoron Luziferis wrote:
> Well, just to see that 2.6.32.11 is already the current version for
> Lucid. So it should be fine.
>
> If it is not then it is a problem elsewhere.
>
>
Where did you find this information ?
I tried to build a vanilla kernel and patching it with the ubuntu
patches, but I was unable to find them anywhere...

Regards
andré

Jeremy Foshee (jeremyfoshee) on 2010-04-28

tags:

added: kj-triage

Revision history for this message

Christoph Lechleitner (lech) wrote on 2010-05-03:

#12

I seem to suffer from the same problem in up-to-date lucid x86_64 with Ubuntu kernel 2.6.32-21-generic.
I assume the problem only occurs when the target server is considerable slower than the local machine, like an ultra slow NAS serving for a SSD boosted developer machine.

Revision history for this message

Jeremy Foshee (jeremyfoshee) wrote on 2010-06-03:

#13

Hi Ancoron,

This bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 561210

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags:	added: needs-kernel-logs
tags:	added: needs-upstream-testing
Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Ancoron Luziferis (ancoron) wrote on 2010-06-03:

#14

@Christoph: If you still experience this bug please do that apport-collect thing.

For me it is fine here. Tested with two different amd64 machines accessing a single NAS (1 Gb network, software-RAID-5 on busybox NAS, write speed 20 - 30 MiB/s). Also simultaneous read/write access doesn't yield any problem here.

How I mount them:

rw,rsize=32768,wsize=32768,hard,intr,noatime

Revision history for this message

Christoph Lechleitner (lech) wrote on 2010-06-04:

#15

I only had the problem with an extremely slow NAS as target, I have given the damn thing away a wee ago.
So I have no easy way of reproducing it now, sorry.

Revision history for this message

Ancoron Luziferis (ancoron) wrote on 2010-06-06:

#16

Download full text (4.3 KiB)

Well, today on one of my machine here at home this issue is back:

[616201.460064] INFO: task kswapd0:52 blocked for more than 120 seconds.
[616201.460072] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[616201.460079] kswapd0 D 0000000000000000 0 52 2 0x00000000
[616201.460090] ffff880128d2f720 0000000000000046 0000000000015bc0 0000000000015bc0
[616201.460100] ffff88012af8df80 ffff880128d2ffd8 0000000000015bc0 ffff88012af8dbc0
[616201.460108] 0000000000015bc0 ffff880128d2ffd8 0000000000015bc0 ffff88012af8df80
[616201.460117] Call Trace:
[616201.460153] [<ffffffffa03a62b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
[616201.460166] [<ffffffff8153eb87>] io_schedule+0x47/0x70
[616201.460192] [<ffffffffa03a62be>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
[616201.460201] [<ffffffff8153f3df>] __wait_on_bit+0x5f/0x90
[616201.460211] [<ffffffff811346a6>] ? __slab_free+0x96/0x120
[616201.460235] [<ffffffffa03a62b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
[616201.460243] [<ffffffff8153f488>] out_of_line_wait_on_bit+0x78/0x90
[616201.460252] [<ffffffff81085360>] ? wake_bit_function+0x0/0x40
[616201.460277] [<ffffffffa03a629f>] nfs_wait_on_request+0x2f/0x40 [nfs]
[616201.460302] [<ffffffffa03aa6af>] nfs_wait_on_requests_locked+0x7f/0xd0 [nfs]
[616201.460329] [<ffffffffa03abaee>] nfs_sync_mapping_wait+0x9e/0x1a0 [nfs]
[616201.460354] [<ffffffffa03abc71>] nfs_wb_page+0x81/0xe0 [nfs]
[616201.460376] [<ffffffffa039ab2f>] nfs_release_page+0x5f/0x80 [nfs]
[616201.460384] [<ffffffff810f2bb2>] try_to_release_page+0x32/0x50
[616201.460392] [<ffffffff81101833>] shrink_page_list+0x453/0x5f0
[616201.460402] [<ffffffff8113b419>] ? mem_cgroup_del_lru+0x39/0x40
[616201.460409] [<ffffffff81100517>] ? isolate_lru_pages+0x227/0x260
[616201.460417] [<ffffffff81101cdd>] shrink_inactive_list+0x30d/0x7e0
[616201.460426] [<ffffffff810116c0>] ? __switch_to+0xd0/0x320
[616201.460434] [<ffffffff81076e2c>] ? lock_timer_base+0x3c/0x70
[616201.460441] [<ffffffff810778b5>] ? try_to_del_timer_sync+0x75/0xd0
[616201.460449] [<ffffffff81102241>] shrink_list+0x91/0xf0
[616201.460455] [<ffffffff81102437>] shrink_zone+0x197/0x240
[616201.460463] [<ffffffff811034c9>] balance_pgdat+0x659/0x6d0
[616201.460470] [<ffffffff81100550>] ? isolate_pages_global+0x0/0x50
[616201.460477] [<ffffffff8110363e>] kswapd+0xfe/0x150
[616201.460485] [<ffffffff81085320>] ? autoremove_wake_function+0x0/0x40
[616201.460492] [<ffffffff81103540>] ? kswapd+0x0/0x150
[616201.460498] [<ffffffff81084fa6>] kthread+0x96/0xa0
[616201.460506] [<ffffffff810141ea>] child_rip+0xa/0x20
[616201.460513] [<ffffffff81084f10>] ? kthread+0x0/0xa0
[616201.460520] [<ffffffff810141e0>] ? child_rip+0x0/0x20

This was an extract job for a rather small archive (just ~ 400 MiB) from the NAS, to the NAS (I know this is bad practice).

What also came up is that again KDE4 completely freezes, until the lock is released:

[616201.460551] INFO: task plasma-desktop:7429 blocked for more than 120 seconds.
[616201.460556] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[616201.460561] plasma-deskto D 0000000000000000 0 7429 ...

Well, today on one of my machine here at home this issue is back:

[616201.460064] INFO: task kswapd0:52 blocked for more than 120 seconds.
[616201.460072] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[616201.460079] kswapd0       D 0000000000000000     0    52      2 0x00000000
[616201.460090]  ffff880128d2f720 0000000000000046 0000000000015bc0 0000000000015bc0
[616201.460100]  ffff88012af8df80 ffff880128d2ffd8 0000000000015bc0 ffff88012af8dbc0
[616201.460108]  0000000000015bc0 ffff880128d2ffd8 0000000000015bc0 ffff88012af8df80
[616201.460117] Call Trace:
[616201.460153]  [<ffffffffa03a62b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
[616201.460166]  [<ffffffff8153eb87>] io_schedule+0x47/0x70
[616201.460192]  [<ffffffffa03a62be>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
[616201.460201]  [<ffffffff8153f3df>] __wait_on_bit+0x5f/0x90
[616201.460211]  [<ffffffff811346a6>] ? __slab_free+0x96/0x120
[616201.460235]  [<ffffffffa03a62b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
[616201.460243]  [<ffffffff8153f488>] out_of_line_wait_on_bit+0x78/0x90
[616201.460252]  [<ffffffff81085360>] ? wake_bit_function+0x0/0x40
[616201.460277]  [<ffffffffa03a629f>] nfs_wait_on_request+0x2f/0x40 [nfs]
[616201.460302]  [<ffffffffa03aa6af>] nfs_wait_on_requests_locked+0x7f/0xd0 [nfs]
[616201.460329]  [<ffffffffa03abaee>] nfs_sync_mapping_wait+0x9e/0x1a0 [nfs]
[616201.460354]  [<ffffffffa03abc71>] nfs_wb_page+0x81/0xe0 [nfs]
[616201.460376]  [<ffffffffa039ab2f>] nfs_release_page+0x5f/0x80 [nfs]
[616201.460384]  [<ffffffff810f2bb2>] try_to_release_page+0x32/0x50
[616201.460392]  [<ffffffff81101833>] shrink_page_list+0x453/0x5f0
[616201.460402]  [<ffffffff8113b419>] ? mem_cgroup_del_lru+0x39/0x40
[616201.460409]  [<ffffffff81100517>] ? isolate_lru_pages+0x227/0x260
[616201.460417]  [<ffffffff81101cdd>] shrink_inactive_list+0x30d/0x7e0
[616201.460426]  [<ffffffff810116c0>] ? __switch_to+0xd0/0x320
[616201.460434]  [<ffffffff81076e2c>] ? lock_timer_base+0x3c/0x70
[616201.460441]  [<ffffffff810778b5>] ? try_to_del_timer_sync+0x75/0xd0
[616201.460449]  [<ffffffff81102241>] shrink_list+0x91/0xf0
[616201.460455]  [<ffffffff81102437>] shrink_zone+0x197/0x240
[616201.460463]  [<ffffffff811034c9>] balance_pgdat+0x659/0x6d0
[616201.460470]  [<ffffffff81100550>] ? isolate_pages_global+0x0/0x50
[616201.460477]  [<ffffffff8110363e>] kswapd+0xfe/0x150
[616201.460485]  [<ffffffff81085320>] ? autoremove_wake_function+0x0/0x40
[616201.460492]  [<ffffffff81103540>] ? kswapd+0x0/0x150
[616201.460498]  [<ffffffff81084fa6>] kthread+0x96/0xa0
[616201.460506]  [<ffffffff810141ea>] child_rip+0xa/0x20
[616201.460513]  [<ffffffff81084f10>] ? kthread+0x0/0xa0
[616201.460520]  [<ffffffff810141e0>] ? child_rip+0x0/0x20

This was an extract job for a rather small archive (just ~ 400 MiB) from the NAS, to the NAS (I know this is bad practice).

What also came up is that again KDE4 completely freezes, until the lock is released:

[616201.460551] INFO: task plasma-desktop:7429 blocked for more than 120 seconds.
[616201.460556] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[616201.460561] plasma-deskto D 0000000000000000     0  7429      1 0x00000000
[616201.460570]  ffff88012766b158 0000000000000086 0000000000015bc0 0000000000015bc0
[616201.460579]  ffff880127acdf80 ffff88012766bfd8 0000000000015bc0 ffff880127acdbc0
[616201.460587]  0000000000015bc0 ffff88012766bfd8 0000000000015bc0 ffff880127acdf80
[616201.460595] Call Trace:
[616201.460619]  [<ffffffffa03a62b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
[616201.460628]  [<ffffffff8153eb87>] io_schedule+0x47/0x70
[616201.460651]  [<ffffffffa03a62be>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
[...]

When the lock is released all apps are going back to usual work. I wouldn't mind if the NFS transfer hangs for some time waiting for the target to complete some work but it just interferes any interaction with the machine so this is a real show-stopper.

However, there was a difference how those NFS export got mounted. The machine that still is fine mounts like this:

rw,rsize=32768,wsize=32768,hard,intr,noatime

...and the machine that just showed this issue again mounts like:

rw,hard,intr,noatime

So I wouldn't expect that setting the sizes changes anything but please give it a try if you haven't done so before.

Revision history for this message

Christoph Lechleitner (lech) wrote on 2010-06-06:

#17

My Desktop also did freeze for those 120 seconds.
I use a Gnome but there are KDE based widgets running, namely klipper.

Ac. buffer size: Larger write buffers (backed by enough RAM) defer or eventually avoid the point at which the buffer runs full.
I think we agree that's when the freeze occurs.

Does the kernel allow some kind of temporary "overbooking" for wirite buffers, like it does for RAM?

Revision history for this message

Christoph Lechleitner (lech) wrote on 2010-06-06:

#18

If my assumption of the freeze trigger is correct, it should be possible to provoke the problem even against fast NFS servers by setting the buffer size extremly low, i.e. to the minimum allowed by the NFS driver.
I 'll try this out, but due to a business trip I won't happen before the end of the upcoming week.

Revision history for this message

Christoph Lechleitner (lech) wrote on 2010-06-10:

#19

Today I tried copying very large files to a NFS share mounted with rsize=64,wsize=64 and I could reproduce freezes of about 30 secs, but no 120-sec-freezes and nothing in dmesg' output.
I am on lucid's amd64 kernel 2.6.32-22.

Revision history for this message

jjbig (a-launchpad-net-jjbig-dittri-ch) wrote on 2010-06-11:

#20

I have to confirm this bug as well. Same effekt, everything hangs for some time while transfering biger data to NFS I'm using Ubuntu 10.04 64 (2.6.32-22-generic #36-Ubuntu SMP Thu Jun 3 19:31:57 UTC 2010 x86_64 GNU/Linux). dmesg does't report anything.

Revision history for this message

Jeremy Foshee (jeremyfoshee) wrote on 2010-06-21:

#21

Ancoron,
Per the new kernel team policy, I'd like to close this issue out for you as fix released.

jjbig / Christoph,
Also, per ubuntu kernel team policy, could I get the two of you to file new bugs for your issues? This will enable us to approach your bugs from an individual standpoint and rule out any hardware as affecting the core problem.

Thanks!

~JFo

Changed in linux (Ubuntu):
status:	Incomplete → Fix Released

Revision history for this message

Ancoron Luziferis (ancoron) wrote on 2010-06-21:

#22

Jeremy,

I disagree with the status change to "Fix Released" as it is not "fixed". The problem still occurs. It didn't occur for my main workstation here as I have raised the rsize/wsize to 32MB on the NFS mounts here and I am just issuing one to two NFS "transactions" at a time, so it didn't come up in the first place.

However, even if I raise the rsize/wsize I still can reproduce this issue by throwing a bit more work in parallel for the NFS mounts. On a "standard" NFS mount just a "cp" of a 400 MB file from the NFS mount to the same is sufficient to raise this issue again. This doesn't occur every time but around every second or third run at least.

I'll try to test the workload required for my 32MB mounts to get stuck too. After that I'll be able to build a small script as a test case for that.

Revision history for this message

JimWright (jim-jim-wright) wrote on 2010-06-21:

#23

Download full text (9.7 KiB)

I believe we have also been affected by this issue. I've managed to recreate it using a single virtual machine running on VMWare server and only using the loopback interface.

First I created a 64-bit VM using VMWare Server with 2 CPUs, 512MB RAM and 8GB disk. Into that I performed a fresh install of 64-bit Ubuntu 10.04 LTS desktop edition (as it was all I had to hand), using the image file:

ubuntu-10.04-desktop-amd64.iso

Once installed, I added the nfs-kernel-server package. At this stage I have not upgraded any packages from the versions that come on the CD.

In /etc/exports I added the line:

/srv *(rw,sync,no_subtree_check)

In /etc/fstab I added the line:

localhost:/srv /mnt/srv nfs rw 0 2

Then I executed the following commands:

# exportfs -a
# mkdir /mnt/srv
# mount /mnt/srv

In /srv I created a 512MB file (512MB was chosen to match the size of RAM on the virtual machine)

# dd if=/dev/urandom of=/srv/test bs=1M count=512

Then I executed a continual gzip loop accessing the file over NFS using the loopback interface, and writing its results back over NFS.

# while true
> do
> gzip -c /mnt/srv/test >/mnt/srv/test.gz
> done

I was running top in another virtual console and within seconds the load on the virtual machine rose rapidly (> 10), gzip was no longer consuming CPU, and the machine appeared to "lock up" permanently and never recover.

On rebooting the following messages were in /var/syslog:

Jun 21 23:22:12 lucid kernel: [ 1202.033262] INFO: task kswapd0:36 blocked for more than 120 seconds.
Jun 21 23:22:12 lucid kernel: [ 1202.150847] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 21 23:22:12 lucid kernel: [ 1202.286349] kswapd0 D 0000000000000000 0 36 2 0x00000000
Jun 21 23:22:12 lucid kernel: [ 1202.286349] ffff880017881720 0000000000000046 0000000000015bc0 0000000000015bc0
Jun 21 23:22:12 lucid kernel: [ 1202.286349] ffff88001d79df80 ffff880017881fd8 0000000000015bc0 ffff88001d79dbc0
Jun 21 23:22:12 lucid kernel: [ 1202.286349] 0000000000015bc0 ffff880017881fd8 0000000000015bc0 ffff88001d79df80
Jun 21 23:22:12 lucid kernel: [ 1202.286349] Call Trace:
Jun 21 23:22:12 lucid kernel: [ 1202.288343] [<ffffffffa01d02b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
Jun 21 23:22:12 lucid kernel: [ 1202.288552] [<ffffffff8153eb57>] io_schedule+0x47/0x70
Jun 21 23:22:12 lucid kernel: [ 1202.288574] [<ffffffffa01d02be>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
Jun 21 23:22:12 lucid kernel: [ 1202.288580] [<ffffffff8153f3af>] __wait_on_bit+0x5f/0x90
Jun 21 23:22:12 lucid kernel: [ 1202.288593] [<ffffffffa01d02b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
Jun 21 23:22:12 lucid kernel: [ 1202.288599] [<ffffffff8153f458>] out_of_line_wait_on_bit+0x78/0x90
Jun 21 23:22:12 lucid kernel: [ 1202.288605] [<ffffffff81085360>] ? wake_bit_function+0x0/0x40
Jun 21 23:22:12 lucid kernel: [ 1202.288627] [<ffffffffa01d029f>] nfs_wait_on_request+0x2f/0x40 [nfs]
Jun 21 23:22:12 lucid kernel: [ 1202.288640] [<ffffffffa01d46af>] nfs_wait_on_requests_locked+0x7f/0xd0 [nfs]
Jun 21 23:22:12 lucid kernel: [ 1202.288658] [<ffffffffa01d...

I believe we have also been affected by this issue.  I've managed to recreate it using a single virtual machine running on VMWare server and only using the loopback interface.

First I created a 64-bit VM using VMWare Server with 2 CPUs, 512MB RAM and 8GB disk.  Into that I performed a fresh install of 64-bit Ubuntu 10.04 LTS desktop edition (as it was all I had to hand), using the image file:

ubuntu-10.04-desktop-amd64.iso

Once installed, I added the nfs-kernel-server package.  At this stage I have not upgraded any packages from the versions that come on the CD.

In /etc/exports I added the line:

/srv            *(rw,sync,no_subtree_check)

In /etc/fstab I added the line:

localhost:/srv          /mnt/srv        nfs     rw      0       2

Then I executed the following commands:

# exportfs -a
# mkdir /mnt/srv
# mount /mnt/srv

In /srv I created a 512MB file (512MB was chosen to match the size of RAM on the virtual machine)

# dd if=/dev/urandom of=/srv/test bs=1M count=512

Then I executed a continual gzip loop accessing the file over NFS using the loopback interface, and writing its results back over NFS.

# while true
> do
>     gzip -c /mnt/srv/test >/mnt/srv/test.gz
> done

I was running top in another virtual console and within seconds the load on the virtual machine rose rapidly (> 10), gzip was no longer consuming CPU, and the machine appeared to "lock up" permanently and never recover.

On rebooting the following messages were in /var/syslog:

Jun 21 23:22:12 lucid kernel: [ 1202.033262] INFO: task kswapd0:36 blocked for more than 120 seconds.
Jun 21 23:22:12 lucid kernel: [ 1202.150847] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 21 23:22:12 lucid kernel: [ 1202.286349] kswapd0       D 0000000000000000     0    36      2 0x00000000
Jun 21 23:22:12 lucid kernel: [ 1202.286349]  ffff880017881720 0000000000000046 0000000000015bc0 0000000000015bc0
Jun 21 23:22:12 lucid kernel: [ 1202.286349]  ffff88001d79df80 ffff880017881fd8 0000000000015bc0 ffff88001d79dbc0
Jun 21 23:22:12 lucid kernel: [ 1202.286349]  0000000000015bc0 ffff880017881fd8 0000000000015bc0 ffff88001d79df80
Jun 21 23:22:12 lucid kernel: [ 1202.286349] Call Trace:
Jun 21 23:22:12 lucid kernel: [ 1202.288343]  [<ffffffffa01d02b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
Jun 21 23:22:12 lucid kernel: [ 1202.288552]  [<ffffffff8153eb57>] io_schedule+0x47/0x70
Jun 21 23:22:12 lucid kernel: [ 1202.288574]  [<ffffffffa01d02be>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
Jun 21 23:22:12 lucid kernel: [ 1202.288580]  [<ffffffff8153f3af>] __wait_on_bit+0x5f/0x90
Jun 21 23:22:12 lucid kernel: [ 1202.288593]  [<ffffffffa01d02b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
Jun 21 23:22:12 lucid kernel: [ 1202.288599]  [<ffffffff8153f458>] out_of_line_wait_on_bit+0x78/0x90
Jun 21 23:22:12 lucid kernel: [ 1202.288605]  [<ffffffff81085360>] ? wake_bit_function+0x0/0x40
Jun 21 23:22:12 lucid kernel: [ 1202.288627]  [<ffffffffa01d029f>] nfs_wait_on_request+0x2f/0x40 [nfs]
Jun 21 23:22:12 lucid kernel: [ 1202.288640]  [<ffffffffa01d46af>] nfs_wait_on_requests_locked+0x7f/0xd0 [nfs]
Jun 21 23:22:12 lucid kernel: [ 1202.288658]  [<ffffffffa01d5aee>] nfs_sync_mapping_wait+0x9e/0x1a0 [nfs]
Jun 21 23:22:12 lucid kernel: [ 1202.288671]  [<ffffffffa01d5c71>] nfs_wb_page+0x81/0xe0 [nfs]
Jun 21 23:22:12 lucid kernel: [ 1202.288683]  [<ffffffffa01c4b2f>] nfs_release_page+0x5f/0x80 [nfs]
Jun 21 23:22:12 lucid kernel: [ 1202.288688]  [<ffffffff810f2bb2>] try_to_release_page+0x32/0x50
Jun 21 23:22:12 lucid kernel: [ 1202.288698]  [<ffffffff81101833>] shrink_page_list+0x453/0x5f0
Jun 21 23:22:12 lucid kernel: [ 1202.288704]  [<ffffffff81101cdd>] shrink_inactive_list+0x30d/0x7e0
Jun 21 23:22:12 lucid kernel: [ 1202.288712]  [<ffffffff810591e5>] ? balance_tasks+0x135/0x160
Jun 21 23:22:12 lucid kernel: [ 1202.288718]  [<ffffffff810fbe3a>] ? determine_dirtyable_memory+0x1a/0x30
Jun 21 23:22:12 lucid kernel: [ 1202.288723]  [<ffffffff810fbee7>] ? get_dirty_limits+0x27/0x2f0
Jun 21 23:22:12 lucid kernel: [ 1202.288727]  [<ffffffff81102241>] shrink_list+0x91/0xf0
Jun 21 23:22:12 lucid kernel: [ 1202.289518]  [<ffffffff81102437>] shrink_zone+0x197/0x240
Jun 21 23:22:12 lucid kernel: [ 1202.289523]  [<ffffffff811034c9>] balance_pgdat+0x659/0x6d0
Jun 21 23:22:12 lucid kernel: [ 1202.289528]  [<ffffffff81100550>] ? isolate_pages_global+0x0/0x50
Jun 21 23:22:12 lucid kernel: [ 1202.289533]  [<ffffffff8110363e>] kswapd+0xfe/0x150
Jun 21 23:22:12 lucid kernel: [ 1202.289537]  [<ffffffff81085320>] ? autoremove_wake_function+0x0/0x40
Jun 21 23:22:12 lucid kernel: [ 1202.289542]  [<ffffffff81103540>] ? kswapd+0x0/0x150
Jun 21 23:22:12 lucid kernel: [ 1202.289547]  [<ffffffff81084fa6>] kthread+0x96/0xa0
Jun 21 23:22:12 lucid kernel: [ 1202.289552]  [<ffffffff810141ea>] child_rip+0xa/0x20
Jun 21 23:22:12 lucid kernel: [ 1202.289557]  [<ffffffff81084f10>] ? kthread+0x0/0xa0
Jun 21 23:22:12 lucid kernel: [ 1202.289561]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
... lots of other blocked processes ...

I then upgraded all packages (including the kernel) to the latest versions and repeated the above test.  The same effect was observed.

Jun 21 23:56:46 lucid kernel: [  242.212616] INFO: task kswapd0:36 blocked for more than 120 seconds.
Jun 21 23:56:46 lucid kernel: [  242.282429] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun 21 23:56:46 lucid kernel: [  242.350201] kswapd0       D 00000000ffffffff     0    36      2 0x00000000
Jun 21 23:56:46 lucid kernel: [  242.350344]  ffff88001786f720 0000000000000046 0000000000015bc0 0000000000015bc0
Jun 21 23:56:46 lucid kernel: [  242.350414]  ffff88001d785f80 ffff88001786ffd8 0000000000015bc0 ffff88001d785bc0
Jun 21 23:56:46 lucid kernel: [  242.350421]  0000000000015bc0 ffff88001786ffd8 0000000000015bc0 ffff88001d785f80
Jun 21 23:56:46 lucid kernel: [  242.350459] Call Trace:
Jun 21 23:56:46 lucid kernel: [  242.351835]  [<ffffffffa016a2b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
Jun 21 23:56:46 lucid kernel: [  242.352983]  [<ffffffff8153ebb7>] io_schedule+0x47/0x70
Jun 21 23:56:46 lucid kernel: [  242.353007]  [<ffffffffa016a2be>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
Jun 21 23:56:46 lucid kernel: [  242.353014]  [<ffffffff8153f40f>] __wait_on_bit+0x5f/0x90
Jun 21 23:56:46 lucid kernel: [  242.353027]  [<ffffffffa016a2b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
Jun 21 23:56:46 lucid kernel: [  242.353033]  [<ffffffff8153f4b8>] out_of_line_wait_on_bit+0x78/0x90
Jun 21 23:56:46 lucid kernel: [  242.353047]  [<ffffffff81085360>] ? wake_bit_function+0x0/0x40
Jun 21 23:56:46 lucid kernel: [  242.353070]  [<ffffffffa016a29f>] nfs_wait_on_request+0x2f/0x40 [nfs]
Jun 21 23:56:46 lucid kernel: [  242.353083]  [<ffffffffa016e6af>] nfs_wait_on_requests_locked+0x7f/0xd0 [nfs]
Jun 21 23:56:46 lucid kernel: [  242.353096]  [<ffffffffa016faee>] nfs_sync_mapping_wait+0x9e/0x1a0 [nfs]
Jun 21 23:56:46 lucid kernel: [  242.353109]  [<ffffffffa016fc71>] nfs_wb_page+0x81/0xe0 [nfs]
Jun 21 23:56:46 lucid kernel: [  242.353121]  [<ffffffffa015eb2f>] nfs_release_page+0x5f/0x80 [nfs]
Jun 21 23:56:46 lucid kernel: [  242.353135]  [<ffffffff810f2bb2>] try_to_release_page+0x32/0x50
Jun 21 23:56:46 lucid kernel: [  242.353144]  [<ffffffff81101833>] shrink_page_list+0x453/0x5f0
Jun 21 23:56:46 lucid kernel: [  242.353163]  [<ffffffff8113b419>] ? mem_cgroup_del_lru+0x39/0x40
Jun 21 23:56:46 lucid kernel: [  242.353167]  [<ffffffff811003cb>] ? isolate_lru_pages+0xdb/0x260
Jun 21 23:56:46 lucid kernel: [  242.353173]  [<ffffffff81101cdd>] shrink_inactive_list+0x30d/0x7e0
Jun 21 23:56:46 lucid kernel: [  242.353179]  [<ffffffff81053980>] ? __dequeue_entity+0x30/0x50
Jun 21 23:56:46 lucid kernel: [  242.353185]  [<ffffffff810116c0>] ? __switch_to+0xd0/0x320
Jun 21 23:56:46 lucid kernel: [  242.353191]  [<ffffffff81076e2c>] ? lock_timer_base+0x3c/0x70
Jun 21 23:56:46 lucid kernel: [  242.353196]  [<ffffffff810778b5>] ? try_to_del_timer_sync+0x75/0xd0
Jun 21 23:56:46 lucid kernel: [  242.353202]  [<ffffffff810fbe3a>] ? determine_dirtyable_memory+0x1a/0x30
Jun 21 23:56:46 lucid kernel: [  242.353208]  [<ffffffff810fbee7>] ? get_dirty_limits+0x27/0x2f0
Jun 21 23:56:46 lucid kernel: [  242.353213]  [<ffffffff81102241>] shrink_list+0x91/0xf0
Jun 21 23:56:46 lucid kernel: [  242.353217]  [<ffffffff81102437>] shrink_zone+0x197/0x240
Jun 21 23:56:46 lucid kernel: [  242.353222]  [<ffffffff811034c9>] balance_pgdat+0x659/0x6d0
Jun 21 23:56:46 lucid kernel: [  242.353227]  [<ffffffff81100550>] ? isolate_pages_global+0x0/0x50
Jun 21 23:56:46 lucid kernel: [  242.353232]  [<ffffffff8110363e>] kswapd+0xfe/0x150
Jun 21 23:56:46 lucid kernel: [  242.353237]  [<ffffffff81085320>] ? autoremove_wake_function+0x0/0x40
Jun 21 23:56:46 lucid kernel: [  242.353242]  [<ffffffff81103540>] ? kswapd+0x0/0x150
Jun 21 23:56:46 lucid kernel: [  242.353247]  [<ffffffff81084fa6>] kthread+0x96/0xa0
Jun 21 23:56:46 lucid kernel: [  242.353252]  [<ffffffff810141ea>] child_rip+0xa/0x20
Jun 21 23:56:46 lucid kernel: [  242.353257]  [<ffffffff81084f10>] ? kthread+0x0/0xa0
Jun 21 23:56:46 lucid kernel: [  242.353261]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
... lots of other blocked processes ...

Running the same test, but using /srv directly instead of via NFS on /mnt/srv, worked reliably as expected without any issues.  i.e.

# while true
> do
>     gzip -c /srv/test >/srv/test.gz
> done

The hardware the VMware server was running on (AMD Opteron) is totally different to that on which we've seen the issue occur in day to day use (Intel Core i7).  Given the error can be caused by using the loopback address I think this means it cannot be network card specific.  We only run 64-bit, so I haven't tested to see if it occurs on 32-bit releases.

I hope this is helpful.  If I can help shed any further light please let me know.

Jim

Jeremy Foshee (jeremyfoshee) on 2010-06-22

Changed in linux (Ubuntu):
status:	Fix Released → Triaged

Revision history for this message

Ancoron Luziferis (ancoron) wrote on 2010-06-22:

#24

Download full text (29.1 KiB)

@Jim: That looks to me like a test case :)

All this information leads me to the same conclusion that those blocking behavior occurs most commonly when the system runs out of free RAM and starts to swap heavily.

I'm currently running a vanilla kernel 2.6.35-rc3 and with that I wasn't able to reproduce some blocking here, but what I noticed was that when the system starts this heavy swapping even my hardware accelerated mouse cursor get stuck sometimes (reminds me on an old crappy windows box I used to have). And this plus the fact that the whole user interface (at least with KDE4) locks up sometimes points me to something that isn't necessarily related to NFS itself.

As every file operation in Linux also uses the system memory to speed up things it could be that the priority of what gets swapped out and what swapped in doesn't suite the individual needs. In the worst case if something like plasma-desktop is chosen to be swapped out then parts of it immediately are scheduled for swap-in as plasma-desktop updates periodically. That way the wrong swapping strategy could introduce this issue.

On the other hand I don't understand why the hell is some heavy swapping able to interfere a hardware accelerated mouse cursor? This lonely symptom leads me to another pointer: interrupt handling. But then I don't know enough about that thing to go any further.

I just issued a test on my 2.6.35-rc3 box:
- set up an NFS mount with rsize=128,wsize=128
- make the system memory (4GB) almost completely used by other processes (very few file cache/buffered)
- issue a "gunzip" of a 4.1 GiB file from the NFS mount to the NFS mount

And now guess what?

Yes, I got those hung tasks again even with a much newer kernel, so the problem isn't addressed upstream and can't be in lucid.

Although this time it is a bit different because I don't get NFS related backtraces here. Instead (and what I currently think is more of a reason to this problem) I always get some calls for memory allocation and/or paging requests. So the system is heavily swapping and the swapping needs time as we all know.

But instead just waiting for the scheduled operation to complete it "blocks" and therefore interferes user interaction at all.

To verify that problem I just made up some KVM's on my machine at work (a 6-core AMD64, 8GiB RAM) and even without having any NFS mounts or exports I got some very similar behavior. Although I didn't provoke a complete lockup yet (well, I need to do some work there), I also got temporarily freezing "hardware accelerated" mouse cursor and completely non-responsive desktop (at least for some seconds up to a minute).

Here it goes for my machine with kernel 2.6.35-rc3:

[240602.803784] INFO: task kwin:2102 blocked for more than 120 seconds.
[240602.803787] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[240602.803789] kwin D 00000000ffffffff 0 2102 2100 0x00000000
[240602.803793] ffff8801281d93d8 0000000000000086 ffff880073c09cc8 0000000000015840
[240602.803796] ffff8801281d9fd8 0000000000015840 ffff8801281d9fd8 ffff88012a2196d0
[240602.803798] 0000000000015840 0000000000015840 ffff8801281d9fd8 0000000000015840
[...

@Jim: That looks to me like a test case :)

All this information leads me to the same conclusion that those blocking behavior occurs most commonly when the system runs out of free RAM and starts to swap heavily.

I'm currently running a vanilla kernel 2.6.35-rc3 and with that I wasn't able to reproduce some blocking here, but what I noticed was that when the system starts this heavy swapping even my hardware accelerated mouse cursor get stuck sometimes (reminds me on an old crappy windows box I used to have). And this plus the fact that the whole user interface (at least with KDE4) locks up sometimes points me to something that isn't necessarily related to NFS itself.

As every file operation in Linux also uses the system memory to speed up things it could be that the priority of what gets swapped out and what swapped in doesn't suite the individual needs. In the worst case if something like plasma-desktop is chosen to be swapped out then parts of it immediately are scheduled for swap-in as plasma-desktop updates periodically. That way the wrong swapping strategy could introduce this issue.

On the other hand I don't understand why the hell is some heavy swapping able to interfere a hardware accelerated mouse cursor? This lonely symptom leads me to another pointer: interrupt handling. But then I don't know enough about that thing to go any further.

I just issued a test on my 2.6.35-rc3 box:
- set up an NFS mount with rsize=128,wsize=128
- make the system memory (4GB) almost completely used by other processes (very few file cache/buffered)
- issue a "gunzip" of a 4.1 GiB file from the NFS mount to the NFS mount

And now guess what?

Yes, I got those hung tasks again even with a much newer kernel, so the problem isn't addressed upstream and can't be in lucid.

Although this time it is a bit different because I don't get NFS related backtraces here. Instead (and what I currently think is more of a reason to this problem) I always get some calls for memory allocation and/or paging requests. So the system is heavily swapping and the swapping needs time as we all know.

But instead just waiting for the scheduled operation to complete it "blocks" and therefore interferes user interaction at all.

To verify that problem I just made up some KVM's on my machine at work (a 6-core AMD64, 8GiB RAM) and even without having any NFS mounts or exports I got some very similar behavior. Although I didn't provoke a complete lockup yet (well, I need to do some work there), I also got temporarily freezing "hardware accelerated" mouse cursor and completely non-responsive desktop (at least for some seconds up to a minute).

Here it goes for my machine with kernel 2.6.35-rc3:

[240602.803784] INFO: task kwin:2102 blocked for more than 120 seconds.
[240602.803787] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[240602.803789] kwin          D 00000000ffffffff     0  2102   2100 0x00000000
[240602.803793]  ffff8801281d93d8 0000000000000086 ffff880073c09cc8 0000000000015840
[240602.803796]  ffff8801281d9fd8 0000000000015840 ffff8801281d9fd8 ffff88012a2196d0
[240602.803798]  0000000000015840 0000000000015840 ffff8801281d9fd8 0000000000015840
[240602.803800] Call Trace:
[240602.803806]  [<ffffffff81032ca9>] ? default_spin_lock_flags+0x9/0x10
[240602.803809]  [<ffffffff810eda70>] ? sync_page+0x0/0x50
[240602.803812]  [<ffffffff81534327>] io_schedule+0x47/0x70
[240602.803814]  [<ffffffff810edaad>] sync_page+0x3d/0x50
[240602.803816]  [<ffffffff81534bdf>] __wait_on_bit+0x5f/0x90
[240602.803818]  [<ffffffff810edc63>] wait_on_page_bit+0x73/0x80
[240602.803821]  [<ffffffff8107a720>] ? wake_bit_function+0x0/0x40
[240602.803824]  [<ffffffff810fb266>] shrink_page_list+0x176/0x580
[240602.803826]  [<ffffffff8106b430>] ? process_timeout+0x0/0x10
[240602.803828]  [<ffffffff81032ca9>] ? default_spin_lock_flags+0x9/0x10
[240602.803830]  [<ffffffff8107a877>] ? finish_wait+0x67/0x90
[240602.803832]  [<ffffffff8110552e>] ? congestion_wait+0x7e/0x90
[240602.803834]  [<ffffffff8107a6e0>] ? autoremove_wake_function+0x0/0x40
[240602.803837]  [<ffffffff810fbd54>] shrink_inactive_list+0x6e4/0x7f0
[240602.803839]  [<ffffffff81047e18>] ? update_curr+0xf8/0x1e0
[240602.803842]  [<ffffffff810f5f7a>] ? determine_dirtyable_memory+0x1a/0x30
[240602.803844]  [<ffffffff810f6027>] ? get_dirty_limits+0x27/0x2f0
[240602.803846]  [<ffffffff8105197c>] ? try_to_wake_up+0xcc/0x400
[240602.803848]  [<ffffffff810fc1cb>] shrink_zone+0x36b/0x4b0
[240602.803850]  [<ffffffff810fc403>] do_try_to_free_pages+0xf3/0x440
[240602.803852]  [<ffffffff810fc8f8>] try_to_free_pages+0x68/0x70
[240602.803854]  [<ffffffff810f4428>] __alloc_pages_nodemask+0x3e8/0x6f0
[240602.803857]  [<ffffffff81444bef>] ? __alloc_skb+0x4f/0x170
[240602.803859]  [<ffffffff8112ae52>] kmalloc_large_node+0x62/0xb0
[240602.803861]  [<ffffffff8112e6fc>] __kmalloc_node_track_caller+0x13c/0x1f0
[240602.803864]  [<ffffffff81440a64>] ? sock_alloc_send_pskb+0x1d4/0x340
[240602.803866]  [<ffffffff81444c23>] __alloc_skb+0x83/0x170
[240602.803868]  [<ffffffff81440a64>] sock_alloc_send_pskb+0x1d4/0x340
[240602.803870]  [<ffffffff81440be5>] sock_alloc_send_skb+0x15/0x20
[240602.803873]  [<ffffffff814d8cb5>] unix_stream_sendmsg+0x275/0x3e0
[240602.803875]  [<ffffffff8153664e>] ? _raw_spin_lock+0xe/0x20
[240602.803877]  [<ffffffff8143bfde>] sock_aio_write+0x14e/0x160
[240602.803879]  [<ffffffff8143be90>] ? sock_aio_write+0x0/0x160
[240602.803882]  [<ffffffff8113cad3>] do_sync_readv_writev+0xd3/0x110
[240602.803884]  [<ffffffff811476a3>] ? putname+0x33/0x50
[240602.803886]  [<ffffffff8114b2f2>] ? user_path_at+0x62/0xa0
[240602.803889]  [<ffffffff81248256>] ? security_file_permission+0x16/0x20
[240602.803891]  [<ffffffff8113db5f>] do_readv_writev+0xcf/0x1f0
[240602.803894]  [<ffffffff8113dcc8>] vfs_writev+0x48/0x60
[240602.803895]  [<ffffffff8113ddf1>] sys_writev+0x51/0xb0
[240602.803897]  [<ffffffff8114f29c>] ? sys_poll+0x7c/0x110
[240602.803900]  [<ffffffff8100a072>] system_call_fastpath+0x16/0x1b
[241082.803794] INFO: task konsole:2145 blocked for more than 120 seconds.
[241082.803798] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[241082.803800] konsole       D 00000001016f201d     0  2145      1 0x00000000
[241082.803803]  ffff88000d8cf3d8 0000000000000082 ffff880000000000 0000000000015840
[241082.803806]  ffff88000d8cffd8 0000000000015840 ffff88000d8cffd8 ffff88000d8c16d0
[241082.803808]  0000000000015840 0000000000015840 ffff88000d8cffd8 0000000000015840
[241082.803810] Call Trace:
[241082.803817]  [<ffffffff810eda70>] ? sync_page+0x0/0x50
[241082.803820]  [<ffffffff81534327>] io_schedule+0x47/0x70
[241082.803822]  [<ffffffff810edaad>] sync_page+0x3d/0x50
[241082.803824]  [<ffffffff81534bdf>] __wait_on_bit+0x5f/0x90
[241082.803827]  [<ffffffff810edc63>] wait_on_page_bit+0x73/0x80
[241082.803830]  [<ffffffff8107a720>] ? wake_bit_function+0x0/0x40
[241082.803833]  [<ffffffff810fb266>] shrink_page_list+0x176/0x580
[241082.803835]  [<ffffffff8106b430>] ? process_timeout+0x0/0x10
[241082.803837]  [<ffffffff81032ca9>] ? default_spin_lock_flags+0x9/0x10
[241082.803840]  [<ffffffff8107a877>] ? finish_wait+0x67/0x90
[241082.803842]  [<ffffffff8110552e>] ? congestion_wait+0x7e/0x90
[241082.803844]  [<ffffffff8107a6e0>] ? autoremove_wake_function+0x0/0x40
[241082.803846]  [<ffffffff810fbd54>] shrink_inactive_list+0x6e4/0x7f0
[241082.803849]  [<ffffffff81047e18>] ? update_curr+0xf8/0x1e0
[241082.803851]  [<ffffffff810532d5>] ? check_preempt_wakeup+0x1c5/0x290
[241082.803854]  [<ffffffff810f5f7a>] ? determine_dirtyable_memory+0x1a/0x30
[241082.803856]  [<ffffffff810f6027>] ? get_dirty_limits+0x27/0x2f0
[241082.803857]  [<ffffffff8105197c>] ? try_to_wake_up+0xcc/0x400
[241082.803860]  [<ffffffff810fc1cb>] shrink_zone+0x36b/0x4b0
[241082.803862]  [<ffffffff810fc403>] do_try_to_free_pages+0xf3/0x440
[241082.803864]  [<ffffffff810fc8f8>] try_to_free_pages+0x68/0x70
[241082.803866]  [<ffffffff810f4428>] __alloc_pages_nodemask+0x3e8/0x6f0
[241082.803869]  [<ffffffff8112ae52>] kmalloc_large_node+0x62/0xb0
[241082.803871]  [<ffffffff8112e6fc>] __kmalloc_node_track_caller+0x13c/0x1f0
[241082.803874]  [<ffffffff81440a64>] ? sock_alloc_send_pskb+0x1d4/0x340
[241082.803876]  [<ffffffff81444bef>] ? __alloc_skb+0x4f/0x170
[241082.803878]  [<ffffffff81444c23>] __alloc_skb+0x83/0x170
[241082.803880]  [<ffffffff81440a64>] sock_alloc_send_pskb+0x1d4/0x340
[241082.803882]  [<ffffffff81440be5>] sock_alloc_send_skb+0x15/0x20
[241082.803885]  [<ffffffff814d8cb5>] unix_stream_sendmsg+0x275/0x3e0
[241082.803887]  [<ffffffff8143bfde>] sock_aio_write+0x14e/0x160
[241082.803890]  [<ffffffff8114e940>] ? pollwake+0x0/0x60
[241082.803892]  [<ffffffff8143be90>] ? sock_aio_write+0x0/0x160
[241082.803894]  [<ffffffff8113cad3>] do_sync_readv_writev+0xd3/0x110
[241082.803896]  [<ffffffff8113cd0a>] ? do_sync_read+0xda/0x120
[241082.803899]  [<ffffffff81248256>] ? security_file_permission+0x16/0x20
[241082.803901]  [<ffffffff8113db5f>] do_readv_writev+0xcf/0x1f0
[241082.803903]  [<ffffffff81248256>] ? security_file_permission+0x16/0x20
[241082.803905]  [<ffffffff8113dcc8>] vfs_writev+0x48/0x60
[241082.803907]  [<ffffffff8113ddf1>] sys_writev+0x51/0xb0
[241082.803909]  [<ffffffff8114f29c>] ? sys_poll+0x7c/0x110
[241082.803912]  [<ffffffff8100a072>] system_call_fastpath+0x16/0x1b
[241202.803836] INFO: task konsole:2145 blocked for more than 120 seconds.
[241202.803845] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[241202.803851] konsole       D 00000001016f201d     0  2145      1 0x00000000
[241202.803867]  ffff88000d8cf3d8 0000000000000082 ffff880000000000 0000000000015840
[241202.803877]  ffff88000d8cffd8 0000000000015840 ffff88000d8cffd8 ffff88000d8c16d0
[241202.803887]  0000000000015840 0000000000015840 ffff88000d8cffd8 0000000000015840
[241202.803895] Call Trace:
[241202.803911]  [<ffffffff810eda70>] ? sync_page+0x0/0x50
[241202.803922]  [<ffffffff81534327>] io_schedule+0x47/0x70
[241202.803931]  [<ffffffff810edaad>] sync_page+0x3d/0x50
[241202.803938]  [<ffffffff81534bdf>] __wait_on_bit+0x5f/0x90
[241202.803947]  [<ffffffff810edc63>] wait_on_page_bit+0x73/0x80
[241202.803956]  [<ffffffff8107a720>] ? wake_bit_function+0x0/0x40
[241202.803966]  [<ffffffff810fb266>] shrink_page_list+0x176/0x580
[241202.803973]  [<ffffffff8106b430>] ? process_timeout+0x0/0x10
[241202.803981]  [<ffffffff81032ca9>] ? default_spin_lock_flags+0x9/0x10
[241202.803988]  [<ffffffff8107a877>] ? finish_wait+0x67/0x90
[241202.803996]  [<ffffffff8110552e>] ? congestion_wait+0x7e/0x90
[241202.804004]  [<ffffffff8107a6e0>] ? autoremove_wake_function+0x0/0x40
[241202.804013]  [<ffffffff810fbd54>] shrink_inactive_list+0x6e4/0x7f0
[241202.804023]  [<ffffffff81047e18>] ? update_curr+0xf8/0x1e0
[241202.804031]  [<ffffffff810532d5>] ? check_preempt_wakeup+0x1c5/0x290
[241202.804039]  [<ffffffff810f5f7a>] ? determine_dirtyable_memory+0x1a/0x30
[241202.804047]  [<ffffffff810f6027>] ? get_dirty_limits+0x27/0x2f0
[241202.804054]  [<ffffffff8105197c>] ? try_to_wake_up+0xcc/0x400
[241202.804063]  [<ffffffff810fc1cb>] shrink_zone+0x36b/0x4b0
[241202.804072]  [<ffffffff810fc403>] do_try_to_free_pages+0xf3/0x440
[241202.804080]  [<ffffffff810fc8f8>] try_to_free_pages+0x68/0x70
[241202.804088]  [<ffffffff810f4428>] __alloc_pages_nodemask+0x3e8/0x6f0
[241202.804098]  [<ffffffff8112ae52>] kmalloc_large_node+0x62/0xb0
[241202.804106]  [<ffffffff8112e6fc>] __kmalloc_node_track_caller+0x13c/0x1f0
[241202.804115]  [<ffffffff81440a64>] ? sock_alloc_send_pskb+0x1d4/0x340
[241202.804123]  [<ffffffff81444bef>] ? __alloc_skb+0x4f/0x170
[241202.804129]  [<ffffffff81444c23>] __alloc_skb+0x83/0x170
[241202.804137]  [<ffffffff81440a64>] sock_alloc_send_pskb+0x1d4/0x340
[241202.804146]  [<ffffffff81440be5>] sock_alloc_send_skb+0x15/0x20
[241202.804155]  [<ffffffff814d8cb5>] unix_stream_sendmsg+0x275/0x3e0
[241202.804163]  [<ffffffff8143bfde>] sock_aio_write+0x14e/0x160
[241202.804172]  [<ffffffff8114e940>] ? pollwake+0x0/0x60
[241202.804179]  [<ffffffff8143be90>] ? sock_aio_write+0x0/0x160
[241202.804187]  [<ffffffff8113cad3>] do_sync_readv_writev+0xd3/0x110
[241202.804195]  [<ffffffff8113cd0a>] ? do_sync_read+0xda/0x120
[241202.804205]  [<ffffffff81248256>] ? security_file_permission+0x16/0x20
[241202.804213]  [<ffffffff8113db5f>] do_readv_writev+0xcf/0x1f0
[241202.804222]  [<ffffffff81248256>] ? security_file_permission+0x16/0x20
[241202.804230]  [<ffffffff8113dcc8>] vfs_writev+0x48/0x60
[241202.804237]  [<ffffffff8113ddf1>] sys_writev+0x51/0xb0
[241202.804244]  [<ffffffff8114f29c>] ? sys_poll+0x7c/0x110
[241202.804253]  [<ffffffff8100a072>] system_call_fastpath+0x16/0x1b
[241202.804304] INFO: task firefox-bin:12398 blocked for more than 120 seconds.
[241202.804309] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[241202.804314] firefox-bin   D 00000000ffffffff     0 12398  12394 0x00000000
[241202.804324]  ffff880078a592f8 0000000000000086 0000000000000400 0000000000015840
[241202.804333]  ffff880078a59fd8 0000000000015840 ffff880078a59fd8 ffff88004b395b40
[241202.804341]  0000000000015840 0000000000015840 ffff880078a59fd8 0000000000015840
[241202.804350] Call Trace:
[241202.804357]  [<ffffffff81032ca9>] ? default_spin_lock_flags+0x9/0x10
[241202.804365]  [<ffffffff810eda70>] ? sync_page+0x0/0x50
[241202.804373]  [<ffffffff81534327>] io_schedule+0x47/0x70
[241202.804381]  [<ffffffff810edaad>] sync_page+0x3d/0x50
[241202.804388]  [<ffffffff81534bdf>] __wait_on_bit+0x5f/0x90
[241202.804397]  [<ffffffff810edc63>] wait_on_page_bit+0x73/0x80
[241202.804404]  [<ffffffff8107a720>] ? wake_bit_function+0x0/0x40
[241202.804412]  [<ffffffff8153485d>] ? schedule_timeout+0x19d/0x310
[241202.804420]  [<ffffffff810fb266>] shrink_page_list+0x176/0x580
[241202.804427]  [<ffffffff8106b430>] ? process_timeout+0x0/0x10
[241202.804435]  [<ffffffff8110552e>] ? congestion_wait+0x7e/0x90
[241202.804442]  [<ffffffff8107a6e0>] ? autoremove_wake_function+0x0/0x40
[241202.804451]  [<ffffffff810fbd54>] shrink_inactive_list+0x6e4/0x7f0
[241202.804460]  [<ffffffff81047e18>] ? update_curr+0xf8/0x1e0
[241202.804469]  [<ffffffff8153664e>] ? _raw_spin_lock+0xe/0x20
[241202.804477]  [<ffffffff810483dd>] ? task_rq_lock+0x5d/0xa0
[241202.804484]  [<ffffffff8105197c>] ? try_to_wake_up+0xcc/0x400
[241202.804492]  [<ffffffff810fc1cb>] shrink_zone+0x36b/0x4b0
[241202.804501]  [<ffffffff810fc403>] do_try_to_free_pages+0xf3/0x440
[241202.804510]  [<ffffffff810fc8f8>] try_to_free_pages+0x68/0x70
[241202.804518]  [<ffffffff810f4428>] __alloc_pages_nodemask+0x3e8/0x6f0
[241202.804526]  [<ffffffff811255ca>] alloc_pages_current+0x9a/0x100
[241202.804533]  [<ffffffff8112b725>] new_slab+0x225/0x2c0
[241202.804541]  [<ffffffff8112d333>] __slab_alloc+0x163/0x500
[241202.804549]  [<ffffffff81440a64>] ? sock_alloc_send_pskb+0x1d4/0x340
[241202.804558]  [<ffffffff8112e671>] __kmalloc_node_track_caller+0xb1/0x1f0
[241202.804566]  [<ffffffff81440a64>] ? sock_alloc_send_pskb+0x1d4/0x340
[241202.804573]  [<ffffffff81444bef>] ? __alloc_skb+0x4f/0x170
[241202.804579]  [<ffffffff81444c23>] __alloc_skb+0x83/0x170
[241202.804587]  [<ffffffff81440a64>] sock_alloc_send_pskb+0x1d4/0x340
[241202.804595]  [<ffffffff81440be5>] sock_alloc_send_skb+0x15/0x20
[241202.804604]  [<ffffffff814d8cb5>] unix_stream_sendmsg+0x275/0x3e0
[241202.804612]  [<ffffffff810eda70>] ? sync_page+0x0/0x50
[241202.804620]  [<ffffffff8143bfde>] sock_aio_write+0x14e/0x160
[241202.804627]  [<ffffffff8143be90>] ? sock_aio_write+0x0/0x160
[241202.804635]  [<ffffffff8113cad3>] do_sync_readv_writev+0xd3/0x110
[241202.804643]  [<ffffffff810eda70>] ? sync_page+0x0/0x50
[241202.804651]  [<ffffffff81108d99>] ? __do_fault+0x459/0x540
[241202.804660]  [<ffffffff81248256>] ? security_file_permission+0x16/0x20
[241202.804668]  [<ffffffff8113db5f>] do_readv_writev+0xcf/0x1f0
[241202.804676]  [<ffffffff8113dcc8>] vfs_writev+0x48/0x60
[241202.804683]  [<ffffffff8113ddf1>] sys_writev+0x51/0xb0
[241202.804691]  [<ffffffff8114f29c>] ? sys_poll+0x7c/0x110
[241202.804699]  [<ffffffff8100a072>] system_call_fastpath+0x16/0x1b
[241322.803828] INFO: task konsole:2145 blocked for more than 120 seconds.
[241322.803837] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[241322.803844] konsole       D 00000001016f201d     0  2145      1 0x00000000
[241322.803856]  ffff88000d8cf3d8 0000000000000082 ffff880000000000 0000000000015840
[241322.803866]  ffff88000d8cffd8 0000000000015840 ffff88000d8cffd8 ffff88000d8c16d0
[241322.803875]  0000000000015840 0000000000015840 ffff88000d8cffd8 0000000000015840
[241322.803884] Call Trace:
[241322.803899]  [<ffffffff810eda70>] ? sync_page+0x0/0x50
[241322.803910]  [<ffffffff81534327>] io_schedule+0x47/0x70
[241322.803919]  [<ffffffff810edaad>] sync_page+0x3d/0x50
[241322.803927]  [<ffffffff81534bdf>] __wait_on_bit+0x5f/0x90
[241322.803935]  [<ffffffff810edc63>] wait_on_page_bit+0x73/0x80
[241322.803944]  [<ffffffff8107a720>] ? wake_bit_function+0x0/0x40
[241322.803954]  [<ffffffff810fb266>] shrink_page_list+0x176/0x580
[241322.803961]  [<ffffffff8106b430>] ? process_timeout+0x0/0x10
[241322.803969]  [<ffffffff81032ca9>] ? default_spin_lock_flags+0x9/0x10
[241322.803977]  [<ffffffff8107a877>] ? finish_wait+0x67/0x90
[241322.803985]  [<ffffffff8110552e>] ? congestion_wait+0x7e/0x90
[241322.803993]  [<ffffffff8107a6e0>] ? autoremove_wake_function+0x0/0x40
[241322.804001]  [<ffffffff810fbd54>] shrink_inactive_list+0x6e4/0x7f0
[241322.804011]  [<ffffffff81047e18>] ? update_curr+0xf8/0x1e0
[241322.804020]  [<ffffffff810532d5>] ? check_preempt_wakeup+0x1c5/0x290
[241322.804028]  [<ffffffff810f5f7a>] ? determine_dirtyable_memory+0x1a/0x30
[241322.804035]  [<ffffffff810f6027>] ? get_dirty_limits+0x27/0x2f0
[241322.804043]  [<ffffffff8105197c>] ? try_to_wake_up+0xcc/0x400
[241322.804051]  [<ffffffff810fc1cb>] shrink_zone+0x36b/0x4b0
[241322.804060]  [<ffffffff810fc403>] do_try_to_free_pages+0xf3/0x440
[241322.804069]  [<ffffffff810fc8f8>] try_to_free_pages+0x68/0x70
[241322.804077]  [<ffffffff810f4428>] __alloc_pages_nodemask+0x3e8/0x6f0
[241322.804086]  [<ffffffff8112ae52>] kmalloc_large_node+0x62/0xb0
[241322.804094]  [<ffffffff8112e6fc>] __kmalloc_node_track_caller+0x13c/0x1f0
[241322.804104]  [<ffffffff81440a64>] ? sock_alloc_send_pskb+0x1d4/0x340
[241322.804111]  [<ffffffff81444bef>] ? __alloc_skb+0x4f/0x170
[241322.804117]  [<ffffffff81444c23>] __alloc_skb+0x83/0x170
[241322.804125]  [<ffffffff81440a64>] sock_alloc_send_pskb+0x1d4/0x340
[241322.804134]  [<ffffffff81440be5>] sock_alloc_send_skb+0x15/0x20
[241322.804144]  [<ffffffff814d8cb5>] unix_stream_sendmsg+0x275/0x3e0
[241322.804152]  [<ffffffff8143bfde>] sock_aio_write+0x14e/0x160
[241322.804161]  [<ffffffff8114e940>] ? pollwake+0x0/0x60
[241322.804168]  [<ffffffff8143be90>] ? sock_aio_write+0x0/0x160
[241322.804176]  [<ffffffff8113cad3>] do_sync_readv_writev+0xd3/0x110
[241322.804184]  [<ffffffff8113cd0a>] ? do_sync_read+0xda/0x120
[241322.804194]  [<ffffffff81248256>] ? security_file_permission+0x16/0x20
[241322.804202]  [<ffffffff8113db5f>] do_readv_writev+0xcf/0x1f0
[241322.804210]  [<ffffffff81248256>] ? security_file_permission+0x16/0x20
[241322.804218]  [<ffffffff8113dcc8>] vfs_writev+0x48/0x60
[241322.804226]  [<ffffffff8113ddf1>] sys_writev+0x51/0xb0
[241322.804233]  [<ffffffff8114f29c>] ? sys_poll+0x7c/0x110
[241322.804242]  [<ffffffff8100a072>] system_call_fastpath+0x16/0x1b
[241322.804295] INFO: task firefox-bin:12398 blocked for more than 120 seconds.
[241322.804299] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[241322.804305] firefox-bin   D 00000000ffffffff     0 12398  12394 0x00000000
[241322.804314]  ffff880078a592f8 0000000000000086 0000000000000400 0000000000015840
[241322.804323]  ffff880078a59fd8 0000000000015840 ffff880078a59fd8 ffff88004b395b40
[241322.804332]  0000000000015840 0000000000015840 ffff880078a59fd8 0000000000015840
[241322.804340] Call Trace:
[241322.804347]  [<ffffffff81032ca9>] ? default_spin_lock_flags+0x9/0x10
[241322.804356]  [<ffffffff810eda70>] ? sync_page+0x0/0x50
[241322.804363]  [<ffffffff81534327>] io_schedule+0x47/0x70
[241322.804371]  [<ffffffff810edaad>] sync_page+0x3d/0x50
[241322.804378]  [<ffffffff81534bdf>] __wait_on_bit+0x5f/0x90
[241322.804387]  [<ffffffff810edc63>] wait_on_page_bit+0x73/0x80
[241322.804394]  [<ffffffff8107a720>] ? wake_bit_function+0x0/0x40
[241322.804402]  [<ffffffff8153485d>] ? schedule_timeout+0x19d/0x310
[241322.804411]  [<ffffffff810fb266>] shrink_page_list+0x176/0x580
[241322.804417]  [<ffffffff8106b430>] ? process_timeout+0x0/0x10
[241322.804426]  [<ffffffff8110552e>] ? congestion_wait+0x7e/0x90
[241322.804433]  [<ffffffff8107a6e0>] ? autoremove_wake_function+0x0/0x40
[241322.804442]  [<ffffffff810fbd54>] shrink_inactive_list+0x6e4/0x7f0
[241322.804450]  [<ffffffff81047e18>] ? update_curr+0xf8/0x1e0
[241322.804459]  [<ffffffff8153664e>] ? _raw_spin_lock+0xe/0x20
[241322.804467]  [<ffffffff810483dd>] ? task_rq_lock+0x5d/0xa0
[241322.804474]  [<ffffffff8105197c>] ? try_to_wake_up+0xcc/0x400
[241322.804483]  [<ffffffff810fc1cb>] shrink_zone+0x36b/0x4b0
[241322.804492]  [<ffffffff810fc403>] do_try_to_free_pages+0xf3/0x440
[241322.804500]  [<ffffffff810fc8f8>] try_to_free_pages+0x68/0x70
[241322.804508]  [<ffffffff810f4428>] __alloc_pages_nodemask+0x3e8/0x6f0
[241322.804517]  [<ffffffff811255ca>] alloc_pages_current+0x9a/0x100
[241322.804524]  [<ffffffff8112b725>] new_slab+0x225/0x2c0
[241322.804532]  [<ffffffff8112d333>] __slab_alloc+0x163/0x500
[241322.804540]  [<ffffffff81440a64>] ? sock_alloc_send_pskb+0x1d4/0x340
[241322.804548]  [<ffffffff8112e671>] __kmalloc_node_track_caller+0xb1/0x1f0
[241322.804557]  [<ffffffff81440a64>] ? sock_alloc_send_pskb+0x1d4/0x340
[241322.804563]  [<ffffffff81444bef>] ? __alloc_skb+0x4f/0x170
[241322.804569]  [<ffffffff81444c23>] __alloc_skb+0x83/0x170
[241322.804577]  [<ffffffff81440a64>] sock_alloc_send_pskb+0x1d4/0x340
[241322.804586]  [<ffffffff81440be5>] sock_alloc_send_skb+0x15/0x20
[241322.804595]  [<ffffffff814d8cb5>] unix_stream_sendmsg+0x275/0x3e0
[241322.804603]  [<ffffffff810eda70>] ? sync_page+0x0/0x50
[241322.804610]  [<ffffffff8143bfde>] sock_aio_write+0x14e/0x160
[241322.804618]  [<ffffffff8143be90>] ? sock_aio_write+0x0/0x160
[241322.804626]  [<ffffffff8113cad3>] do_sync_readv_writev+0xd3/0x110
[241322.804634]  [<ffffffff810eda70>] ? sync_page+0x0/0x50
[241322.804642]  [<ffffffff81108d99>] ? __do_fault+0x459/0x540
[241322.804651]  [<ffffffff81248256>] ? security_file_permission+0x16/0x20
[241322.804659]  [<ffffffff8113db5f>] do_readv_writev+0xcf/0x1f0
[241322.804667]  [<ffffffff8113dcc8>] vfs_writev+0x48/0x60
[241322.804675]  [<ffffffff8113ddf1>] sys_writev+0x51/0xb0
[241322.804682]  [<ffffffff8114f29c>] ? sys_poll+0x7c/0x110
[241322.804690]  [<ffffffff8100a072>] system_call_fastpath+0x16/0x1b
[241383.343089] usb 2-4: USB disconnect, address 23
[241383.343097] usb 2-4.1: USB disconnect, address 24
[241383.343103] usb 2-4.1.1: USB disconnect, address 25
[241442.810050] INFO: task thunderbird-bin:3159 blocked for more than 120 seconds.
[241442.810053] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[241442.810055] thunderbird-b D 00000001016faa2e     0  3159   3155 0x00000000
[241442.810059]  ffff88010500d3d8 0000000000000082 ffff88010500d368 0000000000015840
[241442.810062]  ffff88010500dfd8 0000000000015840 ffff88010500dfd8 ffff880096d496d0
[241442.810064]  0000000000015840 0000000000015840 ffff88010500dfd8 0000000000015840
[241442.810066] Call Trace:
[241442.810071]  [<ffffffff81032ca9>] ? default_spin_lock_flags+0x9/0x10
[241442.810075]  [<ffffffff810eda70>] ? sync_page+0x0/0x50
[241442.810078]  [<ffffffff81534327>] io_schedule+0x47/0x70
[241442.810080]  [<ffffffff810edaad>] sync_page+0x3d/0x50
[241442.810082]  [<ffffffff81534bdf>] __wait_on_bit+0x5f/0x90
[241442.810084]  [<ffffffff810edc63>] wait_on_page_bit+0x73/0x80
[241442.810087]  [<ffffffff8107a720>] ? wake_bit_function+0x0/0x40
[241442.810089]  [<ffffffff8153485d>] ? schedule_timeout+0x19d/0x310
[241442.810091]  [<ffffffff810fb266>] shrink_page_list+0x176/0x580
[241442.810094]  [<ffffffff8106b430>] ? process_timeout+0x0/0x10
[241442.810095]  [<ffffffff81032ca9>] ? default_spin_lock_flags+0x9/0x10
[241442.810097]  [<ffffffff8107a877>] ? finish_wait+0x67/0x90
[241442.810100]  [<ffffffff8110552e>] ? congestion_wait+0x7e/0x90
[241442.810101]  [<ffffffff8107a6e0>] ? autoremove_wake_function+0x0/0x40
[241442.810104]  [<ffffffff810fbd54>] shrink_inactive_list+0x6e4/0x7f0
[241442.810106]  [<ffffffff810512cc>] ? select_task_rq_fair+0x2fc/0x8e0
[241442.810108]  [<ffffffff810532c4>] ? check_preempt_wakeup+0x1b4/0x290
[241442.810110]  [<ffffffff8105197c>] ? try_to_wake_up+0xcc/0x400
[241442.810112]  [<ffffffff810fc1cb>] shrink_zone+0x36b/0x4b0
[241442.810115]  [<ffffffff810fc403>] do_try_to_free_pages+0xf3/0x440
[241442.810117]  [<ffffffff810fc8f8>] try_to_free_pages+0x68/0x70
[241442.810119]  [<ffffffff810f4428>] __alloc_pages_nodemask+0x3e8/0x6f0
[241442.810122]  [<ffffffff81045034>] ? scale_rt_power+0x24/0x70
[241442.810124]  [<ffffffff8112ae52>] kmalloc_large_node+0x62/0xb0
[241442.810126]  [<ffffffff8112e6fc>] __kmalloc_node_track_caller+0x13c/0x1f0
[241442.810129]  [<ffffffff81440a64>] ? sock_alloc_send_pskb+0x1d4/0x340
[241442.810131]  [<ffffffff81444bef>] ? __alloc_skb+0x4f/0x170
[241442.810133]  [<ffffffff81444c23>] __alloc_skb+0x83/0x170
[241442.810135]  [<ffffffff81440a64>] sock_alloc_send_pskb+0x1d4/0x340
[241442.810137]  [<ffffffff81440be5>] sock_alloc_send_skb+0x15/0x20
[241442.810140]  [<ffffffff814d8cb5>] unix_stream_sendmsg+0x275/0x3e0
[241442.810142]  [<ffffffff8143bfde>] sock_aio_write+0x14e/0x160
[241442.810144]  [<ffffffff81138b0e>] ? mem_cgroup_uncharge_swapcache+0x2e/0xb0
[241442.810146]  [<ffffffff8143be90>] ? sock_aio_write+0x0/0x160
[241442.810148]  [<ffffffff8113cad3>] do_sync_readv_writev+0xd3/0x110
[241442.810151]  [<ffffffff8111c475>] ? swap_entry_free+0x115/0x150
[241442.810154]  [<ffffffff81248256>] ? security_file_permission+0x16/0x20
[241442.810156]  [<ffffffff8113db5f>] do_readv_writev+0xcf/0x1f0
[241442.810158]  [<ffffffff8113dcc8>] vfs_writev+0x48/0x60
[241442.810160]  [<ffffffff8113ddf1>] sys_writev+0x51/0xb0
[241442.810162]  [<ffffffff8114f29c>] ? sys_poll+0x7c/0x110
[241442.810165]  [<ffffffff8100a072>] system_call_fastpath+0x16/0x1b
[241442.810181] INFO: task firefox-bin:12398 blocked for more than 120 seconds.
[241442.810183] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[241442.810184] firefox-bin   D 00000000ffffffff     0 12398  12394 0x00000000
[241442.810186]  ffff880078a592f8 0000000000000086 0000000000000400 0000000000015840
[241442.810189]  ffff880078a59fd8 0000000000015840 ffff880078a59fd8 ffff88004b395b40
[241442.810191]  0000000000015840 0000000000015840 ffff880078a59fd8 0000000000015840
[241442.810193] Call Trace:
[241442.810195]  [<ffffffff81032ca9>] ? default_spin_lock_flags+0x9/0x10
[241442.810197]  [<ffffffff810eda70>] ? sync_page+0x0/0x50
[241442.810199]  [<ffffffff81534327>] io_schedule+0x47/0x70
[241442.810201]  [<ffffffff810edaad>] sync_page+0x3d/0x50
[241442.810203]  [<ffffffff81534bdf>] __wait_on_bit+0x5f/0x90
[241442.810205]  [<ffffffff810edc63>] wait_on_page_bit+0x73/0x80
[241442.810207]  [<ffffffff8107a720>] ? wake_bit_function+0x0/0x40
[241442.810209]  [<ffffffff8153485d>] ? schedule_timeout+0x19d/0x310
[241442.810211]  [<ffffffff810fb266>] shrink_page_list+0x176/0x580
[241442.810213]  [<ffffffff8106b430>] ? process_timeout+0x0/0x10
[241442.810215]  [<ffffffff8110552e>] ? congestion_wait+0x7e/0x90
[241442.810216]  [<ffffffff8107a6e0>] ? autoremove_wake_function+0x0/0x40
[241442.810219]  [<ffffffff810fbd54>] shrink_inactive_list+0x6e4/0x7f0
[241442.810221]  [<ffffffff81047e18>] ? update_curr+0xf8/0x1e0
[241442.810224]  [<ffffffff8153664e>] ? _raw_spin_lock+0xe/0x20
[241442.810226]  [<ffffffff810483dd>] ? task_rq_lock+0x5d/0xa0
[241442.810228]  [<ffffffff8105197c>] ? try_to_wake_up+0xcc/0x400
[241442.810230]  [<ffffffff810fc1cb>] shrink_zone+0x36b/0x4b0
[241442.810232]  [<ffffffff810fc403>] do_try_to_free_pages+0xf3/0x440
[241442.810234]  [<ffffffff810fc8f8>] try_to_free_pages+0x68/0x70
[241442.810236]  [<ffffffff810f4428>] __alloc_pages_nodemask+0x3e8/0x6f0
[241442.810238]  [<ffffffff811255ca>] alloc_pages_current+0x9a/0x100
[241442.810240]  [<ffffffff8112b725>] new_slab+0x225/0x2c0
[241442.810242]  [<ffffffff8112d333>] __slab_alloc+0x163/0x500
[241442.810244]  [<ffffffff81440a64>] ? sock_alloc_send_pskb+0x1d4/0x340
[241442.810246]  [<ffffffff8112e671>] __kmalloc_node_track_caller+0xb1/0x1f0
[241442.810248]  [<ffffffff81440a64>] ? sock_alloc_send_pskb+0x1d4/0x340
[241442.810250]  [<ffffffff81444bef>] ? __alloc_skb+0x4f/0x170
[241442.810252]  [<ffffffff81444c23>] __alloc_skb+0x83/0x170
[241442.810254]  [<ffffffff81440a64>] sock_alloc_send_pskb+0x1d4/0x340
[241442.810256]  [<ffffffff81440be5>] sock_alloc_send_skb+0x15/0x20
[241442.810258]  [<ffffffff814d8cb5>] unix_stream_sendmsg+0x275/0x3e0
[241442.810260]  [<ffffffff810eda70>] ? sync_page+0x0/0x50
[241442.810262]  [<ffffffff8143bfde>] sock_aio_write+0x14e/0x160
[241442.810264]  [<ffffffff8143be90>] ? sock_aio_write+0x0/0x160
[241442.810266]  [<ffffffff8113cad3>] do_sync_readv_writev+0xd3/0x110
[241442.810268]  [<ffffffff810eda70>] ? sync_page+0x0/0x50
[241442.810270]  [<ffffffff81108d99>] ? __do_fault+0x459/0x540
[241442.810272]  [<ffffffff81248256>] ? security_file_permission+0x16/0x20
[241442.810274]  [<ffffffff8113db5f>] do_readv_writev+0xcf/0x1f0
[241442.810276]  [<ffffffff8113dcc8>] vfs_writev+0x48/0x60
[241442.810278]  [<ffffffff8113ddf1>] sys_writev+0x51/0xb0
[241442.810280]  [<ffffffff8114f29c>] ? sys_poll+0x7c/0x110
[241442.810282]  [<ffffffff8100a072>] system_call_fastpath+0x16/0x1b

Jeremy Foshee (jeremyfoshee) on 2010-06-23

tags:

added: kernel-fs kernel-needs-review
removed: needs-kernel-logs needs-upstream-testing

Revision history for this message

tom (thomas-gutzler) wrote on 2010-07-07:

#25

Download full text (9.7 KiB)

I'm running
Linux io 2.6.32-23-server #37-Ubuntu SMP Fri Jun 11 09:11:11 UTC 2010 x86_64 GNU/Linux
with all upgrades installed and I got a freeze today. Since this is a multi-user file server I'm not sure what exactly was going on at the time. Unfortunately, I cannot try out different kernels either as the machine is always in use.

Jul 7 11:29:57 io kernel: [254640.440019] INFO: task jbd2/sdb-8:942 blocked for more than 120 seconds.
Jul 7 11:29:57 io kernel: [254640.440033] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 7 11:29:57 io kernel: [254640.440042] jbd2/sdb-8 D 00000000ffffffff 0 942 2 0x00000000
Jul 7 11:29:57 io kernel: [254640.440047] ffff88007bd71d20 0000000000000046 0000000000015bc0 0000000000015bc0
Jul 7 11:29:57 io kernel: [254640.440052] ffff88007bff9ab0 ffff88007bd71fd8 0000000000015bc0 ffff88007bff96f0
Jul 7 11:29:57 io kernel: [254640.440056] 0000000000015bc0 ffff88007bd71fd8 0000000000015bc0 ffff88007bff9ab0
Jul 7 11:29:57 io kernel: [254640.440060] Call Trace:
Jul 7 11:29:57 io kernel: [254640.440069] [<ffffffff8121b9a1>] jbd2_journal_commit_transaction+0x1c1/0x1250
Jul 7 11:29:57 io kernel: [254640.440075] [<ffffffff81076c7c>] ? lock_timer_base+0x3c/0x70
Jul 7 11:29:57 io kernel: [254640.440080] [<ffffffff81085090>] ? autoremove_wake_function+0x0/0x40
Jul 7 11:29:57 io kernel: [254640.440084] [<ffffffff81222f6d>] kjournald2+0xbd/0x220
Jul 7 11:29:57 io kernel: [254640.440088] [<ffffffff81085090>] ? autoremove_wake_function+0x0/0x40
Jul 7 11:29:57 io kernel: [254640.440091] [<ffffffff81222eb0>] ? kjournald2+0x0/0x220
Jul 7 11:29:57 io kernel: [254640.440094] [<ffffffff81084d16>] kthread+0x96/0xa0
Jul 7 11:29:57 io kernel: [254640.440099] [<ffffffff810141ea>] child_rip+0xa/0x20
Jul 7 11:29:57 io kernel: [254640.440102] [<ffffffff81084c80>] ? kthread+0x0/0xa0
Jul 7 11:29:57 io kernel: [254640.440105] [<ffffffff810141e0>] ? child_rip+0x0/0x20
Jul 7 11:29:57 io kernel: [254640.440113] INFO: task nfsd:1753 blocked for more than 120 seconds.
Jul 7 11:29:57 io kernel: [254640.440119] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 7 11:29:57 io kernel: [254640.440126] nfsd D 0000000000000000 0 1753 2 0x00000000
Jul 7 11:29:57 io kernel: [254640.440130] ffff88007b345b10 0000000000000046 0000000000015bc0 0000000000015bc0
Jul 7 11:29:57 io kernel: [254640.440135] ffff88007cbc4890 ffff88007b345fd8 0000000000015bc0 ffff88007cbc44d0
Jul 7 11:29:57 io kernel: [254640.440139] 0000000000015bc0 ffff88007b345fd8 0000000000015bc0 ffff88007cbc4890
Jul 7 11:29:57 io kernel: [254640.440143] Call Trace:
Jul 7 11:29:57 io kernel: [254640.440147] [<ffffffff81219ce1>] start_this_handle+0x251/0x4b0
Jul 7 11:29:57 io kernel: [254640.440150] [<ffffffff81085090>] ? autoremove_wake_function+0x0/0x40
Jul 7 11:29:57 io kernel: [254640.440154] [<ffffffff81559cde>] ? _spin_lock+0xe/0x20
Jul 7 11:29:57 io kernel: [254640.440158] [<ffffffff8121a115>] jbd2_journal_start+0xb5/0x100
Jul 7 11:29:57 io kernel: [254640.440162] [<ffffffff811f80b8>] ext4_journal_start_sb+0x58/0x90
Jul 7 11:29:57 io kernel: [254640.440167]...

I'm running
Linux io 2.6.32-23-server #37-Ubuntu SMP Fri Jun 11 09:11:11 UTC 2010 x86_64 GNU/Linux
with all upgrades installed and I got a freeze today. Since this is a multi-user file server I'm not sure what exactly was going on at the time. Unfortunately, I cannot try out different kernels either as the machine is always in use.

Jul  7 11:29:57 io kernel: [254640.440019] INFO: task jbd2/sdb-8:942 blocked for more than 120 seconds.
Jul  7 11:29:57 io kernel: [254640.440033] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  7 11:29:57 io kernel: [254640.440042] jbd2/sdb-8    D 00000000ffffffff     0   942      2 0x00000000
Jul  7 11:29:57 io kernel: [254640.440047]  ffff88007bd71d20 0000000000000046 0000000000015bc0 0000000000015bc0
Jul  7 11:29:57 io kernel: [254640.440052]  ffff88007bff9ab0 ffff88007bd71fd8 0000000000015bc0 ffff88007bff96f0
Jul  7 11:29:57 io kernel: [254640.440056]  0000000000015bc0 ffff88007bd71fd8 0000000000015bc0 ffff88007bff9ab0
Jul  7 11:29:57 io kernel: [254640.440060] Call Trace:
Jul  7 11:29:57 io kernel: [254640.440069]  [<ffffffff8121b9a1>] jbd2_journal_commit_transaction+0x1c1/0x1250
Jul  7 11:29:57 io kernel: [254640.440075]  [<ffffffff81076c7c>] ? lock_timer_base+0x3c/0x70
Jul  7 11:29:57 io kernel: [254640.440080]  [<ffffffff81085090>] ? autoremove_wake_function+0x0/0x40
Jul  7 11:29:57 io kernel: [254640.440084]  [<ffffffff81222f6d>] kjournald2+0xbd/0x220
Jul  7 11:29:57 io kernel: [254640.440088]  [<ffffffff81085090>] ? autoremove_wake_function+0x0/0x40
Jul  7 11:29:57 io kernel: [254640.440091]  [<ffffffff81222eb0>] ? kjournald2+0x0/0x220
Jul  7 11:29:57 io kernel: [254640.440094]  [<ffffffff81084d16>] kthread+0x96/0xa0
Jul  7 11:29:57 io kernel: [254640.440099]  [<ffffffff810141ea>] child_rip+0xa/0x20
Jul  7 11:29:57 io kernel: [254640.440102]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
Jul  7 11:29:57 io kernel: [254640.440105]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
Jul  7 11:29:57 io kernel: [254640.440113] INFO: task nfsd:1753 blocked for more than 120 seconds.
Jul  7 11:29:57 io kernel: [254640.440119] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  7 11:29:57 io kernel: [254640.440126] nfsd          D 0000000000000000     0  1753      2 0x00000000
Jul  7 11:29:57 io kernel: [254640.440130]  ffff88007b345b10 0000000000000046 0000000000015bc0 0000000000015bc0
Jul  7 11:29:57 io kernel: [254640.440135]  ffff88007cbc4890 ffff88007b345fd8 0000000000015bc0 ffff88007cbc44d0
Jul  7 11:29:57 io kernel: [254640.440139]  0000000000015bc0 ffff88007b345fd8 0000000000015bc0 ffff88007cbc4890
Jul  7 11:29:57 io kernel: [254640.440143] Call Trace:
Jul  7 11:29:57 io kernel: [254640.440147]  [<ffffffff81219ce1>] start_this_handle+0x251/0x4b0
Jul  7 11:29:57 io kernel: [254640.440150]  [<ffffffff81085090>] ? autoremove_wake_function+0x0/0x40
Jul  7 11:29:57 io kernel: [254640.440154]  [<ffffffff81559cde>] ? _spin_lock+0xe/0x20
Jul  7 11:29:57 io kernel: [254640.440158]  [<ffffffff8121a115>] jbd2_journal_start+0xb5/0x100
Jul  7 11:29:57 io kernel: [254640.440162]  [<ffffffff811f80b8>] ext4_journal_start_sb+0x58/0x90
Jul  7 11:29:57 io kernel: [254640.440167]  [<ffffffff811deb9a>] ext4_dirty_inode+0x2a/0x60
Jul  7 11:29:57 io kernel: [254640.440171]  [<ffffffff81166f12>] __mark_inode_dirty+0x42/0x1e0
Jul  7 11:29:57 io kernel: [254640.440174]  [<ffffffff8115d29c>] inode_setattr+0x7c/0x170
Jul  7 11:29:57 io kernel: [254640.440178]  [<ffffffff811ded56>] ext4_setattr+0x186/0x370
Jul  7 11:29:57 io kernel: [254640.440181]  [<ffffffff8115d4fb>] notify_change+0x16b/0x350
Jul  7 11:29:57 io kernel: [254640.440196]  [<ffffffffa031d440>] nfsd_setattr+0x330/0x460 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440205]  [<ffffffffa0327a26>] nfsd3_proc_setattr+0x76/0xc0 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440213]  [<ffffffffa031844e>] nfsd_dispatch+0xfe/0x250 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440230]  [<ffffffffa0129664>] svc_process_common+0x344/0x610 [sunrpc]
Jul  7 11:29:57 io kernel: [254640.440235]  [<ffffffff8105b210>] ? default_wake_function+0x0/0x20
Jul  7 11:29:57 io kernel: [254640.440247]  [<ffffffffa0129a40>] svc_process+0x110/0x150 [sunrpc]
Jul  7 11:29:57 io kernel: [254640.440254]  [<ffffffffa0318af5>] nfsd+0xc5/0x170 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440261]  [<ffffffffa0318a30>] ? nfsd+0x0/0x170 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440264]  [<ffffffff81084d16>] kthread+0x96/0xa0
Jul  7 11:29:57 io kernel: [254640.440268]  [<ffffffff810141ea>] child_rip+0xa/0x20
Jul  7 11:29:57 io kernel: [254640.440271]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
Jul  7 11:29:57 io kernel: [254640.440274]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
Jul  7 11:29:57 io kernel: [254640.440277] INFO: task nfsd:1755 blocked for more than 120 seconds.
Jul  7 11:29:57 io kernel: [254640.440283] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  7 11:29:57 io kernel: [254640.440290] nfsd          D 0000000000000003     0  1755      2 0x00000000
Jul  7 11:29:57 io kernel: [254640.440294]  ffff8800365ffb90 0000000000000046 0000000000015bc0 0000000000015bc0
Jul  7 11:29:57 io kernel: [254640.440298]  ffff88007ade03c0 ffff8800365fffd8 0000000000015bc0 ffff88007ade0000
Jul  7 11:29:57 io kernel: [254640.440302]  0000000000015bc0 ffff8800365fffd8 0000000000015bc0 ffff88007ade03c0
Jul  7 11:29:57 io kernel: [254640.440307] Call Trace:
Jul  7 11:29:57 io kernel: [254640.440310]  [<ffffffff81219ce1>] start_this_handle+0x251/0x4b0
Jul  7 11:29:57 io kernel: [254640.440314]  [<ffffffff81085090>] ? autoremove_wake_function+0x0/0x40
Jul  7 11:29:57 io kernel: [254640.440318]  [<ffffffff8121a115>] jbd2_journal_start+0xb5/0x100
Jul  7 11:29:57 io kernel: [254640.440321]  [<ffffffff811f80b8>] ext4_journal_start_sb+0x58/0x90
Jul  7 11:29:57 io kernel: [254640.440325]  [<ffffffff811dedf0>] ext4_setattr+0x220/0x370
Jul  7 11:29:57 io kernel: [254640.440328]  [<ffffffff8115d4fb>] notify_change+0x16b/0x350
Jul  7 11:29:57 io kernel: [254640.440337]  [<ffffffffa031d440>] nfsd_setattr+0x330/0x460 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440345]  [<ffffffffa0327a26>] nfsd3_proc_setattr+0x76/0xc0 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440353]  [<ffffffffa031844e>] nfsd_dispatch+0xfe/0x250 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440364]  [<ffffffffa0129664>] svc_process_common+0x344/0x610 [sunrpc]
Jul  7 11:29:57 io kernel: [254640.440369]  [<ffffffff8105b210>] ? default_wake_function+0x0/0x20
Jul  7 11:29:57 io kernel: [254640.440380]  [<ffffffffa0129a40>] svc_process+0x110/0x150 [sunrpc]
Jul  7 11:29:57 io kernel: [254640.440387]  [<ffffffffa0318af5>] nfsd+0xc5/0x170 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440394]  [<ffffffffa0318a30>] ? nfsd+0x0/0x170 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440398]  [<ffffffff81084d16>] kthread+0x96/0xa0
Jul  7 11:29:57 io kernel: [254640.440401]  [<ffffffff810141ea>] child_rip+0xa/0x20
Jul  7 11:29:57 io kernel: [254640.440404]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
Jul  7 11:29:57 io kernel: [254640.440407]  [<ffffffff810141e0>] ? child_rip+0x0/0x20
Jul  7 11:29:57 io kernel: [254640.440410] INFO: task nfsd:1757 blocked for more than 120 seconds.
Jul  7 11:29:57 io kernel: [254640.440416] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul  7 11:29:57 io kernel: [254640.440423] nfsd          D 0000000000000002     0  1757      2 0x00000000
Jul  7 11:29:57 io kernel: [254640.440427]  ffff880036b03b10 0000000000000046 0000000000015bc0 0000000000015bc0
Jul  7 11:29:57 io kernel: [254640.440431]  ffff88007b254890 ffff880036b03fd8 0000000000015bc0 ffff88007b2544d0
Jul  7 11:29:57 io kernel: [254640.440435]  0000000000015bc0 ffff880036b03fd8 0000000000015bc0 ffff88007b254890
Jul  7 11:29:57 io kernel: [254640.440439] Call Trace:
Jul  7 11:29:57 io kernel: [254640.440443]  [<ffffffff81219ce1>] start_this_handle+0x251/0x4b0
Jul  7 11:29:57 io kernel: [254640.440447]  [<ffffffff81085090>] ? autoremove_wake_function+0x0/0x40
Jul  7 11:29:57 io kernel: [254640.440451]  [<ffffffff8121a115>] jbd2_journal_start+0xb5/0x100
Jul  7 11:29:57 io kernel: [254640.440455]  [<ffffffff8127e8e1>] ? aa_alloc_task_context+0x31/0xa0
Jul  7 11:29:57 io kernel: [254640.440459]  [<ffffffff811f80b8>] ext4_journal_start_sb+0x58/0x90
Jul  7 11:29:57 io kernel: [254640.440463]  [<ffffffff811deb9a>] ext4_dirty_inode+0x2a/0x60
Jul  7 11:29:57 io kernel: [254640.440466]  [<ffffffff81166f12>] __mark_inode_dirty+0x42/0x1e0
Jul  7 11:29:57 io kernel: [254640.440469]  [<ffffffff8115d29c>] inode_setattr+0x7c/0x170
Jul  7 11:29:57 io kernel: [254640.440473]  [<ffffffff811ded56>] ext4_setattr+0x186/0x370
Jul  7 11:29:57 io kernel: [254640.440476]  [<ffffffff8115d4fb>] notify_change+0x16b/0x350
Jul  7 11:29:57 io kernel: [254640.440484]  [<ffffffffa031d440>] nfsd_setattr+0x330/0x460 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440493]  [<ffffffffa0327a26>] nfsd3_proc_setattr+0x76/0xc0 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440500]  [<ffffffffa031844e>] nfsd_dispatch+0xfe/0x250 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440512]  [<ffffffffa0129664>] svc_process_common+0x344/0x610 [sunrpc]
Jul  7 11:29:57 io kernel: [254640.440516]  [<ffffffff8105b210>] ? default_wake_function+0x0/0x20
Jul  7 11:29:57 io kernel: [254640.440528]  [<ffffffffa0129a40>] svc_process+0x110/0x150 [sunrpc]
Jul  7 11:29:57 io kernel: [254640.440535]  [<ffffffffa0318af5>] nfsd+0xc5/0x170 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440542]  [<ffffffffa0318a30>] ? nfsd+0x0/0x170 [nfsd]
Jul  7 11:29:57 io kernel: [254640.440545]  [<ffffffff81084d16>] kthread+0x96/0xa0
Jul  7 11:29:57 io kernel: [254640.440549]  [<ffffffff810141ea>] child_rip+0xa/0x20
Jul  7 11:29:57 io kernel: [254640.440552]  [<ffffffff81084c80>] ? kthread+0x0/0xa0
Jul  7 11:29:57 io kernel: [254640.440555]  [<ffffffff810141e0>] ? child_rip+0x0/0x20

dmesg attached

Revision history for this message

tom (thomas-gutzler) wrote on 2010-07-07:

#26

/var/log/dmesg Edit (51.6 KiB, text/plain)

Revision history for this message

Timo Harmonen (timo-harmonen) wrote on 2010-07-12:

#27

dmesg-host-guest.txt Edit (130.3 KiB, text/plain)

Download full text (6.1 KiB)

I can repro this issue quite easily with my setup. I'm running two amd64 kvm guests on amd64 host system with 8GB of memory. Nfs server is running on the host, and guests heavily rely on it. All systems are up-to-date, kernel is 2.6.32-23.

So the guests hang when they heavily access nfs mounts, it seems that write operations are needed. First I used nfs3, then switched to nfs4, but it didn't really help.

host export:
/srv/mmedia 172.16.0.0/16(rw,nohide,insecure,no_subtree_check,async)

guest fstab mount:
172.16.1.1:/mmedia /mmedia nfs4 _netdev,auto 0 0

I have had this issue since upgrading to Lucid, and never had anything like this with Karmic, where I had exactly the same setup.

dmesg log attached, both from the host and a guest.

One way to repro this is to run a script on the guest that processes (copies) image files over nfs, this hangs after processing around 20-50 files. System load starts to increase after the script hangs, I have seen loads way over 200. After this happens, also all other processes accessing nfs mounts hang. Cannot reboot, have to hard reset the guest.

syslog from the gust:
--------------------------
Jul 12 13:42:14 scotty kernel: [ 360.190575] INFO: task perl:4360 blocked for more than 120 seconds.
Jul 12 13:42:14 scotty kernel: [ 360.190585] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 12 13:42:14 scotty kernel: [ 360.190592] perl D 0000000000000000 0 4360 4358 0x00000000
Jul 12 13:42:14 scotty kernel: [ 360.190605] ffff8800b02ffc48 0000000000000082 0000000000015bc0 0000000000015bc0
Jul 12 13:42:14 scotty kernel: [ 360.190616] ffff8800ae73c890 ffff8800b02fffd8 0000000000015bc0 ffff8800ae73c4d0
Jul 12 13:42:14 scotty kernel: [ 360.190624] 0000000000015bc0 ffff8800b02fffd8 0000000000015bc0 ffff8800ae73c890
Jul 12 13:42:14 scotty kernel: [ 360.190633] Call Trace:
Jul 12 13:42:14 scotty kernel: [ 360.190729] [<ffffffffa014a3b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
Jul 12 13:42:14 scotty kernel: [ 360.190788] [<ffffffff81541357>] io_schedule+0x47/0x70
Jul 12 13:42:14 scotty kernel: [ 360.190816] [<ffffffffa014a3be>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
Jul 12 13:42:14 scotty kernel: [ 360.190824] [<ffffffff81541bbf>] __wait_on_bit+0x5f/0x90
Jul 12 13:42:14 scotty kernel: [ 360.190850] [<ffffffffa014a3b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
Jul 12 13:42:14 scotty kernel: [ 360.190860] [<ffffffff81541c68>] out_of_line_wait_on_bit+0x78/0x90
Jul 12 13:42:14 scotty kernel: [ 360.190905] [<ffffffff81085470>] ? wake_bit_function+0x0/0x40
Jul 12 13:42:14 scotty kernel: [ 360.190931] [<ffffffffa014a39f>] nfs_wait_on_request+0x2f/0x40 [nfs]
Jul 12 13:42:14 scotty kernel: [ 360.190964] [<ffffffffa014e7df>] nfs_wait_on_requests_locked+0x7f/0xd0 [nfs]
Jul 12 13:42:14 scotty kernel: [ 360.190992] [<ffffffffa014fc1e>] nfs_sync_mapping_wait+0x9e/0x1a0 [nfs]
Jul 12 13:42:14 scotty kernel: [ 360.191027] [<ffffffffa0150009>] nfs_write_mapping+0x79/0xb0 [nfs]
Jul 12 13:42:14 scotty kernel: [ 360.191060] [<ffffffff8115f7d0>] ? mntput_no_expire+0x30/0x110
Jul 12 13:42:14 scotty kernel: [ 360.191087] [<ffffffff...

I can repro this issue quite easily with my setup. I'm running two amd64 kvm guests on amd64 host system with 8GB of memory. Nfs server is running on the host, and guests heavily rely on it. All systems are up-to-date, kernel is 2.6.32-23.

So the guests hang when they heavily access nfs mounts, it seems that write operations are needed. First I used nfs3, then switched to nfs4, but it didn't really help.

host export:
/srv/mmedia             172.16.0.0/16(rw,nohide,insecure,no_subtree_check,async)

guest fstab mount:
172.16.1.1:/mmedia    /mmedia   nfs4 _netdev,auto 0 0

I have had this issue since upgrading to Lucid, and never had anything like this with Karmic, where I had exactly the same setup.

dmesg log attached, both from the host and a guest.

One way to repro this is to run a script on the guest that processes (copies) image files over nfs, this hangs after processing around 20-50 files. System load starts to increase after the script hangs, I have seen loads way over 200. After this happens, also all other processes accessing nfs mounts hang. Cannot reboot, have to hard reset the guest.

syslog from the gust:
--------------------------
Jul 12 13:42:14 scotty kernel: [  360.190575] INFO: task perl:4360 blocked for more than 120 seconds.
Jul 12 13:42:14 scotty kernel: [  360.190585] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 12 13:42:14 scotty kernel: [  360.190592] perl          D 0000000000000000     0  4360   4358 0x00000000
Jul 12 13:42:14 scotty kernel: [  360.190605]  ffff8800b02ffc48 0000000000000082 0000000000015bc0 0000000000015bc0
Jul 12 13:42:14 scotty kernel: [  360.190616]  ffff8800ae73c890 ffff8800b02fffd8 0000000000015bc0 ffff8800ae73c4d0
Jul 12 13:42:14 scotty kernel: [  360.190624]  0000000000015bc0 ffff8800b02fffd8 0000000000015bc0 ffff8800ae73c890
Jul 12 13:42:14 scotty kernel: [  360.190633] Call Trace:
Jul 12 13:42:14 scotty kernel: [  360.190729]  [<ffffffffa014a3b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
Jul 12 13:42:14 scotty kernel: [  360.190788]  [<ffffffff81541357>] io_schedule+0x47/0x70
Jul 12 13:42:14 scotty kernel: [  360.190816]  [<ffffffffa014a3be>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
Jul 12 13:42:14 scotty kernel: [  360.190824]  [<ffffffff81541bbf>] __wait_on_bit+0x5f/0x90
Jul 12 13:42:14 scotty kernel: [  360.190850]  [<ffffffffa014a3b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
Jul 12 13:42:14 scotty kernel: [  360.190860]  [<ffffffff81541c68>] out_of_line_wait_on_bit+0x78/0x90
Jul 12 13:42:14 scotty kernel: [  360.190905]  [<ffffffff81085470>] ? wake_bit_function+0x0/0x40
Jul 12 13:42:14 scotty kernel: [  360.190931]  [<ffffffffa014a39f>] nfs_wait_on_request+0x2f/0x40 [nfs]
Jul 12 13:42:14 scotty kernel: [  360.190964]  [<ffffffffa014e7df>] nfs_wait_on_requests_locked+0x7f/0xd0 [nfs]
Jul 12 13:42:14 scotty kernel: [  360.190992]  [<ffffffffa014fc1e>] nfs_sync_mapping_wait+0x9e/0x1a0 [nfs]
Jul 12 13:42:14 scotty kernel: [  360.191027]  [<ffffffffa0150009>] nfs_write_mapping+0x79/0xb0 [nfs]
Jul 12 13:42:14 scotty kernel: [  360.191060]  [<ffffffff8115f7d0>] ? mntput_no_expire+0x30/0x110
Jul 12 13:42:14 scotty kernel: [  360.191087]  [<ffffffffa0150077>] nfs_wb_all+0x17/0x20 [nfs]
Jul 12 13:42:14 scotty kernel: [  360.191109]  [<ffffffffa013ef9a>] nfs_do_fsync+0x2a/0x60 [nfs]
Jul 12 13:42:14 scotty kernel: [  360.191131]  [<ffffffffa013f1e5>] nfs_file_flush+0x75/0xa0 [nfs]
Jul 12 13:42:14 scotty kernel: [  360.191146]  [<ffffffff8114173c>] filp_close+0x3c/0x90
Jul 12 13:42:14 scotty kernel: [  360.191153]  [<ffffffff81141847>] sys_close+0xb7/0x120
Jul 12 13:42:14 scotty kernel: [  360.191179]  [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
Jul 12 13:44:14 scotty kernel: [  480.190437] INFO: task perl:4360 blocked for more than 120 seconds.
Jul 12 13:44:14 scotty kernel: [  480.190446] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 12 13:44:14 scotty kernel: [  480.190453] perl          D 0000000000000000     0  4360   4358 0x00000000
Jul 12 13:44:14 scotty kernel: [  480.190466]  ffff8800b02ffc48 0000000000000082 0000000000015bc0 0000000000015bc0
Jul 12 13:44:14 scotty kernel: [  480.190477]  ffff8800ae73c890 ffff8800b02fffd8 0000000000015bc0 ffff8800ae73c4d0
Jul 12 13:44:14 scotty kernel: [  480.190486]  0000000000015bc0 ffff8800b02fffd8 0000000000015bc0 ffff8800ae73c890
Jul 12 13:44:14 scotty kernel: [  480.190495] Call Trace:
Jul 12 13:44:14 scotty kernel: [  480.190534]  [<ffffffffa014a3b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
Jul 12 13:44:14 scotty kernel: [  480.190548]  [<ffffffff81541357>] io_schedule+0x47/0x70
Jul 12 13:44:14 scotty kernel: [  480.190582]  [<ffffffffa014a3be>] nfs_wait_bit_uninterruptible+0xe/0x20 [nfs]
Jul 12 13:44:14 scotty kernel: [  480.190591]  [<ffffffff81541bbf>] __wait_on_bit+0x5f/0x90
Jul 12 13:44:14 scotty kernel: [  480.190617]  [<ffffffffa014a3b0>] ? nfs_wait_bit_uninterruptible+0x0/0x20 [nfs]
Jul 12 13:44:14 scotty kernel: [  480.190626]  [<ffffffff81541c68>] out_of_line_wait_on_bit+0x78/0x90
Jul 12 13:44:14 scotty kernel: [  480.190637]  [<ffffffff81085470>] ? wake_bit_function+0x0/0x40
Jul 12 13:44:14 scotty kernel: [  480.190663]  [<ffffffffa014a39f>] nfs_wait_on_request+0x2f/0x40 [nfs]
Jul 12 13:44:14 scotty kernel: [  480.190690]  [<ffffffffa014e7df>] nfs_wait_on_requests_locked+0x7f/0xd0 [nfs]
Jul 12 13:44:14 scotty kernel: [  480.190718]  [<ffffffffa014fc1e>] nfs_sync_mapping_wait+0x9e/0x1a0 [nfs]
Jul 12 13:44:14 scotty kernel: [  480.190745]  [<ffffffffa0150009>] nfs_write_mapping+0x79/0xb0 [nfs]
Jul 12 13:44:14 scotty kernel: [  480.190756]  [<ffffffff8115f7d0>] ? mntput_no_expire+0x30/0x110
Jul 12 13:44:14 scotty kernel: [  480.190782]  [<ffffffffa0150077>] nfs_wb_all+0x17/0x20 [nfs]
Jul 12 13:44:14 scotty kernel: [  480.190805]  [<ffffffffa013ef9a>] nfs_do_fsync+0x2a/0x60 [nfs]
Jul 12 13:44:14 scotty kernel: [  480.190827]  [<ffffffffa013f1e5>] nfs_file_flush+0x75/0xa0 [nfs]
Jul 12 13:44:14 scotty kernel: [  480.190836]  [<ffffffff8114173c>] filp_close+0x3c/0x90
Jul 12 13:44:14 scotty kernel: [  480.190843]  [<ffffffff81141847>] sys_close+0xb7/0x120
Jul 12 13:44:14 scotty kernel: [  480.190852]  [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b

Andy Whitcroft (apw) on 2010-07-12

tags:

added: kernel-candidate kernel-reviewed
removed: kernel-needs-review

Jeremy Foshee (jeremyfoshee) on 2010-07-12

tags:

removed: kernel-candidate

Revision history for this message

no!chance (ralf-fehlau) wrote on 2010-07-13:

#28

I have the same problem on my machine (10.04 LTS, linux-image-2.6.32-23-server 2.6.32-23.37, 64Bit). My system is up-to-date. During the last system freeze, I discovered that no data was written to the nfs share for at least 15 minutes.

I tried to login at the console and got a lot of messages like this:

INFO: task .... blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

I couldn't log in and had to hard reset my machine. After boot up, I found nothing in the system logs.

Andy Whitcroft (apw) on 2010-07-15

Changed in linux (Ubuntu):
importance:	Undecided → Medium

Revision history for this message

David McBride (david-mcbride) wrote on 2010-07-19:

#29

Just ran into this bug, affecting me on a new Lucid 64-bit install. Memory exhaustion is unlikely, as the machine has 8GB of RAM and a fast network link.

This may be related to bug #585657.

Remote SSH still works, and the NFS transfer I started a couple of hours ago is still running, slowly, though a process-listing blocks and both X and the console is completely unusable.

(Which is a bit of a showstopper for rolling out a few hundred Lucid-based desktops..)

Revision history for this message

David McBride (david-mcbride) wrote on 2010-07-20:

#30

(My test machine eventually unblocked when the 90GB NFS transfer finally finished. Still, that took a long time, and it was totally unusable until then.)

Checking the current kernel sources, the kernel patches referenced in bug #585657 are clearly already applied. Reviewing recent development history, it appears this bug entry (patch at end) might also be relevant:

https://bugzilla.kernel.org/show_bug.cgi?id=16056

I'm going to try testing that patch locally to see if it fixes this particular problem.

Revision history for this message

David McBride (david-mcbride) wrote on 2010-07-26:

#31

Applying the patch referenced in the previous comment, the situation with regards to NFS deadlocking is improved -- the local terminal no-longer locks up -- but there are continued issues with processes blocking in 'D' (disk-wait) when they should not.

(For example, automounted NFS volumes not involved in a transfer-in-progress which try to automatically unmount after a period of inactivity will have their 'umount' processes block in disk-wait until the transfer has completed.)

Revision history for this message

David McBride (david-mcbride) wrote on 2010-07-26:

#32

And, indeed, if the umount process is stuck for long enough, the following kernel stack-trace is emitted:

Jul 26 21:38:49 illustrious kernel: [ 838.729063] INFO: task umount.nfs:2570 blocked for more than 120 seconds.
Jul 26 21:38:49 illustrious kernel: [ 838.729069] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 21:38:49 illustrious kernel: [ 838.729073] umount.nfs D 0000000000000000 0 2570 1 0x00000000
Jul 26 21:38:49 illustrious kernel: [ 838.729080] ffff8801d5747d98 0000000000000086 0000000000015bc0 0000000000015bc0
Jul 26 21:38:49 illustrious kernel: [ 838.729087] ffff880210ec03c0 ffff8801d5747fd8 0000000000015bc0 ffff880210ec0000
Jul 26 21:38:49 illustrious kernel: [ 838.729092] 0000000000015bc0 ffff8801d5747fd8 0000000000015bc0 ffff880210ec03c0
Jul 26 21:38:49 illustrious kernel: [ 838.729098] Call Trace:
Jul 26 21:38:49 illustrious kernel: [ 838.729110] [<ffffffff811650d0>] ? bdi_sched_wait+0x0/0x20
Jul 26 21:38:49 illustrious kernel: [ 838.729115] [<ffffffff811650de>] bdi_sched_wait+0xe/0x20
Jul 26 21:38:49 illustrious kernel: [ 838.729123] [<ffffffff8153f3af>] __wait_on_bit+0x5f/0x90
Jul 26 21:38:49 illustrious kernel: [ 838.729127] [<ffffffff811650d0>] ? bdi_sched_wait+0x0/0x20
Jul 26 21:38:49 illustrious kernel: [ 838.729132] [<ffffffff8153f458>] out_of_line_wait_on_bit+0x78/0x90
Jul 26 21:38:49 illustrious kernel: [ 838.729140] [<ffffffff81085360>] ? wake_bit_function+0x0/0x40
Jul 26 21:38:49 illustrious kernel: [ 838.729144] [<ffffffff81165094>] ? bdi_queue_work+0xa4/0xe0
Jul 26 21:38:49 illustrious kernel: [ 838.729149] [<ffffffff8116640f>] bdi_sync_writeback+0x6f/0x80
Jul 26 21:38:49 illustrious kernel: [ 838.729154] [<ffffffff81166440>] sync_inodes_sb+0x20/0x30
Jul 26 21:38:49 illustrious kernel: [ 838.729160] [<ffffffff81169f12>] __sync_filesystem+0x82/0x90
Jul 26 21:38:49 illustrious kernel: [ 838.729164] [<ffffffff81169ff9>] sync_filesystems+0xd9/0x130
Jul 26 21:38:49 illustrious kernel: [ 838.729171] [<ffffffff8115ece1>] sys_umount+0xb1/0xd0
Jul 26 21:38:49 illustrious kernel: [ 838.729178] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b

I've also identified that my test NFS mount was using UDP, and performs several times *better* in terms of IO throughput when switched to TCP operation. This lack of performance (and responsiveness) may have been masking some other issues, so I'll use TCP for future testing.

And, indeed, if the umount process is stuck for long enough, the following kernel stack-trace is emitted:

Jul 26 21:38:49 illustrious kernel: [  838.729063] INFO: task umount.nfs:2570 blocked for more than 120 seconds.
Jul 26 21:38:49 illustrious kernel: [  838.729069] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 26 21:38:49 illustrious kernel: [  838.729073] umount.nfs    D 0000000000000000     0  2570      1 0x00000000
Jul 26 21:38:49 illustrious kernel: [  838.729080]  ffff8801d5747d98 0000000000000086 0000000000015bc0 0000000000015bc0
Jul 26 21:38:49 illustrious kernel: [  838.729087]  ffff880210ec03c0 ffff8801d5747fd8 0000000000015bc0 ffff880210ec0000
Jul 26 21:38:49 illustrious kernel: [  838.729092]  0000000000015bc0 ffff8801d5747fd8 0000000000015bc0 ffff880210ec03c0
Jul 26 21:38:49 illustrious kernel: [  838.729098] Call Trace:
Jul 26 21:38:49 illustrious kernel: [  838.729110]  [<ffffffff811650d0>] ? bdi_sched_wait+0x0/0x20
Jul 26 21:38:49 illustrious kernel: [  838.729115]  [<ffffffff811650de>] bdi_sched_wait+0xe/0x20
Jul 26 21:38:49 illustrious kernel: [  838.729123]  [<ffffffff8153f3af>] __wait_on_bit+0x5f/0x90
Jul 26 21:38:49 illustrious kernel: [  838.729127]  [<ffffffff811650d0>] ? bdi_sched_wait+0x0/0x20
Jul 26 21:38:49 illustrious kernel: [  838.729132]  [<ffffffff8153f458>] out_of_line_wait_on_bit+0x78/0x90
Jul 26 21:38:49 illustrious kernel: [  838.729140]  [<ffffffff81085360>] ? wake_bit_function+0x0/0x40
Jul 26 21:38:49 illustrious kernel: [  838.729144]  [<ffffffff81165094>] ? bdi_queue_work+0xa4/0xe0
Jul 26 21:38:49 illustrious kernel: [  838.729149]  [<ffffffff8116640f>] bdi_sync_writeback+0x6f/0x80
Jul 26 21:38:49 illustrious kernel: [  838.729154]  [<ffffffff81166440>] sync_inodes_sb+0x20/0x30
Jul 26 21:38:49 illustrious kernel: [  838.729160]  [<ffffffff81169f12>] __sync_filesystem+0x82/0x90
Jul 26 21:38:49 illustrious kernel: [  838.729164]  [<ffffffff81169ff9>] sync_filesystems+0xd9/0x130
Jul 26 21:38:49 illustrious kernel: [  838.729171]  [<ffffffff8115ece1>] sys_umount+0xb1/0xd0
Jul 26 21:38:49 illustrious kernel: [  838.729178]  [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b

I've also identified that my test NFS mount was using UDP, and performs several times *better* in terms of IO throughput when switched to TCP operation.  This lack of performance (and responsiveness) may have been masking some other issues, so I'll use TCP for future testing.

Revision history for this message

David McBride (david-mcbride) wrote on 2010-07-27:

#33

I've compiled up and retried my test transfers with Linus's current latest RC, 2.6.35-rc6, and found that this problem does not recur.

Revision history for this message

Alex (d-f0rce) wrote on 2010-08-18:

#34

Will the fix from 2.6.35-rc6 be backported to Lucid?

Revision history for this message

Christoph Lechleitner (lech) wrote on 2010-08-19:

#35

Is there a specific patch that addresses our problem?

Revision history for this message

David Ressman (davidressman) wrote on 2010-08-20:

#36

If this is truly a duplicate of bug #585657, I can verify that I see this problem in 10.04 with both Ubuntu's 2.6.32-24.39 and with the stock kernel.org 2.6.32.18.

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2010-08-30:

#37

I'm not having any luck reproducing this problem on a 2.6.32-24.41 NFS server, using a simple mount from the client thusly: 'sudo mount 10.0.2.210:/export /mnt' and blasting a 90GB file onto the server: 'dd if=/dev/zero of=/mnt/users/rtg/bf.txt bs=512 count=188743680'. This is over a 100Mb switch.

Changed in linux (Ubuntu):
assignee:	nobody → Tim Gardner (timg-tpi)
status:	Triaged → In Progress

Revision history for this message

JimWright (jim-jim-wright) wrote on 2010-08-30:

#38

Hi Tim,

After your post I retried my test case (see comment #23 above), except this time using KVM instead of VMWare Server. All details of the virtual machine, i.e. 2xCPU/512MB RAM/8GB HDD are the same - it's just using KVM instead of VMWare.

I can confirm that under KVM and a vanilla install of 10.04(.0) the problem occurs exactly as I described before, and within seconds. This is using a 64bit kernel version 2.6.32-21.32.

I then upgraded the virtual machine to the latest kernel 2.6.32-24.41 and ran the same test case again. This kernel seemed much more stable and did not exhibit problems. Multiple iterations of the gzip loop ran successfully. So I tried running a second loop of gzip processes. Thus there are now two gzip processes both accessing the NFS mount After a minute or so this caused the issue to occur together with the same messages appearing in syslog, the load skyrocketing and the system freezing. So whilst 2.6.32-24.41 appears to me to be better, I do not believe the fundamental problem has been solved. Repeating the test subsequent times, it sometimes took two simultaneous gzip processes to trigger the error, and other times three. I used the following script to run the gzip processes:

#!/bin/bash
while true
do
gzip -c /mnt/srv/test >/mnt/srv/test$$.gz
done

If I could also make a couple of observations. Everyone I have seen having this issue (or issues that look in my opinion suspiciously close to this one) seems to be running a 64bit kernel - I have not tested a 32bit kernel. My test case uses the loopback interface, so switch speed and/or network card should be irrelevant. Also having an amount of load on the NFS client seems to be important, that is why my test case attempts to gzip a file of random data. I'm not sure your simple dd would generate enough of a real workload.

Jim

BTW In the original test case I missed out a "chmod 777 /srv", but that really isn't going to change anything.

Revision history for this message

Richard Huddleston (rhuddusa) wrote on 2010-08-30:

#39

@Tim

i double the problem would exhibit itself on a 100mb switch unless your nfs server was writing to a storage media with a lower throughput . ... unless you are writing to a USB flash drive ... i'm pretty sure your storage can keep up with a 100mb switch

...

i only see this problem when my nfs writes are only server storage medium write bound ... i.e. ... i never max out the network / client throughput

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2010-08-31:

#40

JimWright - I was focused on the original reporter's setup, so I missed your comments in #23. It does appear that this seems to happen most when the NFS server is saturated, so it makes sense that a loopback mount would do it. I'll give your setup a try.

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2010-08-31:

#41

dmesg.txt Edit (103.1 KiB, text/plain)

JimWright - I never did manage to reproduce your lockup using a VM, but I was able to do it on bare metal; a dual CPU 6 core machine (24 threads) with 16GB RAM. It took 48 instances of 'dd if=/dev/zero of=test.$i bs=1M count=512' over an NFS local mount to get it to lockup (after about 5 attempts). Nothing is leaping out at me in the dmesg. See attached.

Revision history for this message

kylea (kylea) wrote on 2010-09-01:

#42

It may be related to LVM - I get the issue on my root disk. This is mounted via an LVM. If I run an update the update stalls with these errors

kernel: [ 2281.420086] dpkg D 0000000000000000 0 24196 24133 0x00000000
[ 2281.420090] ffff880168c87db8 0000000000000082 0000000000015bc0 0000000000015bc0
[ 2281.420094] ffff88021a7931a0 ffff880168c87fd8 0000000000015bc0 ffff88021a792de0
[ 2281.420097] 0000000000015bc0 ffff880168c87fd8 0000000000015bc0 ffff88021a7931a0
[ 2281.420100] Call Trace:
[ 2281.420109] [<ffffffff81166d80>] ? bdi_sched_wait+0x0/0x20
[ 2281.420112] [<ffffffff81166d8e>] bdi_sched_wait+0xe/0x20
[ 2281.420117] [<ffffffff815591df>] __wait_on_bit+0x5f/0x90
[ 2281.420119] [<ffffffff81166d80>] ? bdi_sched_wait+0x0/0x20
[ 2281.420122] [<ffffffff81559288>] out_of_line_wait_on_bit+0x78/0x90
[ 2281.420126] [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
[ 2281.420129] [<ffffffff81166d44>] ? bdi_queue_work+0xa4/0xe0
[ 2281.420131] [<ffffffff811680ef>] bdi_sync_writeback+0x6f/0x80
[ 2281.420134] [<ffffffff81168120>] sync_inodes_sb+0x20/0x30
[ 2281.420137] [<ffffffff8116bc92>] __sync_filesystem+0x82/0x90
[ 2281.420140] [<ffffffff8116bd79>] sync_filesystems+0xd9/0x130
[ 2281.420142] [<ffffffff8116be31>] sys_sync+0x21/0x40
[ 2281.420146] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
[ 2401.422564] INFO: task dpkg:24196 blocked for more than 120 seconds.
[ 2401.422568] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2401.422570] dpkg D 0000000000000000 0 24196 24133 0x00000000
[ 2401.422574] ffff880168c87db8 0000000000000082 0000000000015bc0 0000000000015bc0
[ 2401.422578] ffff88021a7931a0 ffff880168c87fd8 0000000000015bc0 ffff88021a792de0
[ 2401.422581] 0000000000015bc0 ffff880168c87fd8 0000000000015bc0 ffff88021a7931a0
[ 2401.422584] Call Trace:
[ 2401.422593] [<ffffffff81166d80>] ? bdi_sched_wait+0x0/0x20
[ 2401.422596] [<ffffffff81166d8e>] bdi_sched_wait+0xe/0x20
[ 2401.422601] [<ffffffff815591df>] __wait_on_bit+0x5f/0x90
[ 2401.422603] [<ffffffff81166d80>] ? bdi_sched_wait+0x0/0x20
[ 2401.422606] [<ffffffff81559288>] out_of_line_wait_on_bit+0x78/0x90
[ 2401.422610] [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
[ 2401.422613] [<ffffffff81166d44>] ? bdi_queue_work+0xa4/0xe0
[ 2401.422616] [<ffffffff811680ef>] bdi_sync_writeback+0x6f/0x80
[ 2401.422618] [<ffffffff81168120>] sync_inodes_sb+0x20/0x30
[ 2401.422621] [<ffffffff8116bc92>] __sync_filesystem+0x82/0x90
[ 2401.422624] [<ffffffff8116bd79>] sync_filesystems+0xd9/0x130
[ 2401.422626] [<ffffffff8116be31>] sys_sync+0x21/0x40
[ 2401.422631] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b

It may be related to LVM - I get the issue on my root disk. This is mounted via an LVM. If I run an update the update stalls with these errors

kernel: [ 2281.420086] dpkg D 0000000000000000 0 24196 24133 0x00000000
[ 2281.420090] ffff880168c87db8 0000000000000082 0000000000015bc0 0000000000015bc0
[ 2281.420094] ffff88021a7931a0 ffff880168c87fd8 0000000000015bc0 ffff88021a792de0
[ 2281.420097] 0000000000015bc0 ffff880168c87fd8 0000000000015bc0 ffff88021a7931a0
[ 2281.420100] Call Trace:
[ 2281.420109] [<ffffffff81166d80>] ? bdi_sched_wait+0x0/0x20
[ 2281.420112] [<ffffffff81166d8e>] bdi_sched_wait+0xe/0x20
[ 2281.420117] [<ffffffff815591df>] __wait_on_bit+0x5f/0x90
[ 2281.420119] [<ffffffff81166d80>] ? bdi_sched_wait+0x0/0x20
[ 2281.420122] [<ffffffff81559288>] out_of_line_wait_on_bit+0x78/0x90
[ 2281.420126] [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
[ 2281.420129] [<ffffffff81166d44>] ? bdi_queue_work+0xa4/0xe0
[ 2281.420131] [<ffffffff811680ef>] bdi_sync_writeback+0x6f/0x80
[ 2281.420134] [<ffffffff81168120>] sync_inodes_sb+0x20/0x30
[ 2281.420137] [<ffffffff8116bc92>] __sync_filesystem+0x82/0x90
[ 2281.420140] [<ffffffff8116bd79>] sync_filesystems+0xd9/0x130
[ 2281.420142] [<ffffffff8116be31>] sys_sync+0x21/0x40
[ 2281.420146] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b
[ 2401.422564] INFO: task dpkg:24196 blocked for more than 120 seconds.
[ 2401.422568] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2401.422570] dpkg D 0000000000000000 0 24196 24133 0x00000000
[ 2401.422574] ffff880168c87db8 0000000000000082 0000000000015bc0 0000000000015bc0
[ 2401.422578] ffff88021a7931a0 ffff880168c87fd8 0000000000015bc0 ffff88021a792de0
[ 2401.422581] 0000000000015bc0 ffff880168c87fd8 0000000000015bc0 ffff88021a7931a0
[ 2401.422584] Call Trace:
[ 2401.422593] [<ffffffff81166d80>] ? bdi_sched_wait+0x0/0x20
[ 2401.422596] [<ffffffff81166d8e>] bdi_sched_wait+0xe/0x20
[ 2401.422601] [<ffffffff815591df>] __wait_on_bit+0x5f/0x90
[ 2401.422603] [<ffffffff81166d80>] ? bdi_sched_wait+0x0/0x20
[ 2401.422606] [<ffffffff81559288>] out_of_line_wait_on_bit+0x78/0x90
[ 2401.422610] [<ffffffff810850d0>] ? wake_bit_function+0x0/0x40
[ 2401.422613] [<ffffffff81166d44>] ? bdi_queue_work+0xa4/0xe0
[ 2401.422616] [<ffffffff811680ef>] bdi_sync_writeback+0x6f/0x80
[ 2401.422618] [<ffffffff81168120>] sync_inodes_sb+0x20/0x30
[ 2401.422621] [<ffffffff8116bc92>] __sync_filesystem+0x82/0x90
[ 2401.422624] [<ffffffff8116bd79>] sync_filesystems+0xd9/0x130
[ 2401.422626] [<ffffffff8116be31>] sys_sync+0x21/0x40
[ 2401.422631] [<ffffffff810131b2>] system_call_fastpath+0x16/0x1b

Revision history for this message

JimWright (jim-jim-wright) wrote on 2010-09-01:

#43

TimGardner - From your dmesg output it appears your bare metal test was running kernel version 2.6.35, is that correct? Also the stack trace looks noticeable different. Is there an easy way I can test that kernel version on my 10.04 LTS setup (i.e. is there a package I can install on Lucid Lynx)?

We also experience this lockup on bare metal which is currently making us very nervous about using NFS for large files. I only produced the virtual machine test case to try and constrain the possible variables that may trigger the bug. What parameters did you use for your VM (CPUs/RAM/HDD and Ubuntu & Kernel versions)? BTW I was never able to reproduce the issue with a single CPU, I have always needed multiple CPUs.

I also noticed another difference between your dd test and my gzip test (in addition to the generating CPU load issue I mentioned before) - my gzip process both reads and writes from the NFS mount, whilst your dd only writes. I will see if I can reproduce the issue on my VM using your dd test.

Jim

Revision history for this message

Kim Botherway (dj-dvant) wrote on 2010-09-01:

#44

Hi,

For me NFS locks up most often with large files when Gnome is running. eg, I have two machines, one with a 4 core CPU/4GB RAM running Gnome (machine A) and much smaller system single core/1GB RAM (machine B)

Same kernel version, copying the same file.

Machine A to B NFS pauses
Machine B to A NFS runs at max hard drive speed
Machine A to B (without Gnome running) NFS runs at max hard drive speed

Kim

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2010-09-01:

#45

JimWright: - After not being able to repro the problem using a 10.04 VM with -updates applied (8 CPUs, 512MB, 20GB) using VmWare Workstation as the hypervisor on a 10.04 host, I decided to try Maverick on bare metal. I'll have a better shot at enlisting upstream help if I can demo the problem on a more recent kernel.

I did try your 'gzip' method of reading and writing, but as mentioned, could never get it to fail even on bare metal. I'm next going to see if this issue still exists on 2.6.36-rc3.

Revision history for this message

JimWright (jim-jim-wright) wrote on 2010-09-02:

#46

dmesg.log Edit (62.0 KiB, text/plain)

Tim:

I managed to get your dd testcase to trigger the issue on a VM running Ubuntu 10.04, Kernel 2.6.32-24.42, 8 CPUs, 512MB RAM, 8GB HDD. I've attached the dmesg output from that test.

I downloaded the latest daily snapshot of Maverick Ubuntu 10.10 (kernel 2.6.35-19.28) and installed it onto an identically configured VM. I was unable to trigger the issue I have seen using either my gzip testcase or your dd testcase after many hours of runtime.

So it would seem that the issue has been fixed in a more recent kernel, and that needs to be backported to the Ubuntu 10.04 kernel?

Jim

Revision history for this message

JimWright (jim-jim-wright) wrote on 2010-09-02:

#47

FYI. Looking at David McBride's comments to this bug report, particularly #30, and checking the current kernel source package for Ubuntu 10.04 it appears that patch identified in https://bugzilla.kernel.org/show_bug.cgi?id=16056 (comments 17&18) has not been back ported to the 2.6.32 kernel. I have a high confidence that this would fix the issue I've been having as it always appears to be kswapd that is blocked first.

Jim

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2010-09-02:

#48

JimWright - That particular patch just arrived via stable updates. Please try Ubuntu-2.6.32-25.43 which was just uploaded to -proposed yesterday.

Revision history for this message

JimWright (jim-jim-wright) wrote on 2010-09-02:

#49

dmesg2.log Edit (55.0 KiB, text/plain)

Tim - thanks I have tried upgrading my virtual machine to the 2.6.32-25.43 kernel as you suggested. I now believe I experience the same failure that you observed on bare metal using the maverick 2.6.35 kernel - see the attached dmesg log from my latest test. So it is definitely an improvement of sorts.

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2010-09-02:

#50

JimWright - But you're still doing the loop mount test, right? Upstream has indicated that its not a supportable case, i.e., http://marc.info/?l=linux-kernel&m=128335681711984&w=2

Revision history for this message

JimWright (jim-jim-wright) wrote on 2010-09-03:

#51

Tim - yes I was still doing it over the loopback interface. Thanks for that link, I guess that all kind of makes sense.

Part of my problem also stems from the change to autofs5 from autofs4. This has a separate bug (https://bugs.launchpad.net/ubuntu/+source/autofs5/+bug/517139) filed against it. Thus due to the default hosts file installed by Ubuntu you can end up with NFS mounts over the loopback interface. I can work around that issue myself though some additional manual configuration, as suggested in the bug report, but it might be worth pointing out to the autofs bug/people that it results in an unsupported configuration out of the box so to speak.

Many thanks for you help Tim. I think the latest proposed kernel and a bit of manual tweaking will fix the issue I've been having. I'll let everyone comment on if it fixes their issues.

Revision history for this message

Timo Harmonen (timo-harmonen) wrote on 2010-09-05:

#52

logs.zip Edit (34.0 KiB, application/zip)

So this fix is also in Maverick beta? Unfortunately then it does not seem to fix those problems I'm having.

I installed Maverick beta (2.6.35-19-server, amd64, clean installation) under kvm and was able to repro some of my earlier problems.

I see this most often when using torretflux to download multiple large (1GB) files simultaneously and save them on a nfs-mounted folder. This is the easiest method to repro this in Lucid, and seems to trigger the case also in Maverick. I have had this problem also when using emusic.com emusicj (java) mp3 downloader and when renaming multiple image files with renrot (these methods I have tried only in Lucid). These all process files over nfs, but don't utilize CPU that much.

I wrote some simple test scripts to try to repro this. First script uses dd to read and write several files simultaneously, but that hanged only once (syslog1 attached) even if I let it run for hours. Another script downloads simultaneously six 8MB files using wget, and that seemed to hang more easily. I run it twice, and both times it hanged (dmesg2, dmesg3).

Revision history for this message

Andy Whitcroft (apw) wrote on 2010-09-06:

#53

We have had some success preventing this behaviour by adjusting a kernel tunable. If those who are able to reproduce this could try increasing the sysctl 'vm.min_free_kbytes' and see how that affects their tests. We have been running with this value approximatly 10x its default value for our testing though this is likely an excessive bump.

    # sysctl vm.min_free_kbytes
    vm.min_free_kbytes = 8080
    # sysctl vm.min_free_kbytes=80800
    #

Please report any testing back here on this bug. Thanks.

Revision history for this message

Alex (d-f0rce) wrote on 2010-09-07:

#54

@Andy Whitcroft

Your sysctl setting on its own did not solve the problem for me. However while I was googling for vm.min_free_kbytes to check out what it actually does, I came across this site: http://russ.garrett.co.uk/2009/01/01/linux-kernel-tuning/

So I set these values on both, the server and the client:

## increase amount of kernel memory
vm.min_free_kbytes = 65536

## increase TCP max buffer size
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

## increase TCP autotuning buffer limits
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216

With these settings my problem is gone using the current lucid kernel 2.6.32-24.42. The system still gets a bit laggy on very high NFS network load but it does not stall anymore.

Revision history for this message

Alex (d-f0rce) wrote on 2010-09-07:

#55

Sorry for spaming but there is no edit function.

@Andy Whitcroft

I did not try "vm.min_free_kbytes = 65536" on its own.
So maybe your value of 8080 was not high enough for my system.

Revision history for this message

Timo Harmonen (timo-harmonen) wrote on 2010-09-07:

#56

nfstest-dl.sh Edit (608 bytes, text/x-sh)

I tried adjusting vm.min_free_kbytes, and also those other settings suggested by Alex. They didn't seem to have any impact, still the test script hanged as earlier. I tried with 8080, 65536 and 80800, and also with the suggested TCP settings.

It seems that I can repro this case 100% with my wget test script. I have now run it total of 9 times, and at minimum it has taken 18 downloads in 77 seconds and at maximum 264 downloads in 1120 seconds before nfs gets stalled. I have attached the script in case that would help in trying to repro the problem (sorry for abusing ubuntu.com in testing :)). There shouldn't be anything special in my system setup, except that I run everything under kvm.

Revision history for this message

Alex (d-f0rce) wrote on 2010-09-07:

#57

@Timo Harmonen

I tried your test case but my internet connection is by far too slow to trigger the problem with your script. However I modified your script by using cp instead of wget. So instead of fetching the file from a remote server I just copy it from the local harddisc to the NFS directory. I hope this modification does not destroy your test case.

The script is still running, but I do not even notice any lag:

------
File download count: 1026 Elapsed: 420 s
20:50:19 up 6:24, 2 users, load average: 3.95, 3.22, 1.94
------

Can you give it a try with "cp" instead of "wget". Does it trigger the problem for you?

Revision history for this message

Timo Harmonen (timo-harmonen) wrote on 2010-09-08:

#58

@Alex

You are quite right, just using cp is enough to trigger this. I ran it three times in Maverick beta, and it stalled every time.
-----
File copy count: 450 Elapsed: 150 s
File copy count: 252 Elapsed: 85 s
File copy count: 972 Elapsed: 325 s
-----

Revision history for this message

Alex (d-f0rce) wrote on 2010-09-08:

#59

@Timo Harmonen

I reverted my sysctl settings and indeed it stalled again when using your cp-script. After reenabling the sysctl settings and copying about 40GB I still could not get it to stall. So it seems that the sysctl settings really fixed the problem for me. At least with Lucid kernel 2.6.32-24.42.

For comparison here are my NFS settings. All clients are connected via GB-LAN:

Server: rw,sync,no_subtree_check,no_root_squash
Client: hard,fg,rsize=16384,wsize=16384,acl

I hope you'll find a solution for your problem, too. Good Luck.

Revision history for this message

getnuked (getnuked) wrote on 2010-09-19:

#60

/var/log/messages nfs and kswap errors Edit (3.7 KiB, text/plain)

A week ago I installed a fresh lucid 10.04 amd64 desktop onto a workstation (Athlon II 240, 4GB ECC RAM, 1TB SATAII disk). Within a day this machine locked up with no response to keyboard or mouse. I could ping it yet I couldn't ssh to it, luckily Magic sysrq + REISUB was able to sync the local disk, yet it wouldn't reboot. After looking at the logs I noticed the nfs and kswap errors that eventually brought me to this bug report (I have attached a portion of /var/log/messages showing the similar errors).

At first I couldn't reliably reproduce the lockup, it just happened on its own. However I was able to reproduce it in a few minutes by running a simple loop which copied a cd image to and from an nfs mount then diffing the contents. I later found that I could cause the lockup to occur in under 10 seconds by adding a second instance of the copy loop while also running memtester on half (2GB) of the RAM (which allocates and mlocks the RAM). If I do this test from a VT I can watch the kmesg/nfs dmesg logs you see at the top of this bug report being displayed on the VT in real time.

I am using autofs to mount nfs using the following parameters:

server: rw,sync,no_root_squash,no_subtree_check
client: rw,hard,intr,tcp,fg,nfsvers=3,rsize=32768,wsize=32768

After reading this bug report and the ones from the kernel development I got the impression that the problem was fixed in more recent kernels. Luckily the kernel-ppa team has ported the maverick 2.6.35 kernel for use in lucid. I used the following commands to try out the 2.6.35-21 maverick kernel on the lucid workstation. Unfortunately the lock up happened even with the maverick kernel.

sudo add-apt-repository ppa:kernel-ppa/ppa
sudo apt-get update
sudo apt-get install linux-headers-2.6.35-21-generic linux-image-generic-lts-backport-maverick
sudo apt-get reboot

Apparently this nfs bug is present in not only 2.6.32 yet all the way up to 2.6.35 (four different releases), which ultimately means anyone expecting to use lucid or maverick with nfs will either have to live with lock ups or hope that it eventually gets fixed.

Is there something unique to all of our systems that is masking this from being found during normal regression testing, or perhaps I should ask if NFS is even part of the regular testing?