INFO: task nfsd:5047 blocked for more than 120 seconds.

Bug #606523 reported by Bruce Edge
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

This is triggered by high volume NFS traffic. This box has 3 fibre channel volumes that are exported over NFS.
I started with a 10.04 linux-server kernel, which died regularly at least once a day.
I started upgrading to try the ppa kernels and the frequency has decreased but all of the kernels below have this same problem:

ii linux-headers-2.6.32-22-server 2.6.32-22.36 Linux kernel headers for version 2.6.32 on x
ii linux-image-2.6.32-22-server 2.6.32-22.36 Linux kernel image for version 2.6.32 on x86
ii linux-image-2.6.32-23-server 2.6.32-23.37 Linux kernel image for version 2.6.32 on x86
ii linux-image-2.6.34-5-server 2.6.34-5.14~lucid1 Linux kernel image for version 2.6.34 on x86
ii linux-image-2.6.35-6-server 2.6.35-6.9~lucid1 Linux kernel image for version 2.6.35 on x86
ii linux-image-2.6.35-7-server 2.6.35-7.11~lucid1 Linux kernel image for version 2.6.35 on x86
ii linux-image-2.6.35-8-server 2.6.35-8.13~lucid1 Linux kernel image for version 2.6.35 on x86
ii linux-image-server 2.6.32.23.24 Linux kernel image on Server Equipment.

The only thing that I can think of that's a bit funny in this config is that in order to make all servers have consistent automounter paths, we map all exported devices via NFS, which means that users are using an automounted NFS path for a local drive. Yes, not optimal but it solves a lot of other problems.

dmesg and lspci attached

Revision history for this message
Bruce Edge (bruce-edge) wrote :
tags: added: kj-triage
Revision history for this message
Bruce Edge (bruce-edge) wrote :
Download full text (3.8 KiB)

This kernel still has the same problem:
linux-image-2.6.35-8-server 2.6.35-8.13~lucid1

Jul 21 11:59:38 tonic kernel: [415800.840069] INFO: task nfsd:2880 blocked for more than 120 seconds.
Jul 21 11:59:38 tonic kernel: [415800.840073] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jul 21 11:59:38 tonic kernel: [415800.840077] nfsd D 000000010279c5ff 0 2880 2 0x00000000
Jul 21 11:59:38 tonic kernel: [415800.840082] ffff8803f52f5950 0000000000000046 ffff8803f52f58c0 0000000000015900
Jul 21 11:59:38 tonic kernel: [415800.840087] ffff8803f52f5fd8 0000000000015900 ffff8803f52f5fd8 ffff8803f3905b80
Jul 21 11:59:38 tonic kernel: [415800.840092] 0000000000015900 0000000000015900 ffff8803f52f5fd8 0000000000015900
Jul 21 11:59:38 tonic kernel: [415800.840096] Call Trace:
Jul 21 11:59:38 tonic kernel: [415800.840106] [<ffffffff814e50ac>] ? ip_finish_output+0x14c/0x310
Jul 21 11:59:38 tonic kernel: [415800.840112] [<ffffffff8159d32f>] __mutex_lock_slowpath+0xff/0x190
Jul 21 11:59:38 tonic kernel: [415800.840116] [<ffffffff8159cd23>] mutex_lock+0x23/0x50
Jul 21 11:59:38 tonic kernel: [415800.840121] [<ffffffff811019a2>] generic_file_aio_write+0x52/0xd0
Jul 21 11:59:38 tonic kernel: [415800.840125] [<ffffffff81101950>] ? generic_file_aio_write+0x0/0xd0
Jul 21 11:59:38 tonic kernel: [415800.840129] [<ffffffff81151713>] do_sync_readv_writev+0xd3/0x110
Jul 21 11:59:38 tonic kernel: [415800.840161] [<ffffffffa016e3aa>] ? cache_check+0x5a/0x1f0 [sunrpc]
Jul 21 11:59:38 tonic kernel: [415800.840167] [<ffffffff8128e749>] ? apparmor_file_permission+0x19/0x20
Jul 21 11:59:38 tonic kernel: [415800.840172] [<ffffffff8125d3b6>] ? security_file_permission+0x16/0x20
Jul 21 11:59:38 tonic kernel: [415800.840176] [<ffffffff8115274f>] do_readv_writev+0xcf/0x1f0
Jul 21 11:59:38 tonic kernel: [415800.840180] [<ffffffff8108607b>] ? set_groups+0x3b/0x60
Jul 21 11:59:38 tonic kernel: [415800.840195] [<ffffffffa022d9d6>] ? nfsd_setuser+0x126/0x2c0 [nfsd]
Jul 21 11:59:38 tonic kernel: [415800.840201] [<ffffffff811a54e0>] ? dquot_file_open+0x0/0x50
Jul 21 11:59:38 tonic kernel: [415800.840204] [<ffffffff811528b8>] vfs_writev+0x48/0x60
Jul 21 11:59:38 tonic kernel: [415800.840213] [<ffffffffa022717d>] nfsd_vfs_write+0x10d/0x420 [nfsd]
Jul 21 11:59:38 tonic kernel: [415800.840219] [<ffffffff81150142>] ? dentry_open+0x52/0xc0
Jul 21 11:59:38 tonic kernel: [415800.840228] [<ffffffffa02276e1>] ? nfsd_open+0x101/0x1a0 [nfsd]
Jul 21 11:59:38 tonic kernel: [415800.840237] [<ffffffffa02279f7>] nfsd_write+0xe7/0x100 [nfsd]
Jul 21 11:59:38 tonic kernel: [415800.840247] [<ffffffffa023137f>] nfsd3_proc_write+0xaf/0x140 [nfsd]
Jul 21 11:59:38 tonic kernel: [415800.840256] [<ffffffffa022243e>] nfsd_dispatch+0xfe/0x250 [nfsd]
Jul 21 11:59:38 tonic kernel: [415800.840270] [<ffffffffa0163774>] svc_process_common+0x344/0x610 [sunrpc]
Jul 21 11:59:38 tonic kernel: [415800.840275] [<ffffffff81055b70>] ? default_wake_function+0x0/0x20
Jul 21 11:59:38 tonic kernel: [415800.840288] [<ffffffffa0163b50>] svc_process+0x110/0x150 [sunrpc]
Jul 21 11:59:38 tonic kernel: [415800.840297] [<ffffffffa0222ae5>] nfsd+0xc5/0x170 [nfs...

Read more...

Revision history for this message
Bruce Edge (bruce-edge) wrote :
Download full text (3.3 KiB)

No love,
Linux tonic 2.6.35-9-server #14~lucid1-Ubuntu SMP Sun Jul 18 22:44:45 UTC 2010 x86_64 GNU/Linux
has the same problem:

Jul 22 08:43:10 tonic kernel: [55920.730905] nfsd D 00000001005482f1 0 4452 2 0x00000000
Jul 22 08:43:10 tonic kernel: [55920.730909] ffff8801ef767950 0000000000000046 ffff880100000000 0000000000015900
Jul 22 08:43:10 tonic kernel: [55920.730913] ffff8801ef767fd8 0000000000015900 ffff8801ef767fd8 ffff8801ef780000
Jul 22 08:43:10 tonic kernel: [55920.730917] 0000000000015900 0000000000015900 ffff8801ef767fd8 0000000000015900
Jul 22 08:43:10 tonic kernel: [55920.730921] Call Trace:
Jul 22 08:43:10 tonic kernel: [55920.730925] [<ffffffff8159d32f>] __mutex_lock_slowpath+0xff/0x190
Jul 22 08:43:10 tonic kernel: [55920.730929] [<ffffffff8159cd23>] mutex_lock+0x23/0x50
Jul 22 08:43:10 tonic kernel: [55920.730933] [<ffffffff811019a2>] generic_file_aio_write+0x52/0xd0
Jul 22 08:43:10 tonic kernel: [55920.730936] [<ffffffff81101950>] ? generic_file_aio_write+0x0/0xd0
Jul 22 08:43:10 tonic kernel: [55920.730940] [<ffffffff81151713>] do_sync_readv_writev+0xd3/0x110
Jul 22 08:43:10 tonic kernel: [55920.730944] [<ffffffff8159e49e>] ? _raw_spin_lock+0xe/0x20
Jul 22 08:43:10 tonic kernel: [55920.730948] [<ffffffff8128e749>] ? apparmor_file_permission+0x19/0x20
Jul 22 08:43:10 tonic kernel: [55920.730952] [<ffffffff8125d3b6>] ? security_file_permission+0x16/0x20
Jul 22 08:43:10 tonic kernel: [55920.730956] [<ffffffff8115274f>] do_readv_writev+0xcf/0x1f0
Jul 22 08:43:10 tonic kernel: [55920.730959] [<ffffffff8108607b>] ? set_groups+0x3b/0x60
Jul 22 08:43:10 tonic kernel: [55920.730969] [<ffffffffa026b9d6>] ? nfsd_setuser+0x126/0x2c0 [nfsd]
Jul 22 08:43:10 tonic kernel: [55920.730973] [<ffffffff811a54e0>] ? dquot_file_open+0x0/0x50
Jul 22 08:43:10 tonic kernel: [55920.730976] [<ffffffff811528b8>] vfs_writev+0x48/0x60
Jul 22 08:43:10 tonic kernel: [55920.730985] [<ffffffffa026517d>] nfsd_vfs_write+0x10d/0x420 [nfsd]
Jul 22 08:43:10 tonic kernel: [55920.730989] [<ffffffff81150142>] ? dentry_open+0x52/0xc0
Jul 22 08:43:10 tonic kernel: [55920.730998] [<ffffffffa02656e1>] ? nfsd_open+0x101/0x1a0 [nfsd]
Jul 22 08:43:10 tonic kernel: [55920.731007] [<ffffffffa02659f7>] nfsd_write+0xe7/0x100 [nfsd]
Jul 22 08:43:10 tonic kernel: [55920.731017] [<ffffffffa026f37f>] nfsd3_proc_write+0xaf/0x140 [nfsd]
Jul 22 08:43:10 tonic kernel: [55920.731025] [<ffffffffa026043e>] nfsd_dispatch+0xfe/0x250 [nfsd]
Jul 22 08:43:10 tonic kernel: [55920.731039] [<ffffffffa01b8774>] svc_process_common+0x344/0x610 [sunrpc]
Jul 22 08:43:10 tonic kernel: [55920.731043] [<ffffffff81055b70>] ? default_wake_function+0x0/0x20
Jul 22 08:43:10 tonic kernel: [55920.731056] [<ffffffffa01b8b50>] svc_process+0x110/0x150 [sunrpc]
Jul 22 08:43:10 tonic kernel: [55920.731064] [<ffffffffa0260ae5>] nfsd+0xc5/0x170 [nfsd]
Jul 22 08:43:10 tonic kernel: [55920.731072] [<ffffffffa0260a20>] ? nfsd+0x0/0x170 [nfsd]
Jul 22 08:43:10 tonic kernel: [55920.731076] [<ffffffff8107d9c6>] kthread+0x96/0xa0
Jul 22 08:43:10 tonic kernel: [55920.731080] [<ffffffff8100aee4>] kernel_thread_helper+0x4/0x10
Jul 22 08:43:10 tonic kernel: [55920.731083] [<fff...

Read more...

Revision history for this message
Bruce Edge (bruce-edge) wrote :

Appears to be related to overlay'd IP addresses. With the clientaddr specified, this has not happened since.

Adding clientaddr=135.149.75.130 may have fixed this.
tonic:/home on /import/tonic type nfs (rw,nolock,tcp,rsize=32768,wsize=32768,noacl,clientaddr=135.149.75.130,sloppy,addr=135.149.75.130)

0 #> ifconfig
eth0 Link encap:Ethernet HWaddr 00:17:08:50:dd:52
          inet addr:135.149.75.130 Bcast:135.149.75.255 Mask:255.255.255.0
          inet6 addr: fe80::217:8ff:fe50:dd52/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:48928413 errors:0 dropped:0 overruns:0 frame:0
          TX packets:90708433 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:7286982784 (7.2 GB) TX bytes:124013948263 (124.0 GB)
          Interrupt:28

eth0:1 Link encap:Ethernet HWaddr 00:17:08:50:dd:52
          inet addr:135.149.75.70 Bcast:135.149.75.255 Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          Interrupt:28

Bruce Edge (bruce-edge)
tags: added: kernel nfs panic
Revision history for this message
colo (johannes-truschnigg) wrote :

I believe we're seeing similar problems with our setup. We're having a 24-disk RAID10 setup on our box, with a 22TB-sized XFS filesystem exported over NFS(v3) to our VMWare Cluster. During initial load-testing by means of iometer and dd, we triggered strange behaviour on behalf of nfsd, which made common operations (such as readdir()) on the mounted export excruciatingly slow (we're talking more than an hour for a simple `ls` to complete from within an empty directory). Changing from the stock Lucid 2.6.32-kernel to later releases made things (seemingly) go away during load testing, but it popped up again later on, when the system was moved into semi-production as the backup storage for the aforementioned cluster. Other hardware involed is/are Intel Corporation 82598EB 10-Gigabit Ethernet adapters driven by ixgbe, two Adaptec AAC-RAID controllers driven by aacraid, on an Intel 5520-based dual-socket, fource-core (HT disabled) Nehalem machine.

We see backtraces like the following:
[150122.133802] INFO: task nfsd:2145 blocked for more than 120 seconds.
[150122.133853] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[150122.133934] nfsd D ffff880001e15880 0 2145 2 0x00000000
[150122.133937] ffff8806614e1cd0 0000000000000046 ffff8806614e1cd0 ffff8806614e1fd8
[150122.133940] ffff88065b7dc4a0 0000000000015880 0000000000015880 ffff8806614e1fd8
[150122.133942] 0000000000015880 ffff8806614e1fd8 0000000000015880 ffff88065b7dc4a0
[150122.133945] Call Trace:
[150122.133947] [<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170
[150122.133949] [<ffffffff8157647b>] mutex_lock+0x2b/0x50
[150122.133954] [<ffffffffa03e62ff>] nfsd_unlink+0xaf/0x240 [nfsd]
[150122.133960] [<ffffffffa03edd54>] nfsd3_proc_remove+0x84/0x100 [nfsd]
[150122.133964] [<ffffffffa03df3fb>] nfsd_dispatch+0xbb/0x210 [nfsd]
[150122.133972] [<ffffffffa021d625>] svc_process_common+0x325/0x650 [sunrpc]
[150122.133977] [<ffffffffa03dfa60>] ? nfsd+0x0/0x150 [nfsd]
[150122.133984] [<ffffffffa021da83>] svc_process+0x133/0x150 [sunrpc]
[150122.133988] [<ffffffffa03dfb1d>] nfsd+0xbd/0x150 [nfsd]
[150122.133990] [<ffffffff8107f8d6>] kthread+0x96/0xa0
[150122.133993] [<ffffffff8100be64>] kernel_thread_helper+0x4/0x10
[150122.133995] [<ffffffff8107f840>] ? kthread+0x0/0xa0
[150122.133997] [<ffffffff8100be60>] ? kernel_thread_helper+0x0/0x10

This happened after copying over a few VM images from our primary to our backup storage over NFS(v3). The machine doesn't crash, but NFS performance is rather unimpressive during and after these operations.

I'll investigate if Thag's suggested workaround is applicable in our situation, and if it is, if it helps getting things to work normally. However, as we're not using multiple IPv4 addresses with our NICs afaik, I'm on the lookout for alternative solutions to the problem, or theories what may cause it.

Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 606523

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: lucid
Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.