Comment 5 for bug 606523

Revision history for this message
colo (johannes-truschnigg) wrote :

I believe we're seeing similar problems with our setup. We're having a 24-disk RAID10 setup on our box, with a 22TB-sized XFS filesystem exported over NFS(v3) to our VMWare Cluster. During initial load-testing by means of iometer and dd, we triggered strange behaviour on behalf of nfsd, which made common operations (such as readdir()) on the mounted export excruciatingly slow (we're talking more than an hour for a simple `ls` to complete from within an empty directory). Changing from the stock Lucid 2.6.32-kernel to later releases made things (seemingly) go away during load testing, but it popped up again later on, when the system was moved into semi-production as the backup storage for the aforementioned cluster. Other hardware involed is/are Intel Corporation 82598EB 10-Gigabit Ethernet adapters driven by ixgbe, two Adaptec AAC-RAID controllers driven by aacraid, on an Intel 5520-based dual-socket, fource-core (HT disabled) Nehalem machine.

We see backtraces like the following:
[150122.133802] INFO: task nfsd:2145 blocked for more than 120 seconds.
[150122.133853] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[150122.133934] nfsd D ffff880001e15880 0 2145 2 0x00000000
[150122.133937] ffff8806614e1cd0 0000000000000046 ffff8806614e1cd0 ffff8806614e1fd8
[150122.133940] ffff88065b7dc4a0 0000000000015880 0000000000015880 ffff8806614e1fd8
[150122.133942] 0000000000015880 ffff8806614e1fd8 0000000000015880 ffff88065b7dc4a0
[150122.133945] Call Trace:
[150122.133947] [<ffffffff81576838>] __mutex_lock_slowpath+0xe8/0x170
[150122.133949] [<ffffffff8157647b>] mutex_lock+0x2b/0x50
[150122.133954] [<ffffffffa03e62ff>] nfsd_unlink+0xaf/0x240 [nfsd]
[150122.133960] [<ffffffffa03edd54>] nfsd3_proc_remove+0x84/0x100 [nfsd]
[150122.133964] [<ffffffffa03df3fb>] nfsd_dispatch+0xbb/0x210 [nfsd]
[150122.133972] [<ffffffffa021d625>] svc_process_common+0x325/0x650 [sunrpc]
[150122.133977] [<ffffffffa03dfa60>] ? nfsd+0x0/0x150 [nfsd]
[150122.133984] [<ffffffffa021da83>] svc_process+0x133/0x150 [sunrpc]
[150122.133988] [<ffffffffa03dfb1d>] nfsd+0xbd/0x150 [nfsd]
[150122.133990] [<ffffffff8107f8d6>] kthread+0x96/0xa0
[150122.133993] [<ffffffff8100be64>] kernel_thread_helper+0x4/0x10
[150122.133995] [<ffffffff8107f840>] ? kthread+0x0/0xa0
[150122.133997] [<ffffffff8100be60>] ? kernel_thread_helper+0x0/0x10

This happened after copying over a few VM images from our primary to our backup storage over NFS(v3). The machine doesn't crash, but NFS performance is rather unimpressive during and after these operations.

I'll investigate if Thag's suggested workaround is applicable in our situation, and if it is, if it helps getting things to work normally. However, as we're not using multiple IPv4 addresses with our NICs afaik, I'm on the lookout for alternative solutions to the problem, or theories what may cause it.