As a workaround I tried to install an old Ubuntu 2.6 kernel (linux-image-2.6.35-31-generic_2.6.35-31.63_amd64.deb) into 12.04.1.
I saw a number of locking issues reported and thought these might be caused by using the kernel in a wrong environment. But now after I have downgraded the servers back to 10.10 and kept the clients at 12.04.1, I still see kernel messages like the following:
and the NFS mount hangs. Sometimes the clients are able to recover, but often they hang completely.
It seems that my initial test on Debian was wrong and the Debian testing kernels have at least less load on the server. I cannot comment on the other issues yet. But it was discussed in the linked Debian bug report that the above mentioned patch has been removed in their kernels. This seems to provide at least some positive effect.
Is any Ubuntu kernel developer following this? Could you provide a test kernel with the patch removed?
I'm currently trying to set up a test environment, but fixing my production environment has priority :-(
I'm adding some more test data here:
As a workaround I tried to install an old Ubuntu 2.6 kernel (linux- image-2. 6.35-31- generic_ 2.6.35- 31.63_amd64. deb) into 12.04.1.
I saw a number of locking issues reported and thought these might be caused by using the kernel in a wrong environment. But now after I have downgraded the servers back to 10.10 and kept the clients at 12.04.1, I still see kernel messages like the following:
[ 5474.132324] ------------[ cut here ]------------ buildd/ linux-2. 6.35/net/ sunrpc/ sched.c: 597 rpc_exit_ task+0x5c/ 0x60 [sunrpc]() 6df>] warn_slowpath_ common+ 0x7f/0xc0 73a>] warn_slowpath_ null+0x1a/ 0x20 d4c>] rpc_exit_ task+0x5c/ 0x60 [sunrpc] 52e>] __rpc_execute+ 0x5e/0x280 [sunrpc] 7f0>] ? rpc_async_ schedule+ 0x0/0x20 [sunrpc] 805>] rpc_async_ schedule+ 0x15/0x20 [sunrpc] 395>] run_workqueue+ 0xc5/0x1a0 513>] worker_ thread+ 0xa3/0x110 1a0>] ? autoremove_ wake_function+ 0x0/0x40 470>] ? worker_ thread+ 0x0/0x110 c26>] kthread+0x96/0xa0 ea4>] kernel_ thread_ helper+ 0x4/0x10 b90>] ? kthread+0x0/0xa0 ea0>] ? kernel_ thread_ helper+ 0x0/0x10 buildd/ linux-2. 6.35/net/ sunrpc/ sched.c: 597 rpc_exit_ task+0x5c/ 0x60 [sunrpc]() 6df>] warn_slowpath_ common+ 0x7f/0xc0 73a>] warn_slowpath_ null+0x1a/ 0x20 d4c>] rpc_exit_ task+0x5c/ 0x60 [sunrpc] 52e>] __rpc_execute+ 0x5e/0x280 [sunrpc] 7f0>] ? rpc_async_ schedule+ 0x0/0x20 [sunrpc] 805>] rpc_async_ schedule+ 0x15/0x20 [sunrpc] 395>] run_workqueue+ 0xc5/0x1a0 513>] worker_ thread+ 0xa3/0x110 1a0>] ? autoremove_ wake_function+ 0x0/0x40 470>] ? worker_ thread+ 0x0/0x110 c26>] kthread+0x96/0xa0 ea4>] kernel_ thread_ helper+ 0x4/0x10 b90>] ? kthread+0x0/0xa0 ea0>] ? kernel_ thread_ helper+ 0x0/0x10
[ 5474.132346] WARNING: at /build/
[ 5474.132349] Hardware name: PowerEdge R710
[ 5474.132351] Modules linked in: ipmi_si mpt2sas raid_class mptctl ipmi_devintf ipmi_msghandler dell_rbu nfsd autofs4 xfs exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc joydev ftdi_sio usbhid hid bnx2 usbserial shpchp psmouse i7core_edac serio_raw edac_core hed lp power_meter parport dcdbas ses enclosure mptsas mptscsih mptbase usb_storage scsi_transport_sas megaraid_sas [last unloaded: ipmi_si]
[ 5474.132386] Pid: 1746, comm: rpciod/16 Tainted: G W 2.6.35-32-server #67-Ubuntu
[ 5474.132388] Call Trace:
[ 5474.132399] [<ffffffff81061
[ 5474.132403] [<ffffffff81061
[ 5474.132414] [<ffffffffa016b
[ 5474.132426] [<ffffffffa016c
[ 5474.132437] [<ffffffffa016c
[ 5474.132448] [<ffffffffa016c
[ 5474.132455] [<ffffffff8107b
[ 5474.132460] [<ffffffff8107b
[ 5474.132464] [<ffffffff81080
[ 5474.132468] [<ffffffff8107b
[ 5474.132472] [<ffffffff8107f
[ 5474.132477] [<ffffffff8100a
[ 5474.132481] [<ffffffff8107f
[ 5474.132484] [<ffffffff8100a
[ 5474.132487] ---[ end trace 5a3838b115992a79 ]---
[ 6091.800511] ------------[ cut here ]------------
[ 6091.800532] WARNING: at /build/
[ 6091.800536] Hardware name: PowerEdge R710
[ 6091.800537] Modules linked in: ipmi_si mpt2sas raid_class mptctl ipmi_devintf ipmi_msghandler dell_rbu nfsd autofs4 xfs exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc joydev ftdi_sio usbhid hid bnx2 usbserial shpchp psmouse i7core_edac serio_raw edac_core hed lp power_meter parport dcdbas ses enclosure mptsas mptscsih mptbase usb_storage scsi_transport_sas megaraid_sas [last unloaded: ipmi_si]
[ 6091.800572] Pid: 1744, comm: rpciod/14 Tainted: G W 2.6.35-32-server #67-Ubuntu
[ 6091.800575] Call Trace:
[ 6091.800585] [<ffffffff81061
[ 6091.800590] [<ffffffff81061
[ 6091.800601] [<ffffffffa016b
[ 6091.800612] [<ffffffffa016c
[ 6091.800623] [<ffffffffa016c
[ 6091.800634] [<ffffffffa016c
[ 6091.800642] [<ffffffff8107b
[ 6091.800646] [<ffffffff8107b
[ 6091.800650] [<ffffffff81080
[ 6091.800654] [<ffffffff8107b
[ 6091.800658] [<ffffffff8107f
[ 6091.800663] [<ffffffff8100a
[ 6091.800667] [<ffffffff8107f
[ 6091.800671] [<ffffffff8100a
[ 6091.800673] ---[ end trace 5a3838b115992a7a ]---
On the client I see:
[ 7061.756411] INFO: task unzip:8081 blocked for more than 120 seconds. kernel/ hung_task_ timeout_ secs" disables this message. 130>] ? __lock_ page+0x70/ 0x70 0ff>] schedule+0x3f/0x60 1af>] io_schedule+ 0x8f/0xd0 13e>] sleep_on_ page+0xe/ 0x20 9cf>] __wait_ on_bit+ 0x5f/0x90 2a8>] wait_on_ page_bit+ 0x78/0x80 cc0>] ? autoremove_ wake_function+ 0x40/0x40 3bc>] filemap_ fdatawait_ range+0x10c/ 0x1a0 47b>] filemap_ fdatawait+ 0x2b/0x30 7b9>] writeback_ single_ inode+0x399/ 0x430 8ca>] sync_inode+ 0x7a/0xc0 0b3>] nfs_wb_ all+0x43/ 0x50 [nfs] 7f8>] nfs_setattr+ 0x138/0x140 [nfs] 02b>] notify_ change+ 0x1bb/0x360 17b>] chmod_common+ 0xbb/0xc0 0ba>] ? sys_newstat+ 0x2a/0x40 0bf>] sys_fchmod+ 0x4f/0x80 602>] system_ call_fastpath+ 0x16/0x1b
[ 7061.767633] "echo 0 > /proc/sys/
[ 7061.790039] unzip D 0000000000000007 0 8081 8041 0x00000000
[ 7061.790044] ffff8805ec807b48 0000000000000086 ffff880500000000 ffffffff00000007
[ 7061.790051] ffff8805ec807fd8 ffff8805ec807fd8 ffff8805ec807fd8 00000000000137c0
[ 7061.790063] ffff880608a02e00 ffff8805fb9f1700 ffff8805ec807b28 ffff880617c74080
[ 7061.790075] Call Trace:
[ 7061.790082] [<ffffffff81117
[ 7061.790090] [<ffffffff81659
[ 7061.790097] [<ffffffff81659
[ 7061.790105] [<ffffffff81117
[ 7061.790112] [<ffffffff81659
[ 7061.790119] [<ffffffff81117
[ 7061.790127] [<ffffffff8108a
[ 7061.790135] [<ffffffff81117
[ 7061.790144] [<ffffffff81117
[ 7061.790151] [<ffffffff811a1
[ 7061.790159] [<ffffffff811a1
[ 7061.790169] [<ffffffffa01a2
[ 7061.790177] [<ffffffffa0193
[ 7061.790181] [<ffffffff81194
[ 7061.790185] [<ffffffff81176
[ 7061.790189] [<ffffffff8117d
[ 7061.790193] [<ffffffff81177
[ 7061.790197] [<ffffffff81663
and the NFS mount hangs. Sometimes the clients are able to recover, but often they hang completely.
It seems that my initial test on Debian was wrong and the Debian testing kernels have at least less load on the server. I cannot comment on the other issues yet. But it was discussed in the linked Debian bug report that the above mentioned patch has been removed in their kernels. This seems to provide at least some positive effect.
Is any Ubuntu kernel developer following this? Could you provide a test kernel with the patch removed?
I'm currently trying to set up a test environment, but fixing my production environment has priority :-(