Comment 24 for bug 879334

Revision history for this message
Karsten Suehring (suehring) wrote :

I'm adding some more test data here:

As a workaround I tried to install an old Ubuntu 2.6 kernel (linux-image-2.6.35-31-generic_2.6.35-31.63_amd64.deb) into 12.04.1.

I saw a number of locking issues reported and thought these might be caused by using the kernel in a wrong environment. But now after I have downgraded the servers back to 10.10 and kept the clients at 12.04.1, I still see kernel messages like the following:

[ 5474.132324] ------------[ cut here ]------------
[ 5474.132346] WARNING: at /build/buildd/linux-2.6.35/net/sunrpc/sched.c:597 rpc_exit_task+0x5c/0x60 [sunrpc]()
[ 5474.132349] Hardware name: PowerEdge R710
[ 5474.132351] Modules linked in: ipmi_si mpt2sas raid_class mptctl ipmi_devintf ipmi_msghandler dell_rbu nfsd autofs4 xfs exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc joydev ftdi_sio usbhid hid bnx2 usbserial shpchp psmouse i7core_edac serio_raw edac_core hed lp power_meter parport dcdbas ses enclosure mptsas mptscsih mptbase usb_storage scsi_transport_sas megaraid_sas [last unloaded: ipmi_si]
[ 5474.132386] Pid: 1746, comm: rpciod/16 Tainted: G W 2.6.35-32-server #67-Ubuntu
[ 5474.132388] Call Trace:
[ 5474.132399] [<ffffffff810616df>] warn_slowpath_common+0x7f/0xc0
[ 5474.132403] [<ffffffff8106173a>] warn_slowpath_null+0x1a/0x20
[ 5474.132414] [<ffffffffa016bd4c>] rpc_exit_task+0x5c/0x60 [sunrpc]
[ 5474.132426] [<ffffffffa016c52e>] __rpc_execute+0x5e/0x280 [sunrpc]
[ 5474.132437] [<ffffffffa016c7f0>] ? rpc_async_schedule+0x0/0x20 [sunrpc]
[ 5474.132448] [<ffffffffa016c805>] rpc_async_schedule+0x15/0x20 [sunrpc]
[ 5474.132455] [<ffffffff8107b395>] run_workqueue+0xc5/0x1a0
[ 5474.132460] [<ffffffff8107b513>] worker_thread+0xa3/0x110
[ 5474.132464] [<ffffffff810801a0>] ? autoremove_wake_function+0x0/0x40
[ 5474.132468] [<ffffffff8107b470>] ? worker_thread+0x0/0x110
[ 5474.132472] [<ffffffff8107fc26>] kthread+0x96/0xa0
[ 5474.132477] [<ffffffff8100aea4>] kernel_thread_helper+0x4/0x10
[ 5474.132481] [<ffffffff8107fb90>] ? kthread+0x0/0xa0
[ 5474.132484] [<ffffffff8100aea0>] ? kernel_thread_helper+0x0/0x10
[ 5474.132487] ---[ end trace 5a3838b115992a79 ]---
[ 6091.800511] ------------[ cut here ]------------
[ 6091.800532] WARNING: at /build/buildd/linux-2.6.35/net/sunrpc/sched.c:597 rpc_exit_task+0x5c/0x60 [sunrpc]()
[ 6091.800536] Hardware name: PowerEdge R710
[ 6091.800537] Modules linked in: ipmi_si mpt2sas raid_class mptctl ipmi_devintf ipmi_msghandler dell_rbu nfsd autofs4 xfs exportfs nfs lockd fscache nfs_acl auth_rpcgss sunrpc joydev ftdi_sio usbhid hid bnx2 usbserial shpchp psmouse i7core_edac serio_raw edac_core hed lp power_meter parport dcdbas ses enclosure mptsas mptscsih mptbase usb_storage scsi_transport_sas megaraid_sas [last unloaded: ipmi_si]
[ 6091.800572] Pid: 1744, comm: rpciod/14 Tainted: G W 2.6.35-32-server #67-Ubuntu
[ 6091.800575] Call Trace:
[ 6091.800585] [<ffffffff810616df>] warn_slowpath_common+0x7f/0xc0
[ 6091.800590] [<ffffffff8106173a>] warn_slowpath_null+0x1a/0x20
[ 6091.800601] [<ffffffffa016bd4c>] rpc_exit_task+0x5c/0x60 [sunrpc]
[ 6091.800612] [<ffffffffa016c52e>] __rpc_execute+0x5e/0x280 [sunrpc]
[ 6091.800623] [<ffffffffa016c7f0>] ? rpc_async_schedule+0x0/0x20 [sunrpc]
[ 6091.800634] [<ffffffffa016c805>] rpc_async_schedule+0x15/0x20 [sunrpc]
[ 6091.800642] [<ffffffff8107b395>] run_workqueue+0xc5/0x1a0
[ 6091.800646] [<ffffffff8107b513>] worker_thread+0xa3/0x110
[ 6091.800650] [<ffffffff810801a0>] ? autoremove_wake_function+0x0/0x40
[ 6091.800654] [<ffffffff8107b470>] ? worker_thread+0x0/0x110
[ 6091.800658] [<ffffffff8107fc26>] kthread+0x96/0xa0
[ 6091.800663] [<ffffffff8100aea4>] kernel_thread_helper+0x4/0x10
[ 6091.800667] [<ffffffff8107fb90>] ? kthread+0x0/0xa0
[ 6091.800671] [<ffffffff8100aea0>] ? kernel_thread_helper+0x0/0x10
[ 6091.800673] ---[ end trace 5a3838b115992a7a ]---

On the client I see:

[ 7061.756411] INFO: task unzip:8081 blocked for more than 120 seconds.
[ 7061.767633] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 7061.790039] unzip D 0000000000000007 0 8081 8041 0x00000000
[ 7061.790044] ffff8805ec807b48 0000000000000086 ffff880500000000 ffffffff00000007
[ 7061.790051] ffff8805ec807fd8 ffff8805ec807fd8 ffff8805ec807fd8 00000000000137c0
[ 7061.790063] ffff880608a02e00 ffff8805fb9f1700 ffff8805ec807b28 ffff880617c74080
[ 7061.790075] Call Trace:
[ 7061.790082] [<ffffffff81117130>] ? __lock_page+0x70/0x70
[ 7061.790090] [<ffffffff816590ff>] schedule+0x3f/0x60
[ 7061.790097] [<ffffffff816591af>] io_schedule+0x8f/0xd0
[ 7061.790105] [<ffffffff8111713e>] sleep_on_page+0xe/0x20
[ 7061.790112] [<ffffffff816599cf>] __wait_on_bit+0x5f/0x90
[ 7061.790119] [<ffffffff811172a8>] wait_on_page_bit+0x78/0x80
[ 7061.790127] [<ffffffff8108acc0>] ? autoremove_wake_function+0x40/0x40
[ 7061.790135] [<ffffffff811173bc>] filemap_fdatawait_range+0x10c/0x1a0
[ 7061.790144] [<ffffffff8111747b>] filemap_fdatawait+0x2b/0x30
[ 7061.790151] [<ffffffff811a17b9>] writeback_single_inode+0x399/0x430
[ 7061.790159] [<ffffffff811a18ca>] sync_inode+0x7a/0xc0
[ 7061.790169] [<ffffffffa01a20b3>] nfs_wb_all+0x43/0x50 [nfs]
[ 7061.790177] [<ffffffffa01937f8>] nfs_setattr+0x138/0x140 [nfs]
[ 7061.790181] [<ffffffff8119402b>] notify_change+0x1bb/0x360
[ 7061.790185] [<ffffffff8117617b>] chmod_common+0xbb/0xc0
[ 7061.790189] [<ffffffff8117d0ba>] ? sys_newstat+0x2a/0x40
[ 7061.790193] [<ffffffff811770bf>] sys_fchmod+0x4f/0x80
[ 7061.790197] [<ffffffff81663602>] system_call_fastpath+0x16/0x1b

and the NFS mount hangs. Sometimes the clients are able to recover, but often they hang completely.

It seems that my initial test on Debian was wrong and the Debian testing kernels have at least less load on the server. I cannot comment on the other issues yet. But it was discussed in the linked Debian bug report that the above mentioned patch has been removed in their kernels. This seems to provide at least some positive effect.

Is any Ubuntu kernel developer following this? Could you provide a test kernel with the patch removed?

I'm currently trying to set up a test environment, but fixing my production environment has priority :-(