nfs4_reclaim_locks: unhandled error crashes applications and creates high load

Bug #932687 reported by coli
44
This bug affects 7 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Medium
Unassigned
nfs-utils (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

We tried to move our Natty clients to Oneiric but have a severe show-stopper bug. Oneiric seems to have a problem with nfs. We use nfs for our home-folders with strict permissions. nfs4-server is running solaris.

As said no problem on natty, just with oneiric. Oneiric is running fine some minutes, then in the dmesg we get such output:

[ 7778.934514] nfs4_reclaim_locks: unhandled error -10024. Zeroing state
[ 7778.934521] nfs4_reclaim_open_state: Lock reclaim failed!
[ 7869.899811] nfs4_reclaim_locks: unhandled error -10024. Zeroing state
[ 7869.899818] nfs4_reclaim_open_state: Lock reclaim failed!
[ 7869.938180] nfs4_reclaim_locks: unhandled error -10024. Zeroing state
[ 7869.938184] nfs4_reclaim_open_state: Lock reclaim failed!
[ 7869.950989] nfs4_reclaim_locks: unhandled error -10024. Zeroing state
[ 7869.950993] nfs4_reclaim_open_state: Lock reclaim failed!
[ 7869.977253] nfs4_reclaim_locks: unhandled error -10024. Zeroing state
[ 7869.977258] nfs4_reclaim_open_state: Lock reclaim failed!
[ 7870.364422] nfs4_reclaim_locks: unhandled error -10024. Zeroing state
[ 7870.364429] nfs4_reclaim_open_state: Lock reclaim failed!
[ 7870.594833] nfs4_reclaim_locks: unhandled error -10024. Zeroing state
[ 7870.594839] nfs4_reclaim_open_state: Lock reclaim failed!
[ 7870.652639] nfs4_reclaim_locks: unhandled error -10024. Zeroing state
[ 7870.652644] nfs4_reclaim_open_state: Lock reclaim failed!
[ 7870.678166] nfs4_reclaim_locks: unhandled error -10024. Zeroing state
[ 7870.678171] nfs4_reclaim_open_state: Lock reclaim failed!
[ 7880.217148] nfs4_reclaim_locks: unhandled error -10024. Zeroing state
[ 7880.217155] nfs4_reclaim_open_state: Lock reclaim failed!
[ 7880.277521] nfs4_reclaim_locks: unhandled error -10024. Zeroing state
[ 7880.277527] nfs4_reclaim_open_state: Lock reclaim failed!
[ 7880.374106] nfs4_reclaim_locks: unhandled error -10024. Zeroing state
[ 7880.374113] nfs4_reclaim_open_state: Lock reclaim failed!
[ 7880.440398] nfs4_reclaim_locks: unhandled error -10024. Zeroing state
[ 7880.440404] nfs4_reclaim_open_state: Lock reclaim failed!
[ 7880.451121] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
[ 7880.451330] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
[ 7880.451520] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
[ 7880.451738] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
[ 7880.451921] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
[ 7880.452099] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
[ 7880.452279] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
[ 8160.252077] INFO: task claws-mail:23156 blocked for more than 120 seconds.
[ 8160.252085] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 8160.252092] claws-mail D 0000000000000000 0 23156 1848 0x00000004
[ 8160.252103] ffff8800363e3a08 0000000000000046 ffff8800363e39a8 ffffffffa03275f0
[ 8160.252109] ffff8800363e3fd8 ffff8800363e3fd8 ffff8800363e3fd8 0000000000012a40
[ 8160.252115] ffff8800191edc80 ffff88003d7cdc80 ffff8800363e39e8 ffff88003fc132c0
[ 8160.252121] Call Trace:
[ 8160.252148] [<ffffffffa03275f0>] ? rpc_put_task+0x10/0x20 [sunrpc]
[ 8160.252158] [<ffffffff8110a180>] ? __lock_page+0x70/0x70
[ 8160.252164] [<ffffffff815eff1f>] schedule+0x3f/0x60
[ 8160.252168] [<ffffffff815effcf>] io_schedule+0x8f/0xd0
[ 8160.252173] [<ffffffff8110a18e>] sleep_on_page+0xe/0x20
[ 8160.252177] [<ffffffff815f07ef>] __wait_on_bit+0x5f/0x90
[ 8160.252182] [<ffffffff8110a378>] wait_on_page_bit+0x78/0x80
[ 8160.252189] [<ffffffff81081c50>] ? autoremove_wake_function+0x40/0x40
[ 8160.252194] [<ffffffff8110a48c>] filemap_fdatawait_range+0x10c/0x1a0
[ 8160.252216] [<ffffffffa03be1d0>] ? nfs_writedata_alloc+0x150/0x150 [nfs]
[ 8160.252233] [<ffffffffa03b89e0>] ? nfs_free_request+0x90/0x90 [nfs]
[ 8160.252243] [<ffffffff81115211>] ? do_writepages+0x21/0x40
[ 8160.252252] [<ffffffff8110bd5b>] ? __filemap_fdatawrite_range+0x5b/0x60
[ 8160.252261] [<ffffffff8110bdc8>] filemap_write_and_wait_range+0x68/0x80
[ 8160.252271] [<ffffffff811940e2>] vfs_fsync_range+0x42/0xa0
[ 8160.252277] [<ffffffff811941ac>] vfs_fsync+0x1c/0x20
[ 8160.252295] [<ffffffffa03ad2e3>] nfs_file_flush+0x53/0x80 [nfs]
[ 8160.252301] [<ffffffff811661ff>] filp_close+0x3f/0x90
[ 8160.252307] [<ffffffff81060f3a>] put_files_struct.part.14+0x7a/0xe0
[ 8160.252312] [<ffffffff81062a08>] put_files_struct+0x18/0x20
[ 8160.252316] [<ffffffff81062ad4>] exit_files+0x54/0x70
[ 8160.252320] [<ffffffff81062fed>] do_exit+0x19d/0x440
[ 8160.252325] [<ffffffff8107186a>] ? __dequeue_signal+0x6a/0xb0
[ 8160.252330] [<ffffffff81063434>] do_group_exit+0x44/0xa0
[ 8160.252334] [<ffffffff8107406d>] get_signal_to_deliver+0x27d/0x3f0
[ 8160.252340] [<ffffffff8100a7e6>] do_signal+0x56/0x180
[ 8160.252348] [<ffffffff8104e94d>] ? set_next_entity+0x9d/0xb0
[ 8160.252352] [<ffffffff8104e5e9>] ? finish_task_switch+0x49/0xf0
[ 8160.252356] [<ffffffff815ef8c4>] ? __schedule+0x3d4/0x700
[ 8160.252361] [<ffffffff8100aad5>] do_notify_resume+0x65/0x80
[ 8160.252368] [<ffffffff815fa490>] int_signal+0x12/0x17
[ 8280.252073] INFO: task firefox:5030 blocked for more than 120 seconds.
[ 8280.252080] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 8280.252087] firefox D ffffffff81805120 0 5030 1 0x00000004
[ 8280.252097] ffff88000bfc7a08 0000000000000046 ffff88001156adf0 ffff88000bfc7b98
[ 8280.252105] ffff88000bfc7fd8 ffff88000bfc7fd8 ffff88000bfc7fd8 0000000000012a40
[ 8280.252111] ffffffff81c0b020 ffff88003645c560 ffff88000bfc79e8 ffff88003fc132c0
[ 8280.252117] Call Trace:
[ 8280.252130] [<ffffffff8110a180>] ? __lock_page+0x70/0x70
[ 8280.252137] [<ffffffff815eff1f>] schedule+0x3f/0x60
[ 8280.252141] [<ffffffff815effcf>] io_schedule+0x8f/0xd0
[ 8280.252145] [<ffffffff8110a18e>] sleep_on_page+0xe/0x20
[ 8280.252150] [<ffffffff815f07ef>] __wait_on_bit+0x5f/0x90
[ 8280.252154] [<ffffffff8110a378>] wait_on_page_bit+0x78/0x80
[ 8280.252165] [<ffffffff81081c50>] ? autoremove_wake_function+0x40/0x40
[ 8280.252170] [<ffffffff8110a48c>] filemap_fdatawait_range+0x10c/0x1a0
[ 8280.252177] [<ffffffff81115211>] ? do_writepages+0x21/0x40
[ 8280.252181] [<ffffffff8110bd5b>] ? __filemap_fdatawrite_range+0x5b/0x60
[ 8280.252186] [<ffffffff8110bdc8>] filemap_write_and_wait_range+0x68/0x80
[ 8280.252192] [<ffffffff811940e2>] vfs_fsync_range+0x42/0xa0
[ 8280.252196] [<ffffffff811941ac>] vfs_fsync+0x1c/0x20
[ 8280.252217] [<ffffffffa03ad2e3>] nfs_file_flush+0x53/0x80 [nfs]
[ 8280.252223] [<ffffffff811661ff>] filp_close+0x3f/0x90
[ 8280.252229] [<ffffffff81060f3a>] put_files_struct.part.14+0x7a/0xe0
[ 8280.252233] [<ffffffff81062a08>] put_files_struct+0x18/0x20
[ 8280.252237] [<ffffffff81062ad4>] exit_files+0x54/0x70
[ 8280.252243] [<ffffffff81062fed>] do_exit+0x19d/0x440
[ 8280.252251] [<ffffffff8107186a>] ? __dequeue_signal+0x6a/0xb0
[ 8280.252260] [<ffffffff81063434>] do_group_exit+0x44/0xa0
[ 8280.252268] [<ffffffff8107406d>] get_signal_to_deliver+0x27d/0x3f0
[ 8280.252277] [<ffffffff8100a7e6>] do_signal+0x56/0x180
[ 8280.252285] [<ffffffff811afe27>] ? fcntl_setlk+0x67/0x220
[ 8280.252294] [<ffffffff81178e42>] ? do_fcntl+0x1b2/0x340
[ 8280.252302] [<ffffffff8100aad5>] do_notify_resume+0x65/0x80
[ 8280.252311] [<ffffffff815fa490>] int_signal+0x12/0x17
[ 8280.252322] INFO: task claws-mail:23156 blocked for more than 120 seconds.
[ 8280.252328] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 8280.252333] claws-mail D 0000000000000000 0 23156 1848 0x00000004
[ 8280.252342] ffff8800363e3a08 0000000000000046 ffff8800363e39a8 ffffffffa03275f0
[ 8280.252350] ffff8800363e3fd8 ffff8800363e3fd8 ffff8800363e3fd8 0000000000012a40
[ 8280.252360] ffff8800191edc80 ffff88003d7cdc80 ffff8800363e39e8 ffff88003fc132c0
[ 8280.252368] Call Trace:
[ 8280.252395] [<ffffffffa03275f0>] ? rpc_put_task+0x10/0x20 [sunrpc]
[ 8280.252404] [<ffffffff8110a180>] ? __lock_page+0x70/0x70
[ 8280.252412] [<ffffffff815eff1f>] schedule+0x3f/0x60
[ 8280.252419] [<ffffffff815effcf>] io_schedule+0x8f/0xd0
[ 8280.252427] [<ffffffff8110a18e>] sleep_on_page+0xe/0x20
[ 8280.252434] [<ffffffff815f07ef>] __wait_on_bit+0x5f/0x90
[ 8280.252442] [<ffffffff8110a378>] wait_on_page_bit+0x78/0x80
[ 8280.252451] [<ffffffff81081c50>] ? autoremove_wake_function+0x40/0x40
[ 8280.252459] [<ffffffff8110a48c>] filemap_fdatawait_range+0x10c/0x1a0
[ 8280.252476] [<ffffffffa03be1d0>] ? nfs_writedata_alloc+0x150/0x150 [nfs]
[ 8280.252491] [<ffffffffa03b89e0>] ? nfs_free_request+0x90/0x90 [nfs]
[ 8280.252495] [<ffffffff81115211>] ? do_writepages+0x21/0x40
[ 8280.252500] [<ffffffff8110bd5b>] ? __filemap_fdatawrite_range+0x5b/0x60
[ 8280.252505] [<ffffffff8110bdc8>] filemap_write_and_wait_range+0x68/0x80
[ 8280.252509] [<ffffffff811940e2>] vfs_fsync_range+0x42/0xa0
[ 8280.252513] [<ffffffff811941ac>] vfs_fsync+0x1c/0x20
[ 8280.252524] [<ffffffffa03ad2e3>] nfs_file_flush+0x53/0x80 [nfs]
[ 8280.252529] [<ffffffff811661ff>] filp_close+0x3f/0x90
[ 8280.252534] [<ffffffff81060f3a>] put_files_struct.part.14+0x7a/0xe0
[ 8280.252538] [<ffffffff81062a08>] put_files_struct+0x18/0x20
[ 8280.252542] [<ffffffff81062ad4>] exit_files+0x54/0x70
[ 8280.252546] [<ffffffff81062fed>] do_exit+0x19d/0x440
[ 8280.252550] [<ffffffff8107186a>] ? __dequeue_signal+0x6a/0xb0
[ 8280.252555] [<ffffffff81063434>] do_group_exit+0x44/0xa0
[ 8280.252561] [<ffffffff8107406d>] get_signal_to_deliver+0x27d/0x3f0
[ 8280.252570] [<ffffffff8100a7e6>] do_signal+0x56/0x180
[ 8280.252578] [<ffffffff8104e94d>] ? set_next_entity+0x9d/0xb0
[ 8280.252586] [<ffffffff8104e5e9>] ? finish_task_switch+0x49/0xf0
[ 8280.252591] [<ffffffff815ef8c4>] ? __schedule+0x3d4/0x700
[ 8280.252596] [<ffffffff8100aad5>] do_notify_resume+0x65/0x80
[ 8280.252601] [<ffffffff815fa490>] int_signal+0x12/0x17
[ 8280.252608] INFO: task firefox:15992 blocked for more than 120 seconds.
[ 8280.252613] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 8280.252618] firefox D ffffffff81805120 0 15992 1728 0x00000004
[ 8280.252627] ffff8800362b9c18 0000000000000086 ffff880011569d70 ffff8800362b9da8
[ 8280.252636] ffff8800362b9fd8 ffff8800362b9fd8 ffff8800362b9fd8 0000000000012a40
[ 8280.252645] ffff88003d698000 ffff88003d662e40 ffff8800362b9bf8 ffff88003fd132c0
[ 8280.252654] Call Trace:
[ 8280.252661] [<ffffffff8110a180>] ? __lock_page+0x70/0x70
[ 8280.252665] [<ffffffff815eff1f>] schedule+0x3f/0x60
[ 8280.252668] [<ffffffff815effcf>] io_schedule+0x8f/0xd0
[ 8280.252673] [<ffffffff8110a18e>] sleep_on_page+0xe/0x20
[ 8280.252676] [<ffffffff815f07ef>] __wait_on_bit+0x5f/0x90
[ 8280.252681] [<ffffffff8110a378>] wait_on_page_bit+0x78/0x80
[ 8280.252685] [<ffffffff81081c50>] ? autoremove_wake_function+0x40/0x40
[ 8280.252690] [<ffffffff8110a48c>] filemap_fdatawait_range+0x10c/0x1a0
[ 8280.252694] [<ffffffff81115211>] ? do_writepages+0x21/0x40
[ 8280.252699] [<ffffffff8110bd5b>] ? __filemap_fdatawrite_range+0x5b/0x60
[ 8280.252703] [<ffffffff8110a54b>] filemap_fdatawait+0x2b/0x30
[ 8280.252707] [<ffffffff8110c7b4>] filemap_write_and_wait+0x44/0x60
[ 8280.252726] [<ffffffffa03b06d5>] nfs_getattr+0x105/0x120 [nfs]
[ 8280.252735] [<ffffffff8116c8fe>] vfs_getattr+0x4e/0x80
[ 8280.252741] [<ffffffff8116c988>] vfs_fstatat+0x58/0x70
[ 8280.252745] [<ffffffff8116c9db>] vfs_stat+0x1b/0x20
[ 8280.252748] [<ffffffff8116cb1a>] sys_newstat+0x1a/0x40
[ 8280.252752] [<ffffffff811695e5>] ? fput+0x25/0x30
[ 8280.252756] [<ffffffff8100b705>] ? math_state_restore+0x45/0x60
[ 8280.252762] [<ffffffff815f312e>] ? do_device_not_available+0xe/0x10
[ 8280.252769] [<ffffffff815fb17b>] ? device_not_available+0x1b/0x20
[ 8280.252778] [<ffffffff815fa1c2>] system_call_fastpath+0x16/0x1b

The related applications like firefox or claws-mail don't react anymore, killing them results in zombie-processes..
---
AlsaVersion: Advanced Linux Sound Architecture Driver Version 1.0.24.
AplayDevices:
 **** List of PLAYBACK Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: ALC260 Analog [ALC260 Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
ApportVersion: 1.23-0ubuntu4
Architecture: amd64
ArecordDevices:
 **** List of CAPTURE Hardware Devices ****
 card 0: Intel [HDA Intel], device 0: ALC260 Analog [ALC260 Analog]
   Subdevices: 1/1
   Subdevice #0: subdevice #0
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: sebastian 1782 F.... xfce4-volumed
                      sebastian 1800 F.... pulseaudio
                      sebastian 1811 F.... xfce4-mixer-plu
CRDA: Error: [Errno 2] No such file or directory
Card0.Amixer.info:
 Card hw:0 'Intel'/'HDA Intel at 0xf8000000 irq 41'
   Mixer name : 'Realtek ALC260'
   Components : 'HDA:10ec0260,17348601,00100400'
   Controls : 18
   Simple ctrls : 10
CurrentDmesg: Error: command ['sh', '-c', 'dmesg | comm -13 --nocheck-order /var/log/dmesg -'] failed with exit code 1: comm: /var/log/dmesg: Permission denied
DistroRelease: Ubuntu 11.10
IwConfig: Error: [Errno 2] No such file or directory
Lsusb: Error: [Errno 2] No such file or directory
MachineType: FUJITSU SIEMENS D2151-A1
Package: nfs-utils
ProcEnviron:
 PATH=(custom, user)
 SHELL=/bin/bash
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.0.0-15-generic root=UUID=e35df589-f3cc-42a9-bff5-0c2cb1a7c0c2 ro quiet
ProcVersionSignature: Ubuntu 3.0.0-15.26-generic 3.0.13
RfKill: Error: [Errno 2] No such file or directory
Tags: oneiric
UdevDb: Error: [Errno 2] No such file or directory
Uname: Linux 3.0.0-15-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: sysadmin www-sg
WifiSyslog:

dmi.bios.date: 11/17/2005
dmi.bios.vendor: FUJITSU SIEMENS // Phoenix Technologies Ltd.
dmi.bios.version: 5.00 R1.07.2151.A1
dmi.board.name: D2151-A1
dmi.board.vendor: FUJITSU SIEMENS
dmi.board.version: S26361-D2151-A1
dmi.chassis.type: 6
dmi.chassis.vendor: FUJITSU SIEMENS
dmi.modalias: dmi:bvnFUJITSUSIEMENS//PhoenixTechnologiesLtd.:bvr5.00R1.07.2151.A1:bd11/17/2005:svnFUJITSUSIEMENS:pnD2151-A1:pvr:rvnFUJITSUSIEMENS:rnD2151-A1:rvrS26361-D2151-A1:cvnFUJITSUSIEMENS:ct6:cvr:
dmi.product.name: D2151-A1
dmi.sys.vendor: FUJITSU SIEMENS

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.3 kernel[1] (Not a kernel in the daily directory). Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed by the mainline kernel, please add the following tag 'kernel-fixed-upstream-KERNEL-VERSION'. For example, if kernel version 3.3-rc2 fixed the issue, the tag would be: 'kernel-fixed-upstream-v3.3-rc2'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[1] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-rc3-precise/

affects: linux-meta (Ubuntu) → linux (Ubuntu)
tags: added: needs-upstream-testing oneiric regression-release
Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key
Revision history for this message
Brad Figg (brad-figg) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 932687

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
coli (sebastian-coli) wrote : AcpiTables.txt

apport information

tags: added: apport-collected
description: updated
Revision history for this message
coli (sebastian-coli) wrote : AlsaDevices.txt

apport information

Revision history for this message
coli (sebastian-coli) wrote : Card0.Amixer.values.txt

apport information

Revision history for this message
coli (sebastian-coli) wrote : Card0.Codecs.codec.2.txt

apport information

Revision history for this message
coli (sebastian-coli) wrote : Lspci.txt

apport information

Revision history for this message
coli (sebastian-coli) wrote : PciMultimedia.txt

apport information

Revision history for this message
coli (sebastian-coli) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
coli (sebastian-coli) wrote : ProcInterrupts.txt

apport information

Revision history for this message
coli (sebastian-coli) wrote : ProcModules.txt

apport information

Revision history for this message
coli (sebastian-coli) wrote : PulseSinks.txt

apport information

Revision history for this message
coli (sebastian-coli) wrote : PulseSources.txt

apport information

Revision history for this message
coli (sebastian-coli) wrote : UdevLog.txt

apport information

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
coli (sebastian-coli) wrote :

@Joseph: Which kernel do you mean should I test? Here's just v3.2 rc4 for oneiric http://kernel.ubuntu.com/~kernel-ppa/mainline/ or shall I manually compile the most recent kernel?

Revision history for this message
coli (sebastian-coli) wrote :

Okay downgrading to 3.0.0-12 doesn't change anything, but downgrading to the latest natty-kernel 2.6.38-13-generic seems to have fixed the problem, at least so far (no hangs within two hours now).

If someone has the same problem, I just downloaded the natty-kernel here: http://packages.ubuntu.com/natty-updates/linux-image-2.6.38-13-generic and installed it with dpkg -i. Of course that's no longtime solution because ubuntu-packages may require kernel 3+.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

It would be great if you could try the kernel available at:
[1] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-rc3-precise/

There is a .deb file located at that link, so you should be able to download it and run:

sudo dpkg -i FILE_NAME.deb

Revision history for this message
Steve Langasek (vorlon) wrote :

Since this is an nfs client issue with kernel backtraces, which means it's definitely a bug in the kernel, not in the nfs-utils package.

Changed in nfs-utils (Ubuntu):
status: New → Invalid
Revision history for this message
Steve Langasek (vorlon) wrote :

Since this is an nfs client issue with kernel backtraces, it's definitely a bug in the kernel, not in the nfs-utils package.

Revision history for this message
coli (sebastian-coli) wrote :

Okay some more information:

The problem doesn't appear when just one Oneiric is running (in this case it's running fine!) but as soon as a second Oneiric-installation is used at the same time the problem appears. It also appears when I install the Natty-kernel on both of those machines and they are running at the same time but it hadn't happen when both machines were running Natty with the same Natty-Kernel as oneiric now. And at the moment all of our working-stations are running Natty with the Natty-kernel without problems.

Strange problem. I will try the kernel linked above now and report back soon!

To sum it up at the moment:

1 Natty with Natty-Kernel: runs fine
1 Oneiric with Oneiric-Kernel: runs fine
1 Oneiric with Natty-Kernel: runs fine
2 or more Natties with Natty-Kernel: run fine
2 or more Oneirics with Oneiric-Kernel: problem appears
2 or more Onericis with Natty-Kernel: problem appears

Revision history for this message
coli (sebastian-coli) wrote :

Okay tested and the problem also appears when both Oneirics are running kernel 3.3.0-030300rc3-generic

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report at bugzilla.kernel.org [1]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

If you are comfortable with opening a bug upstream, It would be great if you can report back the upstream bug number in this bug report. That will allow us to link this bug to the upstream report.

[1] https://wiki.ubuntu.com/Bugs/Upstream/kernel

tags: added: kernel-bug-exists-upstream
removed: needs-upstream-testing
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I will also perform some searches upstream to see if this issue is being discussed.

Changed in linux (Ubuntu):
status: Confirmed → Triaged
Revision history for this message
Peter Winterer (peter-winterer) wrote :
Download full text (3.4 KiB)

Are there any news about this issue?.
After upgrading from ubuntu 10.04 to ubuntu 12.04 precise, we are getting the same errors like it is described in this bug-report. We have an solaris nfs server and serveral Ubuntu box's as nfsv4 client. This worked fine with 10.04, with 12.04 we are getting these errors:

[ 308.555535] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
 [ 309.139900] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
 [ 309.140185] nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
 [ 309.882763] NFS: v4 server returned a bad sequence-id error on an unconfirmed sequence ec5be010!
 [ 311.029716] BUG: unable to handle kernel NULL pointer dereference at 000000a4
 [ 311.029800] IP: [<xxxxxxxx>] nfs4_alloc_lockdata+0x1c/0x1c0 [nfs]
 [ 311.029882] *pdpt = 000000002ce7f001 *pde = 0000000000000000
 [ 311.029951] Oops: 0000 [#1] SMP
 [ 311.029980] Modules linked in: autofs4 rfcomm bnep bluetooth parport_pc ppdev binfmt_misc nfsd nfs lockd fscache auth_rpcgss nfs_acl sunrpc joydev snd_hda_codec_realtek snd_hda_intel snd_hda_codec radeon snd_hwdep snd_pcm snd_seq_midi snd_rawmidi snd_seq_midi_event psmouse snd_seq snd_timer serio_raw snd_seq_device ttm hid_logitech_dj drm_kms_helper drm snd soundcore snd_page_alloc i2c_algo_bit mac_hid mei(C) lp parport usbhid hid e1000e
 [ 311.030353]
 [ 311.030367] Pid: 3312, comm: xxx.yyy.zzz.ff- Tainted: G C 3.2.0-24-generic-pae #37-Ubuntu
 [ 311.030475] EIP: 0060:[<f88c0b6c>] EFLAGS: 00010282 CPU: 1
 [ 311.030525] EIP is at nfs4_alloc_lockdata+0x1c/0x1c0 [nfs]
 [ 311.030563] EAX: 00000088 EBX: f39b69a0 ECX: ec5be700 EDX: 00000050
 [ 311.030611] ESI: ec5be700 EDI: 00000006 EBP: ec75be88 ESP: ec75be70
 [ 311.030658] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
 [ 311.030700] Process xxx.yyy.zzz.ff- (pid: 3312, ti=ec75a000 task=ec45d860 task.ti=ec75a000)
 [ 311.030761] Stack:
 [ 311.030781] ec692e80 f39b69a0 ed839480 f39b69a0 00000002 00000006 ec75bed4 f88c0daa
 [ 311.030853] 00000050 f7402580 ec75bec8 00000000 f39b1b00 ec75beb8 f88d2898 00000000
 [ 311.030930] edab8f60 00000001 f88dbeb8 00000000 00000000 ed839480 ee192c80 f39b69a0
 [ 311.031009] Call Trace:
 [ 311.031041] [<f88c0daa>] _nfs4_do_setlk.isra.37+0x9a/0x1f0 [nfs]
 [ 311.031096] [<f88c12ba>] nfs4_lock_expired+0x6a/0xb0 [nfs]
 [ 311.031148] [<f88c9ab6>] nfs4_reclaim_locks.isra.17+0x86/0x140 [nfs]
 [ 311.031200] [<f88a82af>] ? put_nfs_open_context+0xf/0x20 [nfs]
 [ 311.031253] [<f88ca85a>] nfs4_reclaim_open_state+0x8a/0x280 [nfs]
 [ 311.031308] [<f88cab10>] nfs4_do_reclaim+0xc0/0x100 [nfs]
 [ 311.031367] [<f88cad00>] nfs4_state_manager+0x1b0/0x2a0 [nfs]
 [ 311.031417] [<c106b817>] ? recalc_sigpending+0x17/0x40
 [ 311.031470] [<f88cadf0>] ? nfs4_state_manager+0x2a0/0x2a0 [nfs]
 [ 311.031530] [<f88cae0c>] nfs4_run_state_manager+0x1c/0x30 [nfs]
 [ 311.031573] [<c107956d>] kthread+0x6d/0x80
 [ 311.031609] [<c1079500>] ? flush_kthread_worker+0x80/0x80
 [ 311.031655] [<c15af6be>] kernel_thread_helper+0x6/0x10
 [ 311.031693] Code: c7 04 24 e6 4f 8d f8 e8 b8 1b cd c8 eb a8 90 55 89 e5 57 56 53 83 ec 0c 3e 8d 74 26 00 89 45 ec 8b 41 08 89 ce 89 55 e8 8b 55...

Read more...

Revision history for this message
coli (sebastian-coli) wrote :

Maybe related to this problem we experienced a crash of our fileserver. Can you also confirm this on your setup? Because we aren't fully sure if the one problem is related to the other.

In the meantime, as we couldn't solve this problem, we"re using NFS3 on oneiric without problems..

Revision history for this message
Peter Winterer (peter-winterer) wrote :

I experienced no server side crashes. This is really a crazy bug!
We have 25 Linux Boxes with 10.04 LTS everything works fine. I start to upgrade one by one. Still working when i start to upgrade the first box to 12.04. However, after upgrading the second one, both 12.05 boxes crashes with the errors described above.
The really bad workaround ist to use nfsv3 on 12.04, that is working for us.

Maybe it relates to this bug, however the fix ist not working for us:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/974664

Revision history for this message
Peter Winterer (peter-winterer) wrote :

After rebooting the ubuntu 12.04 box and login to system, I found the following:
the "mount" command shows:
server:/path on /home/user type nfs4 (rw,nosuid,proto=tcp,port=2049,sloppy,addr=IP ,clientaddr=0.0.0.0)
clientaddr=0.0.0.0 is definitively wrong.
To correct this, I had to disable networkmanager from managing the network interface.
I added to the /etc/network/interfaces the following:
..
iface eth0 inet dhcp
auto eth0
..
and after rebooting, the mount command shows the right clientaddr:
server:/path on /home/user type nfs4 (rw,nosuid,proto=tcp,port=2049,sloppy,addr=IP ,clientaddr=IP)

so far, it workss without crashes an error messages.

Revision history for this message
coli (sebastian-coli) wrote :

We don't use networkmanager but configure the interfaces through /etc/network/interfaces all the time.. But we still have this problem.

I figured out, that autofs-mounts have the correct clientaddr set while fstab mounts don't. Our fstab-mounts also have clientaddr=0.0.0.0. But when I manually unmount them and remount them, they get the right clientaddr but still the problem exists, at least we still have the error-messages in dmesg shown.

I also found this redhat bugreport https://bugzilla.redhat.com/show_bug.cgi?id=732748 which claims that the problem is fixed with these patches: http://article.gmane.org/gmane.linux.nfs/48705 .

I used kernel 3.3.6-030306-generic now but we still have the "nfs4_reclaim_open_state: Lock reclaim failed!" messages in dmesg but no hang so far..

Does your workaround still work for you? Do you had any hangs or the error-messages in dmesg anymore?

Does anyone know what the clientaddr-field is used for?

Revision history for this message
Peter Winterer (peter-winterer) wrote :

Since the nfs "client-adress" is correct, it still works for us with four 12.04 boxes, no crashes and no error messages anymore. There are probably serveral causes for the nfs4 reclaim error. Keep in mind, we have a solaris 10 as nfs-Server ... maybe there is something wrong on the server-side?

Revision history for this message
Andreas Heinlein (aheinlein) wrote :

Any news on this? We're experiencing exactly the same problems as described by Peter, except that the workaround doesn't work for us.
We have a lot of Ubuntu 10.04 LTS clients running with /home mounted through NFSv4, with a Debian 6.0 server. We also had a single test machine running 12.04 for several months now without problems. Last friday, I upgraded a second machine and the described problems began.
We also had a server crash on friday, where I'm not sure whether it is related. The server stopped with "Out of memory and no killable processes left." Apparently, it started killing processes to free up memory. The logs say it was due to imapd claiming more memory, but that could well be wrong. What we also see on the server is that two out of four rpciod kernel threads are stuck in the 'D' state, which apparently also causes a permanent load level of at least 2.0. It doesn't seem to have any real performance impact, though. These stuck threads are obviously resolved when you reboot the server, but return as soon as you fire up the 12.04 boxes.
We already had network cards configured by /etc/network/interfaces, so Peters workaround doesn't work for us. I have now removed the /home line from fstab and instead mount /home manually on these two boxes. The clientaddr field is now correct (was 0.0.0.0 before), and everything seems to work now.
That is still something that needs to be resolved quickly. I suspect there are some protocol incompatibilities here; we already went back on the server from kernel 3.2.0 (from Debian backports) to the official sqeeze kernel 2.6.32 because we had problems with ever increasing load on the server. Maybe going again to 3.2.0 on the server would help now, since both client and server would then be running the same kernel version again. But I cannot upgrade all boxes to 12.04 beforehand just to test. I will try and set up a test environment and post the results.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

The v3.7-rc4 kernel is now available. It would be great if you could test this latest kernel, which can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.7-rc4-raring/

Note that you need to install both the linux-image and linux-image-extra packages.

Thanks in advance!

Revision history for this message
Bryan Quigley (bryanquigley) wrote :

The v3.9-rc6 kernel has some more fixes that appear to be related. Could anyone try it? Is anyone still experiencing this issue?

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9-rc6-raring/

Instructions are here: https://wiki.ubuntu.com/Kernel/MainlineBuilds#Installing_Mainline_Kernels

Changed in linux (Ubuntu):
status: Triaged → Confirmed
Revision history for this message
Bryan Quigley (bryanquigley) wrote :

The 3.9 kernel has been released and if you are still having this issue please give it a try: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.9.1-saucy/

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Cyril Aknine (darylounet) wrote :
Download full text (9.1 KiB)

I'm running Ubuntu 13.04 for the NFS Server and Ubuntu 12.04.2 on NFS clients, with the latest saucy kernel 3.9.2 for both clients & server. I'm still having the issue :

2013-05-14 20:13:54.0 front Warn kernel front kernel: [16314892.359664] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:54.0 front Warn kernel front kernel: [16314892.376015] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:54.0 front Warn kernel front kernel: [16314892.392631] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:54.0 front Warn kernel front kernel: [16314892.411032] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:54.0 front Warn kernel front kernel: [16314892.442049] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:54.0 front Warn kernel front kernel: [16314892.463733] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:54.0 front Warn kernel front kernel: [16314892.463280] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:54.0 front Warn kernel front kernel: [16314892.483293] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:54.0 front Warn kernel front kernel: [16314892.512223] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:54.0 front Warn kernel front kernel: [16314892.526620] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:20.0 front Warn kernel front kernel: [23047649.204043] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:20.0 front Warn kernel front kernel: [23047649.241656] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:20.0 front Warn kernel front kernel: [23047649.267490] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:20.0 front Warn kernel front kernel: [23047649.276897] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:20.0 front Warn kernel front kernel: [23047649.220605] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:20.0 front Warn kernel front kernel: [23047649.258010] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:20.0 front Warn kernel front kernel: [23047649.295832] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:20.0 front Warn kernel front kernel: [23047649.748445] NFS: v4 server 10.0.0.254 returned a bad sequence-id error!
2013-05-14 20:13:20.0 front Warn kernel front kernel: [23047649.747233] NFS: v4 server 10.0.0.254 returned a bad sequence-id error!
2013-05-14 20:13:20.0 front Warn kernel front kernel: [23047649.748046] NFS: v4 server 10.0.0.254 returned a bad sequence-id error!
2013-05-14 20:13:20.0 front Error kernel front kernel: [23047649.749241] NFS: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
2013-05-14 20:13:20.0 front Error kernel front kernel: [23047649.749622] NFS: nfs4_reclaim_open_state: unhandled error -10026. Zeroing state
2013-05-14 20:13:20.0 front Warn kernel front kernel: [16314858.377681] NFS: nfs4_reclaim_open_state: Lock reclaim failed!
2013-05-14 20:13:20.0 front Error kernel ...

Read more...

Revision history for this message
Bryan Quigley (bryanquigley) wrote :

@darylounet
Please tell us more about your setup. On the server, /etc/exports, and on the client the output of fstab, etc. Are you using kerberos encryption?

How often does it happen? Do you have to do something to casue it? Can you reproduce it at will?

Revision history for this message
Cyril Aknine (darylounet) wrote :

Ok, all my servers are VM hosted on AWS EC2.

I have one NFS server (called "back") that I recently upgraded on Ubuntu 13.04 with the saucy 3.9.2 kernel (I haven't tried 3.9.3 yet).
I have 1,n web server(s) (called "front") that runs PHP files that read/write media and cache files on the NFS server. "front" servers actually scales up and down, according to CPU usage.

I can reproduce the bug at will as every time autoscaling adds at least 1 "front" server, I get the bug. So I have to block autoscaling to have only one "front" server in order to work.

Actually I can't have only one server in production state, so I switched back on NFSv3.

I don't use kerberos. Here is some configuration files :

root@back:~# cat /etc/default/nfs-kernel-server
RPCNFSDCOUNT=8
RPCNFSDPRIORITY=0
RPCMOUNTDOPTS=--manage-gids
NEED_SVCGSSD=
RPCSVCGSSDOPTS=
RPCNFSDOPTS=

root@back:~# cat /etc/exports
/srv/exports *(rw,async,fsid=0,no_subtree_check,no_root_squash)
/srv/exports/data *(rw,async,no_subtree_check,no_root_squash)

root@front:~# cat /etc/fstab
LABEL=cloudimg-rootfs / ext4 defaults,noatime 0 0
/dev/xvdf /srv/apps ext4 defaults,noatime 0 0
## This is the NFS mount ##
10.0.0.254:/data /srv/nfs nfs4 defaults,rsize=32768,wsize=32768,noatime 0 0
###################
/dev/xvdb /mnt auto defaults,noatime,nobootwait,comment=cloudconfig 0 2

Revision history for this message
Bryan Quigley (bryanquigley) wrote :

If you can distill it to an easy to rebuild setup that would be very helpful (that isn't dependent on AWS). Does the issue happen if you scale up even if there isn't load? Or, in other words, is one of your scripts or apps important for reproducing this issue? If so, can it be narrowed down to something you can share?

Revision history for this message
Cyril Aknine (darylounet) wrote :

No, it happens only on production environment with high load. I think that it's related to multiple concurrent writes by our software, eZ Publish.

Revision history for this message
Bryan Quigley (bryanquigley) wrote :

I was hoping for something we could reproduce at will without impacting production machines. If you can find a way to simulate the load to cause this issue*, we can proceed. If not, I'm not sure we have another way to proceed on this bug.

I'm going to run some of these tests myself, but you appear to have a setup conducive to this bug.

*You can try something like iozone3, dbench, or the phoronix-test-suite.

Revision history for this message
gergnz (gergnz) wrote :

FWIW, just did an upgrade on some systems.

All preices machines, all were running -39

Upgraded to -48, this bug appeared.

Rolled back the clients only to -39, bug disappeared.

Site is a semi heavy web service with 1 NFS server, and 2 NFS clients running nginx/php.

Issue was seen almost immediately upon nginx/php starting up, lots of threads got stuck, load went up due to high I/O wait.

fstab:
10.1.2.3:/www /data/www nfs defaults 0 0
exports:
/data/shared 10.1.2.0/24(rw,fsid=0,insecure,no_subtree_check,async,no_root_squash)
/data/shared/www 10.1.2.0/24(rw,nohide,insecure,no_subtree_check,async,no_root_squash)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.