NFS server: lockd: server not responding

Bug #181996 reported by Denis Sidorov on 2008-01-11
40
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Unassigned
Gutsy
Undecided
Unassigned
linux-source-2.6.22 (Ubuntu)
High
Unassigned
Gutsy
High
Unassigned

Bug Description

Running NFS server on Ubuntu Server 7.10 x86 (Pentium-3, 1G RAM).
- linux-image-2.6.22-14-server (2.6.22-14.47)
- nfs-kernel-server (1:1.1.1~git-20070709-3ubuntu1)
- nfs-common (1:1.1.1~git-20070709-3ubuntu1)

NFS clients (ubuntu, gentoo, fedora core) mount home directories from the server.
Works fine for a while after reboot, but at some moment (30 minutes to several days after last reboot) client applications (firefox, thunderbird, openoffice, ...) would freeze at start and the following error message can be seen in the syslog:

Jan 11 14:08:33 jig kernel: [ 5527.793749] lockd: server tango not responding, still trying
Jan 11 14:08:34 jig kernel: [ 5529.029039] lockd: server tango not responding, still trying
Jan 11 14:08:45 jig kernel: [ 5540.246812] lockd: server tango not responding, still trying

The nfsd, rpc.statd, rpc.mountd processes keep running on server. No relevant errors can be found in server syslog.

Restarting the nfs-kernel-server (on server) and nfs-common (on both server and client) would not help - the problem persists.

Have also tried nfs-user-server instead of nfs-kernel-server - no luck.

The only way to make it work is to reboot the server.

description: updated
Denis Sidorov (sidorov-denis) wrote :

Since I have downgraded kernel to 2.6.20 (a week ago), the error does not show up anymore.
It appears to be a bug in kernel, 'cause I used to find a similar issue reported for Fedora Core 7, running 2.6.22.

the.jxc (jonathan-spiderfan) wrote :

I can confirm the same on a brand new Gutsy install with 2.6.22-14. Within 24 hours usually, this will occur. When it happens, Skype, Amarok and other apps will hang first, but eventually all apps will hang. Client reboot does nothing. nfs-kernel-server doesn't help (it doesn't clear the broken lockd, see below). Only a daily server reboot will resolve anything.

On the client side (also 2.6.22-14) I see:
syslog.0:Jan 17 23:23:21 romita kernel: [28308.368819] lockd: server kirby not responding, still trying

On the server side, if I restart nfs-kernel-server, I see:
Jan 18 08:55:37 kirby kernel: [62797.376546] lockd_down: lockd failed to exit, clearing pid

...and on the server side I will now see TWO "[lockd]" processes where before I saw one.

I don't have a 2.6.20 kernel to go back to on my new server. This is basically making my server totally unusable. I'm looking at having to drop nfs and use samba instead. GACK!

the.jxc (jonathan-spiderfan) wrote :

Feel free to contact me if I can offer help debugging this.

the.jxc (jonathan-spiderfan) wrote :
Download full text (3.5 KiB)

OK, I turned on debug with:

echo "65535" > /proc/sys/sunrpc/nlm_debug

There seems to be a problem when lockd enters garbage collection. Here's the last of the debug seen from lockd on the server side.

[ 2277.091005] lockd: request from 192.168.1.210, port=864
[ 2277.091018] lockd: LOCK called
[ 2277.091022] lockd: nlm_lookup_host(192.168.1.210, p=6, v=4, my role=server, name=romita)
[ 2277.091026] lockd: get host romita
[ 2277.091027] lockd: nsm_monitor(romita)
[ 2277.091031] lockd: nlm_file_lookup (01070001 00288001 00000000 926e57da d142d9c6 dabb48bd c2a30bcf 0028c75d)
[ 2277.091035] lockd: creating file for (01070001 00288001 00000000 926e57da d142d9c6 dabb48bd c2a30bcf 0028c75d)
[ 2277.091047] lockd: found file f7be3900 (count 0)
[ 2277.091050] lockd: nlmsvc_lock(sda1/2672477, ty=0, pi=58832, 1073741824-1073741824, bl=0)
[ 2277.091054] lockd: nlmsvc_lookup_block f=f7be3900 pd=58832 1073741824-1073741824 ty=0
[ 2277.091056] lockd: nlm_lookup_host(192.168.1.210, p=6, v=4, my role=server, name=romita)
[ 2277.091058] lockd: get host romita
[ 2277.091062] lockd: created block ecbfe6c0...
[ 2277.091066] lockd: vfs_lock_file returned 0
[ 2277.091068] lockd: freeing block ecbfe6c0...
[ 2277.091069] lockd: release host romita
[ 2277.091071] lockd: nlm_release_file(f7be3900, ct = 2)
[ 2277.091073] lockd: nlmsvc_lock returned 0
[ 2277.091075] lockd: LOCK status 0
[ 2277.091076] lockd: release host romita
[ 2277.091078] lockd: nlm_release_file(f7be3900, ct = 1)

[ 2277.091298] lockd: request from 192.168.1.210, port=864
[ 2277.091302] lockd: LOCK called
[ 2277.091304] lockd: nlm_lookup_host(192.168.1.210, p=6, v=4, my role=server, name=romita)
[ 2277.091306] lockd: get host romita
[ 2277.091307] lockd: nsm_monitor(romita)
[ 2277.091310] lockd: nlm_file_lookup (01070001 00288001 00000000 926e57da d142d9c6 dabb48bd c2a30bcf 0028c75d)
[ 2277.091316] lockd: found file f7be3900 (count 0)
[ 2277.091319] lockd: nlmsvc_lock(sda1/2672477, ty=0, pi=58832, 1073741826-1073742335, bl=0)
[ 2277.091322] lockd: nlmsvc_lookup_block f=f7be3900 pd=58832 1073741826-1073742335 ty=0
[ 2277.091325] lockd: nlm_lookup_host(192.168.1.210, p=6, v=4, my role=server, name=romita)
[ 2277.091327] lockd: host garbage collection
[ 2277.091328] lockd: nlmsvc_mark_resources

Nothing more is seen from the lockd after the start of the GC. Looking at earlier GC runs from the syslog, the pattern is:

[ 2037.388911] lockd: nlm_lookup_host(192.168.1.210, p=6, v=4, my role=server, name=romita)
[ 2037.388914] lockd: host garbage collection
[ 2037.388916] lockd: nlmsvc_mark_resources
[ 2037.388920] nlm_gc_hosts skipping romita (cnt 0 use 0 exp 455264)
[ 2037.388922] nlm_gc_hosts skipping ditko (cnt 0 use 0 exp 460016)
[ 2037.388924] lockd: get host romita

So it finds a couple of entries (skips 'em) and then breaks out to carry on immediately with "get host". I'm assuming that GC is invoked as part of lookup handling, and doesn't just get triggered asynchronously.

Anyhow, this looks like a good spot to start digging. I don't see anything running on top (does lockd show on top?) But the process still seems to be in the ps table. It just doesn't do anything...

Read more...

the.jxc (jonathan-spiderfan) wrote :
the.jxc (jonathan-spiderfan) wrote :
the.jxc (jonathan-spiderfan) wrote :

OK, I added more debug to /usr/src/linux/fs/lockd/host.c and installed a new lockd module. Seems like it's getting lost somewhere in nlmsvc_mark_resources(). I'll keep digging.

the.jxc (jonathan-spiderfan) wrote :

It's getting lost in nlm_inspect_file ().

[ 693.679373] lockd: mutex acquired, checking 128 file hash entries
[ 693.679375] lockd: got entry in list 58
[ 693.679376] lockd: inspecting file

        dprintk("lockd: mutex acquired, checking %d file hash entries\n", FILE_NRHASH);
        for (i = 0; i < FILE_NRHASH; i++) {
                hlist_for_each_entry_safe(file, pos, next, &nlm_files[i], f_list) {
                        dprintk("lockd: got entry in list %d\n", i);
                        file->f_count++;
                        mutex_unlock(&nlm_file_mutex);

                        /* Traverse locks, blocks and shares of this file
                         * and update file->f_locks count */
                        dprintk("lockd: inspecting file\n");
                        if (nlm_inspect_file(host, file, match))
                                ret = 1;

                        dprintk("lockd: inspection complete\n");

...it never returns from nlm_inspect_file (...).

the.jxc (jonathan-spiderfan) wrote :

Right, this appears (unsurprisingly) to be a mutex contention issue, on the file-specific mutex.

See the attached trace: server-kirby-v3.dmsg

Key parts are:

[ 5845.725268] lockd: request from 192.168.1.210, port=860
[ 5845.725272] lockd: LOCK called
[ 5845.725274] lockd: nlm_lookup_host(192.168.1.210, p=6, v=4, my role=server, name=romita)
[ 5845.725504] lockd: get host romita
[ 5845.725506] lockd: found host in cache
[ 5845.725507] lockd: nsm_monitor(romita)
[ 5845.725509] lockd: nlm_file_lookup (01070001 00288001 00000000 926e57da d142d9c6 dabb48bd c2a30bcf 0028c75d)
[ 5845.725789] lockd: found file f7aa2840 (count 0)
[ 5845.725792] lockd: nlmsvc_lock(sda1/2672477, ty=0, pi=80357, 1073741826-1073742335, bl=0)
[ 5845.725806] lockd: nlmsvc_lookup_block f=f7aa2840 pd=80357 1073741826-1073742335 ty=0
[ 5845.725809] lockd: nlm_lookup_host(192.168.1.210, p=6, v=4, my role=server, name=romita)
[ 5845.726186] lockd: host garbage collection
[ 5845.726188] lockd: nlmsvc_mark_resources
[ 5845.726189] lockd: nlm_traverse_files
[ 5845.726255] lockd: mutex acquired, checking 128 file hash entries
[ 5845.726257] lockd: got entry in list 29
[ 5845.726259] lockd: inspecting file f=f7afd480
[ 5845.726260] lockd: traverse blocks
[ 5845.726262] lockd: locking file mutex
[ 5845.726483] lockd: unlocking file mutex
[ 5845.726485] lockd: traverse shares
[ 5845.726486] lockd: traverse locks
[ 5845.726488] lockd: inspection complete
[ 5845.726625] lockd: check file for release
...
(Same pattern repeated for several other files)
...
[ 5845.728644] lockd: got entry in list 58
[ 5845.728645] lockd: inspecting file f=f7aa2840
[ 5845.728646] lockd: traverse blocks
[ 5845.728648] lockd: locking file mutex
...
The final debug is from nlmsvc_traverse_blocks() in /usr/src/linux/fs/lockd/svclock.c

        dprintk ("lockd: locking file mutex\n");
        mutex_lock(&file->f_mutex);
        list_for_each_entry_safe(block, next, &file->f_blocks, b_flist) {
                dprintk ("lockd: trying block for host %p\n", host)
                ...
        }
        dprintk ("lockd: unlocking file mutex\n");

And it's clear now that we're calling the mutex_lock and never leaving.

The important note is that all the previous file checks worked. Why is there a mutex
already taken on only this file? Well, note that this is the file from the request that
actually triggered the GC. Presumably there's a mutex taken out for this file, then
we run the GC, and we attempt to re-take out the mutex. I'll trawl the code and
confirm this.

If so, the fix is probably to move the call to the GC so that it's outside the handling
for the actual RPC call. In fact, the mutex isn't strictly required for the GC because
in this case we're only counting host references. But it looks like we're doing our
reference count by piggybacking on some other code which actually does sweeps of
file locks, so we can't just remove the mutexes.

the.jxc (jonathan-spiderfan) wrote :
Download full text (3.2 KiB)

OK, let's follow the request and see what is performing a file mutex lock.

[ 5845.725268] lockd: request from 192.168.1.210, port=860
This is from the main lockd kernel thread function.
static void lockd (...) in svc.c. It invokes svc_process().

[ 5845.725272] lockd: LOCK called
Via some xdr magic, preprocessor, and function lookup table, our main handler function
nlmsvc_proc_lock (...) from svcproc.c is called.

[ 5845.725274] lockd: nlm_lookup_host(192.168.1.210, p=6, v=4, my role=server, name=romita)
[ 5845.725504] lockd: get host romita
[ 5845.725506] lockd: found host in cache
nlmsvc_proc_lock (...) invokes nlmsvc_retrieve_args (...) also in svcproc.c to get/parse
some args, including the host. In this case, the host is found in the cache.

[ 5845.725507] lockd: nsm_monitor(romita)
nlmsvc_retrieve_args (...) also monitors the host in some way that isn't clear to me yet.
It doesn't appear to be related to our problem, so that can be put aside for now.

[ 5845.725509] lockd: nlm_file_lookup (01070001 00288001 00000000 926e57da d142d9c6 dabb48bd c2a30bcf 0028c75d)
nlmsvc_retrieve_args (...) also does a file lookup by calling nlm_lookup_file(...) This debug
is from nlm_lookup_file (even though it says nlm_file_lookup). We take out the file table
mutex here, but not the file-specific mutex. We initialise the file mutex here, so from this
point onwards we need to be looking out for file specific locks.
[ 5845.725789] lockd: found file f7aa2840 (count 0)

[ 5845.725792] lockd: nlmsvc_lock(sda1/2672477, ty=0, pi=80357, 1073741826-1073742335, bl=0)
Now nlmsvc_proc_lock (...) calls nlmsvc_lock (...) from svclock.c to do the actual locking. Very
first thing, right after the debug, this takes out the mutex on the file...

        /* Lock file against concurrent access */
        mutex_lock(&file->f_mutex);

The corresponding...
        mutex_unlock(&file->f_mutex);
...is right down the bottom of nlmsvc_lock (...)...
out:
        mutex_unlock(&file->f_mutex);
        nlmsvc_release_block(block);
        dprintk("lockd: nlmsvc_lock returned %u\n", ret);

...but we don't get that far. I think we've found it then. But let's carry on...

[ 5845.725806] lockd: nlmsvc_lookup_block f=f7aa2840 pd=80357 1073741826-1073742335 ty=0
The call from nlmsvc_lock (...) to nlmsvc_lookup_block (...) is right after the file-specific
mutex lock is taken out. We don't find an existing block, so nlmsvc_lock (...) creates a
new one by calling nlmsvc_create_block (...).

[ 5845.725809] lockd: nlm_lookup_host(192.168.1.210, p=6, v=4, my role=server, name=romita)
nlmsvc_create_block (...) calls nlmsvc_lookup_host (...) ...

[ 5845.726186] lockd: host garbage collection
...which decides it's time to take out the trash.

[ 5845.726188] lockd: nlmsvc_mark_resources
[ 5845.726189] lockd: nlm_traverse_files
[ 5845.726255] lockd: mutex acquired, checking 128 file hash entries
[ 5845.726257] lockd: got entry in list 29
[ 5845.726259] lockd: inspecting file f=f7afd480
[ 5845.726260] lockd: traverse blocks
[ 5845.726262] lockd: locking file mutex
...which goes through all the files fine, until it comes to the specific file for which we are currently
serving the r...

Read more...

the.jxc (jonathan-spiderfan) wrote :

Hmm... one quick-fix approach would seem to be to pass LOCK_RECURSIVE to mutex_init.

On that note, it's not clear to me yet why we're even using mutexes here. Isn't there only a single lockd process? And in that case, all these mutexes are private, no? Or is it possible to start two lockd's for higher performance (not something I've ever done).

Alternatively, create a new function:

/*
 * Check to see if it's time to sweep the garbage out of the hosts structures.
 */
static void
nlm_gc_hosts_if_needed(void)

        if (time_after_eq(jiffies, next_gc))
                nlm_gc_hosts();
}

...remove the corresponding code from nlm_lookup_host (...), and invoke nlm_gc_hosts_if_needed from somewhere outside the file-specific mutex code. Maybe in the lockd main loop, after each call to svc_process (...).

I think I'll try that with my code. I'm just a bit worried about performance impact of making all file mutexes recursive. Surely a recursive mutex has to be a bit of a hit compared to the vanilla version?

the.jxc (jonathan-spiderfan) wrote :

I went with the nlm_gc_hosts_if_needed () approach. Stable so far. Debug shows completion of GC.

[ 6879.405447] lockd: request from 192.168.1.211, port=729
[ 6879.405454] lockd: LOCK called
[ 6879.405458] lockd: nlm_lookup_host(192.168.1.211, p=6, v=4, my role=server, name=ditko)
[ 6879.405460] lockd: get host ditko
[ 6879.405461] lockd: found host in cache
[ 6879.405463] lockd: nsm_monitor(ditko)
[ 6879.405466] lockd: nlm_file_lookup (01070001 00288001 00000000 926e57da d142d9c6 dabb48bd c2a30bcf 002a0bca)
[ 6879.405470] lockd: creating file for (01070001 00288001 00000000 926e57da d142d9c6 dabb48bd c2a30bcf 002a0bca)
[ 6879.405477] lockd: found file f7a3fcc0 (count 0)
[ 6879.405481] lockd: nlmsvc_lock(sda1/2755530, ty=1, pi=95, 0-9223372036854775807, bl=0)
[ 6879.405484] lockd: nlmsvc_lookup_block f=f7a3fcc0 pd=95 0-9223372036854775807 ty=1
[ 6879.405487] lockd: nlm_lookup_host(192.168.1.211, p=6, v=4, my role=server, name=ditko)
[ 6879.405488] lockd: get host ditko
[ 6879.405489] lockd: found host in cache
[ 6879.405492] lockd: created block ef70db80...
[ 6879.405495] lockd: vfs_lock_file returned 0
[ 6879.405497] lockd: freeing block ef70db80...
[ 6879.405498] lockd: release host ditko
[ 6879.405500] lockd: nlm_release_file(f7a3fcc0, ct = 2)
[ 6879.405502] lockd: nlmsvc_lock returned 0
[ 6879.405503] lockd: LOCK status 0
[ 6879.405504] lockd: release host ditko
[ 6879.405506] lockd: nlm_release_file(f7a3fcc0, ct = 1)
[ 6879.405512] lockd: host garbage collection
[ 6879.405513] lockd: nlmsvc_mark_resources
[ 6879.405515] lockd: nlm_traverse_files
[ 6879.405516] lockd: mutex acquired, checking 128 file hash entries
[ 6879.405519] lockd: got entry in list 109
[ 6879.405520] lockd: inspecting file f=f7a3fcc0
[ 6879.405521] lockd: traverse blocks
[ 6879.405525] lockd: locking file mutex
[ 6879.405526] lockd: unlocking file mutex
[ 6879.405527] lockd: traverse shares
[ 6879.405528] lockd: traverse locks
[ 6879.405530] lockd: inspection complete
[ 6879.405531] lockd: check file for release
[ 6879.405532] lockd: nlm_traverse_files finally releasing mutex
[ 6879.405533] lockd: nlm_traverse_files completed
[ 6879.405535] lockd: now removing inactive hostsnlm_gc_hosts skipping romita (cnt 0 use 0 exp 1672246)
[ 6879.405538] nlm_gc_hosts skipping ditko (cnt 0 use 1 exp 1672627)
[ 6879.405540] lockd: completed host garbage collection, next at (1642627 + 15000 = 1657627)
[ 6879.406106] lockd: request from 192.168.1.211, port=729
...

I'm missing a \n in a dprintk. Otherwise looks sweet.

the.jxc (jonathan-spiderfan) wrote :

This is fixed in the 2.6.24 kernel series.

I installed:

linux-image-2.6.24-5-generic_2.6.24-5.8_i386.deb
linux-ubuntu-modules-2.6.24-5-generic_2.6.24-5.9_i386.deb

from:

http://packages.ubuntu.com/hardy/base/

(After making the changes to yaird required to install it)
vi /usr/lib/yaird/perl/Input.pm
--- Input.pm.orig 2007-10-22 18:29:27.000000000 +0200
+++ Input.pm 2007-12-11 15:39:52.000000000 +0100
@@ -54,6 +54,11 @@
                my $devLink = Conf::get('sysFs')
                        . "/class/input/$handler/device";
                my $hw = readlink ($devLink);
+ if (defined ($hw) && $hw =~ s!^(\.\./)+(class/input/input\d+)$!$2!) {
+ # Linux 2.6.23 eventX -> inputX link
+ $devLink = Conf::get('sysFs') . '/' . $hw . '/device';
+ $hw = readlink ($devLink);
+ }
                if (defined ($hw)) {
                        unless ($hw =~ s!^(\.\./)+devices/!!) {
                                # imagine localised linux (/sys/geraete ...)

...it all works fine. I'll try and track down the patchset required to fix the Gibbon kernel.

Ben Beuchler (insyte) wrote :

Any progress on a patch? I'm running into the same problem. If not, would you mind providing a bit more info describing the necessary steps to get the 2.6.24 kernel installed on a Gutsy server?

Thanks...

the.jxc (jonathan-spiderfan) wrote :

Second part is easy. Fix yaird as above, download the .deb files, and "dpkg install" them both. I had no hassles with that.

J. Bruce Fields suggested the following two patches, but I didn't use those.

http://git.linux-nfs.org/?p=trondmy/nfs-2.6.git;a=commitdiff;h=255129d1e9ca0ed3d69d5517fae3e03d7ab4b806
 http://git.linux-nfs.org/?p=trondmy/nfs-2.6.git;a=commitdiff;h=a6d85430424d44e946e0946bfaad607115510989

...I just downloaded the ubuntu source for the kernel I had, and manually patched the lockd driver.

Russel Winder (russel) wrote :

I am running fully up to date Gutsy server and am getting what I think is the same problem as is reported here. After an indeterminate amount of time and/or activity, the [lockd] process on the server goes from S state to D state and all queries from clients result in messages such as:

Feb 28 07:59:36 balin kernel: [73693.569139] lockd: server dimen not responding, still trying

and hang forever.

I tried stopping and then starting nfs-common and nfs-kernel-server but the [lockd] process remains and in state D. Killing it explicitly has no apparent effect. A new [lockd] process appears in the process table after the restart of nfs-kernel-server but it appears not to be used.

The only remedy appears to be to reboot the server and then it seems all the clients.

It seems that the solution to the problem may now be known, so I guess the question is when will an update to the Gutsy kernel be issued? I guess it goes without saying that it would be good if the kernel issued with Hardy does not have this problem?

Thanks.

Yeah,

 I know how to fix the problem, but I have no idea how to get a patch
into Gutsy. Any ideas who I would contact?

 J.

Russel Winder wrote:
> I am running fully up to date Gutsy server and am getting what I think
> is the same problem as is reported here. After an indeterminate amount
> of time and/or activity, the [lockd] process on the server goes from S
> state to D state and all queries from clients result in messages such
> as:
>
> Feb 28 07:59:36 balin kernel: [73693.569139] lockd: server dimen not
> responding, still trying
>
> and hang forever.
>
> I tried stopping and then starting nfs-common and nfs-kernel-server but
> the [lockd] process remains and in state D. Killing it explicitly has
> no apparent effect. A new [lockd] process appears in the process table
> after the restart of nfs-kernel-server but it appears not to be used.
>
> The only remedy appears to be to reboot the server and then it seems all
> the clients.
>
> It seems that the solution to the problem may now be known, so I guess
> the question is when will an update to the Gutsy kernel be issued? I
> guess it goes without saying that it would be good if the kernel issued
> with Hardy does not have this problem?
>
> Thanks.
>

Russel Winder (russel) wrote :

I would have thought that the Ubuntu Kernel Team would have looked at this problem -- especially as there is a putative fix. However, it seems it may not yet have even been triaged by them. The problem, at least as I see it, is that there is no regularity to the failure. This must make it hard to actively work on.

the.jxc (jonathan-spiderfan) wrote :

No,

 The failure is very regular. It happens whenever the garbage
collection is performed as a result of a lock request.

 J.

Brett Sealey (brett-sealey) wrote :

I've been seeing it for a while now, but only when I run an application on the nfs client that intensively uses file locking.

The only fix is to reboot the server.

When it occurs, the following hangs on the client(in the flock):
       time flock ~/junk echo ok; rm ~/junk

[note: flock is in the util-linux package]

A fix in Gutsy seems simple and would be very nice.

From the comments here it seems this is resolved for the Hardy kernel so marking "Fix Released" against the Hardy 'linux' kernel source package. The kernel stable release update policy if fairly strict: https://wiki.ubuntu.com/KernelUpdates . If someone could confirm the two patches mentioned in comment https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/181996/comments/15 resolve the issue for Gutsy the kernel team may take this into consideration for an SRU. Until then, against 2.6.22 this will be closed. Thanks.

Changed in linux:
status: New → Fix Released
Changed in linux-source-2.6.22:
status: New → Won't Fix
Jesper Krogh (jesper) wrote :

I can confirm that the above 2 patches solves the problem.

The problem is really grave.. making the NFS-server in gutsy rarely usable. The locking problem occoured about every second day here.. I applied the patch over a week ago and hasn't seen the problem since.

Jesper

Jesper Krogh (jesper) wrote :

Leann Ogasawara: Should we provide more to get a SRU for this bug in gutsy?

Jesper

the.jxc (jonathan-spiderfan) wrote :

What's an SRU? I'd love to know more about the process for getting fixes into Ubuntu. Please explain!

Jesper Krogh (jesper) wrote :

SRU is a StableReleaseUpdate .. thats described in the links above. The process to get fixes pushed to a "stable release".

Jesper Krogh (jesper) wrote :

Changing to Confirmed.. as Described by Leann Ogasawara when the patches are confirmed to work on a gutsy system.

Changed in linux-source-2.6.22:
status: Won't Fix → Confirmed

Hi Jesper,

Thanks so much for testing and the feedback. I've reopened the Gutsy nomination and have reassigned to the kernel team.

For anyone wanting more information about the Stable Release Policy also refer to: https://wiki.ubuntu.com/StableReleaseUpdates .

Thanks again for the testing and the help. We definitely appreciate your patience and cooperation.

Changed in linux:
status: New → Invalid
Changed in linux-source-2.6.22:
assignee: nobody → ubuntu-kernel-team
importance: Undecided → High
milestone: none → gutsy-updates
status: New → Triaged
assignee: nobody → ubuntu-kernel-team
importance: Undecided → High
milestone: none → gutsy-updates
status: Confirmed → Triaged
JT (spikyjt) wrote :

I'd just like to add that I have this problem too and thank all those who have provided debugging info. This bug has been crippling my system for some time, and confusing me greatly.

I would like to tentatively ask if there is any further progress with adding the patch into a release update? I shall test the patches myself to add another confirmed success with them (I hope) and report back.

I have to say I find it a little scary that this kernel version could go out as a "stable" release with this bug in it. Do not many people use NFS in ubuntu circles? I thought it would be considered an essential service.

Thanks again for all your help.

Russel Winder (russel) wrote :

I am now running Hardy with kernel 2.6.24-16-server and have not seen this problem for 8 days now. Is it the case that the kernel was patched and this is a patched kernel? If it is I am very happy and thankful to those who did the debugging and the patching. If not, then has the problem been circumvented?

Thanks.

Jesper Krogh (jk-novozymes) wrote :

Well.. since the problem only is present on a gutsy kernel.. it is quite obvious that you can reproduce on the hardy kernel. The patch above is from the patch-stream between gutsy and hardy.

Jesper

Changed in linux-source-2.6.22:
assignee: ubuntu-kernel-team → colin-king
Tim Gardner (timg-tpi) wrote :

There are a series of NFS patches pending on the SRU process. Any day now...

Changed in linux-source-2.6.22:
assignee: colin-king → timg-tpi
status: Triaged → Fix Committed
Tim Gardner (timg-tpi) on 2008-05-28
Changed in linux-source-2.6.22:
status: Triaged → Fix Committed
Shang Wu (shangwu) wrote :

Any update on this? Has it been released yet??

Tim Gardner (timg-tpi) wrote :

Released in 2.6.22-14.53

Changed in linux-source-2.6.22:
assignee: timg-tpi → nobody
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Vincent A (vja) wrote :

After getting the same problem last week ("lockd: server ... not responding, timed out" on client; unkillable lockd on server) I had a look at the source of the linux-image-2.6.22-15-generic package that we're using. To my surprise, I couldn't confirm that the patches mentioned in https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/181996/comments/15 had been applied. Can anyone comment on this?

Details:
$ dpkg -s linux-image-generic |grep ^Version
Version: 2.6.22.15.22
$ apt-get source linux-image-2.6.22-15-generic
[...]
$ less linux-source-2.6.22-2.6.22/fs/lockd/svclock.c

Bart Swennen (bswennen) wrote :

I've come to the same conclusion as Vincent in https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.22/+bug/181996/comments/35 : the 2.6.22-15 kernel seems not to have those patches applied ... any chance it will in the near future ?

I've looked at the sources in linux-source-2.6.22_2.6.22-15.58_all.deb

Upgrading to hardy is not (yet) an option, but we really would like to use a `normal' Ubuntu-gutsy-kernel, which we cannot now because of this bug.

Eckart Haug (ubuntu-syntacs) wrote :

Upgrading to hardy won't help, still the same;

client:
2.6.24-19-generic #1 SMP Wed Aug 20 22:56:21 UTC 2008 i686 GNU/Linux

says:
Oct 7 10:54:15 lagaffe kernel: [ 3099.897267] lockd: server tide not responding, still trying
Oct 7 10:54:16 lagaffe kernel: [ 3101.624752] lockd: server tide not responding, still trying

server:
2.6.24-19-server #1 SMP Sat Jul 12 00:40:01 UTC 2008 i686 GNU/Linux

says:
Oct 7 10:56:15 tide kernel: [3364891.912872] lockd: server lagaffe not responding, timed out
Oct 7 10:56:15 tide kernel: [3364891.912939] lockd: couldn't create RPC handle for lagaffe
Oct 7 10:56:15 tide kernel: [3364891.913118] rpcbind: server lagaffe not responding, timed out

the.jxc (jonathan-spiderfan) wrote :

Can't agree with you there, Eckart. I upgraded to Hardy and all my problems with NFS disappeared.

jcouper@kirby:~$ uname -a
Linux kirby 2.6.24-19-generic #1 SMP Wed Aug 20 22:56:21 UTC 2008 i686 GNU/Linux

...and hasn't hung once in months. Used to hang at least once a day under Gibbon.

Bart Swennen (bswennen) wrote :

Same here: do not agree with Eckart: we use the hardy kernel on an otherwise Gutsy installation and the problem stays away.

When booting the Gutsy kernel, it promptly pops up again (within a day).

Eckart Haug (ubuntu-syntacs) wrote :

I tried the generic kernel (as opposed to server). Worked for a couple of days, then same again.
Until about 4 weeks ago, the problem appeared sporadically, then almost every day - without
any change to the server (no automatic updates). It might depend on configuratin or certain
packages on the client. My home resides on the server. Within the time under question I installed
virtual box on the client. It adds a script which adds a tap device (but doesn't activate a bridge).
Might also depend on my slow server hw (PIII-866/256MB).
I'm mounting nolock for the moment :-)), seems to work fine

Per a decision made by the Ubuntu Kernel Team, bugs will longer be assigned to the ubuntu-kernel-team in Launchpad as part of the bug triage process. The ubuntu-kernel-team is being unassigned from this bug report. Refer to https://wiki.ubuntu.com/KernelTeamBugPolicies for more information. Thanks.

I have exactly the same problem on a Hardy server (Should I open a new bug report ?):
   * linux-image-server 2.6.24.23.25
   * nfs-kernel-server 1:1.1.2-2ubuntu2.2
   * nfs-common 1:1.1.2-2ubuntu2.2

If I reboot the server, it works for only a few minutes.

Eckart Haug (ubuntu-syntacs) wrote :

(Storm)
I was using the nolock option since then - that means work with locking disabled,
wich - of course - worked.

On 10.02. I enabled locking again to give it a try. No problems since then.
Kernels are 2.6.24-23-generic on both client and server
nfs-kernel-server and nfs-common are 1:1.1.2-2ubuntu2.2

I still don't think it's a new problem - it just shows up in very special cases,
whch we don't know. Over here it disappeared as randomly as it appeared before
- and you still have it. When did it appear in your site ? Which changes did you make
before ?

If you post, have a look at https://wiki.ubuntu.com/KernelTeamBugPolicies
Over here, we're on our own now.

Guido Nickels (gsn) wrote :

Hi!

We're experiencing the bug on hardy here, too:

- snip -
Sep 3 11:22:57 recovery1 kernel: [68409.731835] rpcbind: server s03.hallopizza.org not responding, timed out
Sep 3 11:22:57 recovery1 kernel: [68409.731876] lockd: server s03.hallopizza.org not responding, timed out
Sep 3 11:22:57 recovery1 kernel: [68409.731895] lockd: couldn't create RPC handle for s03.hallopizza.org
Sep 3 11:23:57 recovery1 kernel: [68469.578518] rpcbind: server s03.hallopizza.org not responding, timed out
Sep 3 11:23:57 recovery1 kernel: [68469.578559] lockd: server s03.hallopizza.org not responding, timed out
Sep 3 11:23:57 recovery1 kernel: [68469.578568] lockd: couldn't create RPC handle for s03.hallopizza.org
- snap -

Versions:
linux-image-2.6.24-24-generic 2.6.24-24.59
nfs-common 1:1.1.2-2ubuntu2.2
nfs-kernel-server 1:1.1.2-2ubuntu2.2

only reboot helps, but not for long - and we can't disable locking as some customers depend on it.

Please tell me if I can help with debug information.

Cheers!

Guido

Arie Skliarouk (skliarie) wrote :

We use initrd.img-2.6.24-19-openvz with bunch of Linux clients without any problems.
Recently I tried to add Mac OS X client and immediately noticed that the nfs-kernel-server on Linux started locking up for several seconds (thus stalling NFS access for every other client) every minute with following message printed in the logs:
Jan 10 11:25:32 ubuntu1 kernel: [15421367.859941] rpcbind: server boaz-macbook.local not responding, timed out
Jan 10 11:25:32 ubuntu1 kernel: [15421367.859965] lockd: couldn't create RPC handle for boaz-macbook.local

I had to switch the MacOS X to use samba instead.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers