Comment 9 for bug 269954

Revision history for this message
Joe Kislo (joe-k12s) wrote :

We have been regression testing our production environment on Ubuntu Hardy, and I believe the root cause of an intermittent nfs failure issue we are seeing on Hardy (but not previous versions of ubuntu) is sourced from this kernel bug. It took a long time to track the issue down, as our use case is significantly more complex than the one described here, but when you boil it down, I think it's the same thing.

We are testing the 2.6.24-24-generic kernel on Ubuntu hardy. All patches are applied as of 06/04/09. The test case provided by Ari Mujunen does not work for me, and I am unable to reproduce it by those steps. I have been able to reproduce it reliably a different way:

System A: NFS Server
System B: NFS Client

System A:
mkdir tmpdir
touch tmpdir/tmpfile
tar -cvf x.tar tmpdir

System B:
stat tmpdir/tmpfile

System A:
tar -xvf x.tar

System B:
stat tmpdir/tmpfile

There's something about tar and having a subdirectory in the mix that seems to trigger the issue for me, that Ari's steps don't.

Here are the server export:
xxx client(rw,async,no_root_squash,no_subtree_check)
and the client mount:
server:/xxx /x nfs soft,intr,rsize=8192,wsize=8192,nosuid,noac,tcp,timeo=20

This is a very serious issue for us, and unlike the .XAuthority usecase, we can't just 'work around it'. Our production environment implements an Active/Active redundant NFS store at the application level. This issue will (rightly so) make our nfs layer believe the nfs server has failed, adding in arbitrary directory listing calls isn't a practical option, and doesn't reliably work either. The ls trick described above seems to MOSTLY work for getting the nfs client back alive, but not always. We have directory structures that may be nested a few levels deep, and this seems to cause further issues. I've seen our application get the kernel nfs client into a state where whole directories of files are returning 'Stale NFS file handle' errors... One time I used the ls trick to try to get the client working again, and it stopped the 'Stale NFS file handle' errors, but ls returned completely corrupted garbage (it listed the files but with alot of ??????'s for the attributes).

I have found reference to this bug here:
http://bugzilla.kernel.org/show_bug.cgi?id=12557

The but has a candidate patch attached, but I haven't tried it yet.

This seems like a fairly serious regression. The tar usecase I have provided seems like a very frequent operation over nfs, and this isn't a client race condition. The client is just toast... it doesn't come back unless you remount, or you do some trickery by finding out what's changed and trying to do directory listings of those directories.