Bug #269954 “ssh -X breaks Xauthority on NFS mounted home dir” : Bugs : linux package : Ubuntu

Revision history for this message

Ari Mujunen (ari-mujunen) wrote on 2008-09-16:

#1

I'm also running 8.04.1 with linux-image-2.6.24-19-generic (2.6.24-19.41) and I can confirm this problem.

By running 'while true; do date; ls -il .Xauthority ; sleep 1; done' on both my Ubuntu client and Debian etch NFS server (running 2.6.18-6-amd64) I can see that doing a 'ssh -X third-machine' indeed replaces the ~/.Xauthority file in my home directory: the inode number changes on the server but not on my Ubuntu desktop.

With the 'defaults' mount options in my /etc/fstab, my desktop continues to show old ~/.Xauthority inode number and stat() data for 60 seconds, then 'ls -l' starts returning 'ls: cannot access .Xauthority: Stale NFS file handle'. A command line 'stat .Xauthority' returns the same error. Doing any of 'ls .', 'cat .Xauthority >/dev/null', 'touch .Xauthority' will immediately cure this, letting 'ls -l .Xauthority' show the correct and updated info, the same as in the NFS server.

With the 'noac' mount option in my /etc/fstab the behavior is otherwise the same as in the above, but the 60 second timeout disappears and 'stat .Xauthority' starts to return 'Stale NFS file handle' immediately after 'ssh -X third-machine' has replaced the file at 'third-machine' and the NFS server.

With the 'sync' mount option in /etc/fstab the behavior is just like in the case of 'defaults' mount option.

I find it quite likely that this has actually nothing to do with the way OpenSSH does the updating of '.Xauthority' file (i.e. many other applications revise files by replacing them with a new version) but the way '~/.Xauthority' is apparently often used, by just 'stat()'ting it, leads to exposure of this problem. It looks to me more like a kernel NFS client problem.

I'm a bit puzzled why 'lstat64(".Xauthority", ...)' would fail with stale NFS handle whereas a 'open(".Xauthority", ...)' will work just fine---why the former gets out-of-date directory info and the latter gets the correct, updated one?

Revision history for this message

Ari Mujunen (ari-mujunen) wrote on 2008-09-16:

#2

Further testing with various kernel versions seem to confirm this bug to be specific to NFS clients running a 2.6.24 kernel.

Changed in openssh:
status:	New → Confirmed

Revision history for this message

Ari Mujunen (ari-mujunen) wrote on 2008-09-16:

#3

We performed further testing with NFS clients of various kernel versions:
- Debian etch with 2.6.18.dfsg.1-22etch2: ok
- Ubuntu Hardy 8.04.1 with linux-image-2.6.24-19-generic 2.6.24-19.41: stat() returns ESTALE indefinitely
- Debian etch with vanilla kernel.org 2.6.26.1: ok
- Ubuntu Intrepid 8.10 Alpha 5 with linux-image-2.6.27-3-generic 2.6.27-3.4: ok (stale for approx. one second and then ok and updated)

So it seems to us that this bug quite likely has something to do with these changes to the kernel:
'http://kerneltrap.org/Linux/NFS_Client_Updates_for_2.6.24'. It seems to be fixed at least in 2.6.26 and 2.6.27. I wonder what would be a suitable fix for 8.04 LTS?

Revision history for this message

Ari Mujunen (ari-mujunen) wrote on 2008-09-16:

#4

Ah, the easiest steps to reproduce:
1) Take two NFS clients, one of which is running 2.6.24-something (machine B) and the other can run any version (machine A). You can equally well use the NFS server itself as machine A, as long as machine B is running 2.6.24 NFS client.
2) On A: 'touch file'.
3) On B: 'stat file', stats print out fine.
4) On A: 'cp file another && mv another file'.
5) On B: 'stat file' results in 'stat: cannot stat `file': Stale NFS file handle for a long time (~minutes).
6) On B: 'ls .' or 'touch file' or 'cat file >/dev/null' (open()ing the file or reading the directory containing the file) makes 'stat file' work normally again.

Revision history for this message

stef70 (stephane-chauveau-central) wrote on 2008-11-05:

#5

I found a workaround for the ssh problem: create a different Xauthority file on each host.

That can be done from the client side as follow:

First of all, add something like that in your bashrc file:

if [ -n "$SSH_CLIENT" ] ; then
export XAUTHORITY=$HOME/.Xauthority-$HOSTNAME
fi

Unfortunately, ssh calls xauth before setting the user environment so I use the following ~/.ssh/rc file:

if [ -n "$DISPLAY" ] ; then
  if read proto cookie ; then
   case $DISPLAY in
     localhost:*) xauth -f $HOME/.Xauthority-$HOSTNAME add unix:$(echo $DISPLAY | cut -c11-) $proto $cookie ;;
     *) xauth -f $HOME/.Xauthority-$HOSTNAME add $DISPLAY $proto $cookie ;;
   esac
fi
fi

Revision history for this message

Ari Mujunen (ari-mujunen) wrote on 2008-11-14:

#6

My previous note about this bug being corrected in intrepid and kernel 2.6.27 was a bit premature: currently with the linux-image-2.6.27-7-generic version 2.6.27-7.16 the problem still occurs quite often. It has been improved a little, namely only the first 'stat file' results in 'stat: cannot stat `file': Stale NFS file handle'. Subsequent 'stat file' will find updated file information.

For the '~/.Xauthority' case it is still sufficient to do a simple 'ls' in '~/' directory and after that the stat info of '~/.Xauthority' is not stale anymore.

Currently the bug is harder to reproduce, since not every 'ssh -X' will result in stale NFS handles on '~/.Xauthority'. However, once it happens, it can prevent starting new X programs for a long time. Although now a second "manual" 'stat .Xauthority' resolves the stale handle, attempting to start new X programs multiple times does _not_ seem to do the same.

So the original bug found in 2.6.24 is still there but its occurrence is not as deterministic as before and it still causes major confusion to users who run into it.

Revision history for this message

Christoph Cullmann (cullmann) wrote on 2008-12-16:

#7

I can reproduce this behaviour too, therefor the hardy kernel is unusable for our company, as each normal editor does unlink & move on save, which leads to massive probs for example for an apache taking up this files from NFS, they break after nearly each edit for some minutes :(

Revision history for this message

Steven Hirsch (snhirsch) wrote on 2009-01-18:

#8

And here I thought I was the only one seeing this! This issue has been driving me nuts for the past (2) Ubuntu releases (currently running Hardy). In my case, it's more basic than problems with ssh. I have my home directory in NFS and the desktop seems to "lose" ~/.Xauthority periodically. The symptoms are that I'll be working along and all of sudden nothing will start! If I try, e.g. xclock at the command line it tells me "Xlib: connection to :0.0.. refused..". A simple 'cd ; ls' seems to get things going again. Until the next time.

I have Googled endlessly and can find no mention of it other than this thread of reports. Both Gutsy and Hardy have done this. I'm beginning to suspect the xauth mechanism itself rather than the kernel, but that's just a guess. (If xauth had previously done an open/close round on the file vs. simply calling stat now - for example).

My setup is very simple and only one client machine is actually using the home directory - it's not dueling overwrites of .Xauthority.

I opened a bug report on it, but the person processing the issue simply could not grasp the issue and it's never been addressed.

Revision history for this message

Joe Kislo (joe-k12s) wrote on 2009-06-04:

#9

We have been regression testing our production environment on Ubuntu Hardy, and I believe the root cause of an intermittent nfs failure issue we are seeing on Hardy (but not previous versions of ubuntu) is sourced from this kernel bug. It took a long time to track the issue down, as our use case is significantly more complex than the one described here, but when you boil it down, I think it's the same thing.

We are testing the 2.6.24-24-generic kernel on Ubuntu hardy. All patches are applied as of 06/04/09. The test case provided by Ari Mujunen does not work for me, and I am unable to reproduce it by those steps. I have been able to reproduce it reliably a different way:

System A: NFS Server
System B: NFS Client

System A:
mkdir tmpdir
touch tmpdir/tmpfile
tar -cvf x.tar tmpdir

System B:
stat tmpdir/tmpfile

System A:
tar -xvf x.tar

System B:
stat tmpdir/tmpfile

There's something about tar and having a subdirectory in the mix that seems to trigger the issue for me, that Ari's steps don't.

Here are the server export:
xxx client(rw,async,no_root_squash,no_subtree_check)
and the client mount:
server:/xxx /x nfs soft,intr,rsize=8192,wsize=8192,nosuid,noac,tcp,timeo=20

This is a very serious issue for us, and unlike the .XAuthority usecase, we can't just 'work around it'. Our production environment implements an Active/Active redundant NFS store at the application level. This issue will (rightly so) make our nfs layer believe the nfs server has failed, adding in arbitrary directory listing calls isn't a practical option, and doesn't reliably work either. The ls trick described above seems to MOSTLY work for getting the nfs client back alive, but not always. We have directory structures that may be nested a few levels deep, and this seems to cause further issues. I've seen our application get the kernel nfs client into a state where whole directories of files are returning 'Stale NFS file handle' errors... One time I used the ls trick to try to get the client working again, and it stopped the 'Stale NFS file handle' errors, but ls returned completely corrupted garbage (it listed the files but with alot of ??????'s for the attributes).

I have found reference to this bug here:
http://bugzilla.kernel.org/show_bug.cgi?id=12557

The but has a candidate patch attached, but I haven't tried it yet.

This seems like a fairly serious regression. The tar usecase I have provided seems like a very frequent operation over nfs, and this isn't a client race condition. The client is just toast... it doesn't come back unless you remount, or you do some trickery by finding out what's changed and trying to do directory listings of those directories.

We have been regression testing our production environment on Ubuntu Hardy, and I believe the root cause of an intermittent nfs failure issue we are seeing on Hardy (but not previous versions of ubuntu) is sourced from this kernel bug.  It took a long time to track the issue down, as our use case is significantly more complex than the one described here, but when you boil it down, I think it's the same thing.

We are testing the 2.6.24-24-generic kernel on Ubuntu hardy.  All patches are applied as of 06/04/09.  The test case provided by Ari Mujunen does not work for me, and I am unable to reproduce it by those steps.  I have been able to reproduce it reliably a different way:

System A: NFS Server
System B: NFS Client

System A:
mkdir tmpdir
touch tmpdir/tmpfile
tar -cvf x.tar tmpdir

System B:
stat tmpdir/tmpfile

System A:
tar -xvf x.tar

System B:
stat tmpdir/tmpfile

There's something about tar and having a subdirectory in the mix that seems to trigger the issue for me, that Ari's steps don't.

Here are the server export:
xxx         client(rw,async,no_root_squash,no_subtree_check)
and the client mount:
server:/xxx  /x       nfs     soft,intr,rsize=8192,wsize=8192,nosuid,noac,tcp,timeo=20

This is a very serious issue for us, and unlike the .XAuthority usecase, we can't just 'work around it'.  Our production environment implements an Active/Active redundant NFS store at the application level.  This issue will (rightly so) make our nfs layer believe the nfs server has failed, adding in arbitrary directory listing calls isn't a practical option, and doesn't reliably work either.  The ls trick described above seems to MOSTLY work for getting the nfs client back alive, but not always.  We have directory structures that may be nested a few levels deep, and this seems to cause further issues.  I've seen our application get the kernel nfs client into a state where whole directories of files are returning 'Stale NFS file handle' errors... One time I used the ls trick to try to get the client working again, and it stopped the 'Stale NFS file handle' errors, but ls returned completely corrupted garbage (it listed the files but with alot of ??????'s for the attributes).

I have found reference to this bug here:
http://bugzilla.kernel.org/show_bug.cgi?id=12557

The but has a candidate patch attached, but I haven't tried it yet.

This seems like a fairly serious regression.  The tar usecase I have provided seems like a very frequent operation over nfs, and this isn't a client race condition.  The client is just toast... it doesn't come back unless you remount, or you do some trickery by finding out what's changed and trying to do directory listings of those directories.

Revision history for this message

Joe Kislo (joe-k12s) wrote on 2009-06-04:

#10

(Once a fix is found, I second the nomination to backport it to hardy)

Revision history for this message

Joe Kislo (joe-k12s) wrote on 2009-07-20:

#11

Using the attached patch I linked to in bugzilla, it did NOT solve our problem, we patched against an
Ubuntu Hardy 2.6.24-24 kernel. It may or may not have helped the original way of reproducing this error (which I could not ever reproduce on the Hardy kernel), but it does not aid our way of reproducing the problem (with tar)

Anybody have any thoughts? This seems like a pretty serious problem.

Revision history for this message

Tobias Oetiker (tobi-oetiker) wrote on 2009-08-07:

#12

since the proposed patch has been integrated in the official kernel since 2.6.29and does help some I would love to see it backported

http://mirror.celinuxforum.org/gitstat//commit-detail.php?commit=a71ee337b31271e701f689d544b6153b75609bc5

the bug is really annoying especially for large shops where peple regularly login to different boxes.

it also affects svn repositories, if they are update on different machines they go stale on the boxes where it was not updated.

Revision history for this message

Tobias Oetiker (tobi-oetiker) wrote on 2009-08-07:

#13

... backported to hardy and jaunty that is.

Revision history for this message

gcc (chris+ubuntu-qwirx) wrote on 2009-08-07:

#14

Another vote for a backport to Hardy :)

Bug Watch Updater (bug-watch-updater) on 2009-08-07

Changed in debian:
status:	Unknown → Confirmed

Revision history for this message

Andreas Romer (andreas-romer) wrote on 2009-08-07:

#15

another vote for a backport to hardy. We are seeing sever problems in our environment with stale nfs on files that are used over the network.

Revision history for this message

Johannes Becker (becker-mip) wrote on 2009-08-07:

#16

I would also appreciate a backport very much.

Revision history for this message

Tobias Oetiker (tobi-oetiker) wrote on 2009-08-21:

#17

I have analyzed this a bit further and figured why it was not always reproducible ... see the last comment on http://bugzilla.kernel.org/show_bug.cgi?id=12557

Revision history for this message

Johannes Becker (becker-mip) wrote on 2009-08-25: Re: [Bug 269954] Re: ssh -X breaks Xauthority on NFS mounted home dir

#18

Hoi Tobi,

ausführlicher kann man den Fehler kaum beschreiben. Ich bin beeindruckt.
Hoffentlich die Kernel-Maintainer bei Ubuntu ebenfalls.

Viele Grüße,
Johannes

Tobias Oetiker wrote:
> I have analyzed this a bit further and figured why it was not always
> reproducible ... see the last comment on
> http://bugzilla.kernel.org/show_bug.cgi?id=12557
>
>

--
Johannes Gerd Becker
Dipl.-Vw., Dipl.-Math.

CER-ETH -- Center of Economic Research
at ETH Zürich
Lehrstuhl für Makroökonomie: Innovation und Politik
Zürichbergstrasse 18
8092 Zürich
Schweiz

Raum ZUE D5
Tel. +41-44-63-28272
Fax +41-44-63-21867
<email address hidden>
http://www.cer.ethz.ch/mip

Bug Watch Updater (bug-watch-updater) on 2009-10-30

Changed in debian:
status:	Confirmed → Fix Released

Bug Watch Updater (bug-watch-updater) on 2012-06-01

Changed in debian:
status:	Fix Released → Confirmed

Revision history for this message

penalvch (penalvch) wrote on 2013-08-13:

#19

Brendan Powers, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, could you please test for this with the latest development release of Ubuntu? ISO images are available from http://cdimage.ubuntu.com/daily-live/current/ .

If it remains an issue, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

Also, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.11-rc5

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

If you are unable to test the mainline kernel, please comment as to why specifically you were unable to test it and add the following tags:
kernel-unable-to-test-upstream
kernel-unable-to-test-upstream-VERSION-NUMBER

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

Brendan Powers, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, could you please test for this with the latest development release of Ubuntu? ISO images are available from http://cdimage.ubuntu.com/daily-live/current/ .

If it remains an issue, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

Also, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.11-rc5

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

If you are unable to test the mainline kernel, please comment as to why specifically you were unable to test it and add the following tags:
kernel-unable-to-test-upstream
kernel-unable-to-test-upstream-VERSION-NUMBER

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

tags:	added: hardy needs-kernel-logs needs-upstream-testing regression-release
Changed in linux (Ubuntu):
status:	Confirmed → Incomplete

Bug Watch Updater (bug-watch-updater) on 2021-05-01

Changed in debian:
status:	Confirmed → Fix Released

Affects		Status	Importance	Assigned to
	Debian	Fix Released	Unknown	debbugs #508866
	linux (Ubuntu)	Incomplete	Undecided	Unassigned
Nominated for Hardy by gcc
Nominated for Jaunty by Tobias Oetiker

Ubuntu
linux package

ssh -X breaks Xauthority on NFS mounted home dir

Bug Description

Other bug subscribers

Remote bug watches

Ubuntulinux package

ssh -X breaks Xauthority on NFS mounted home dir

Bug Description

Other bug subscribers

Remote bug watches

Ubuntu
linux package