ssh -X breaks Xauthority on NFS mounted home dir

Bug #269954 reported by Brendan Powers
30
Affects Status Importance Assigned to Milestone
Debian
Fix Released
Unknown
linux (Ubuntu)
Incomplete
Undecided
Unassigned
Nominated for Hardy by gcc
Nominated for Jaunty by Tobias Oetiker

Bug Description

When using an nfs mounted home directory, using ssh to connect and run a X11 app from the nfs server causes X applications to break.

Steps to reproduct the problem
1) have to computers(A and B) with ubuntu 8.04 installed
2) Export /home from computer A, and mount it on computer B
3) Create a user that has the same username, home directory, and UID on both computers
4) Log into computer B with the user that was just created, and ssh into box A and run an X application (ssh -X boxa xterm)
5) In another terminal, try to run another X app on box b as the same user.

What is supposed to happen: The X app on box B runs properly
What actually happens: The X app will fail with an error like "Xlib: connection to ":0.0" refused by server" about 50% of the time

This can be fixed by running "cat ~/.Xauthority".
Also sometimes running "ls -al ~/.Xauthority" will result in a "stale nfs file handle" error.

I assume this is because the Xauthority file is removed, and then a new one is created when ssh sets up credentials on computer A. This causes the file inode number to change, causing the stale file handle errors on computer B. However, this problem did not occur in ubuntu feisty. I havn't tried it with gutsy. I've tried installing the packages for xauth and libxau6 from feisty, but it did not fix the problem. So I'm guessing its an ssh issue, but i could be wrong.

Someone reported the same issue on the ubuntu mailing list, but didn't seem to file a bug report that i could find.
https://lists.ubuntu.com/archives/ubuntu-users/2008-July/152242.html

I'm using ubuntu 8.04.1, also had the same problem with 8.04.0. It worked fine in ubuntu 7.04.
Package Versions
==========
openssh-client: 1:4.7p1-8ubuntu1.2
openssh-server: 1:4.7p1-8ubuntu1.2
xauth: 1:1.0.2-2
libxau6: 1:1.0.3-2

Revision history for this message
Ari Mujunen (ari-mujunen) wrote :

I'm also running 8.04.1 with linux-image-2.6.24-19-generic (2.6.24-19.41) and I can confirm this problem.

By running 'while true; do date; ls -il .Xauthority ; sleep 1; done' on both my Ubuntu client and Debian etch NFS server (running 2.6.18-6-amd64) I can see that doing a 'ssh -X third-machine' indeed replaces the ~/.Xauthority file in my home directory: the inode number changes on the server but not on my Ubuntu desktop.

With the 'defaults' mount options in my /etc/fstab, my desktop continues to show old ~/.Xauthority inode number and stat() data for 60 seconds, then 'ls -l' starts returning 'ls: cannot access .Xauthority: Stale NFS file handle'. A command line 'stat .Xauthority' returns the same error. Doing any of 'ls .', 'cat .Xauthority >/dev/null', 'touch .Xauthority' will immediately cure this, letting 'ls -l .Xauthority' show the correct and updated info, the same as in the NFS server.

With the 'noac' mount option in my /etc/fstab the behavior is otherwise the same as in the above, but the 60 second timeout disappears and 'stat .Xauthority' starts to return 'Stale NFS file handle' immediately after 'ssh -X third-machine' has replaced the file at 'third-machine' and the NFS server.

With the 'sync' mount option in /etc/fstab the behavior is just like in the case of 'defaults' mount option.

I find it quite likely that this has actually nothing to do with the way OpenSSH does the updating of '.Xauthority' file (i.e. many other applications revise files by replacing them with a new version) but the way '~/.Xauthority' is apparently often used, by just 'stat()'ting it, leads to exposure of this problem. It looks to me more like a kernel NFS client problem.

I'm a bit puzzled why 'lstat64(".Xauthority", ...)' would fail with stale NFS handle whereas a 'open(".Xauthority", ...)' will work just fine---why the former gets out-of-date directory info and the latter gets the correct, updated one?

Revision history for this message
Ari Mujunen (ari-mujunen) wrote :

Further testing with various kernel versions seem to confirm this bug to be specific to NFS clients running a 2.6.24 kernel.

Changed in openssh:
status: New → Confirmed
Revision history for this message
Ari Mujunen (ari-mujunen) wrote :

We performed further testing with NFS clients of various kernel versions:
 - Debian etch with 2.6.18.dfsg.1-22etch2: ok
 - Ubuntu Hardy 8.04.1 with linux-image-2.6.24-19-generic 2.6.24-19.41: stat() returns ESTALE indefinitely
 - Debian etch with vanilla kernel.org 2.6.26.1: ok
 - Ubuntu Intrepid 8.10 Alpha 5 with linux-image-2.6.27-3-generic 2.6.27-3.4: ok (stale for approx. one second and then ok and updated)

So it seems to us that this bug quite likely has something to do with these changes to the kernel:
'http://kerneltrap.org/Linux/NFS_Client_Updates_for_2.6.24'. It seems to be fixed at least in 2.6.26 and 2.6.27. I wonder what would be a suitable fix for 8.04 LTS?

Revision history for this message
Ari Mujunen (ari-mujunen) wrote :

Ah, the easiest steps to reproduce:
1) Take two NFS clients, one of which is running 2.6.24-something (machine B) and the other can run any version (machine A). You can equally well use the NFS server itself as machine A, as long as machine B is running 2.6.24 NFS client.
2) On A: 'touch file'.
3) On B: 'stat file', stats print out fine.
4) On A: 'cp file another && mv another file'.
5) On B: 'stat file' results in 'stat: cannot stat `file': Stale NFS file handle for a long time (~minutes).
6) On B: 'ls .' or 'touch file' or 'cat file >/dev/null' (open()ing the file or reading the directory containing the file) makes 'stat file' work normally again.

Revision history for this message
stef70 (stephane-chauveau-central) wrote :

I found a workaround for the ssh problem: create a different Xauthority file on each host.

That can be done from the client side as follow:

First of all, add something like that in your bashrc file:

if [ -n "$SSH_CLIENT" ] ; then
  export XAUTHORITY=$HOME/.Xauthority-$HOSTNAME
fi

Unfortunately, ssh calls xauth before setting the user environment so I use the following ~/.ssh/rc file:

if [ -n "$DISPLAY" ] ; then
  if read proto cookie ; then
   case $DISPLAY in
     localhost:*) xauth -f $HOME/.Xauthority-$HOSTNAME add unix:$(echo $DISPLAY | cut -c11-) $proto $cookie ;;
     *) xauth -f $HOME/.Xauthority-$HOSTNAME add $DISPLAY $proto $cookie ;;
   esac
 fi
fi

Revision history for this message
Ari Mujunen (ari-mujunen) wrote :

My previous note about this bug being corrected in intrepid and kernel 2.6.27 was a bit premature: currently with the linux-image-2.6.27-7-generic version 2.6.27-7.16 the problem still occurs quite often. It has been improved a little, namely only the first 'stat file' results in 'stat: cannot stat `file': Stale NFS file handle'. Subsequent 'stat file' will find updated file information.

For the '~/.Xauthority' case it is still sufficient to do a simple 'ls' in '~/' directory and after that the stat info of '~/.Xauthority' is not stale anymore.

Currently the bug is harder to reproduce, since not every 'ssh -X' will result in stale NFS handles on '~/.Xauthority'. However, once it happens, it can prevent starting new X programs for a long time. Although now a second "manual" 'stat .Xauthority' resolves the stale handle, attempting to start new X programs multiple times does _not_ seem to do the same.

So the original bug found in 2.6.24 is still there but its occurrence is not as deterministic as before and it still causes major confusion to users who run into it.

Revision history for this message
Christoph Cullmann (cullmann) wrote :

I can reproduce this behaviour too, therefor the hardy kernel is unusable for our company, as each normal editor does unlink & move on save, which leads to massive probs for example for an apache taking up this files from NFS, they break after nearly each edit for some minutes :(

Revision history for this message
Steven Hirsch (snhirsch) wrote :

And here I thought I was the only one seeing this! This issue has been driving me nuts for the past (2) Ubuntu releases (currently running Hardy). In my case, it's more basic than problems with ssh. I have my home directory in NFS and the desktop seems to "lose" ~/.Xauthority periodically. The symptoms are that I'll be working along and all of sudden nothing will start! If I try, e.g. xclock at the command line it tells me "Xlib: connection to :0.0.. refused..". A simple 'cd ; ls' seems to get things going again. Until the next time.

I have Googled endlessly and can find no mention of it other than this thread of reports. Both Gutsy and Hardy have done this. I'm beginning to suspect the xauth mechanism itself rather than the kernel, but that's just a guess. (If xauth had previously done an open/close round on the file vs. simply calling stat now - for example).

My setup is very simple and only one client machine is actually using the home directory - it's not dueling overwrites of .Xauthority.

I opened a bug report on it, but the person processing the issue simply could not grasp the issue and it's never been addressed.

Revision history for this message
Joe Kislo (joe-k12s) wrote :

We have been regression testing our production environment on Ubuntu Hardy, and I believe the root cause of an intermittent nfs failure issue we are seeing on Hardy (but not previous versions of ubuntu) is sourced from this kernel bug. It took a long time to track the issue down, as our use case is significantly more complex than the one described here, but when you boil it down, I think it's the same thing.

We are testing the 2.6.24-24-generic kernel on Ubuntu hardy. All patches are applied as of 06/04/09. The test case provided by Ari Mujunen does not work for me, and I am unable to reproduce it by those steps. I have been able to reproduce it reliably a different way:

System A: NFS Server
System B: NFS Client

System A:
mkdir tmpdir
touch tmpdir/tmpfile
tar -cvf x.tar tmpdir

System B:
stat tmpdir/tmpfile

System A:
tar -xvf x.tar

System B:
stat tmpdir/tmpfile

There's something about tar and having a subdirectory in the mix that seems to trigger the issue for me, that Ari's steps don't.

Here are the server export:
xxx client(rw,async,no_root_squash,no_subtree_check)
and the client mount:
server:/xxx /x nfs soft,intr,rsize=8192,wsize=8192,nosuid,noac,tcp,timeo=20

This is a very serious issue for us, and unlike the .XAuthority usecase, we can't just 'work around it'. Our production environment implements an Active/Active redundant NFS store at the application level. This issue will (rightly so) make our nfs layer believe the nfs server has failed, adding in arbitrary directory listing calls isn't a practical option, and doesn't reliably work either. The ls trick described above seems to MOSTLY work for getting the nfs client back alive, but not always. We have directory structures that may be nested a few levels deep, and this seems to cause further issues. I've seen our application get the kernel nfs client into a state where whole directories of files are returning 'Stale NFS file handle' errors... One time I used the ls trick to try to get the client working again, and it stopped the 'Stale NFS file handle' errors, but ls returned completely corrupted garbage (it listed the files but with alot of ??????'s for the attributes).

I have found reference to this bug here:
http://bugzilla.kernel.org/show_bug.cgi?id=12557

The but has a candidate patch attached, but I haven't tried it yet.

This seems like a fairly serious regression. The tar usecase I have provided seems like a very frequent operation over nfs, and this isn't a client race condition. The client is just toast... it doesn't come back unless you remount, or you do some trickery by finding out what's changed and trying to do directory listings of those directories.

Revision history for this message
Joe Kislo (joe-k12s) wrote :

(Once a fix is found, I second the nomination to backport it to hardy)

Revision history for this message
Joe Kislo (joe-k12s) wrote :

Using the attached patch I linked to in bugzilla, it did NOT solve our problem, we patched against an
Ubuntu Hardy 2.6.24-24 kernel. It may or may not have helped the original way of reproducing this error (which I could not ever reproduce on the Hardy kernel), but it does not aid our way of reproducing the problem (with tar)

Anybody have any thoughts? This seems like a pretty serious problem.

Revision history for this message
Tobias Oetiker (tobi-oetiker) wrote :

since the proposed patch has been integrated in the official kernel since 2.6.29and does help some I would love to see it backported

http://mirror.celinuxforum.org/gitstat//commit-detail.php?commit=a71ee337b31271e701f689d544b6153b75609bc5

the bug is really annoying especially for large shops where peple regularly login to different boxes.

it also affects svn repositories, if they are update on different machines they go stale on the boxes where it was not updated.

Revision history for this message
Tobias Oetiker (tobi-oetiker) wrote :

... backported to hardy and jaunty that is.

Revision history for this message
gcc (chris+ubuntu-qwirx) wrote :

Another vote for a backport to Hardy :)

Changed in debian:
status: Unknown → Confirmed
Revision history for this message
Andreas Romer (andreas-romer) wrote :

another vote for a backport to hardy. We are seeing sever problems in our environment with stale nfs on files that are used over the network.

Revision history for this message
Johannes Becker (becker-mip) wrote :

I would also appreciate a backport very much.

Revision history for this message
Tobias Oetiker (tobi-oetiker) wrote :

I have analyzed this a bit further and figured why it was not always reproducible ... see the last comment on http://bugzilla.kernel.org/show_bug.cgi?id=12557

Revision history for this message
Johannes Becker (becker-mip) wrote : Re: [Bug 269954] Re: ssh -X breaks Xauthority on NFS mounted home dir

Hoi Tobi,

ausführlicher kann man den Fehler kaum beschreiben. Ich bin beeindruckt.
Hoffentlich die Kernel-Maintainer bei Ubuntu ebenfalls.

Viele Grüße,
Johannes

Tobias Oetiker wrote:
> I have analyzed this a bit further and figured why it was not always
> reproducible ... see the last comment on
> http://bugzilla.kernel.org/show_bug.cgi?id=12557
>
>

--
Johannes Gerd Becker
Dipl.-Vw., Dipl.-Math.

CER-ETH -- Center of Economic Research
at ETH Zürich
Lehrstuhl für Makroökonomie: Innovation und Politik
Zürichbergstrasse 18
8092 Zürich
Schweiz

Raum ZUE D5
Tel. +41-44-63-28272
Fax +41-44-63-21867
<email address hidden>
http://www.cer.ethz.ch/mip

Changed in debian:
status: Confirmed → Fix Released
Changed in debian:
status: Fix Released → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

Brendan Powers, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? If so, could you please test for this with the latest development release of Ubuntu? ISO images are available from http://cdimage.ubuntu.com/daily-live/current/ .

If it remains an issue, could you please run the following command in the development release from a Terminal (Applications->Accessories->Terminal), as it will automatically gather and attach updated debug information to this report:

apport-collect -p linux <replace-with-bug-number>

Also, could you please test the latest upstream kernel available following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Please do not test the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-v3.11-rc5

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

If you are unable to test the mainline kernel, please comment as to why specifically you were unable to test it and add the following tags:
kernel-unable-to-test-upstream
kernel-unable-to-test-upstream-VERSION-NUMBER

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

tags: added: hardy needs-kernel-logs needs-upstream-testing regression-release
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Changed in debian:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.