NFS shares are mounted with wrong clientaddr or not at all

Bug #1037192 reported by Nikolaus Rath on 2012-08-15
34
This bug affects 6 people
Affects Status Importance Assigned to Milestone
mountall (Ubuntu)
Medium
Unassigned

Bug Description

With the following fstab:

proc /proc proc nodev,noexec,nosuid 0 0
/dev/mapper/vg0-fat_client / ext4 relatime,errors=remount-ro 0 1
/dev/mapper/vg0-swap none swap sw 0 0
spitzer:/opt /opt nfs4 auto 0 0
spitzer:/home /home nfs4 auto 0 0

The /opt and /home are sometimes not mounted at all, and sometimes with the wrong clientaddr:

$ mount | grep nfs
rpc_pipefs on /var/lib/nfs/rpc_pipefs type rpc_pipefs (rw)
spitzer:/home on /home type nfs4 (rw,clientaddr=0.0.0.0,addr=192.168.1.2)
spitzer:/opt on /opt type nfs4 (rw,clientaddr=0.0.0.0,addr=192.168.1.2)

Since this happens on all clients (they all get same clientaddr), this results in a frozen mount (cf http://thread.gmane.org/gmane.linux.nfs/47780)

The "spitzer" host is only reachable via a tinc VPN, and mounting (or remounting) manually after boot always works and results in the correct clientaddr, so I believe that mountall is attempting to mount this too early.

ProblemType: Bug
DistroRelease: Ubuntu 10.04
Package: mountall 2.15.3
ProcVersionSignature: Ubuntu 3.0.0-23.39~lucid1-server 3.0.36
Uname: Linux 3.0.0-23-server x86_64
Architecture: amd64
Date: Wed Aug 15 12:16:17 2012
ProcEnviron:
 PATH=(custom, user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: mountall

Nikolaus Rath (nikratio) wrote :
Steve Langasek (vorlon) wrote :

Thanks for the report. Can you please capture some debugging output from mountall so we can see what's happening? The most straightforward way to do this is to modify /etc/init/mountall.conf and replace this line:

      exec mountall --daemon $force_fsck $fsck_fix

with this:

    exec mountall --daemon --verbose $force_fsck $fsck_fix > /dev/mountall.log 2>&1

Be aware of course that any typos in this job file may render your system unbootable and require recovery by way of rescue mode and/or editing from the initramfs.

Changed in mountall (Ubuntu):
status: New → Incomplete
importance: Undecided → Medium
Steve Langasek (vorlon) wrote :

also: is the tinc vpn initialized through /etc/network/interfaces?

Nikolaus Rath (nikratio) wrote :

No, tinc has a separate initscript in /etc/init.d. During startup, tinc executes "ifconfig $INTERFACE $IP netmask 255.255.255.0", where $INTERFACE is a tun interface created by tinc.

Will gather the debugging logs as well.

On Wed, Aug 15, 2012 at 05:48:13PM -0000, Nikolaus Rath wrote:
> No, tinc has a separate initscript in /etc/init.d. During startup, tinc
> executes "ifconfig $INTERFACE $IP netmask 255.255.255.0", where
> $INTERFACE is a tun interface created by tinc.

Ok. This implies it will not integrate with mountall and tell mountall to
re-attempt the NFS mounts. That's probably part of the problem here, and
will be a bug in tinc.

Nikolaus Rath (nikratio) wrote :

How would tinc need to behave for compatibility? I was assuming that udev would automatically generate a net-dev up event when the tun interface comes up, and that nothing else would be necessary.

Steve Langasek (vorlon) wrote :

On Wed, Aug 15, 2012 at 08:51:32PM -0000, Nikolaus Rath wrote:
> How would tinc need to behave for compatibility? I was assuming that
> udev would automatically generate a net-dev up event when the tun
> interface comes up, and that nothing else would be necessary.

net-device-up is generated from ifupdown hooks, not from udev. udev only
generates the net-device-added event. See /etc/network/if-up.d/upstart .

Nikolaus Rath (nikratio) wrote :

Ah, so if we manually emit a net-device-up event after the ifconfig call (or better, call the hooks in if-up.d), everything should integrate well?

Steve Langasek (vorlon) wrote :

On Wed, Aug 15, 2012 at 09:32:42PM -0000, Nikolaus Rath wrote:
> Ah, so if we manually emit a net-device-up event after the ifconfig call
> (or better, call the hooks in if-up.d), everything should integrate
> well?

At least for mountall, yes.

But the fact that you're getting 0.0.0.0 client addresses implies that the
mount is still succeeding earlier than it should; so fixing tinc is probably
not enough to fix your issue.

Sarah Angelini (miriyan) wrote :

Hi Steve. I'm working with Niko on the server mountall problem. You're right that the 0.0.0.0 client address issue occurs because mountall is able to mount our NFS shares before it should. Initially upon startup, we run busybox's ipconfig (no tinc) and then rsync with the server to get system updates. After that, ifconfig is used to bring the device down and startup is to proceed as normal. It seems that some ip information must be hanging around, though. The log is recorded in mountall-0000.log

Adding an "ip addr flush" command prevents mountall from completing the mount before tinc is up. Unfortunately, it also keeps our shares from mounting at all. If while it's sitting there I issue a "mount /opt" command, then mountall seems to jump back in and continue with mounting /home. The log is mountall-flush.log and the lines at the end with "+" in front are after I've manually mounted /opt.

Adding "initctl emit -n net-device-up" to our tinc scripts prompts mountall-net.conf to send the SIGUSR1 signal. The logs show mount commands and the SIGUSR1 signal, but the network shares still don't come up unless I manually mount one. The log is mountall-net-device.log and the "+" lines are after the manual mount.

What I have gotten to work is a bit of a hack. Instead of depending on net-device-up, tinc sends out a custom signal (vpn-node-up), which mountall-net.conf is modified to listen for instead. I think the net-device-up signal is sent when the ethernet itself comes up, but before tinc is fully initialized. With the vpn-node-up signal, mountall-net only runs after tinc is up. The log for this is in mountall-hack.log. I would prefer to use the net-device-up signal properly if possible.

Please let me know what further information would be helpful.

Sarah Angelini (miriyan) wrote :
Sarah Angelini (miriyan) wrote :
Sarah Angelini (miriyan) wrote :
Sarah Angelini (miriyan) wrote :
Launchpad Janitor (janitor) wrote :

[Expired for mountall (Ubuntu) because there has been no activity for 60 days.]

Changed in mountall (Ubuntu):
status: Incomplete → Expired
Steve Langasek (vorlon) wrote :

Sorry, this shouldn't be expired out; I realize you've provided more information on this bug but I haven't gotten a chance to review it in adequate detail yet and respond to it. Setting the bug back to 'New'.

Changed in mountall (Ubuntu):
status: Expired → New
sandro dentella (sandro-e-den) wrote :

I have a related problem in the sense that an LTSP fat client can't nfs mount /home. Mounting by hand succedes w/o any problem and also running 'mountall' from command line, but when boot finises /home is not mounted. I'm on Ubuntu 12.04 and mountall 2.36.

I added same debug info as suggested in mountall.conf logging to /dev/mountall.log and I see that the last part states:

montaggio di /home
local 3/3 remote 0/0 virtual 11/11 swap 0/0
local 3/3 remote 0/0 virtual 11/11 swap 0/0
mount /home [639] uscito normalmente
local 3/3 remote 0/0 virtual 11/11 swap 0/0

"uscito normalmente" means normal exit. I wanted to check if at that point /home is mounted but I couldn't manage to log the output of df:

  df -h > /dev/df.log 2>&1 does not log anything nor creates an empty file. Any suggestion on the origin of the problem or on further test to do?

Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in mountall (Ubuntu):
status: New → Confirmed
Nick Hatch (nicholas-hatch) wrote :

I can confirm this problem on Precise 12.04.3 with mountall=2.36.4 from precise-updates.

With clientaddr=0.0.0.0, nfsstat -s showed rapidly incrementing 'setcltid' and 'setcltidconf' operations on the server, and high IO utilization on the root volume due to writing v4recovery information.

Although I am not certain on this part, shortly after performing updates on three clients which bumped mountall to 2.36.4, our NFS server crashed after 300 days of uptime less than 24 hours later. ( Don't have much data here, but it was brought down due to an OOM condition, I'm assuming due to a memory leak somewhere in the NFS code. This server is used exclusively for NFS, and no user-land process was to blame. )

Reverting mountall to 2.36 and rebooting restored a properly configured clientaddr, and the excessive client ID operations stopped.

Changes in mountall 2.36.1 and 2.36.2 both seem highly suspicious, see LP #643289 and LP #1078926.

ubm (edv-o) wrote :

I can also confirm this, with 2.36.4, 2.42 from quantal-updates (I couldn't go higher with 12.04.3 LTS's libc) and even with 2.36

This can completely fry netapp filers (i.e. clients with clientaddr=0.0.0.0 or others accessing the same filer) might have their transfer speed reduced to a few K (!) per sec or even stall with no I/O possible anymore at single processes.

This is a major showstopper for all enterprise applications.

Manually specified clientaddr mount options are a no-go for generically provisioned boxes (like cluster nodes, etc.)

There's currently no solution to this here but dirty shell hackery in rc.local and friends...

ubm (edv-o) wrote :

.. apart from this, boxen with critical paths mounted via NFS are often unable to successfully boot at all.
These are often boxen with just one physical ethernet interface, so the algorithm for determining which address to use can't be to blame.

webrat (irc-webratz) wrote :

I've experienced this issue on machines which have:
- ubuntu 12.04 installed
- used network manager with dhcp
- have a 3.2 kernel installed
- used nfs4 with kerberos

Error Messages on the nfs server in syslog are:
~~~~~~ snip ~~~~~~
rpc.gssd[737]: ERROR: unable to resolve 0.0.0.0 to hostname: Name or service not known
rpc.gssd[737]: ERROR: failed to read service info
~~~~~~ snap ~~~~~~

Until now i could only test one box, but after i upgraded it to a 3.8 kernel (via linux-image-generic-lts-raring package) the ip address was set properly in clientaddr.
I will test further systems during this week and hope that also resolves the issue on them.
Maybe this helps others also to resolve this issue or at least find a workaround.

webrat (irc-webratz) wrote :

seems like this was just lucky coincidence yesterday. tried it on another machine today and it did not help to upgrade the kernel.

ubm (edv-o) wrote :

This bug is very related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/932687

Anyone also having "Stale NFS handles" and "INFO: task cp:30227 blocked for more than 120 seconds." ? I have a feeling all of this is related.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers