Comment 4 for bug 1167337

Revision history for this message
Steve Langasek (vorlon) wrote : Re: [Bug 1167337] Re: nfs4 mounts hang in bootup with upstart starting rpc.gssd

On Wed, Apr 17, 2013 at 09:11:56AM -0000, stef wrote:
> Where can I find this 2 upstart files?

http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/raring/nfs-utils/raring/view/head:/debian/nfs-common.gssd-mounting.upstart
http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/raring/nfs-utils/raring/view/head:/debian/nfs-common.gssd.upstart

> I have also tested a sleep of 10 sec before starting rpc.gssd, which also
> helps to solve the problem mostly.

> The problem here was a started rpc.gssd with wrong or no credentials (missing
> network connection at start of rpc.gssd).(mount was denied by server). I than
> have no /tmp/krb5cc_machine_* credential.
> In about 50% of boots the mounts work, in 50% not. This also depends on the
> hardware of the machine, as the time for the network-start-up influences the
> rpc.gssd startup.
> On some machines there is no Problem at all (maybe one in 30 boots) but other
> machines results in 50% boot failed.

Ok. So this doesn't sound like it's really related to rpcbind at all; and
therefore the updated upstart jobs may not have any effect.

I'm not sure why you're seeing the behavior that you do, however. I'm using
kerberos-authenticated NFS mounts at boot here on 13.04, and was using them
on 12.04 as well, with no problems. Many of the events can be received out
of order, but in the "worst case" scenario, the expected sequence, which
should work, is roughly:

 - mountall tries to mount an nfs filesystem before the loopback network
   interface is up
 - the gssd job is triggered, but (I think) fails to start because there's
   no network interface
 - so the nfs mount fails
 - the loopback interface is brought up, triggering a retry of the nfs mount
 - the gssd job is triggered again, and this time it starts, though it can't
   reach the kdc
 - the nfs mount fails again because the server can't be reached
 - the "real" network interface is brought up, triggering a retry of the nfs
   mount
 - gssd is already running, so the mount attempt runs immediately
 - the kernel (and gssd) is now able to reach the server, so the mount
   succeeds

This is the expected behavior; what we need to figure out is where your
system's boot is deviating from this.

> The problem may also arise in ignoring the _netdev option in the fstab
> file. So mounting network filesystems without network connection is not a
> very good idea at all, because it will never work.

Well, mounting _netdev devices should be deferred until there's at least one
network connection up, yes; but we don't defer them until "all" network
connections are up because we don't necessarily *know* when all network
connections are up. Therefore we instead use the method described above of
trying the mount again after *each* network connection comes up. This is
expected to be resilient in the face of too-early mount attempts. If it
isn't in some cases, we should fix that.