nfs4 mounts hang in bootup with upstart starting rpc.gssd

Bug #1167337 reported by stef
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
nfs-utils (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

System 12.04.2 32 and 64bit. nfs4 with krb5/ldap authentication.
No network-manager!

A nfs4 mount results in a system hang, if rpc.gssd is started before rpcbind (portmap) and the filesystem is defined as:
nfs4 sec=krb5....
in /etc/init/gssd.conf the start condition is
   start on (started portmap
          or mounting TYPE=nfs4 OPTIONS=*sec*krb5*)

So when the file system is defined as: nfs4 sec=krb5
upstart does not wait on portmap (rpcbind) and starts sometimes rpc.gssd before rpcbind. This lead to mounting errors and prevents the further bootup.

My solution at the moment is to define the file systems as:
    ... nfs vers=4,sec=krb5...

which works at the moment.
(I have also added
    NEED_GSSD=yes
    NEED_IDMAPD=yes
to /etc/defaults/nfs-common

Tags: krb5 nfs4 upstart
Revision history for this message
stef (update-5) wrote :

I have to apologize, but the bug is not autofs5 related.
Its only nfs4 related!

Stef

Robie Basak (racb)
affects: autofs5 (Ubuntu) → nfs-utils (Ubuntu)
Revision history for this message
Steve Langasek (vorlon) wrote :

So while there have been updates to the gssd job in later releases, these updates were to change the job so that it does not depend on portmap *at all*. The rationale is that gssd is not actually supposed to need to talk to rpcbind... and this is true of the version in 12.04 as well.

And in any case, even when there was a dependency on portmap, this was only supposed to ever have been relevant for NFSv3, not for NFSv4.

So I don't see how rpc.gssd starting before portmap was actually causing this problem for you. Could you try installing the /etc/init/gssd.conf and /etc/init/gssd-mounting.conf jobs from Ubuntu 12.10, to see if the problem persists?

Changed in nfs-utils (Ubuntu):
status: New → Incomplete
Revision history for this message
stef (update-5) wrote : Re: [Bug 1167337] Re: nfs4 mounts hang in bootup with upstart starting rpc.gssd

Where can I find this 2 upstart files?

I have also tested a sleep of 10 sec before starting rpc.gssd, which also
helps to solve the problem mostly.

The problem here was a started rpc.gssd with wrong or no credentials (missing
network connection at start of rpc.gssd).(mount was denied by server). I than
have no /tmp/krb5cc_machine_* credential.
In about 50% of boots the mounts work, in 50% not. This also depends on the
hardware of the machine, as the time for the network-start-up influences the
rpc.gssd startup.
On some machines there is no Problem at all (maybe one in 30 boots) but other
machines results in 50% boot failed.

In the boot failed situations only a manual restart of rpc.gssd solves the
mount denied situation. (here you have to press "S" very fast, as after some
time the machine ignores the "S"-key and the system hangs)

The problem may also arise in ignoring the _netdev option in the fstab file.
So mounting network filesystems without network connection is not a very good
idea at all, because it will never work.

On Tuesday 16 April 2013, Steve Langasek wrote:
> So while there have been updates to the gssd job in later releases,
> these updates were to change the job so that it does not depend on
> portmap *at all*. The rationale is that gssd is not actually supposed
> to need to talk to rpcbind... and this is true of the version in 12.04
> as well.
>
> And in any case, even when there was a dependency on portmap, this was
> only supposed to ever have been relevant for NFSv3, not for NFSv4.
>
> So I don't see how rpc.gssd starting before portmap was actually causing
> this problem for you. Could you try installing the /etc/init/gssd.conf
> and /etc/init/gssd-mounting.conf jobs from Ubuntu 12.10, to see if the
> problem persists?
>
> ** Changed in: nfs-utils (Ubuntu)
> Status: New => Incomplete

Revision history for this message
Steve Langasek (vorlon) wrote :

On Wed, Apr 17, 2013 at 09:11:56AM -0000, stef wrote:
> Where can I find this 2 upstart files?

http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/raring/nfs-utils/raring/view/head:/debian/nfs-common.gssd-mounting.upstart
http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/raring/nfs-utils/raring/view/head:/debian/nfs-common.gssd.upstart

> I have also tested a sleep of 10 sec before starting rpc.gssd, which also
> helps to solve the problem mostly.

> The problem here was a started rpc.gssd with wrong or no credentials (missing
> network connection at start of rpc.gssd).(mount was denied by server). I than
> have no /tmp/krb5cc_machine_* credential.
> In about 50% of boots the mounts work, in 50% not. This also depends on the
> hardware of the machine, as the time for the network-start-up influences the
> rpc.gssd startup.
> On some machines there is no Problem at all (maybe one in 30 boots) but other
> machines results in 50% boot failed.

Ok. So this doesn't sound like it's really related to rpcbind at all; and
therefore the updated upstart jobs may not have any effect.

I'm not sure why you're seeing the behavior that you do, however. I'm using
kerberos-authenticated NFS mounts at boot here on 13.04, and was using them
on 12.04 as well, with no problems. Many of the events can be received out
of order, but in the "worst case" scenario, the expected sequence, which
should work, is roughly:

 - mountall tries to mount an nfs filesystem before the loopback network
   interface is up
 - the gssd job is triggered, but (I think) fails to start because there's
   no network interface
 - so the nfs mount fails
 - the loopback interface is brought up, triggering a retry of the nfs mount
 - the gssd job is triggered again, and this time it starts, though it can't
   reach the kdc
 - the nfs mount fails again because the server can't be reached
 - the "real" network interface is brought up, triggering a retry of the nfs
   mount
 - gssd is already running, so the mount attempt runs immediately
 - the kernel (and gssd) is now able to reach the server, so the mount
   succeeds

This is the expected behavior; what we need to figure out is where your
system's boot is deviating from this.

> The problem may also arise in ignoring the _netdev option in the fstab
> file. So mounting network filesystems without network connection is not a
> very good idea at all, because it will never work.

Well, mounting _netdev devices should be deferred until there's at least one
network connection up, yes; but we don't defer them until "all" network
connections are up because we don't necessarily *know* when all network
connections are up. Therefore we instead use the method described above of
trying the mount again after *each* network connection comes up. This is
expected to be resilient in the face of too-early mount attempts. If it
isn't in some cases, we should fix that.

Revision history for this message
stef (update-5) wrote :

May be a additional flag "-e" has some influence?
I use this flag to prevent gssd from blocking the entire system when an users
krb-credential has expired to restore the old behavior.
This was needed since OpenSuse 11.4 and also for Ubuntu 12.04. (due to some
changes in the kernel as I remember.)

Revision history for this message
Steve Langasek (vorlon) wrote :

On Wed, Apr 17, 2013 at 05:06:18PM -0000, stef wrote:
> May be a additional flag "-e" has some influence?
> I use this flag to prevent gssd from blocking the entire system when an users
> krb-credential has expired to restore the old behavior.

Possible. You could test without that option to see if it affects the boot
behavior?

Please also test the upstart jobs from 13.04 (linked from my previous
comment). While these changes don't cause gssd to wait for rpcbind, they do
fix various other low-probability race conditions, one of which might be the
root cause of your problem.

Revision history for this message
stef (update-5) wrote :

Hi Steve,
I have tested your scripts some times on one problematic machine.
So far the boots have not failed. But in one test with the old setup the eth0
interface was not started with IP address. So this may be also a reason for
the failing boot. In this case the interface was up but has no IP adress.
after manual issuing dhcpclient, the network was working.
By the way, this problem do I have at the moment also on my home machine
(without nfs/krb5 etc.)
So an other possibility is, that longer delay of dhcpclient cause the problem
for mounting the nfs-fs.
But in one earlier test, I have had a running network, started gssd and no
valid credential for mounting the share. May be this is caused by an other
race condition. I have since that event hanged the default lifetime for /tmp
files from 0 to 2 days, as one idea was the removal of the /tmp/krb_machine
credential during boot after krb5 handshake....

So I will let your scripts installed on this machine and have a look on the
network connection.
Is there a possibility to restart dhcpclient in case of a unsuccessful
inquiry.

On Wednesday 17 April 2013, Steve Langasek wrote:
> On Wed, Apr 17, 2013 at 05:06:18PM -0000, stef wrote:
> > May be a additional flag "-e" has some influence?
> > I use this flag to prevent gssd from blocking the entire system when an
> > users krb-credential has expired to restore the old behavior.
>
> Possible. You could test without that option to see if it affects the boot
> behavior?
>
> Please also test the upstart jobs from 13.04 (linked from my previous
> comment). While these changes don't cause gssd to wait for rpcbind, they
> do fix various other low-probability race conditions, one of which might be
> the root cause of your problem.

Revision history for this message
Steve Langasek (vorlon) wrote :

> So an other possibility is, that longer delay of dhcpclient cause the problem
> for mounting the nfs-fs.

The handling of both ifupdown and network-manager is designed such that the interface should not be considered 'up' until dhclient succeeds. If 'sudo initctl list' is showing the interface as up in this case, then that's a bug; more likely, however, the interface is either still waiting for a dhcp answer or is considered failed, and bringing up the interface will bring up the NFS mounts as well. Either way, that's obviously not a bug in the NFS packages, since if the network has no address NFS can't work.

> Is there a possibility to restart dhcpclient in case of a unsuccessful
> inquiry.

"ifdown eth0; ifup eth0"

Please let us know if you find that the new version of the upstart jobs solves the nfs problem (but not the network problem, obviously). If it does, that strengthens the argument for pushing this fix as a stable release update to 12.04.

Revision history for this message
stef (update-5) wrote :

On Tuesday 23 April 2013, Steve Langasek wrote:
> > So an other possibility is, that longer delay of dhcpclient cause the
> > problem for mounting the nfs-fs.
>
> The handling of both ifupdown and network-manager is designed such that
> the interface should not be considered 'up' until dhclient succeeds. If
> 'sudo initctl list' is showing the interface as up in this case, then
> that's a bug; more likely, however, the interface is either still
> waiting for a dhcp answer or is considered failed, and bringing up the
> interface will bring up the NFS mounts as well. Either way, that's
> obviously not a bug in the NFS packages, since if the network has no
> address NFS can't work.
This problem is not on a nfs machine. At the moment I have solved this problem
with a broadcom addon card, which works more reliable than the built in Intel
card. Maybe this is a e1000 driver issue.
>
> > Is there a possibility to restart dhcpclient in case of a unsuccessful
> > inquiry.
>
> "ifdown eth0; ifup eth0"
>
> Please let us know if you find that the new version of the upstart jobs
> solves the nfs problem (but not the network problem, obviously). If it
> does, that strengthens the argument for pushing this fix as a stable
> release update to 12.04.

Until now your new upstart scripts seams to work ok.
I have not been able to reproduce the boot hang in about 10 starts, but a
hot-boot was already successful most times.
But my 2 test users have not reported any problems this week (2 coldboots
until now). I will report you the "long term" results end of the week.

Revision history for this message
Steve Langasek (vorlon) wrote :

Ok, then I think we can consider this a duplicate of bug #643289. Marking this bug as a duplicate of that one, and nudging that up my priority list.

Revision history for this message
stef (update-5) wrote :

On Saturday 27 April 2013, Steve Langasek wrote:
> Ok, then I think we can consider this a duplicate of bug #643289.
> Marking this bug as a duplicate of that one, and nudging that up my
> priority list.

Looks so.
We have no boot promblems last week with your patch applied!

Thank you very much for the bug fix!
Do you apply the fix to the 12.04 updates?

Revision history for this message
Steve Langasek (vorlon) wrote :

On Mon, Apr 29, 2013 at 09:37:42AM -0000, stef wrote:
> Do you apply the fix to the 12.04 updates?

Yes, that's the plan. I have no ETA yet for when this will happen, but
hopefully in the next month.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.