statd job starts before service is listening

Reported by Andrew Edmunds on 2010-07-28
This bug affects 4 people
Affects Status Importance Assigned to Milestone
nfs-utils (Ubuntu)

Bug Description

Binary package hint: nfs-common
nfs-common version: 1:1.2.0-4ubuntu4

On my Lucid system with latest updates, NFS filesystems sometimes fail to mount successfully at boot.

Steps to reproduce:
1. In /etc/fstab, add an entry of the form:
<nfs-server>:<share> <mountpoint> nfs _netdev 0 0
2. Reboot a number of times

Expected results:
The NFS filesystem should be mounted on every boot

Actual results:
Occasionally the NFS mount fails and messages of the following form are logged in /var/log/boot.log:
mount.nfs: rpc.statd is not running but is required for remote locking.
mount.nfs: Either use '-o nolock' to keep locks local, or start statd.
mountall: mount <mountpoint> [<pid>] terminated with status 32

After a boot where the NFS mounts fail, the mountall process is still running on login. Running:
kill -USR1 $(pidof mountall)
will then mount the missing filesystems sucessfully.

It is also possible to observe the problem without rebooting as follows.

Steps to reproduce:
Run the following shell loop, after setting $nfs_server, $share and $mntpt to suitable values

while (( i < 100 ))
    stop statd
    start statd
    mount.nfs $nfs_server:$share $mntpt
    umount $mntpt
    (( i++ ))

Expected results:
The NFS filesystem should be mounted on every iteration of the loop

Actual results:
Occasionally the NFS mount fails and the following messages are output:

mount.nfs: rpc.statd is not running but is required for remote locking.
mount.nfs: Either use '-o nolock' to keep locks local, or start statd.
mount.nfs: an incorrect mount option was specified
umount: <mountpoint> not mounted

The symptoms are very similar to this bug back in Karmic:

Architecture: i386
DistroRelease: Ubuntu 10.04
InstallationMedia: Mythbuntu 10.04 LTS "Lucid Lynx" - Release i386 (20100427.1)
NonfreeKernelModules: nvidia
Package: nfs-common 1:1.2.0-4ubuntu4
PackageArchitecture: i386
ProcVersionSignature: Ubuntu 2.6.32-23.37-generic-pae
Tags: lucid
Uname: Linux 2.6.32-23-generic-pae i686

Andrew Edmunds (andrew-edmunds) wrote :
Andrew Edmunds (andrew-edmunds) wrote :

I did the following to get more verbose logging:

--- etc/default/grub.20100728 2010-07-28 18:15:41.536842847 +1000
+++ etc/default/grub 2010-07-28 21:57:45.144891535 +1000
@@ -6,7 +6,7 @@
 GRUB_DISTRIBUTOR=`lsb_release -i -s 2> /dev/null || echo Debian`

 # Uncomment to disable graphical terminal (grub-pc only)

... and ...

--- etc/init/mountall.conf.20100728 2010-04-26 15:46:36.000000000 +1000
+++ etc/init/mountall.conf 2010-07-28 21:58:14.917391034 +1000
@@ -28,7 +28,7 @@
     [ -f /forcefsck ] && force_fsck="--force-fsck"
     [ "$FSCKFIX" = "yes" ] && fsck_fix="--fsck-fix"

- exec mountall --daemon $force_fsck $fsck_fix
+ exec mountall --daemon --verbose $force_fsck $fsck_fix
 end script

 post-stop script

Andrew Edmunds (andrew-edmunds) wrote :

Theory A:
A "mounting TYPE=nfs" event while statd is stopped will cause statd to be started and mountall will block untill statd is running. However a "mounting TYPE=nfs" event while statd is in (say) pre-start or spawned state is ignored and does not block mountall. Therefore it is not guaranteed that statd is in running state when mountall calls mount.nfs.

Theory B:
statd is not really ready for business when upstart marks it as "runnning". It needs a post-start script that waits while it does initialization tasks such as registering with portmap.

I have not so far dug into the source sufficiently to prove or rule out either of these.

The log that I posted seems to support theory B but I have other examples that seem more consistent with theory A.

Andrew Edmunds (andrew-edmunds) wrote :

I scanned quickly through the source of relevant packages. The following lines caught my eye (upstart-0.6.5/init/job_process.c around line 1394):

        nih_info (_("%s %s process (%d) became new process (%d)"),
                  job_name (job), process_name (process),
                  job->pid[process], (pid_t)data);

        /* We no longer care about this process, it's the child that we're
         * interested in from now on, so detach it and allow it to go about
         * its business unhindered.

The comment here is revealing. statd takes care not to let its parent process exit until the child is ready to service client requests. I think the sysvinit startup scripts depended on this. If upstart completely ignores the parent as soon as it has forked, this would introduce a race as per theory B above.

In the above output there are 5 examples where upstart has marked statd as running but the parent process still exists, which implies that statd is not ready to service client requests.

affects: mountall (Ubuntu) → nfs-utils (Ubuntu)
summary: - mountall races with statd startup
+ Race condition in statd.conf

The above patch adds a post-start script that loops until the statd parent process has exited. This is the best workaround that I can think of. It doesn't seem possible to just wait() for the process with the current implementation of upstart's "expect fork" feature.

Suggestion for a better approach: The start script should be allowed to tell upstart the PID of the process that upstart should supervise. The information could be sent (for example) via a pipe, or the traditional /var/run/<job>.pid file. Then the knowledge of how to correctly start a particular daemon could be entirely in the start script where it belongs, rather than part of it being coded into upstart. This would allow the "expect fork" and "expect daemon" features to go away.

tags: added: patch

apport information

tags: added: apport-collected
description: updated
description: updated

Tested on Maverick alpha 3 and confirmed that this bug is still present. Debdiff for Maverick attached.

I just noticed that the underlying upstart issue mentioned in #8 is already described by https://bugs.launchpad.net/upstart/+bug/530779. I won't mark this as a duplicate though - the workaround described here may be useful until the upstart bug is fixed.

Steve Langasek (vorlon) on 2010-12-23
summary: - Race condition in statd.conf
+ statd job starts before service is listening
Bryce Harrington (bryce) wrote :

Hi Andrew,

I notice there have been a number of changes since 1.2.0 which appear to improve behavior with upstart; would you mind confirming that the issue still exists with nfs-utils 1.2.4 (e.g. snag an oneiric livecd and reproduce there).

Steve Langasek (vorlon) wrote :

If the issue is still present, we should properly address it by fixing the statd startup code, not by adding a post-start script.

tags: removed: patch
Changed in nfs-utils (Ubuntu):
status: New → Confirmed
To post a comment you must log in.