Ubuntu

mountall for /var or other nfs mount races with rpc.statd

Reported by Brian J. Murrell on 2010-02-21
380
This bug affects 76 people
Affects Status Importance Assigned to Milestone
mountall (Ubuntu)
Undecided
Unassigned
Lucid
Undecided
Unassigned
Maverick
Undecided
Unassigned
Natty
Undecided
Unassigned
nfs-utils (Ubuntu)
High
Unassigned
Lucid
High
Unassigned
Maverick
High
Unassigned
Natty
High
Unassigned
portmap (Ubuntu)
Undecided
Unassigned
Lucid
High
Unassigned
Maverick
High
Unassigned
Natty
Undecided
Unassigned

Bug Description

If one has /var (or /var/lib or /var/lib/nfs for that matter) on its own filesystem the statd.conf start races with the mounting of /var as rpc.statd needs /var/lib/nfs to be available in order to work.

I am sure this is not the only occurrence of this type of problem.

A knee-jerk solution is to simply spin in statd.conf waiting for /var/lib/nfs to be available, but polling sucks, especially for something like upstart whose whole purpose is to be an event driven action manager.

SRU justification: NFS mounts do not start reliably on boot in lucid and maverick (depending on the filesystem layout of the client system) due to race conditions in the startup of statd. This should be fixed so users of the latest LTS can make reliable use of NFS.

Regression potential: Some systems may fail to mount NFS filesystems at boot time that didn't fail before. Some systems may hang at boot. Some systems may hang while upgrading the packages (this version or in a future SRU). I believe the natty update adequately guards against all of these possibilities, but the risk is there.

TEST CASE:
1. Configure a system with /var as a separate partition.
2. Add one or more mounts of type 'nfs' to /etc/fstab.
3. Boot the system.
4. Verify whether statd has started (status statd) and whether all NFS filesystems have been mounted.
5. Repeat 3-4 until the race condition is triggered.
6. Upgrade to the new version of portmap and nfs-common from -proposed.
7. Repeat steps 3-4 until satisfied that statd now starts reliably and all non-gss-authenticated NFSv3 filesystems mount correctly at boot time.

The whole /var thing is not really a very well thought out part of the FHS

affects: upstart (Ubuntu) → nfs-utils (Ubuntu)

On Wed, 2010-02-24 at 15:06 +0000, Scott James Remnant wrote:
> The whole /var thing is not really a very well thought out part of the
> FHS

OK. What does that mean in terms of this bug though?

statd is "start on (started portmap or mount TYPE=nfs)"
portmap is "start on (local-filesystems and net-device-up IFACE=lo)"

and the statd job tries to start portmap if it's not already running.

So the only possible race conditions I see here are if
 - mount TYPE=nfs is emitted before all the local filesystems are mounted
 - mount TYPE=nfs is emitted before lo is configured, and this causes portmap to fail
 - /var is on a network filesystem *other* than NFS (if it's on NFS, then this can't really be solved, you just get a deadlock if you try)

Can you post your fstab, so I can better understand which of these cases applies? (I'm not sure it's fixable anyway with current upstart, but at least we'll know what we're dealing with)

Changed in nfs-utils (Ubuntu):
status: New → Incomplete

On Thu, 2010-02-25 at 13:35 +0000, Steve Langasek wrote:
> statd is "start on (started portmap or mount TYPE=nfs)"
> portmap is "start on (local-filesystems and net-device-up IFACE=lo)"

I'm not terribly conversant on the state language of upstart yet, but
does the above say that statd will be started after portmap has been
started *or* an NFS mount is required and portmap will be started after
local-filesystems has been completed and interface "lo" is up?

> and the statd job tries to start portmap if it's not already running.

Yeah.

> So the only possible race conditions I see here are if
> - mount TYPE=nfs is emitted before all the local filesystems are mounted

Indeed! And I believe this is in fact the race I am running into.

> - mount TYPE=nfs is emitted before lo is configured, and this causes portmap to fail

Nope. I have debugged enough to know this is not the case.

> - /var is on a network filesystem *other* than NFS (if it's on NFS, then this can't really be solved, you just get a deadlock if you try)

Nope. /var is local.

> Can you post your fstab, so I can better understand which of these cases
> applies?

Sure:

# /etc/fstab: static file system information.
#
# file system mount point type options dump pass
/dev/rootvol/ubuntu_root / ext3 defaults 0 0
UUID=9d79e085-9980-444d-b58b-e0a49b5c2edb /boot ext3 rw,nosuid,nodev 0 2

/dev/rootvol/swap none swap sw 0 0
proc /proc proc defaults 0 0
sys /sys sysfs defaults 0 0

/dev/fd0 /mnt/floppy auto noauto,rw,sync,user,exec 0 0
/dev/cdrom /mnt/cdrom auto noauto,ro,user,exec 0 0
/dev/rootvol/ubuntu_var /var ext3 rw,nosuid,nodev 0 2
/dev/rootvol/apt_archives /var/cache/apt/archives ext3 rw,nosuid,nodev 0 2
/dev/rootvol/ubuntu_usr /usr ext3 rw,nodev 0 2
/dev/rootvol/home /home ext3 rw,nosuid,nodev 0 2
/dev/datavol/video /video xfs rw,nosuid,nodev 0 2
pc:/home/brian /autohome/brian nfs auto,exec,dev,suid,rw,bg,rsize=8192,wsize=8192 1 1
linux:/mnt/mp3/library /var/lib/mythtv/music nfs rw,noexec,nodev,nosuid,bg,rsize=8192,wsize=8192 1 1
linux:/usr/src /usr/src nfs rw,nodev,nosuid,bg,rsize=8192,wsize=8192 1 1

I think you will agree that it's the first race condition.

I'm not sure exactly what "local-filesystems" signal is signalling but
assuming it really does mean "local" (i.e. directly attached block
devices) is there any reason the boolean operator in the condition for
starting statd is not "and" rather than "or"? That would ensure
that /var is mounted and portmapper is running before a statd start is
attempted. Doesn't statd require portmapper anyway?

> I'm not sure exactly what "local-filesystems" signal is signalling but
> assuming it really does mean "local" (i.e. directly attached block
> devices) is there any reason the boolean operator in the condition for
> starting statd is not "and" rather than "or"?

Because due to a bug in upstart, this would cause every NFS mount after the first one to block indefinitely, waiting for another 'started portmap' event that will never come.

Anyway, as you're aware, portmap no longer waits for local-filesystems; so that's no longer a guarantee.

Changed in nfs-utils (Ubuntu):
status: Incomplete → Triaged

Per bug 555661 is there a change in upstart .conf file dependencies that you would like me to test/try?

You said in bug 555661 comment 12:
> > So for lucid, I'm still inclined to update the statd job to 'start on
> > local-filesystems'. Possibly 'start on (local-filesystems and mounting
> > TYPE=nfs)' - if that doesn't cause NFS mount attempts after the first one to
> > deadlock in mountall/upstart. I'll have to test this and propose it as an
> > SRU if it checks out.
>
> Ah, in fact that causes a deadlock in mountall/upstart even before NFS
> mounts are attempted. So 'start on local-filesystems' is as close as we can
> probably get for lucid.

Can you clarify exactly what changes you envision for Lucid, to clear this mess up?

ody (ody-cat) wrote :

This is rather an unacceptable race condition. This will once again cause me an enormous amount of pain in upgrading my nearly 200 Ubuntu servers and desktops that all mount a substantial amount of stuff over NFS, including user home directories.

ody (ody-cat) wrote :

Changing line 6 to the following fixes the problem.

--- statd.conf 2010-04-29 14:22:27.567158573 -0700
+++ /etc/init/statd.conf 2010-04-29 14:18:56.057316910 -0700
@@ -3,7 +3,7 @@
 description "NSM status monitor"
 author "Steve Langasek <email address hidden>"

-start on (started portmap or mounting TYPE=nfs)
+start on ((started portmap and mounted MOUNTPOINT=/var) or mounting TYPE=nfs)
 stop on stopping portmap

 expect fork

On Thu, Apr 29, 2010 at 09:25:53PM -0000, ody wrote:
> Changing line 6 to the following fixes the problem.

> --- statd.conf 2010-04-29 14:22:27.567158573 -0700
> +++ /etc/init/statd.conf 2010-04-29 14:18:56.057316910 -0700
> @@ -3,7 +3,7 @@
> description "NSM status monitor"
> author "Steve Langasek <email address hidden>"
>
> -start on (started portmap or mounting TYPE=nfs)
> +start on ((started portmap and mounted MOUNTPOINT=/var) or mounting TYPE=nfs)
> stop on stopping portmap

For users with a separate /var partition, yes. For users without, it causes
statd to consistently fail to start at boot.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

ody (ody-cat) wrote :

On 04/30/2010 05:21 AM, Steve Langasek wrote:
> For users with a separate /var partition, yes. For users without, it causes
> statd to consistently fail to start at boot.
>

Oh yeah. That patch is a total kludge/hack we put in place so we could
quickly deploy Lucid. Will look forward to the actual fix or might hack
on a better statd.conf change that doesn't break the rest of the world.

I am the sysadmin for a company that uses Ubuntu for desktops, and uses nfs heavily. I upgraded to Lucid from Karmic, and everything has been fine. A co-worker upgraded, and ran into this bug. He tried the mounted MOUNTPOINT=/var workaround. It actually seemed to make the problem worse. The first time he booted it just hung the boot process. With a reboot, it came up without the hang.

What upstart needs is more of a "can I write to this directory" option.

On May 4, 2010, at 4:14 PM, Nathan Grennan wrote:

> I am the sysadmin for a company that uses Ubuntu for desktops, and uses
> nfs heavily. I upgraded to Lucid from Karmic, and everything has been
> fine. A co-worker upgraded, and ran into this bug. He tried the mounted
> MOUNTPOINT=/var workaround. It actually seemed to make the problem
> worse. The first time he booted it just hung the boot process. With a
> reboot, it came up without the hang.
>
> What upstart needs is more of a "can I write to this directory" option.
>

I saw a similar freeze when I was hacking about and tried `mounted MOUNTPOINT=/var/run` which is used by /etc/init/mounted-varrun.conf. This new tight integration with upstart is going to take some getting use too.

Anyone willing to try out this work around? `/etc/init/mountall` emits local-filesystems so if you change line 6 of statd.conf to the following things look to come up normally. This is probably a better more sane solution then what I posted earlier.

start on ((started portmap and local-filesystems) or mounting TYPE=nfs)

John Peach (john-launchpad) wrote :

Yes, that change works for me....

start on ((started portmap and local-filesystems) or mounting TYPE=nfs)

Steve Langasek (vorlon) wrote :

This proposed workaround will cause a hang whenever portmap is restarted on a package upgrade.

Arjen Verweij (arjen-verweij) wrote :

#13 doesn't work for me. The only viable workaround I know of is listing the NFS mounts as noauto and adding them to /etc/rc.local individually

I think the suggested fix in comment #8 is absolutely evil. The intentions are well-placed but the results of using that work-around are evil and I believe lead to the sort of issues reported in bug #543506. I still have to go to a few more machines and "undo" that change to be completely sure. I will update when I have done more testing.

I should add, that even with the suggested patch from comment #8 in place, it didn't stop rpc.statd from being started before /var was mounted, so it was not even helping in that manner.

As an alternative hack/solution to this race (until it's resolved more elegantly within upstart), could we not simply spin in the statd.conf script waiting for /var/lib/nfs to be available?

This would be most interesting to do because, in fact, I believe that /var/lib/nfs not being available when statd.conf runs is not the only issue that is causing rpc.statd startup to fail. I see failures reported by upstart, during boot, even after /var is mounted.

Let me put such a spin lock in place and see how that goes.

OK. The spin loop/lock seems to work much better than the start triggers.

I still find however that mountall is trying to mount nfs filesystems before rpc.statd is started. Do we consider this an nfs-utils bug or an upstart/mountall bug?

Joshua Baergen (joshuabaergen) wrote :

> I still find however that mountall is trying to mount nfs
> filesystems before rpc.statd is started. Do we consider
> this an nfs-utils bug or an upstart/mountall bug?

But statd's job is triggered on such NFS mounts, so I think that's OK as long as mountall doesn't do it (or give up) too early, is it not?

On Thu, 2010-05-13 at 19:03 +0000, Joshua Baergen wrote:
>
> But statd's job is triggered on such NFS mounts,

If by triggered you mean that such a mount stops and waits for statd to
be started, nope.

> so I think that's OK as
> long as mountall doesn't do it (or give up) too early, is it not?

But it does do it. Eventually the mount does succeed but it's very ugly
to race like, hoping that a retry will succeed and then there's the spat
of ugly "failure" messages during boot.

All in all, the race needs to be resolved, IMHO.

So, I now have a case where this race prevents a successful boot completely, 100% of the time. The only way I can get a normal boot is to manually mount an NFS filesystem in another window while mountall/upstart has stalled.

To clarify, the only way I can get this machine to boot is to boot the kernel with init=/bin/bash.

Once I have the init/bash i open a vt with "open -c 12 /bin/bash". I then exec init with "exec /sbin/init" and normal boot continues until upstart/mountall gets stuck waiting for NFS filesystems to mount, which never do. This will wait here forever if I don't intervene.

To get the boot to continue, I switch to the bash I started on vt 12 and just mount one of the several nfs filesystems mountall is waiting for with:

# mount /autohome/brian

for example. At this point upstart resumes starting services and regular boot completes.

Happy to provide any information needed to progress this issue to resolution.

Steve Langasek (vorlon) wrote :

In this reproducible case, is this the problem that the network has been brought up before /var is mounted? I.e., if you run 'killall -USR1 mountall' /instead of/ running a mount command by hand, does the boot continue?

On Mon, 2010-05-17 at 23:09 +0000, Steve Langasek wrote:
> In this reproducible case, is this the problem that the network has been
> brought up before /var is mounted?

I have not been able to test on that platform yet, however...

> I.e., if you run 'killall -USR1
> mountall' /instead of/ running a mount command by hand, does the boot
> continue?

On another platform that I was debugging last night, this indeed did
seem to work.

Hopefully I can get to my 100% reproducible case today and will let you
know.

On Mon, 2010-05-17 at 23:09 +0000, Steve Langasek wrote:
> In this reproducible case, is this the problem that the network has been
> brought up before /var is mounted? I.e., if you run 'killall -USR1
> mountall' /instead of/ running a mount command by hand, does the boot
> continue?

Yes sir, it does! Nice catch.

Now, what's the fix? :-)

FWIW, this new 100% reproducible case is 100% reproducible for a reason, which I will get to in a minute.

So, given that in two cases where I have had a stalled boot, signalling mountall with a USR1 has caused the boot to proceed. This seems like a race somewhere.

The reason I was able to reproduce this 100% of the time on one particular given platform is because it's a PXE (i.e. netboot/netroot) system, booting from the network and mounting it's entire root (which includes /usr and /var) from an NFS server. In this scenario I found that allowing ifup to configure an interface that has already been configured by the kernel during the netboot was causing the system to hang.

As a short-term solution until I could research the real, long-term solution, I decided that given that the interface was already up even before init was even started, I would just disable it in /etc/network/interfaces. That of course also prevents the "initctl emit -n net-device-up ..." in /etc/network/if-up.d/upstart from firing.

But it would only prevent it from firing for the ethernet interface. Shouldn't it still fire for lo being ifup'd, giving mountall the USR1 it needs?

Steve Langasek (vorlon) wrote :

Oh, for NFS root, there's bug #537133. It seems there are known problems still with mountall in that configuration.

On Wed, 2010-05-19 at 15:13 +0000, Steve Langasek wrote:
> Oh, for NFS root, there's bug #537133. It seems there are known
> problems still with mountall in that configuration.

Funny enough, once I stopped dhclient from resetting an interface's
address to 0.0.0.0 (and filing a bug upstream about it) and then
re-enabled the interface in /etc/network/interfaces, my netboot/nfsroot
system is working just peachy. Well apart from the ugly messages about
failing to mount (other, non /) NFS filesystems first time through.

tags: added: patch

Same problem here, /var on separate file system, patch in #13 seems to work.

Bjarne Steinsbø (bsteinsbo) wrote :

Thank you all for providing descriptions and solutions. My systems were upgraded from 9.10 where everything was working OK, and I was about to give up on Ubuntu when the systems didn't boot after upgrade.

computer 1: /var on separate file system, mounting /home from nfs, hangs consistently on boot, "fixed" by removing fstab entry and mounting /home in /etc/rc.local

computer 2: hmm, this is a special one... On one hand it is a perfectly normal dual-boot laptop, running Windows 7 and Ubuntu. On the other hand, that same Ubuntu partition is the disk of a virtual machine, so that I can run the same Ubuntu installation from within VirtualBox. Working OK when booting from BIOS, but was failing to boot from within VirtualBox until i moved the mount of a vboxsf share to rc.local.

Chelmite (steve-kelem) on 2010-06-14
description: updated
description: updated
Gerb (g.roest) wrote :

having /var on a separate filesystem, patch in #13 works for me too.

Carl Nobile (cnobile1) wrote :

This seems to be the cause of both bugs 573919 and 590570.

/var on separate filesystem, remote dir moounted using autofs. #13 works for me.

So, what will be done about this (and the host of other mounting bugs) in Lucid? Lucid is an LTS release and as such should be stable for 2 years. It is not. This bug (and the host of other mounting bugs) make it not so. Yet no updates have come forth to solve the problem nor has there been any activity from Ubuntu developers.

Have Ubuntu abandoned this (and the host of other mounting bugs) as simply too difficult to deal with? If not, please, let us know what the path forward, to a stable LTS (with separate /var, and/or NFS rooting, etc.) release is.

Steve Langasek (vorlon) wrote :

As I wrote in comment #15, we don't have a viable workaround yet that doesn't inroduce other hangs / failures in other scenarios. A "fix" that will break all ability to further upgrade the system is worse than the status quo, because it means security fixes can't be applied.

Until someone is able to identify a solution that doesn't have this disadvantage, there's nothing that can be done here.

> Lucid is an LTS release and as such should be stable for 2 years.

"stable" does not mean "usable for all proposed uses".

On Tue, 2010-07-27 at 15:28 +0000, Steve Langasek wrote:
> As I wrote in comment #15, we don't have a viable workaround yet that
> doesn't inroduce other hangs / failures in other scenarios.

I guess the question then is whether any effort is being spent towards
such a solution.

> A "fix"
> that will break all ability to further upgrade the system is worse than
> the status quo, because it means security fixes can't be applied.

Fair enough. But leaving systems unbootable for a portion of the users
surely cannot be an acceptable solution either, yes?

> Until someone is able to identify a solution that doesn't have this
> disadvantage, there's nothing that can be done here.

Who would this "someone" be? It's sounding an awful lot like nobody is
actually working on the real root cause (and solution) of this issue and
everyone is just hoping that "somebody else" will come up with a
solution.

> "stable" does not mean "usable for all proposed uses".

So having a /var on a separate filesystem is "fringe" enough that those
users should not be able to experience a stable system? /var on it's
own filesystem is the only "responsible" way to manage a system. You
can glom any other crap you want onto / but leaving /var on / to grow
until it fills up / is simply irresponsible for any use-case except
single-user machines. Is this "single-user" use case all that Ubuntu is
interested in satisfying? Multi-user/server (i.e. corporate users)
installations are not a useful user-base for Canonical?

This bug is affecting me too. It would be great to have a definitively solution.

MarkG (movieman523) wrote :

Same here: this is making my MythTV frontend unusable as about half the time it can't see the MythTV NFS directories after booting.

This used to work OK in 9.10; I upgraded it to Lucid in the hope that the newer kernel might fix USB remote bugs and instead I got an unusable network and lost most of the sound from xbmc.

I'm not an expert but I tested this:

Create a script "startstatd" and save it in "/etc/init.d".
The script:

#! /bin/sh
service statd start
mount /192.168.0.31: ... #add directives for mounting nfs shares

Then create a symlink to /etc/rcS.d, for example @S91startstatd.

It seems to be working well.

There is a mistake in my script, sorry. The slash before 192.168.0.31 should be omitted:

mount 192.168.0.31: ... #add directives for mounting nfs shares

Furthermore:

chmod 755 /etc/init.d/startstatd

Couldn't we imagine using the output of " df -P /var/lib/nfs/ | tail -n 1 | awk '{print $6}' " to adapt the "/etc/init/statd.conf" script auto-magically during the 'nfs-common' package installation (iow. having the package's "postinst" script add the 'mounted MOUNTPOINT=...' stanza when appropriate) ?

See proposed attached patch.

Sorry, previous patch contained a typo. New one attached (and should also be more resilient to package updates).
I rebuilt the "nfs-common_1.2.0-4ubuntu4_amd64.deb" package and installed it with the expected auto-magic modification.

John Peach (john-launchpad) wrote :

I have to agree with Brian in #37. It's looking more and more like Canonical are not interested in the user-base. First they remove Sun Java and expect everyone to report the bugs in openjdk and now this fiasco. In all the years I have been administering UNIX/Linux systems, it has been regarded as best-practice to have /var (and, indeed, /usr) on its own filesystem. I would suggest that upstart is really not production quality and, as such, should not be used for stable releases.

I fully agree with Brian in #37 and John in #44. Ubuntu boasts on http://www.ubuntu.com/server: "Ubuntu Server delivers services reliably, predictably and economically – and it easily integrates with your existing infrastructure." This is not true with regard to this bug.

I also fully agree with Brian in #37, John in #44 and Matthias in #45.

We also have such a problem; we also use a separate /var-partiton as usual.

> Until someone is able to identify a solution that doesn't have this
> disadvantage, there's nothing that can be done here.
>> Lucid is an LTS release and as such should be stable for 2 years.
> "stable" does not mean "usable for all proposed uses".

I feel a little bit mocked. Until now I was very satisfied by Ubuntu, but this problem already has a special dimension!
Why did Canonical introduce Upstart, when it does not work correct in normal use?
When this problem is not solvable, Canonical should go back to the former init solution.
I gladly spend a few seconds more for (correct) booting.

And I would add to #44, #45 and #46 that Canonical can not honestly ignore the fact that LTS releases are certainly preferred by enterprise/production environments, for obvious reasons (well, my 1-penny maybe mislead guess). Jumping to so much untested/unreliable stuff in a LTS release just makes me lose the trust I had built for Ubuntu until now.

A lame workaround is to simply add this to /etc/rc.local (before the "exit 0"):
/sbin/initctl start statd

David Mathog (mathog) wrote :

On my system /var is not in a separate partition, but I still have the infamous statd error messages on nfs mounts.

Ubuntu 10.04.1 LTS
mountall 2.15
upstart.6.5-6

I tried a bunch of things so far and none have worked reliably to cure it:

1. put a "sleep 2" in /etc/mountall-net.conf to allow time for rpc.statd to start.
2. put in two different tests in /etc/init/statd.conf to make sure statd was working
before it left that script - neither worked. One was by Andrew Edmunds from bug 610863. Then
this one, which verifies the expected ports are open:

post-start script
       while [ true ]
       do
           pcount=$(rpcinfo -p 2>/dev/null | grep -c ' status$' 2>/dev/null)
           if [ "$pcount" == 2 ]
           then
             break;
           else
             sleep 0.1
           fi
       done
end script

3. added --debug to the kernel boot line in grub. With messages spewing every which way I sometimes see the
statd warning messages but in 7 tries it never came up without the NFS volumes mounted. Of course it took much longer
to boot this way, and it looks to a naive user like the system has failed, so it is not acceptable for a production system.

4. added to /etc/rc.local

service statd start
killall -SIGUSR1 mountall

Amazingly I have since rebooted a couple of times with (just) the /etc/rc.local change in place and while it helped, it sometimes came up not only with the infamous statd error messages from the failed nfs mounts, but with mountall still running! That looks to me like upstart (which perhaps should have been named "upchuck") blew up and NEVER EVEN RAN rc.local. Either that or mountall ignored the USR1 signal. Both are really appalling possibilities in software which is supposed to be used in production environments. Heck, for all I know both failures may be present.

Assuming at least one of the fixes from (2) above worked as intended then mountall is starting before the statd.conf script should have allowed it to. At least, that's the case if the post-start script must complete as a precondition of a successful statd start. If that isn't true, then what stanza should be used instead?

David Mathog (mathog) wrote :
Download full text (3.6 KiB)

I think it is finally working, but what a mess.

Observations;

1. An explicit nfs mount like:

  mount /mnt/server/directory

will not work from within rc.local while mountall is still running as a daemon.

2. An explicit

  killall -SIGUSR1 mountall

from within rc.local will be ignored by mountall.

3. An explict

  killall -9 mountall

will kill mountall, but then init goes nuts and throws up one of these

 General error mounting filesystems
 A maintenance shell will now be started
 etc.

It does this even after one has already logged in on the console!

4. An explicit

  killall -SIGUSR1 mountall

from a console session will allow the NFS mount to proceed, but of course
that can only be done by root, and so it isn't a general solution for the end users

5. An explicit

  killall -SIGUSR1 mountall

from an at job will also allow NFS mounts to complete.

6. I have no idea why mountall is hanging around and not starting the NFS connection.
This is not a race condition with statd, as it can be tested and shown to be running while mountall
is still twiddling its metaphorical fingers.

So my "final" solution, until somebody fixes mountall, is
to add to /etc/rc.local

set +e # else any failed grep causes an exit
marunning=`ps -ef | grep mountall | grep -v grep`
logger "MATHOG marunning [ $marunning ]"
set -e
if [ -n "$marunning" ]
then
  logger "MATHOG use at to kill mountall"
  at -f /etc/saf2.sh now
else
  logger "MATHOG ma not running, no kill needed"
fi

Create the at file "saf2.sh" like:

cat >/etc/saf2.sh <<EOD # (name isn't important)
#!/bin/bash
set +e
#allow time for rc.local to exit, this should probably be shorter
sleep 4
marunning=`ps -ef | grep mountall | grep -v grep`
logger "MATHOG2 marunning? [ $marunning ]"
logger "MATHOG2 trying sigusr1 on mountall from at job"
killall -SIGUSR1 mountall
kstat=$?
marunning=`ps -ef | grep mountall | grep -v grep`
logger "MATHOG2 kstat $kstat marunning STILL? [ $marunning ]"
set -e
EOD

When this boots the nfs/statd messages still appear on the console, but it does mount
the NFS volumes within a few seconds, and these messages show up in /var/log/messages:
Aug 25 14:48:07 saf04 logger: MATHOG marunning [ root 400 1 0 14:48 ? 00:00:00 mountall --daemon ]
Aug 25 14:48:07 saf04 logger: MATHOG use at to kill mountall
Aug 25 14:48:10 saf04 logger: MATHOG2 marunning? [ root 400 1 0 14:48 ? 00:00:00 mountall --daemon ]
Aug 25 14:48:10 saf04 logger: MATHOG2 trying sigusr1 on mountall from at job
Aug 25 14:48:10 saf04 logger: MATHOG2 kstat 0 marunning STILL? [ root 400 1 0 14:48 ? 00:00:00 mountall --daemon ]
Aug 25 14:48:10 saf04 kernel: [ 20.345629] RPC: Registered udp transport module.
Aug 25 14:48:10 saf04 kernel: [ 20.345632] RPC: Registered tcp transport module.
Aug 25 14:48:10 saf04 kernel: [ 20.345633] RPC: Registered tcp NFSv4.1 backchannel transport module.

That is, when it glitches mountall is still running in rc.local, and when that is detected the at job is sent
off and that finally triggers the NFS mount. Obviously you can dispense with the MATHOG tags, that is
just in there for my debugging.

Oh yes, there was also...

Read more...

David Mathog (mathog) wrote :

After 5 more reboots the at method failed. More exploration showed that

  status statd

could show "start/running" and

  rpcinfo -p | grep status

could show both the tcp and udp ports, and a couple of wait seconds could be put in there,
and it could be run from an at command and even then

  killall -SIGUSR1 mountall

would sometimes fail to respond to the signal.

So I figured that perhaps statd was in a funky state too. Reset all the /etc/init files
to their original values and put in rc.local and saf2.sh (the next two attachments). Have now
booted 10 consecutive times with NSF mounts connected once I login. Of these 8 still had
the statd error messages, but at least this hack finally made them mount in the end. Here is what
shows up in the message.log when this happens:

Aug 26 09:54:30 saf04 logger: NFSGLITCH marunning [ root 391 1 0 09:54 ? 00:00:00 mountall --daemon ]
Aug 26 09:54:30 saf04 logger: NFSGLITCH use at to kill mountall
Aug 26 09:54:31 saf04 logger: NFSGLITCH2 statd: CLAIMS to be in start/running, restart it anyway
Aug 26 09:54:32 saf04 logger: NFSGLITCH2 marunning? [ root 391 1 1 09:54 ? 00:00:00 mountall --daemon ]
Aug 26 09:54:32 saf04 logger: NFSGLITCH2 trying sigusr1 on mountall from at job
Aug 26 09:54:32 saf04 kernel: [ 20.298991] RPC: Registered udp transport module.
Aug 26 09:54:32 saf04 kernel: [ 20.298994] RPC: Registered tcp transport module.
Aug 26 09:54:32 saf04 kernel: [ 20.298995] RPC: Registered tcp NFSv4.1 backchannel transport module.
Aug 26 09:54:33 saf04 logger: NFSGLITCH2 kstat 0 marunning STILL? [ ]

David Mathog (mathog) wrote :

see previous coment

David Mathog (mathog) wrote :

see previous comment

Making the changes suggested in comment #8 by ody has completed solved the problem for me.

My clients have /var mounted on a separate local partition and users home mounted over the network via NFS.

David Mathog (mathog) wrote :
Download full text (3.2 KiB)

The changes in #8 did not work for me. /var is not mounted, it is just part of /, so the upstart's "mounted" test cannot be applied.

The strangest thing about this whole mess is that, at least in my hands, rpc.statd can be in a state where to all outward appearances it is running normally ("status statd" shows "start/running" and "rpcinfo -p" shows both ports), yet until it is restarted (server statd stop; server statd start) mountall will never respond to a SIGUSR1 from within an init script, or (sometimes) an "at" job; yet mountall will always respond to that signal from root in a terminal! Moreover, in this strange state if root in a terminal enters "mount /mnt/safserver/u1" it will result in all NFS mounts being made, not just that one.

Rarely I also see in /var/log/boot.log

  mount.nfs: DNS resolution failed for safserver: Name or service not known

This is just wrong because nsswitch.conf has "files" first for hosts, and safserver is in /etc/hosts. Probably yet another race condition. This one is definitely not persistent since (even without my fix) logging in after a failed NFS mount DNS is always working.

Just discovered that the order of the lines in /etc/fstab also seems to make a difference:

This one fails the most (every one of 4 boots):

proc /proc proc nodev,noexec,nosuid 0 0
LABEL=root / ext3 errors=remount-ro 0 1
LABEL=boot /boot ext3 defaults 0 2
LABEL=swap none swap sw 0 0
/dev/fd0 /media/floppy0 auto rw,user,noauto,exec,utf8 0 0
safserver:/u4/pdb /mnt/safserver/pdb nfs ro,bg,hard,intr 0 0
safserver:/u1 /mnt/safserver/u1 nfs rw,bg,hard,intr 0 0
/dev/sda1 /mnt/windows/C ntfs-3g ro 0 0
/dev/sda6 /mnt/windows/D ntfs-3g defaults 0 0

This one fails the least (none of 4 boots):

proc /proc proc nodev,noexec,nosuid 0 0
LABEL=root / ext3 errors=remount-ro 0 1
LABEL=boot /boot ext3 defaults 0 2
LABEL=swap none swap sw 0 0
/dev/fd0 /media/floppy0 auto rw,user,noauto,exec,utf8 0 0
/dev/sda1 /mnt/windows/C ntfs-3g ro 0 0
/dev/sda6 /mnt/windows/D ntfs-3g defaults 0 0
safserver:/u4/pdb /mnt/safserver/pdb nfs ro,bg,hard,intr 0 0
safserver:/u1 /mnt/safserver/u1 nfs rw,bg,hard,intr 0 0

This one is in between (fails about 75% of the time):

proc /proc proc nodev,noexec,nosuid 0 0
LABEL=root / ext3 errors=remount-ro 0 1
LABEL=boot /boot ext3 defaults 0 2
LABEL=swap none swap sw 0 0
/dev/fd0 /media/floppy0 auto rw,user,noauto,exec,utf8 0 0
/dev/sda1 /mnt/windows/C ntfs-3g ro 0 0
safserver:/u4/pdb /mnt/safserver/pdb nfs ro,bg,hard,intr 0 0
safserver:/u1 /mnt/safserver/u1 nfs rw,bg,hard,intr 0 0
/dev/sda6 /mnt/windows/D ntfs-3g defaults 0 0

Suggests that the ntfs-3g mounts provide enough of a delay so that the race condition between statd starting and nfs mounting is overcome. (Mostly, surely in enough boots ...

Read more...

Whit Blauvelt (whit-launchpad) wrote :

I strongly agree with the comments above. Against considerable management pressure to run all Linux boxes as RHEL/CentOS, I've insisted on building half our systems on Ubuntu Server, notwithstanding the really poor decision to transition from init.d to upstart. But to encounter this sort of brokenness, affecting stuff as common to professional use as keeping /var in its own partition and using nfs mounts, and then see an utter lack of commitment from Cannonical about fixing this, is sad. In other places I've been in the minority claiming Ubuntu is also "enterprise ready." Now I'm kicking myself.

Solution? Please? I'm not just looking for a kludge. This needs a clean, complete, universal solution.

David Mathog (mathog) wrote :

In another limitation of upstart, or at least upstart is most likely responsible, the "recovery mode" boot hangs when in the /etc/fstab shown in #55 either /dev/sda6 is corrupt or there are network problems and the NFS mounts are not available. I guess using Mandriva spoiled me, because there booting "failsafe" resulted in the init being bright enough to start only the barest minimum of the operating system. Mandriva (and presumably RedHat and CentOS) come up when networks are broken and partitions (other than /boot and /) are screwed up. Not so Ubuntu - it tries to mount everything in /etc/fstab even when "single" is passed as a kernel option. Moreover, due to upstart's parallel nature, it isn't entirely clear at what point it is locking - it might not even be directly after the failed mount warnings, that might indirectly break something else. Perhaps there is some other Ubuntu/upstart specific kernel option to invoke a more low level boot, but if there is, that raises the question why it was not employed by the "recovery mode" boot option Ubuntu created at installation.

Dave Martin (dave-martin-arm) wrote :
Download full text (4.8 KiB)

particularly @Scott, @Steve

Since I've now hit this bug again, I had another read over the bug thread...

Here are some thoughts... which may be substantially wrong, but hey.

There feels like a disconnect here between the true preconditions for some jobs, and the kind of preconditions specifyable for upstart jobs in general.

A fundamental problem, if I understand the situation correctly, is that we have cases where the events (things happening dynamically during boot) are not adequate to determine whether/when upstart should consider a job startable, at least at the level of the simple boolean combinations etc. that upstart currently understands.

The startability of some jobs depends on other factors (in this case, static system configuration which the administrator expects to customise --- the fstab). If upstart is conservative and waits until _everything_ is mounted, we will fail in some cases, for example when there are NFS mounts in fstab. Alternatively, if upstart is aggressive and tries to start the statd job as soon as it is _probably_ startable, then it might fail to start, and there's not much we can do about it -- that seems to be the current behaviour.

This is a problem because upstart doesn't currently have any sensible metholodogy for retrying failed jobs. So, we either need a way to retry jobs at sensible times, or we would need a more expressive way to determine when jobs should be started.

Conservative approach
==================
The "conservative" approach would be this approximation (which seems work for me):

    start on (filesystem and (started portmap or mounting TYPE=nfs))

...because "filesystem" really does mean the whole FHS tree has been mounted, and that the contents of /var are real (not just a stub mountpoint). This won't work for anyone who uses NFS for a mountpoint within the FHS (even if it's not /var and not otherwise needed for launching statd) and probably won't work if an NFS filesystem is listed in /etc/fstab (?) - but shouldn't cause extra problems when using nfsroot since the kernel's internal statd is used in that case (I think?)

Better approach?
==============
Ideally, we could write something like:

    start on (mounted-final MOUNTPOINT=/var) and (started portmap or mounting TYPE=nfs)

Where "mounted-final MOUNTPOINT=<path>" means that all necessary mounts have been done to populate <path> with its "real" FHS contents, and the boot process won't mount anything else on top.

This could be implemented in a practical way in mountall if we don't attempt to make it universal--- i.e., we don't ensure that it works for every possible <path>, but we do make it work for top-level directories defined by the FHS. To emit these events, mountall's must parse the whole fstab and then act appropriately on each mount:

  * When <path> is mounted:
      * emit mounted MOUNTPOINT=<path>
      * for d in {each FHS top-level dir}:
            if no explicit mount for d or a parent of d in fstab:
                emit mounted MOUNTPOINT=<d>

General approach
==============

The above feels a bit messy and fragile, and doesn't solve the general problem of configuration-dependent job start preconditions. So, ...

Read more...

David Mathog (mathog) wrote :

Regarding #58, while I agree in principle with much of what you say the underlying problem on my systems, as described in the 2nd paragraph in #55 is that rpc.statd can enter a state where it seems to be working, but actually isn't. (The only way out is to force rpc.statd to restart.) Consequently it is hard to imagine what sort of upstart syntax could possibly be used to avoid the subsequent failed NFS mounts. This is almost by definition an rpc.statd bug since no matter what upstart might or might not have done to start rpc.statd incorrectly (too early for instance), rpc.statd should be able to determine if it is, or isn't, working properly, and exit with a failed status in the latter case.

Dave Martin (dave-martin-arm) wrote :

Hmmm, that suggests there are two or three problems here:

    1) upstart cannot correctly identify when it is safe to start certain jobs
    2) rpc.statd may not fail cleanly if launched in an unsafe system state (maybe)
    3) mountall doesn't always respond to SIGUSR1 (maybe)

I'm not observing (2) -- but I'm not running a real server, so my observations may be simplistic. Are you in a position to debug what's going on with statd?

For (3), it's a long shot, but this might be related to an old known issue affecting responsiveness to signals which I observed in the libnih code, but never observed at runtime - see https://bugs.launchpad.net/ubuntu/+source/libnih/+bug/518921. AFAICT from the code, this issue still potentially exists.

@David: It would be interesting if you could try my fix for this to see if it makes any difference for you. (You can download updated libnih and mountall packages from my PPA https://launchpad.net/~dave-martin-arm/+archive/ppa, assuming they've finished building.)

Using "start on filesystem and ..." seems to give me a working statd in that clients don't hang, but I guess since you have NFS mounts in fstab this won't work in your case. With the default start condition, I get no statd running at all, but can manually do "start statd" and that works too. I don't seem to end up with the half-working state you saw in this configuration.

I also have a separate /var filesystem and had a problem with autofs, which was cured by the solution proposed in #13.
Is there any news about this bug?

As written in #46 we have also this intolerable boot problem.
(Like others) my two colleagues and I have insisted on building our systems on Ubuntu Server instead of SUSE Professional which redounds upon us.
In the meantime I have worked on this problem for many weeks. So time was going to run out and I was going to lose my temper. This week (as a last try before switching to another distribution) I made again a new instalIation (after countless other tries). I used no separate /var-partition and tried to change the NFS-entries in the /etc/fstab to "noauto". In consequence of this I couldn't use “mount –a –t nfs” for mounting these directories. As a first primitive approach I mounted those in /etc/rc.local by calling each mount command directly after a sleep period. This solution worked, but was not useable in practice. So I made a script which is scanning the /etc/fstab for NFS-entries and is mounting / unmounting (here in reverse order) the directories. Before mounting after boot there is a 20 seconds sleep period to be sure all necessary initializations have be done. The script (placed in /etc/init.d) is invoked over links from /etc/rc?.d.
It works well. If this solution is compatible with a separate /var-partition I haven’t tested yet …

Maciej Puzio (maciej-puzio) wrote :

For those for whom workarounds in comments #8 and #13 do not work and who use autofs, there is a separate Upstart-induced problem related to statd: it races not only with mountall, but also with autofs. Bug 573919 has more information. Autofs users should also see bug 579858 if they wish to have their local drives cleanly unmounted on shutdown.

Jan-Marek Glogowski (jmglogow) wrote :

The attached patch changes the rpc_pipefs upstart job to start on statd and also poll in the pre-start script for the directory /var/lib/nfs/rpc_pipefs for 120*0.5 seconds to become available.

Fixes my problem booting a NFSv4 Lucid server and survives an "apt-get install --reinstall nfs-common nfs-kernel-server" of the patched nfs-utils packages.

Hope that the /var fsck / mount never takes more then 60 seconds (didn't check that yet).

Klaus Ethgen (klaus+ubuntu) wrote :

I must back #62. That is the same situation I am in.

The problem that sometimes NFS mounts work an sometimes not is not acceptable. And that is not that our setup is that problematic. Just a /home from NFS. The behavior of upstart is far from reliability.

Steve Langasek (vorlon) on 2010-12-23
Changed in nfs-utils (Ubuntu):
importance: Undecided → High
Changed in nfs-utils (Ubuntu Natty):
status: Triaged → In Progress
assignee: nobody → Clint Byrum (clint-fewbar)
Clint Byrum (clint-fewbar) wrote :

Alright I've pushed up branches for nfs-utils and portmap that address this issue in natty. The open merge proposal for nfs-utils is dependent on the portmap package due to the new portmap-wait and portmap-boot upstart jobs.

I've tested this on a clean natty vm with and without sleeps inserted for fsck, portmap start, and statd start, and it all seems to work no matter the race.

Note that I was able to get around looping and polling for upstart jobs to start with the technique used in portmap-wait and statd-mounting. This seems to be a reliable way to allow waiting on a job that one did not trigger the start/stop for, and may be a good candidate for an upstart feature request.

Changed in nfs-utils (Ubuntu Lucid):
status: New → Confirmed
Changed in nfs-utils (Ubuntu Maverick):
status: New → Confirmed
Changed in portmap (Ubuntu Natty):
status: New → Confirmed

On Wed, 2011-01-05 at 20:34 +0000, Clint Byrum wrote:
> Alright I've pushed up branches for nfs-utils and portmap that address
> this issue in natty. The open merge proposal for nfs-utils is dependent
> on the portmap package due to the new portmap-wait and portmap-boot
> upstart jobs.

I've yet to look at the branch to see what changes have been made but
perhaps you can tell me, is this accomplished all with modifications to
upstart scripts or other feasibly editable (i.e. text) files?

If so, I can try porting your changes back to one of my several maverick
(or lucid even -- this should get backported to lucid, which is the LTS
release, afterall ---in any case, right?) systems that has problems with
statd not getting started properly.

Clint Byrum (clint-fewbar) wrote :

Yes this is all done in upstart mods. They are fairly extensive.. And can potentially make the system unbootable if done wrong, so do be careful.

The fact that this is already accepted for maverick and lucid means that the fix will be backported.

On Jan 5, 2011, at 2:56 PM, "Brian J. Murrell" <email address hidden> wrote:

> On Wed, 2011-01-05 at 20:34 +0000, Clint Byrum wrote:
>> Alright I've pushed up branches for nfs-utils and portmap that address
>> this issue in natty. The open merge proposal for nfs-utils is dependent
>> on the portmap package due to the new portmap-wait and portmap-boot
>> upstart jobs.
>
> I've yet to look at the branch to see what changes have been made but
> perhaps you can tell me, is this accomplished all with modifications to
> upstart scripts or other feasibly editable (i.e. text) files?
>
> If so, I can try porting your changes back to one of my several maverick
> (or lucid even -- this should get backported to lucid, which is the LTS
> release, afterall ---in any case, right?) systems that has problems with
> statd not getting started properly.
>
> --
> You received this bug notification because you are a bug assignee.
> https://bugs.launchpad.net/bugs/525154
>
> Title:
> mountall for /var races with rpc.statd
>
> Status in “nfs-utils” package in Ubuntu:
> In Progress
> Status in “portmap” package in Ubuntu:
> Confirmed
> Status in “nfs-utils” source package in Lucid:
> Confirmed
> Status in “portmap” source package in Lucid:
> New
> Status in “nfs-utils” source package in Maverick:
> Confirmed
> Status in “portmap” source package in Maverick:
> New
> Status in “nfs-utils” source package in Natty:
> In Progress
> Status in “portmap” source package in Natty:
> Confirmed
>
> Bug description:
> Binary package hint: upstart
>
> If one has /var (or /var/lib or /var/lib/nfs for that matter) on its own filesystem the statd.conf start races with the mounting of /var as rpc.statd needs /var/lib/nfs to be available in order to work.
>
> I am sure this is not the only occurrence of this type of problem.
>
> A knee-jerk solution is to simply spin in statd.conf waiting for /var/lib/nfs to be available, but polling sucks, especially for something like upstart whose whole purpose is to be an event driven action manager.
>
>
>
>

This bug was fixed in the package portmap - 6.0.0-2ubuntu2

---------------
portmap (6.0.0-2ubuntu2) natty; urgency=low

  * debian/upstart renamed to debian/portmap.portmap.upstart,
    debian/portmap.portmap-boot.upstart, debian/rules: Added to set
    special ON_BOOT flag during boot, which allows statd to use an
    AND with 'started portmap ON_BOOT=y'. This version of portmap is a
    dependency of nfs-utils to fix LP: #525154
  * debian/portmap.portmap-wait.upstart: job to wait for portmap to
    finish starting. also dependedon on by nfs-utils.
 -- Clint Byrum <email address hidden> Wed, 05 Jan 2011 11:47:26 -0800

Changed in portmap (Ubuntu Natty):
status: Confirmed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nfs-utils - 1:1.2.2-4ubuntu2

---------------
nfs-utils (1:1.2.2-4ubuntu2) natty; urgency=low

  * debian/nfs-common.statd.upstart,
    debian/nfs-common.statd-mounting.upstart: refactor startup to wait for
    local-filesystems. (LP: #525154)
  * debian/control: depend on portmap version that sets ON_BOOT=y and
    has the portmap-wait job.
  * debian/rules: install new statd-mounting upstart job
 -- Clint Byrum <email address hidden> Wed, 05 Jan 2011 12:27:32 -0800

Changed in nfs-utils (Ubuntu Natty):
status: In Progress → Fix Released
Steve Langasek (vorlon) on 2011-01-16
Changed in nfs-utils (Ubuntu Lucid):
status: Confirmed → Triaged
Changed in portmap (Ubuntu Lucid):
status: New → Triaged
Changed in nfs-utils (Ubuntu Maverick):
importance: Undecided → High
Changed in nfs-utils (Ubuntu Lucid):
importance: Undecided → High
Changed in portmap (Ubuntu Lucid):
importance: Undecided → High
Changed in portmap (Ubuntu Maverick):
importance: Undecided → High
status: New → Triaged
Changed in nfs-utils (Ubuntu Maverick):
status: Confirmed → Triaged
Antti Miranto (software-antti) wrote :

Fixed this by making upstart watch ower the rpc.idmapd. Might be bit cleaner.

Steve Langasek (vorlon) wrote :

You seem to be following up to the wrong bug. This bug is about statd, not idmapd.

In general we want to avoid running services in the foreground. The idmapd bug is probably fixable without resorting to such a change. We should probably fix idmapd to not fork before it's propery running, at any rate.

Steve Langasek (vorlon) on 2011-01-19
description: updated

Accepted nfs-utils into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in nfs-utils (Ubuntu Lucid):
status: Triaged → Fix Committed
tags: added: verification-needed
Changed in portmap (Ubuntu Lucid):
status: Triaged → Fix Committed
Martin Pitt (pitti) wrote :

Accepted portmap into lucid-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

I am testing on a VM, and the race is kinda hard to trigger. Mounting of /var needs to be delayed somehow, for example, by an fsck. In my case, bug #579858 trigger an fsck on reboot, which make the race happen somewhat reliably.

After updating the nfs-common and portmap packages to the one in lucid-proposed, statd start reliably. So, for my synthetic test-case, it works perfectly. I am not going to set the bug verification-done just yet, as I think it would be best if it was confirmed by someone with a real-world test-case.

Clint Byrum (clint-fewbar) wrote :

Etienne, thanks for trying it out. I wasn't able to reliably cause statd to lose the race w/ mountall without forcing FS corruption or introducing a sleep into the fsck process. I did so by moving /sbin/fsck to /sbin/fsck.real and then using this script:

#!/bin/sh
echo sleeping 5 seconds before calling real fsck
sleep 5
exec /sbin/fsck.real $@

Given that this is a race condition brought on by fsck problems, the only real world examples are exactly what you had to use, caused by recovering from some other problem. I don't think your method is synthetic at all, but it would be good to get somebody else to verify in a similar manner.

Ray Nichols (ray-rdnichols) wrote :

My machine *always* starts with statd down and therefore with all NFS auto-mounting not working.

$ sudo status statd
statd stop/waiting

I'll look out for maverick-proposed packages to test.

On Fri, 2011-01-21 at 20:43 +0000, Ray Nichols wrote:
> My machine *always* starts with statd down and therefore with all NFS
> auto-mounting not working.

Indeed. Mine too. My lucid machines also.

> I'll look out for maverick-proposed packages to test.

Indeed, me too. I'd rather not diddle with my lucid machines. There is
a reason they are LTS installs. :-) Once I see happiness on
maverick-proposed I will have more confidence in diddling the lucid
machines.

b.

After Etienne and Clint said this problem is most easily reproduced by a slow fsck - I checked and I do appear to have filesystems unmounted uncleanly. So I'll investigate that as a separate issue (when I find out how to get my shutdown messages recorded to log). I'm certainly experiencing shutdowns that are very quick compared to my last install in Hardy Heron.

/var/log/boot.log:

fsck from util-linux-ng 2.17.2
...etc...
init: statd main process (920) terminated with status 1
init: statd main process ended, respawning
init: statd main process (930) terminated with status 1
...etc...
/dev/sda1: recovering journal
init: statd main process (987) terminated with status 1
init: statd main process ended, respawning
/dev/mapper/ubuntu01vg-homelv: recovering journal
init: statd main process (993) terminated with status 1
init: statd respawning too fast, stopped
/dev/mapper/ubuntu01vg-usermedialv: clean, 12/610800 files, 76473/2441216 blocks
/dev/mapper/ubuntu01vg-usrlv: recovering journal
/dev/mapper/ubuntu01vg-varlv: recovering journal
/dev/mapper/ubuntu01vg-rootlv: clean, 13556/183264 files, 148953/732160 blocks
/dev/mapper/ubuntu01vg-tmplv: recovering journal
/dev/mapper/ubuntu01vg-usrlocallv: clean, 43/60928 files, 8262/243712 blocks (check in 3 mounts)
init: ureadahead-other main process (1040) terminated with status 4
/dev/mapper/ubuntu01vg-tmplv: clean, 785/121920 files, 35321/487424 blocks
init: ureadahead-other main process (1061) terminated with status 4
init: mounted-tmp main process (1062) terminated with status 127
mountall: Event failed
...etc...

Darin Tay (dtay) wrote :

Verified here with the packages in Lucid proposed.

Not only does statd successfully start, but it does so the first time it tries, avoiding bug #697473.

Thanks for the fix!

Martin Pitt (pitti) on 2011-01-25
tags: added: verification-done
removed: verification-needed

Accepted portmap into maverick-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

Changed in portmap (Ubuntu Maverick):
status: Triaged → Fix Committed
tags: removed: verification-done
tags: added: verification-needed
tags: added: verification-done
Changed in nfs-utils (Ubuntu Maverick):
status: Triaged → Fix Committed
tags: removed: verification-done
Martin Pitt (pitti) wrote :

Accepted nfs-utils into maverick-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

tags: added: verification-done
removed: verification-needed
tags: added: verification-needed
Changed in nfs-utils (Ubuntu Natty):
assignee: Clint Byrum (clint-fewbar) → nobody

I confirm that after installing portmap and nfs-common packages from maverick-proposed I can use NFS auto-mounting with no problems at all. My /var/log/boot.log has no statd error messages now after testing with two reboots.

Thanks!

Steve Langasek (vorlon) on 2011-01-25
tags: removed: verification-needed

Ditto. The maverick-proposed packages seem to work for me too. After a
reboot rpc.statd is indeed started.

Cheers!

This bug was fixed in the package nfs-utils - 1:1.2.0-4ubuntu4.1

---------------
nfs-utils (1:1.2.0-4ubuntu4.1) lucid-proposed; urgency=low

  * debian/nfs-common.statd.upstart,
    debian/nfs-common.statd-mounting.upstart: refactor startup to wait for
    local-filesystems. (LP: #525154)
  * debian/control: depend on portmap version that sets ON_BOOT=y and
    has the portmap-wait job.
  * debian/rules: install new statd-mounting upstart job
  * debian/nfs-common.rpc_pipefs.upstart: instantiate this job separately for
    gssd and idmapd, so that the filesystem gets mounted and unmounted
    correctly even if both of gssd and idmapd aren't being run, or if one of
    the two tries to start before the filesystem is fully mounted. Though
    it may be simpler now to move this logic back into the gssd and idmapd
    jobs directly, leave that for a later date.
 -- Steve Langasek <email address hidden> Wed, 19 Jan 2011 16:28:35 -0800

Changed in nfs-utils (Ubuntu Lucid):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package portmap - 6.0.0-1ubuntu2.1

---------------
portmap (6.0.0-1ubuntu2.1) lucid-proposed; urgency=low

  * debian/upstart renamed to debian/portmap.portmap.upstart,
    debian/portmap.portmap-boot.upstart, debian/rules: Added to set
    special ON_BOOT flag during boot, which allows statd to use an
    AND with 'started portmap ON_BOOT=y'. This version of portmap is a
    dependency of nfs-utils to fix LP: #525154
  * debian/portmap.portmap-wait.upstart: job to wait for portmap to
    finish starting. also depended on on by nfs-utils.
 -- Steve Langasek <email address hidden> Tue, 18 Jan 2011 15:44:43 -0800

Changed in portmap (Ubuntu Lucid):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package portmap - 6.0.0-2ubuntu1.1

---------------
portmap (6.0.0-2ubuntu1.1) maverick-proposed; urgency=low

  * debian/upstart renamed to debian/portmap.portmap.upstart,
    debian/portmap.portmap-boot.upstart, debian/rules: Added to set
    special ON_BOOT flag during boot, which allows statd to use an
    AND with 'started portmap ON_BOOT=y'. This version of portmap is a
    dependency of nfs-utils to fix LP: #525154
  * debian/portmap.portmap-wait.upstart: job to wait for portmap to
    finish starting. also depended on on by nfs-utils.
 -- Steve Langasek <email address hidden> Tue, 18 Jan 2011 15:28:05 -0800

Changed in portmap (Ubuntu Maverick):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package nfs-utils - 1:1.2.2-1ubuntu1.1

---------------
nfs-utils (1:1.2.2-1ubuntu1.1) maverick-proposed; urgency=low

  * debian/nfs-common.statd.upstart,
    debian/nfs-common.statd-mounting.upstart: refactor startup to wait for
    local-filesystems. (LP: #525154)
  * debian/control: depend on portmap version that sets ON_BOOT=y and
    has the portmap-wait job.
  * debian/rules: install new statd-mounting upstart job
  * debian/nfs-common.rpc_pipefs.upstart: instantiate this job separately for
    gssd and idmapd, so that the filesystem gets mounted and unmounted
    correctly even if both of gssd and idmapd aren't being run, or if one of
    the two tries to start before the filesystem is fully mounted. Though
    it may be simpler now to move this logic back into the gssd and idmapd
    jobs directly, leave that for a later date.
 -- Steve Langasek <email address hidden> Wed, 19 Jan 2011 16:05:07 -0800

Changed in nfs-utils (Ubuntu Maverick):
status: Fix Committed → Fix Released
Alexander Achenbach (xela) wrote :

You may also refer to LP #643289 posts #2, #3, #4, which provide fixes for mountall blocking and for various start-up problems of statd as well as for idmapd/gssd/rpc_pipefs. Those fixes are based on the above fixes, but go further to make sure mountall and rpc_pipefs mounting proceed in proper sequence where needed.

Stephan Adig (sadig) wrote :

@All:

Did anyone tried to install now nfs-common (in lucid with the latest sru bugfix) in a chroot environment?

If so, did nobody see nfs-common failing in nfs-common.postinst?

Normally, in a chroot, starting services is disabled and/or not allowed (or in some cases the startup mechanism is somehow diverted to /bin/true or /bin/false whatever).
But now, nfs-common.postinst does an invoke.rc statd and this failes in a chroot environment which means, that the package in question is not configured properly.

The upstart dependency (started portmap ON_BOOT= or (local-filesystems and started portmap ON_BOOT=y)) doesn't apply inside a chroot.

One thing that needs to be done is, to not start the services during installation (which means removing all blind magic of dh_installinit) or whatever it takes to come back with the old behaviour.

Regards,
\sh

On Mon, 2011-02-21 at 14:10 +0000, Stephan Adig wrote:
> @All:
>
> Did anyone tried to install now nfs-common (in lucid with the latest sru
> bugfix) in a chroot environment?
>
> If so, did nobody see nfs-common failing in nfs-common.postinst?
>
> Normally, in a chroot, starting services is disabled and/or not allowed (or in some cases the startup mechanism is somehow diverted to /bin/true or /bin/false whatever).
> But now, nfs-common.postinst does an invoke.rc statd and this failes in a chroot environment which means, that the package in question is not configured properly.
>
> The upstart dependency (started portmap ON_BOOT= or (local-filesystems
> and started portmap ON_BOOT=y)) doesn't apply inside a chroot.
>
> One thing that needs to be done is, to not start the services during
> installation (which means removing all blind magic of dh_installinit) or
> whatever it takes to come back with the old behaviour.
>

Hi Stephan, thanks for the feedback.

First off, services are always started/stopped with invoke-rc.d on
upgrade in Debian packages and, thusly, have always been started and
stopped in Ubuntu.

This is a known issue in upstart which is under active development.
Basically upstart doesn't know anything about the chroot environment, so
it doesn't read the /etc/init from the chroot.

Take a look at bug #430224 for more information.

A simple workaround to get the postinst to succeed is to edit the
postinst in /var/lib/dpkg/info/nfs-common.postinst and remove the calls
to invoke-rc.d

Steve Langasek (vorlon) wrote :

On Mon, Feb 21, 2011 at 02:10:11PM -0000, Stephan Adig wrote:

> Did anyone tried to install now nfs-common (in lucid with the latest sru
> bugfix) in a chroot environment?

> If so, did nobody see nfs-common failing in nfs-common.postinst?

> Normally, in a chroot, starting services is disabled and/or not allowed
> (or in some cases the startup mechanism is somehow diverted to /bin/true
> or /bin/false whatever). But now, nfs-common.postinst does an invoke.rc
> statd and this failes in a chroot environment which means, that the
> package in question is not configured properly.

> The upstart dependency (started portmap ON_BOOT= or (local-filesystems
> and started portmap ON_BOOT=y)) doesn't apply inside a chroot.

invoke-rc.d should not fail in a chroot environment. How have you
configured your chroot? *Any* of the standard methods for disabling
services in a chroot should have the intended effect (i.e., diverting
/sbin/initctl to /bin/true, or configuring a policy-rc.d to disallow service
starting).

If you have /sbin/initctl diverted to /bin/*false*, that is a
misconfiguration in your environment.

> One thing that needs to be done is, to not start the services during
> installation (which means removing all blind magic of dh_installinit) or
> whatever it takes to come back with the old behaviour.

No, that is not a thing that needs to be done.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

i'm using Ubuntu 10.04 as nfsroot on diskless workstation. For now, only working patch for this issue is putting

restart portmap
restart nfs

in some script started early in /etc/rcS.d (so 'single user' mode stage of init)

Download full text (3.3 KiB)

Excerpts from Janusz Mordarski's message of Tue Mar 29 09:55:14 UTC 2011:
> i'm using Ubuntu 10.04 as nfsroot on diskless workstation. For now, only
> working patch for this issue is putting
>
> restart portmap
> restart nfs
>
> in some script started early in /etc/rcS.d (so 'single user' mode stage
> of init)

Janus, can you explain how you think this is related to the bug you've
commented on?

It sounds like you have a different situation, and you should open a new
report or look through some of the other ones against mountall. There
are known issues with nfs root that we haven't addressed yet, though
we'd like to and it will help if we can have enough information from
users like yourself.

If you do open another report, it would be a good idea to come back and
note the new bug # in the comments here.

>
> --
> You received this bug notification because you are a direct subscriber
> of a duplicate bug (692793).
> https://bugs.launchpad.net/bugs/525154
>
> Title:
> mountall for /var races with rpc.statd
>
> Status in “nfs-utils” package in Ubuntu:
> Fix Released
> Status in “portmap” package in Ubuntu:
> Fix Released
> Status in “nfs-utils” source package in Lucid:
> Fix Released
> Status in “portmap” source package in Lucid:
> Fix Released
> Status in “nfs-utils” source package in Maverick:
> Fix Released
> Status in “portmap” source package in Maverick:
> Fix Released
> Status in “nfs-utils” source package in Natty:
> Fix Released
> Status in “portmap” source package in Natty:
> Fix Released
>
> Bug description:
> If one has /var (or /var/lib or /var/lib/nfs for that matter) on its
> own filesystem the statd.conf start races with the mounting of /var as
> rpc.statd needs /var/lib/nfs to be available in order to work.
>
> I am sure this is not the only occurrence of this type of problem.
>
> A knee-jerk solution is to simply spin in statd.conf waiting for
> /var/lib/nfs to be available, but polling sucks, especially for
> something like upstart whose whole purpose is to be an event driven
> action manager.
>
> SRU justification: NFS mounts do not start reliably on boot in lucid
> and maverick (depending on the filesystem layout of the client system)
> due to race conditions in the startup of statd. This should be fixed
> so users of the latest LTS can make reliable use of NFS.
>
> Regression potential: Some systems may fail to mount NFS filesystems
> at boot time that didn't fail before. Some systems may hang at boot.
> Some systems may hang while upgrading the packages (this version or in
> a future SRU). I believe the natty update adequately guards against
> all of these possibilities, but the risk is there.
>
> TEST CASE:
> 1. Configure a system with /var as a separate partition.
> 2. Add one or more mounts of type 'nfs' to /etc/fstab.
> 3. Boot the system.
> 4. Verify whether statd has started (status statd) and whether all NFS filesystems have been mounted.
> 5. Repeat 3-4 until the race condition is triggered.
> 6. Upgrade to the new version of portmap and nfs-common from -proposed.
> 7. Repeat steps 3-4 until satisfied that statd now starts reliably ...

Read more...

well restarting portmap and nfs helped, and i don't want now to revert back to faulty config.

i was getting this mesage when booting my diskless workstations:

mount.nfs: rpc.statd is not running but is required for remote locking.
   Either use '-o nolocks' to keep locks local, or start statd.

/root was mounted OK of course, because if not, it wouldn't boot at all
- problem was with /home and other directories mounted by mountall init scripts
- after complete boot, in gdm, there was no /home , i had to once again invoke mount -a (when gdm was running) - i solved it by using rc.local < dirty solution ;)

portmap and nfs services were starting, but this error message was showing up anyway

thanks to this thread i found out, that maybe starting statd was racing with mounting /var by nfs in read write mode or something like this, so i thought that restarting portmap and nfs services in single user mode, before any extra NFS shares are mounted (/home) will solve the problem, and it works fine now. no error messages during boot process. i don't really know if it qualifies for new bug or not.

Excerpts from Janusz Mordarski's message of Sat Apr 09 16:46:18 UTC 2011:
> well restarting portmap and nfs helped, and i don't want now to revert
> back to faulty config.
>
> i was getting this mesage when booting my diskless workstations:
>
> mount.nfs: rpc.statd is not running but is required for remote locking.
> Either use '-o nolocks' to keep locks local, or start statd.
>
> /root was mounted OK of course, because if not, it wouldn't boot at all
> - problem was with /home and other directories mounted by mountall init scripts
> - after complete boot, in gdm, there was no /home , i had to once again invoke mount -a (when gdm was running) - i solved it by using rc.local < dirty solution ;)
>
> portmap and nfs services were starting, but this error message was
> showing up anyway
>
> thanks to this thread i found out, that maybe starting statd was racing
> with mounting /var by nfs in read write mode or something like this, so
> i thought that restarting portmap and nfs services in single user mode,
> before any extra NFS shares are mounted (/home) will solve the problem,
> and it works fine now. no error messages during boot process. i don't
> really know if it qualifies for new bug or not.

Janusz, this bug is about statd not being available when NFS mounts are
made. This should be fixed in Lucid and Maverick. Make sure all updates
are applied. If you have customized any of the upstart jobs in /etc/init
that control statd or portmap, that may be causing problems. Look for
files with the extension '.dpkg-new', like /etc/init/statd.conf.dpkg-new,
if those exist, you will want to make sure to merge them into the live
files (like /etc/init/statd.conf).

Clint, similarly to Janusz I see these exact same issues on my Natty NFS-root system, even when running with the pristing statd* *portmap* init configuration files. I don't quite understand enough about upstart to say what's wrong yet.

I just spent an afternoon chasing this down and am pretty sure a bug still remains somewhere, though I'm finding it hard to see where without complete event logging or an infinite kernel scrollback buffer.

Clint Byrum (clint-fewbar) wrote :

Christian, if you're using NFS root, you probably have an issue. But it is probably not *this* issue, as this one was not specific to nfs root configurations. It would be quite helpful if you were to raise a new bug report against nfs-utils that detailed what you expect to have happen, and what is actually happening.

On 11-06-20 07:10 PM, Clint Byrum wrote:
> Christian, if you're using NFS root, you probably have an issue. But it
> is probably not *this* issue, as this one was not specific to nfs root
> configurations. It would be quite helpful if you were to raise a new bug
> report against nfs-utils that detailed what you expect to have happen,
> and what is actually happening.

But the question that begs to be asked is why are common configurations
like NFS-root and separate /var and /usr filesystems STILL not part of
Ubuntu's standard QA processes?

These are not entirely "esoteric" configurations you know and they have
been shown to have problems in past releases so why are current QA
processes not testing for these?

You do understand that effective QA means that when a configuration is
shown to have a potential for regressions that such a configuration be
added to the battery of tests that QA runs. It's simply not effective
to identify a configuration that doesn't work, (think that you have)
fix(ed) it and simply move on and not ever test that configuration again
during a regular release cycle. That's exactly how regressions leak
into GA product. It's embarrassing.

Clint Byrum (clint-fewbar) wrote :

Excerpts from Brian J. Murrell's message of Tue Jun 21 10:18:06 UTC 2011:
> On 11-06-20 07:10 PM, Clint Byrum wrote:
> > Christian, if you're using NFS root, you probably have an issue. But it
> > is probably not *this* issue, as this one was not specific to nfs root
> > configurations. It would be quite helpful if you were to raise a new bug
> > report against nfs-utils that detailed what you expect to have happen,
> > and what is actually happening.
>
> But the question that begs to be asked is why are common configurations
> like NFS-root and separate /var and /usr filesystems STILL not part of
> Ubuntu's standard QA processes?
>
> These are not entirely "esoteric" configurations you know and they have
> been shown to have problems in past releases so why are current QA
> processes not testing for these?
>
> You do understand that effective QA means that when a configuration is
> shown to have a potential for regressions that such a configuration be
> added to the battery of tests that QA runs. It's simply not effective
> to identify a configuration that doesn't work, (think that you have)
> fix(ed) it and simply move on and not ever test that configuration again
> during a regular release cycle. That's exactly how regressions leak
> into GA product. It's embarrassing.

Brian, much of our QA is still community driven. We devote significant
resources to testing the base system, but multi-server setups like NFS
root are taxing to manually test, and more complex to automate. I'd love
to say we write a regression test for every issue we fix and run it on
every possible configuration. Clearly, we don't.

If NFS root is important to you, I would suggest that you help us out
by gathering other interested users, and putting together a blueprint
for the next UDS. Lets get automated tests setup for this configuration.

I'd support this 100%, but I don't think we can do it without some help
from the actual users of NFS root systems.

I can definitely file a new bug. I've been on and off fighting this and found that while the solution posted to this bug does not fix this problem with our diskless root, but a related fix does. Here's my statd-starting script below; I noticed that the script is run multiple times and that the "start statd" line isn't actually syncronous, problems which the rpcinfo check solves. I'm not sure you can assume portmap is listening on localhost, but this works for me:

description^I"Trigger a statd run"

start on mounting TYPE=nfs
task
console output

script
    # This apparently is necessary to ensure the statd run completes;
    # it's a hack but it seems to work more reliably than anything else
    while ! rpcinfo -u localhost status; do
        start statd
        echo "Waiting for statd to show up.."
        sleep 1s
    done
end script

Christian Reis (kiko) wrote :

By using the script above, I don't actually need any "start on" clause in my statd.conf file at all. I posted some stream-of-consciousness entries here:

  - http://www.async.com.br/~kiko/diary.html?date=20.06.2011
  - http://www.async.com.br/~kiko/diary.html?date=21.06.2011
  - http://www.async.com.br/~kiko/diary.html?date=22.06.2011

I've pushed the bzr branch containing my complete init setup to http://bazaar.launchpad.net/~kiko/+junk/init-diskless/files -- feel free to poach and comment.

Forest (foresto) wrote :

I'm still seeing this problem on a fully updated Natty. My nfs mount occasionally succeeds at boot, but most of the time it doesn't. After editing /etc/init/mountall.conf to log the mountall --debug output, I see these messages in the log:

mounting /myremotedir
spawn: mount -t nfs -o rw,intr,retrans=180,async,noatime,nodiratime 10.0.95.4:/myremotedir /myremotedir
mount.nfs: rpc.statd is not running but is required for remote locking.
mount.nfs: Either use '-o nolock' to keep locks local, or start statd.
mount.nfs: an incorrect mount option was specified
mountall: mount /myremotedir [896] terminated with status 32

When the mount fails, "status statd" reports that statd is running, which makes me think there is still a race condition here.

I'm not sure if it's relevant, but the 10.0.95.x network is reachable via eth1, not eth0.

I see a lot of "fix released" notes on this bug report. Has the fix made it in to the Natty release repositories yet? If so, it looks to me like the fix needs some work.

Forest (foresto) wrote :

Following up on my own question: I don't see any updated mountall or nfs package in natty-proposed, and my fully-updated natty system still is still failing to mount my nfs shares at boot because of a race with statd.

When I change this line in /etc/init/mountall-net.conf:

start on net-device-up

To this:

start on net-device-up or stopped statd-mounting

mountall gets called again after statd has actually started, and my nfs shares get mounted at startup. I'm attaching a patch for mountall.

On 11-07-01 02:52 PM, Forest wrote:
> Following up on my own question: I don't see any updated mountall or
> nfs package in natty-proposed, and my fully-updated natty system still
> is still failing to mount my nfs shares at boot because of a race with
> statd.

My advise here would be either (a) give up on running a /var that's
separate from / and just learn to cope with a system that becomes
entirely useless at some point because something writing into /var has
filled your root filesystem or (b) switch to a different distro that
actually pays attention to "server deployment" practices wherein
separating /var from / (and /usr for that matter) is an accepted and
supported practice.

Given that this bug has existed since Lucid (3 releases now) makes it
clear to me that Ubuntu is not at all interested in supporting server
deployments where responsible practice is to keep /var from being able
to cripple an entire system simply because it fills up.

I guess Ubuntu is targeting the desktop and if you want to deploy
servers (where you likely will have budgets for support contracts) you
need to look at a different distro.

Just my perspective having watched this bug stagnate through three releases.

Forest (foresto) on 2011-07-02
summary: - mountall for /var races with rpc.statd
+ mountall for /var or other nfs mount races with rpc.statd
Steve Langasek (vorlon) wrote :

The proposed change to mountall-net.conf here is incorrect. Please file new bug reports against the nfs-utils package for races you're seeing between statd startup and mountall calls. From some of the later comments in this bug report, it looks like the statd process isn't ready to serve requests at the time it forks; if so, that's a bug in statd that needs to be fixed there.

Changed in mountall (Ubuntu):
status: New → Invalid
Changed in mountall (Ubuntu Lucid):
status: New → Invalid
Changed in mountall (Ubuntu Maverick):
status: New → Invalid
Changed in mountall (Ubuntu Natty):
status: New → Invalid
Steve Langasek (vorlon) wrote :

Brian, in comment #84 you said that the SRUed package fixed the issue for you, but in your latest post you comment that "this bug has existed since Lucid". Did the updated package fix your issue, or did it not? If it didn't, we should reopen this bug report; up to then, you had certainly given the impression that your bug was fixed.

On 11-07-17 04:59 AM, Steve Langasek wrote:
> Brian, in comment #84 you said that the SRUed package fixed the issue
> for you, but in your latest post you comment that "this bug has existed
> since Lucid". Did the updated package fix your issue, or did it not?
> If it didn't, we should reopen this bug report; up to then, you had
> certainly given the impression that your bug was fixed.

Yeah. I would say that it has. I guess it must just be all of the
other problems in Lucid (and beyond) that have not been fixed (i.e. all
of the bugs related to /var being on it's own filesystem that are still
open and dangling) that are clouding my judgment in this bug.

b.

Steve Langasek (vorlon) wrote :

On Mon, Jul 18, 2011 at 05:38:51PM -0000, Brian J. Murrell wrote:
> On 11-07-17 04:59 AM, Steve Langasek wrote:
> > Brian, in comment #84 you said that the SRUed package fixed the issue
> > for you, but in your latest post you comment that "this bug has existed
> > since Lucid". Did the updated package fix your issue, or did it not?
> > If it didn't, we should reopen this bug report; up to then, you had
> > certainly given the impression that your bug was fixed.

> Yeah. I would say that it has. I guess it must just be all of the
> other problems in Lucid (and beyond) that have not been fixed (i.e. all
> of the bugs related to /var being on it's own filesystem that are still
> open and dangling) that are clouding my judgment in this bug.

Oh. Can you give me some specific bug numbers there? I wasn't aware of any
other issues with /var as a separate filesystem, and based on the
architecture I really wouldn't *expect* any bugs not related to NFS. So if
there are other problems, I'd very much like to know what they are so we can
see about getting them fixed.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

On 11-07-18 02:49 PM, Steve Langasek wrote:
>
> Oh. Can you give me some specific bug numbers there?

Not at the moment I'm afraid. The number of bugs i have in my
subscribed list is just way to big to go searching right now, but off
the top of my head, there is the ureadahead bug, where people have
actually posted solutions and afaik, it's still open.

#484209

#275451

#690401

come to my mind.

Also just having a bind mount to a NFS mount in fstab makes my lucid nodes unbootable.

All in all I think we fixed/worked around 4 bugs in lucid to get them to boot.
All related to the introduction of upstart. And not all have fixes released.

Also I think at this point most people will have migrated to another distro, so don't assume
missing feedback means fixed.

ingo (ingo-steiner) wrote :

@ Andy
> Also I think at this point most people will have migrated to another distro,

so me!

It's now well over a year since Lucid has been released and still not a single word of those issues in the "Release Notes (known issues)" here http://www.debian.org/releases/stable/amd64/release-notes/index.en.html.

As most of those bugs are obviously "by design" (upstart) they should have been fixed before release or QC should have refused to approve Lucid as a LTS.

Steve Langasek (vorlon) wrote :

On Mon, Jul 18, 2011 at 07:05:51PM -0000, Brian J. Murrell wrote:
> On 11-07-18 02:49 PM, Steve Langasek wrote:

> > Oh. Can you give me some specific bug numbers there?

> Not at the moment I'm afraid. The number of bugs i have in my
> subscribed list is just way to big to go searching right now, but off
> the top of my head, there is the ureadahead bug, where people have
> actually posted solutions and afaik, it's still open.

This appears to be bug #523484.

To the best of my understanding, this describes a feature that is missing
when using a separate /var (the ureadahead job will run but not do anything
useful). This is certainly a bug, but not something that we are likely to
backport to lucid even when a fix becomes available. I think you would be
hard pressed to convince any of the Ubuntu developers that it has a major
impact on the usability of Ubuntu that systems with a separate /var don't
get the boot speed enhancement from ureadahead!

Also, you say that people have posted solutions; I've reviewed the bug log
and there are no solutions to the bug there. Most of the proposals would
work *only* on systems with detached /var, so are not suitable for
inclusion in the distribution; the closest thing to a fix is Clint's
proposal to add a signal handler and an additional upstart job, but that
code hasn't actually been written.

So I'm afraid I regard this bug's current prioritization as "medium" to be
correct, sorry. Patches welcome, but I think it's unlikely that this is
going to be worked on soon otherwise.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

ingo (ingo-steiner) wrote :

Sorry, my link was to the Squeeze release notes, here the one for Lucid:
https://wiki.ubuntu.com/LucidLynx/ReleaseNotes#Other_known_issues

Steve Langasek (vorlon) wrote :

On Mon, Jul 18, 2011 at 07:42:54PM -0000, Andy Hauser wrote:
> #484209

Which is fixed.

> #275451

Which has nothing to do with lucid; the bug report dates back to 2008, is
marked "incomplete" in Debian, and does not discuss /var as a separate
partition. That's not an actionable bug report; I've closed it now.

> #690401

This is marked as a duplicate of the present bug report, which is fixed!

> Also just having a bind mount to a NFS mount in fstab makes my lucid
> nodes unbootable.

Is that bug #524972 (where the bind mount refers to a path which is a
symlink)?

> All in all I think we fixed/worked around 4 bugs in lucid to get them to
> boot. All related to the introduction of upstart. And not all have fixes
> released.

I'm sorry you found this to be the case. Unfortunately some of the
NFS-related problems were discovered quite late in the Lucid cycle, and some
of these bugs took quite some time to untangle once identified. But we do
take responsibility for all such critical bugs, which is precisely why I'm
checking to see if there are any such bugs that aren't on our radar.

So far, it doesn't appear that there are.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Seems like the further problems in #484209 also lead here.

So maybe only this and the bindmount thing remain.

> > Also just having a bind mount to a NFS mount in fstab makes my lucid
> > nodes unbootable.

> Is that bug #524972 (where the bind mount refers to a path which is a symlink)?

Maybe. Only that I don't remember there being a question. Anyways answering
questions on the tty everytime the cluster nodes reboot is not a solution here.
Not even sure what the resolution of that bug is.

As far as I remember I also had to remove ureadahead.

Anyways. I guess it's nice of you to care. Certainly a great improvement over
the responses of Scott James Remnant at the time these bug reports were
filed ...

On 11-07-18 04:27 PM, Steve Langasek wrote:
>
> This appears to be bug #523484.
>
> To the best of my understanding, this describes a feature that is missing

Uhm, not so much "missing" as "broken".

> when using a separate /var (the ureadahead job will run but not do anything
> useful).

It's not even that nice. What it will do is litter the boot with error
messages, which is annoying and distracting at best and a red herring
when there are other upstart/mountall bugs, at worst.

A first time admin of a system with a separate /var sees these errors on
boot, and even if he is lucky enough that boot succeeds, is still
concerned about the errors (if he is any good) and wastes time chasing
them down only to find that they are a bug that exists and has not and
will not be fixed. Not very classy.

> This is certainly a bug, but not something that we are likely to
> backport to lucid even when a fix becomes available.

So instead you let the above situation continue on ad infinitum,
confusing new users?

> I think you would be
> hard pressed to convince any of the Ubuntu developers that it has a major
> impact on the usability of Ubuntu that systems with a separate /var don't
> get the boot speed enhancement from ureadahead!

It's not even the boot speed enhancement that is the issue. It's the
emission of spurious errors that any good admin will have to waste time
chasing down.

> Also, you say that people have posted solutions; I've reviewed the bug log
> and there are no solutions to the bug there. Most of the proposals would
> work *only* on systems with detached /var, so are not suitable for
> inclusion in the distribution;

Surely they are a good start though, with some conditional code needing
added to test for that separate /var case.

> So I'm afraid I regard this bug's current prioritization as "medium" to be
> correct, sorry. Patches welcome, but I think it's unlikely that this is
> going to be worked on soon otherwise.

Exactly my point and the point of mine (and others') frustration. You
guys let a bug get out into the wild by not testing a use-case that is
and/or should be very common in the server use space and now that it's
out there, you are just letting it ride.

Oliver Brakmann (obrakmann) wrote :

Hi,

this is turning into a forum discussion real fast, and really has no
place here. I suggest we take it someplace else. Ubuntu-devel, maybe?

On 2011-07-19 12:17, Brian J. Murrell wrote:
> On 11-07-18 04:27 PM, Steve Langasek wrote:
>>
>> This appears to be bug #523484.
>> To the best of my understanding, this describes a feature that is missing
> Uhm, not so much "missing" as "broken".
>> when using a separate /var (the ureadahead job will run but not do anything
>> useful).
>
> It's not even that nice. What it will do is litter the boot with error
> messages

I think "litter" is a bit strong. There's a single line that says that
ureadahead terminated with status soandso. That's hardly littering.
And that's about it with regard to the impact of that bug. I have never
seen a system not boot due to it, and I have used (and am using) Lucid
on bare-metal servers, virtual machines, desktops and road warrior
laptops, all with a separate /var.

> which is annoying and distracting at best and a red herring
> when there are other upstart/mountall bugs, at worst.

I'll agree that it is annoying, but any admin that sees the ureadahead
message when a system shows other problems and goes 'oh, ureadahead
croaked, that must be the cause of it all!' seriously needs to drop it
right then and there and go stack shelves somewhere.

>> This is certainly a bug, but not something that we are likely to
>> backport to lucid even when a fix becomes available.
>
> So instead you let the above situation continue on ad infinitum,
> confusing new users?

That's not what Steve said.

>> I think you would be
>> hard pressed to convince any of the Ubuntu developers that it has a major
>> impact on the usability of Ubuntu that systems with a separate /var don't
>> get the boot speed enhancement from ureadahead!

I agree.

> Exactly my point and the point of mine (and others') frustration. You
> guys let a bug get out into the wild by not testing a use-case that is
> and/or should be very common in the server use space and now that it's
> out there, you are just letting it ride.

Again, I agree that the message is annoying, and it would be nice if we
could get rid of it. But I would describe it as 'cosmetic' at best, and
hardly something to get frustrated about, much less fault the developers
for not giving it highest priority.

So is this seriously something to get so worked up about?

Also, you know, Ubuntu makes alphas public for a reason.

Regards,
Oliver

Steve Langasek (vorlon) wrote :

On Mon, Jul 18, 2011 at 10:56:50PM -0000, Andy Hauser wrote:
> > Is that bug #524972 (where the bind mount refers to a path which is a
> > symlink)?

> Maybe. Only that I don't remember there being a question.

That could point to a plymouth bug affecting your system.

> Anyways answering questions on the tty everytime the cluster nodes reboot
> is not a solution here.

Certainly not! But if you're affected by bug #524972, there's a
straightforward workaround - specify the real path in /etc/fstab instead of
a symlink.

> As far as I remember I also had to remove ureadahead.

So far, the only confirmed issues with ureadahead in the final release are
cosmetic ones; if you can reproduce any problems with ureadahead causing
boot failures, we'd be interested to know about them.

Download full text (6.6 KiB)

I just had quite a few updates in the pipe today and it seems I run into statd problems also. Not quite sure if it's the same bug.

cat syslog | grep statd gives me this:

Aug 12 18:46:04 doubi statd-pre-start: local-filesystems started
Aug 12 18:46:05 doubi rpc.statd[1008]: Version 1.2.2 starting
Aug 12 18:46:05 doubi rpc.statd[1008]: Flags:
Aug 12 18:46:05 doubi rpc.statd[1008]: unable to register (statd, 1, udp).
Aug 12 18:46:05 doubi init: statd main process (1008) terminated with status 1
Aug 12 18:46:05 doubi init: statd main process ended, respawning
Aug 12 18:46:05 doubi statd-pre-start: local-filesystems started
Aug 12 18:46:05 doubi rpc.statd[1032]: Version 1.2.2 starting
Aug 12 18:46:05 doubi rpc.statd[1032]: Flags:
Aug 12 18:46:05 doubi rpc.statd[1032]: unable to register (statd, 1, udp).
Aug 12 18:46:05 doubi init: statd main process (1032) terminated with status 1
Aug 12 18:46:05 doubi init: statd main process ended, respawning
Aug 12 18:46:05 doubi statd-pre-start: local-filesystems started
Aug 12 18:46:05 doubi rpc.statd[1041]: Version 1.2.2 starting
Aug 12 18:46:05 doubi rpc.statd[1041]: Flags:
Aug 12 18:46:05 doubi rpc.statd[1041]: unable to register (statd, 1, udp).
Aug 12 18:46:05 doubi init: statd main process (1041) terminated with status 1
Aug 12 18:46:05 doubi init: statd main process ended, respawning
Aug 12 18:46:05 doubi statd-pre-start: local-filesystems started
Aug 12 18:46:05 doubi rpc.statd[1049]: Version 1.2.2 starting
Aug 12 18:46:05 doubi rpc.statd[1049]: Flags:
Aug 12 18:46:05 doubi rpc.statd[1049]: unable to register (statd, 1, udp).
Aug 12 18:46:05 doubi init: statd main process (1049) terminated with status 1
Aug 12 18:46:05 doubi init: statd main process ended, respawning
Aug 12 18:46:05 doubi statd-pre-start: local-filesystems started
Aug 12 18:46:05 doubi rpc.statd[1066]: Version 1.2.2 starting
Aug 12 18:46:05 doubi rpc.statd[1066]: Flags:
Aug 12 18:46:05 doubi rpc.statd[1066]: unable to register (statd, 1, udp).
Aug 12 18:46:05 doubi init: statd main process (1066) terminated with status 1
Aug 12 18:46:05 doubi init: statd main process ended, respawning
Aug 12 18:46:05 doubi statd-pre-start: local-filesystems started
Aug 12 18:46:05 doubi rpc.statd[1074]: Version 1.2.2 starting
Aug 12 18:46:05 doubi rpc.statd[1074]: Flags:
Aug 12 18:46:05 doubi rpc.statd[1074]: unable to register (statd, 1, udp).
Aug 12 18:46:05 doubi init: statd main process (1074) terminated with status 1
Aug 12 18:46:05 doubi init: statd main process ended, respawning
Aug 12 18:46:05 doubi statd-pre-start: local-filesystems started
Aug 12 18:46:05 doubi rpc.statd[1082]: Version 1.2.2 starting
Aug 12 18:46:05 doubi rpc.statd[1082]: Flags:
Aug 12 18:46:05 doubi rpc.statd[1082]: unable to register (statd, 1, udp).
Aug 12 18:46:05 doubi init: statd main process (1082) terminated with status 1
Aug 12 18:46:05 doubi init: statd main process ended, respawning
Aug 12 18:46:05 doubi statd-pre-start: local-filesystems started
Aug 12 18:46:05 doubi rpc.statd[1101]: Version 1.2.2 starting
Aug 12 18:46:05 doubi rpc.statd[1101]: Flags:

[... continues for quite some time... ]

Aug 12 18:46:05 doubi rpc.statd[1109]: unable to register (statd, 1, ...

Read more...

dennis berger (z-db-b) wrote :
Download full text (6.3 KiB)

It seems we ran into the same problem today.
System is 11.04 natty from yesterday.

boot.log from today.

Begin: Loading essential drivers ... done.
Begin: Running /scripts/init-premount ... done.
Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
Begin: Running /scripts/local-premount ... done.
Begin: Running /scripts/local-bottom ... done.
done.
Begin: Running /scripts/init-bottom ... done.
fsck from util-linux-ng 2.17.2
fsck from util-linux-ng 2.17.2
fsck from util-linux-ng 2.17.2
/dev/mapper/raid1-root: sauber, 356460/3662848 Dateien, 12171413/14648320 Blöcke
/dev/sda1: sauber, 236/488640 Dateien, 118485/975872 Blöcke
/dev/mapper/raid1-profiles: sauber, 12/6553600 Dateien, 426582/26214400 Blöcke
init: portmap-wait (statd) main process (409) killed by TERM signal^M
init: statd-mounting main process (402) killed by TERM signal^M
mount.nfs: Failed to resolve server apps: Name or service not known
init: statd main process (432) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (446) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (454) terminated with status 1^M
init: statd main process ended, respawning^M
init: ureadahead-other main process (461) terminated with status 4^M
init: statd main process (463) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (471) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (479) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (487) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (495) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (503) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (511) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (533) terminated with status 1^M
init: statd respawning too fast, stopped^M
init: ureadahead-other main process (701) terminated with status 4^M
mountall: mount /group [433] brach mit dem Status 32 ab
init: statd-mounting main process (723) killed by TERM signal^M
mount.nfs: Failed to resolve server apps: Name or service not known
mountall: mount /group [760] brach mit dem Status 32 ab
init: statd main process (759) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (769) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (777) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (785) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (793) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (802) terminated with status 1^M
init: statd main process ended, respawning^M
init: statd main process (821) terminated with status 1^M
init: statd main process ended, respawning^M
mount.nfs: Failed to reso...

Read more...

Clint Byrum (clint-fewbar) wrote :

Excerpts from dennis berger's message of Thu Oct 13 09:14:05 UTC 2011:
> It seems we ran into the same problem today.
> System is 11.04 natty from yesterday.
>
> boot.log from today.
>
> Begin: Loading essential drivers ... done.
> Begin: Running /scripts/init-premount ... done.
> Begin: Mounting root file system ... Begin: Running /scripts/local-top ... done.
> Begin: Running /scripts/local-premount ... done.
> Begin: Running /scripts/local-bottom ... done.
> done.
> Begin: Running /scripts/init-bottom ... done.
> fsck from util-linux-ng 2.17.2
> fsck from util-linux-ng 2.17.2
> fsck from util-linux-ng 2.17.2
> /dev/mapper/raid1-root: sauber, 356460/3662848 Dateien, 12171413/14648320 Blöcke
> /dev/sda1: sauber, 236/488640 Dateien, 118485/975872 Blöcke
> /dev/mapper/raid1-profiles: sauber, 12/6553600 Dateien, 426582/26214400 Blöcke
> init: portmap-wait (statd) main process (409) killed by TERM signal^M
> init: statd-mounting main process (402) killed by TERM signal^M
> mount.nfs: Failed to resolve server apps: Name or service not known

This is a slightly different problem. Here, you don't have network yet,
so sm-notify can't lookup the server to inform it of the reboot. This
is a pretty tricky problem, but one I think that can be solved with some
TLC in the sequencing of statd and mounting.

I'd suggest opening a new bug and linking back your comment from it,
so that we can evaluate the problem.

Also if you can try a similar setup in 11.10, the way the network comes
up has changed somewhat, and may solve the issue.

Thanks!

Johnathon (kirrus) wrote :

Has the patches actually gotten out of proposed now? We've just run head-long into this one.

Steve Langasek (vorlon) wrote :

On Thu, Feb 28, 2013 at 02:23:21PM -0000, Johnathon wrote:
> Has the patches actually gotten out of proposed now? We've just run
> head-long into this one.

They've been out of proposed for quite some time. If you're still seeing
issues, you'll need to provide more information.

I have a possibly related problem. Now and then boot lockups, with NFS mounts in /etc/fstab. They disappear, when NFS mounts are removed from /etc/fstab and mounted after the machine is up. Our /var/lib/nfs is not on a NFS share, however.

I cannot tell if this is a duplicate, related or another bug. Could someone please see #1118447 and clarify if this is related? Thanks in advance!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Bug attachments