retry remote devices when parent is ready after SIGUSR1

Bug #470776 reported by hmpl
166
This bug affects 29 people
Affects Status Importance Assigned to Milestone
Release Notes for Ubuntu
Won't Fix
Undecided
Unassigned
mountall (Ubuntu)
Fix Released
Medium
Scott James Remnant (Canonical)
Nominated for Karmic by Johan Walles

Bug Description

Binary package hint: mountall

Hi,
I have a similar problem like in https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/461133 . My NFS-Shares are not mounted at boot time. But I don't think both problems have the same cause.
I think my problem is more related to https://bugs.launchpad.net/ubuntu/+source/mountall/+bug/446678 , but I'm not sure.

Release: Ubuntu 9.10

This is what happens:
- Grub (still old Grub) line is: /vmlinuz-2.6.31-14-generic root=/dev/mapper/system-lvgggroot ro quiet splash
- root is mounted readonly due to Grubline
- After Network is up, mountall receives SIGUSR1 (as expected)
- At this time root is not yet remounted RW and NFS-Shares cannot be mounted:
  usr1_handler: Received SIGUSR1 (network device up)
  try_mount: /Bignfs waiting for /
  try_mount: /Backnfs waiting for /
  usr1_handler: Received SIGUSR1 (network device up)
  try_mount: /Bignfs waiting for /
  try_mount: /Backnfs waiting for /
- After root is remounted, NFS-Shares will not be mounted as they are waiting for SIGUSR1
  try_mount: /Bignfs waiting for device gigakobold.daheim:udatanoback
  try_mount: /Backnfs waiting for device gigakobold.daheim:udataback
- mountall will continue running
- Sending SIGUSR1 manually will get everything mounted and mountall exits normally

So the problem is: Mountall doesn't remember having SIGUSR1 (i.e. network up) already received.
So if there is a problem with NFS-mounts while signal from network up is received and the NFS-filesystems cannot be mounted at this time, later retries will not work, too.

Regards
Markus

Revision history for this message
hmpl (hmplfgrmp) wrote :
Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

Your assumption is, I think, correct. When SIGUSR1 is received, it tries to bring up the mount but fails because the parent mount isn't yet ready. When it brings up the parent, it hasn't remembered that it can bring up the network mount, so waits for SIGUSR1 again

summary: - karmic: nfs shares not mounted on boot with ro-root
+ retry remote devices when parent is ready after SIGURS1
summary: - retry remote devices when parent is ready after SIGURS1
+ retry remote devices when parent is ready after SIGUSR1
Changed in mountall (Ubuntu):
status: New → Triaged
importance: Undecided → Medium
Changed in mountall (Ubuntu):
status: Triaged → Fix Committed
milestone: none → lucid-alpha-2
assignee: nobody → Scott James Remnant (scott)
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package mountall - 2.0

---------------
mountall (2.0) lucid; urgency=low

  [ Scott James Remnant ]
  * "mount" event changed to "mounting", to make it clear it happens
    before the filesystem is mounted. Added "mounted" event which
    happens afterwards.
  * Dropped the internal hooks, these are now better handled by Upstart
    jobs on the "mounted" event.
  * Dropped the call to restorecon for tmpfs filesystems, this can also be
    handled by an Upstart job supplied by SELinux now.
    - mounted-dev.conf replaces /dev hook, uses MAKEDEV to make devices.
    - mounted-varrun.conf replaces /var/run hook
    - mounted-tmp.conf replaces /tmp hook.
      + Hook will be run for any /tmp mountpoint. LP: #478392.
      + Switching back to using "find" fixes $TMPTIME to be in days again,
        rathern than hours. LP: #482602
  * Try and make mountpoints, though we only care about failure if the
    mountpoint is marked "optional" since otherwise the filesystem might
    make the mountpoint or something.
  * Rather than hiding the built-in mountpoints inside the code, put them
    in a new /lib/init/fstab file; that way users can copy the lines into
    /etc/fstab if they wish to override them in some interesting way.
  * Now supports multiple filesystem types listed in fstab, the whole
    comma-separated list is passed to mount and then /proc/self/mountinfo
    is reparsed to find out what mount actually did.
    * /dev will be mounted as a devtmpfs filesystem if supported by the
      kernel (which then does not need to run the /dev hook script).
  * Filesystem checks may be forced by adding force-fsck to the kernel
    command-line.
  * Exit gracefully with an error on failed system calls, don't infinite
    loop over them. LP: #469985.
  * Use plymouth for all user communication, replacing existing usplash and
    console code;
    * When plymouth is running, rather than exiting on failures, prompt the
      user as to whether to fix the problem (if possible), ignore the problem,
      ignore the mountpoint or drop to a maintenance shell. LP: #489474.
    * If plymouth is not running for whatever reason, the fallback action
      is always to start the recovery shell.
  * Adjust the set of filesystems that we wait for by default: LP: #484234.
    * Wait for all local filesystems, except those marked with the
      "nobootwait" option.
    * Wait for remote filesystems mounted as, or under, /usr or /var, and
      those marked with the "bootwait" option.
  * Always try network mount points, since we allow them to fail silently;
    SIGUSR1 now simply retries them once more. LP: #470776.
  * Don't retry devices repeatedly. LP: #480564.
  * Added manual pages for the events emitted by this tool.

  [ Johan Kiviniemi ]
  * Start all fsck instances in parallel, but set their priorities so that
    thrashing is avoided. LP: #491389.
 -- Scott James Remnant <email address hidden> Mon, 21 Dec 2009 23:09:23 +0000

Changed in mountall (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
Patrick (patrick-ostenberg) wrote :

When is this going to be fixed in "karmic"?

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote : Re: [Bug 470776] Re: retry remote devices when parent is ready after SIGUSR1

On Thu, 2009-12-31 at 11:52 +0000, Patrick wrote:

> When is this going to be fixed in "karmic"?
>
It isn't.

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message
Dan (danser) wrote :

> > When is this going to be fixed in "karmic"?
> It isn't.

Um, but my nested NFS mounts don't work in Karmic. Or should I build a backport of mountall from Lucid for my machines?

Revision history for this message
Scott James Remnant (Canonical) (canonical-scott) wrote :

On Mon, 2010-01-04 at 18:14 +0000, Dan wrote:

> > > When is this going to be fixed in "karmic"?
> > It isn't.
>
> Um, but my nested NFS mounts don't work in Karmic. Or should I build a
> backport of mountall from Lucid for my machines?
>
A Lucid backport won't work, you'd need to backport more than just
mountall (i.e. Lucid's mountall is based on plymouth not usplash, and
Lucid's udev, etc.)

Scott
--
Scott James Remnant
<email address hidden>

Revision history for this message
Alvin (alvind) wrote :

Without a fix, a lot of servers using NFS can't even boot unattended. Isn't that important enough for a separate fix?

Revision history for this message
Jason Straight (jason-jeetkunedomaster) wrote :

My local library is setup to netboot 30 workstations, which won't boot well with karmic.

Revision history for this message
Johan Walles (walles) wrote :

Since I have to boot via a rescue shell nowadays, I filed bug 504271 requesting a fix for this issue in Karmic.

Revision history for this message
Alvin (alvind) wrote :

Is it possible that this issue also affects mounting CIFS filesystems at boot?

Revision history for this message
Johan Walles (walles) wrote :

This is a request for an SRU.

This bug prevents machines from booting if home directories are on NFS. My machine was able to boot Jaunty so this is a regression. Not being able to boot is severe.

According to https://wiki.ubuntu.com/StableReleaseUpdates, SRUs are issued:
* to fix high-impact bugs.
* for bugs which represent severe regressions

This issue should mostly affect corporate, educational and public users (like libraries). Most home users are likely not affected by this. I can't judge where your main customer base is, but for me this is a big problem.

If the SRU team says "no", I'll just shut up and try not to reboot so much for the next four months.

Revision history for this message
ianc (ian-criddle) wrote :

Well, I'm a home user, I'm affected - and the issue is a right PITA...

Karmic's inability to mount NFS drives at boot is a major regression.

So if the mechanism for solving the problem is an SRU then I vote for that solution too please.

Revision history for this message
Steve Langasek (vorlon) wrote :

I believe an SRU is unfortunately not practical here because the fix is intertwined with significant architectural changes to mountall that can't be backported to 9.10. Scott, please correct me if I'm wrong.

Revision history for this message
Dan (danser) wrote :

The attached patch is sufficient to fix this bug for me in Karmic, but is probably not a very good long-term solution. It simply sets the "ready" flag of each remote filesystem after SIGUSR1 is received, before attempting to mount them all.

Revision history for this message
Alvin (alvind) wrote :

Subscribing ubuntu-release-notes because upgrading in environments that rely on NFS might not be a good idea.

Revision history for this message
Jürgen Sauer (juergen-sauer) wrote :

This Bug is a very, very critical and nasty problem (a K.O.) criteria für karmic use in commercial environments.

This is a mission critical bug, which makes karmic and so ubuntu unusable ins coperorate standard setups.

It must fixed.

J. Sauer

Dan (danser)
tags: added: verification-needed
Revision history for this message
Russel Winder (russel) wrote :

I have a workstation with a fresh Karmic install and am seeing that all my NFS mounts are failing at boot time with this same message:

    One or more of the mounts listed in /etc/fstab cannot yet be mounted:
    . . .
    Press ESC to enter a recovery shell.

Other bug reports using the same strings are marked as duplicates of this one hence repoirting here.

On trying to mount things manually in the "repair shell" I get:

    mount.nfs: DNS resolution failed for . . .

for the server for each mount.

Seems like mount is being called before DNS look ups are set up.

This problem means Ubuntu workstations just do not work. Reading the notes above it seems that this problem is fixed in Lucid but that there is no intention to even try and fix this in Karmic. Is this actually tru, is there no fix for Karmic?

Revision history for this message
Russel Winder (russel) wrote :

Actually, I don't have eth0 which is why DNS isn't working and also why nothing will work on the network. So why is the boot sequence trying to mount NFS mounts before the network interface is started?

So how do I get the eth0 interface started so that the machine can boot sensibly?

Revision history for this message
Steve Langasek (vorlon) wrote :

Russel,

This bug is not about trying to mount NFS mounts before the network is up. That's expected behavior, and does not affect the reliability of NFS mounts at boot time. If your eth0 isn't coming up, then of course NFS mounts will fail; you should determine why your network isn't being brought up.

Revision history for this message
Russel Winder (russel) wrote :

Steve,

It seems as though Network Manager is as big a crock of **** as it ever was -- at least for workstations that are either connected via eth0 or not connected at all.

I purged Network Manager and added the appropriate lines to /etc/network/interfaces and now when I boot I get the same message about being unable to mount NFS filestores and that I should press ESC, but by waiting a few second the message goes away and the boot resumes and has all the NFS mounts as they should be.

So now I don't have a blocker problem, but I must say the overall user experience here is very poor. The sequencing of the various boot time events is either not right and/or the error reporting by the various stages need better management.

Speaking from a position of deep ignorance about the details, it seems that an attempt is made to mount all the devices in /etc/fstab at a point in the boot sequence when only the connected discs can be mounted as the network is not up yet. So until the network is up no attempt to mount NFS mounts should be made or the error messages redirected to /dev/null. Then when the second attempt to mount stuff is made after the network is up -- assuming Network Manager has been expunged -- then it would be appropriate to have error messages if there were failures.

Revision history for this message
Steve Langasek (vorlon) wrote :

Network Manager is supposed to bring up wired network connections at boot time. If this isn't happening for you, then that's either a bug in network-manager or a configuration problem of some sort. I would encourage you to file a bug report on network-manager about that issue.

I agree with you that this is not a good user experience, but your experience is certainly exceptional. Most users of NFS (including myself) have seen no such problem in karmic, aside from the warning that the NFS mounts are not yet available - messages which I believe are being suppressed in lucid. Again, the *attempt* to mount the NFS filesystems early is not itself an error; this is done for compatibility with systems where the network interface is brought up in the initramfs (e.g., for systems using NFS root), the only issue is the distracting message.

Revision history for this message
Alvin (alvind) wrote :

On 11/02/2010 14:06, Steve Langasek wrote:
> Most users of NFS (including myself) have seen no such problem in karmic, aside from the warning that the NFS mounts are not yet available - messages which I believe are being suppressed in lucid.

My experience is different. Out of 10 NFS clients, most do not succeed
in mounting all NFS shares at boot time. Some hang at boot (after the
distracting warning message), most need to have their shares mounted
manually after boot. Some succeed in mouting and booting, but with the
warnings.

I experience this issue on 3 different networks, with different NFS
servers, on fresh karmic installations and on upgrades from Jaunty. You
can even test this on virtual machines.

Revision history for this message
K Mori (kmori-nospam) wrote :

I am/was also affected by this. I run a home network with a main nfs server and 4 diskless clients which need root over nfs as well as other nfs mounts. I have been able to keep this setup going since Dapper until this past upgrade to Karmic. After struggling with this for the last 2 days I can at least complete a bootup. The main changes I had to do were:
1. Change the kernel option in my pxe setup to use nolock
2. Add "/dev/nfs / nfs defaults,nolock 0 0" to my fstab. Previously, it wasn't necessary to have my root dir listed in the fstab.
3. Add nolock to all the other nfs mount entries in the fstab.
4. Add remounts to set lock in rc.local

Revision history for this message
Rory McCann (rorymcc) wrote :

I was affected by this on my karmic box.

My work around was to add "noauto" to the /etc/fstab line for the NFS mount, and then add "mount /path/to/mount" in /etc/rc.local. This means that the directory would get mounted at start up, but at the end. This worked for my situation.

Revision history for this message
Brian Morton (rokclimb15) wrote :

I was running into this issue on Karmic and now I have tried Lucid Beta 1. The behavior here is a little different because the shares do mount on startup, but error messages are still displayed in the terminal.

mount.nfs: DNS resolution failed for <hostname>: Name or service not known
mountall: mount /media/mount [930] terminated with status 32

The release notes for this fix state that it should fail silently. This could be very misleading in an LTS release. Is my observed behavior intended, or is this a regression in Beta 1?

Revision history for this message
Steve Langasek (vorlon) wrote :

Brian,

The mount.nfs messages are shown if you aren't using the graphical splash, but this should also be resolved in the next upload of plymouth to lucid.

Revision history for this message
Brian Morton (rokclimb15) wrote :

Thanks for the helpful info. I'll make sure to report back after that package is updated.

Revision history for this message
Alvin (alvind) wrote :

The bug about the messages is bug 504224.

Steve, I hope you mean the reason for the messages will be gone in Lucid and not: "don't worry, we'll cover the messages with a nice looking boot splash."

Older Unix admins will have a heart attack if they see a boot splash on a server OS.

Revision history for this message
Steve Langasek (vorlon) wrote :

On Thu, Mar 25, 2010 at 03:03:50PM -0000, Alvin wrote:
> Steve, I hope you mean the reason for the messages will be gone in Lucid

No, I do not.

> and not: "don't worry, we'll cover the messages with a nice looking boot
> splash."

They're *already* covered in lucid on systems which support the boot splash.
Shortly, they will also be hidden on systems which don't support the
graphical boot splash.

If you boot without 'splash', you'll still get these messages and many
others.

> Older Unix admins will have a heart attack if they see a boot splash on
> a server OS.

I would advise anyone with such a weak heart, regardless of age, to consult
their physician. With good diet and exercise, heart attacks are
preventible.

--
Steve Langasek Give me a lever long enough and a Free OS
Debian Developer to set it on, and I can move the world.
Ubuntu Developer http://www.debian.org/
<email address hidden> <email address hidden>

Revision history for this message
Alvin (alvind) wrote :

Meanwhile we have bug 548954. I'm glad other people agree that hiding boot messages on servers is a bad idea. The bug is even marked critical, but why is this one not critical? Only because it is 'hard to fix' in Karmic?

description: updated
Revision history for this message
Loïc Minier (lool) wrote :

Is this still an issue in lucid final?

Revision history for this message
Steve Langasek (vorlon) wrote :

There aren't any known remaining reliability problems in lucid regarding stacked NFS mounts, and I don't think we'll go back now to document this in the release notes for 9.10. Marking wontfix.

Changed in ubuntu-release-notes:
status: New → Won't Fix
Revision history for this message
Michael Palmer (mp4) wrote :

Hi Steve,

Thanks for your work on this.

Just to make sure you understand (since you mentioned the problem doesn't happen for you). this bug is not just about warning messages... It actually drops you into the rescue shell, and the boot process stops there, waiting for console input.

This happens about 10% of the time per node for me (I gather it's when some race condition occurs). This makes unattended boot of a cluster not usable. E.g., I have 20 nodes in a cluster so almost always one or more nodes don't boot.

I am mounting /home and /usr/local over NFS.
evoa1:/home /home nfs rw 0 0
evoa1:/usr/local /usr/local nfs rw 0 0

I tried a the workaround suggested above - putting noauto in /etc/fstab and then mounting the directories in /etc/rc.local... however then users can ssh into the nodes before their home directories are mounted & that causes other problems. (We are running some job queuing software & queued jobs may try to start up quickly as soon as a node is up.)

A workaround would be helpful... e.g., just knowing the right place to put a "sleep 30" to reduce the frequency of the race condition.

thanks,

Mike

Revision history for this message
Anton Altaparmakov (aia21) wrote :

I am using a fresh install of Ubuntu 10.04 LTS Server and have used virt-manager to set up an NFS storage repository (located on another machine).

On boot the storage fails to mount.

First it was failing with "statd not running" and I fixed that by editing /etc/init/statd.conf making this change:

-start on (started portmap or mounting TYPE=nfs)
+start on ((started portmap and local-filesystems) or mounting TYPE=nfs)

Statd is now started but the NFS mount still does not happen at boot!

When I then launch virt-manager, and click on the "Play/Start" button in the Storage tab after selecting the NFS storage, it then mounts it. But I want it to be mounted automatically at boot rather than having to do it by hand each time the machine boots...

Note the server has a static (non-dhcp) IP address as all servers should have! I don't want booting to fail because the dhcp server has gone away...

Best regards,

Anton

See full activity log

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.