NFS root device never ready

Bug #430348 reported by Tobias Wolf on 2009-09-15
62
This bug affects 11 people
Affects Status Importance Assigned to Milestone
mountall (Ubuntu)
High
Scott James Remnant (Canonical)
Karmic
High
Scott James Remnant (Canonical)

Bug Description

Binary package hint: mountall

I have a curious displayless server under my bed that boots via PXE netboot.
Its root filesystem is a remote nfsroot, but it has a locally attached drive for /home.

The new init system is not very clear to me, so I’m unsure how to find out what’s wrong.
Here are some symptoms:

- Boots past init-bottom scripts
- Errors out on mountall exit code 1
- Spawns root shell

- Output of mountall -v in root shell is:

mount / [1191] exited normally
mount /proc [1193] exited normally
mount /sys [1194] exited normally
mount /sys/fs/fuse/connections [1195] exited normally
mount /sys/kernel/debug [1196] exited normally
mount /sys/kernel/security [1197] exited normally
mount /dev [1198] exited normally
mount /dev/pts [1199] exited normally
mount /dev/shm [1200] exited normally
mount /var/run [1201] exited normally
mount /var/lock [1202] exited normally
mount /lib/init/rw [1203] exited normally
virtual filesystems finished
swap finished
remote finished
  (exit code is 1)

- Contents of fstab are:

# file system mount point type options dump pass
proc /proc proc defaults 0 0
/dev/nfs / nfs noatime 0 0
# /dev/sda7
LABEL=sams-home /home ext4 noatime 0 2
# /dev/sda8
LABEL=sams-data /home/towolf/data xfs noatime,user 0 2
# /dev/sda6
LABEL=sams-swap none swap sw 0 0

- Kernel boot line in pxeconfig is:

LABEL samsdiskless
  kernel root/sams/vmlinuz
  append root=/dev/nfs nfsroot=10.0.0.1:/pxe/sams,nfsvers=3,nolock,udp ip=dhcp initrd=root/sams/initrd.img rw --

I need guidance to get to the bottom of this. Why is mountall exiting with code 1 anyway?

Tobias Wolf (towolf) wrote :

It got worse.

I upgraded after today’s batch of boot-related uploads and now it hangs after /scripts/init-bottom.
No further output, no root shell.

No idea if this is still mountall. No output no clue.

Unknown, if you could test 0.1.6 that would be appreciated - at least that should output to the console why it's terminating

Changed in mountall (Ubuntu):
importance: Undecided → High
status: New → Incomplete
Tobias Wolf (towolf) wrote :

Hrmpf. I was using 0.1.6. There isn’t any output unfortunately. The last line is "pera pera /scripts/init-bottom".

I’m going to try making control-alt-delete spawn a sulogin because I noticed that control-alt-delete still shuts down.
Wonder what I will be looking for, though.

Tobias Wolf (towolf) wrote :

Hm. I’m not sure what I’m doing. The sulogin hack worked. When I logged in the output of «initctl list» was:

rc stop/waiting
udev start/running, process 1111
mountall-net stop/waiting
upstart-udev-bridge start/running, process 1109
rsyslog stop/waiting
avahi-daemon stop/waiting
hwclock-save stop/waiting
dbus stop/waiting
atd stop/waiting
control-alt-delete start/running, process 1425
hwclock stop/waiting
module-init-tools stop/waiting
mountall start/running, process 1090
cron stop/waiting
rcS stop/waiting
acpid stop/waiting
hal stop/waiting
dbus-reconnect stop/waiting
rc-sysinit stop/waiting
udevtrigger stop/waiting
anacron stop/waiting
tty2 stop/waiting
udev-finish stop/waiting
rsyslog-kmsg stop/waiting
hostname stop/waiting
mountall-reboot stop/waiting
udevmonitor stop/waiting
network-interface (wmaster0) start/running
network-interface (lo) start/running
network-interface (eth0) start/running
network-interface (wlan0) start/running
mountall-shell stop/waiting
tty1 stop/waiting
networking stop/waiting
dmesg stop/waiting
procps stop/waiting

That is, mountall was running, but the local disk was not mounted at all. The ethernet and root mount already were up courtesy of the initrd of course.

So I ran «mount -a», which worked fine. Other than that I was in a pretty much uninitialized state. rsyslog, dbusm avahi-daemon etc. were not running, and I’m not sure networking is supposed to be in stopped state.

Then I ran «init 5» and the scripts in rc5.d ran, but I had to «start rsyslog» and «start dbus» manually.

Is this in some way my own fuckup? I just dist-upgraded after all?!?

Tobias Wolf (towolf) wrote :

Oh, by the way:

    $ mountall --version
    mountall 0.1.0

Shouldn’t that be 1.6.0?

Probably, but that's kinda minor

Check what "dpkg-query -W" says

From the shell, could you run "stop mountall". And then try "mountall --debug"

If you can pipe the output somewhere (e.g. to under /dev) to capture it, and attach it to this bug, that would be great

summary: - new mountall exits with code 1. only root shell using PXE netboot.
+ PXE netboot root not being handled correctly

Here you go.

- It booted to the line /scripts/init-bottom \n done
- I pressed control-alt-del to get a sulogin
- I started screen

The log is a copy of what you told me to do and some additional info.

On Thu, 2009-09-17 at 18:22 +0000, Tobias Wolf wrote:

> Here you go.
>
> - It booted to the line /scripts/init-bottom \n done
> - I pressed control-alt-del to get a sulogin
> - I started screen
>
> The log is a copy of what you told me to do and some additional info.
>
> ** Attachment added: "mountall --debug"
> http://launchpadlibrarian.net/31980649/mountall-transcript
>
If you open another console and do:

  killall -USR1 mountall

does it output more stuff? If so, attach that.

Then if that does something, try:

  status udev
  (if running, "start udevtrigger", if not running, "start udev" then
   udevtrigger)

Scott
--
Scott James Remnant
<email address hidden>

I believe that this is stuck because the root is a remote filesystem, but mountall is hardwired to believe it's a local filesystem - thus it never believes that the root is ready and never mounts it.

Worse still, if you never had anything that brought an interface up (even lo), it would never mount it if it thought it was a remote filesystem.

summary: - PXE netboot root not being handled correctly
+ NFS root device never ready

On Do, 2009-09-17 at 19:15 +0000, Scott James Remnant wrote:

> If you open another console and do:
>
> killall -USR1 mountall
>
> does it output more stuff? If so, attach that.

I didn’t reboot anew, because that’s really cumbersome.
This is via SSH now:

    usr1_handler: Received SIGUSR1 (network device up)

> Then if that does something, try:
>
> status udev
> (if running, "start udevtrigger", if not running, "start udev" then
> udevtrigger)

    root@sams:/# killall -USR1 mountall
    root@sams:/# status udev
    udev start/running, process 28029
    root@sams:/# start udevtrigger
    ^C

 ... I had to Ctrl-C it, it was stuck. Then ...

    root@sams:/# status udev
    udev start/running, process 28029
    root@sams:/# status udevtrigger
    udevtrigger start/starting

  ... and no change in the console with mountall running.

What now?

--Tobias

Changed in mountall (Ubuntu):
status: Incomplete → In Progress
assignee: nobody → Scott James Remnant (scott)
tags: added: ubuntu-boot
Lou Ruppert (louferd) wrote :

How did you get the sulogin to work for ctrl alt del? I tried replacing the shutdown line in the conf file for upstart with 'exec sulogin' but it doesn't spawn the login. Also tried dash -i. I'm trying to debug this problem as well, but you've gotten farther along on it than I have.

On Do, 2009-09-17 at 22:02 +0000, Lou Ruppert wrote:
> How did you get the sulogin to work for ctrl alt del?

I just looked around and made a minor edit. See the attachment.

Lou Ruppert (louferd) wrote :

In case the people debugging this don't have a diskless pxe system to work from, I wonder if the difference between what the mtab says and what the fstab says is confusing mountall. Example:

fstab:

/dev/nfs / nfs defaults 0 0

mtab:

192.168.23.2:/export/diskless/fajr on / type nfs (rw,addr=192.168.33.2)

Meanwhile, for people who need to work on their machines while this is being debugged, I got mine to start with this really awful hack, which I will delete from /etc/init once my mountall is updated:

# fake-mountall - Mount filesystems on boot
#
# This helper mounts filesystems in the correct order as the devices
# and mountpoints become available.

description "Mount filesystems on boot"

start on startup

expect daemon
task

emits local-filesystems
emits filesystem

# temporary, until we have progress indication
# and output capture (next week :p)
console output

script
    echo "Sleeping to let regular mountall do its funky thing."
    sleep 10
    initctl emit local-filesystems
    exec initctl emit filesystem
end script

Twigathy (twigathy) wrote :

I too have this problem. although in a slightly different form. I have /boot on a CompactFlash and root on an NFS server.

The entry in grub reads as follows:
title Ubuntu 9.10 kernel 2.6.31-6-generic NETBOOT
root (hd0,0)
kernel /vmlinuz-2.6.31-6-generic root=/dev/nfs nfsroot=192.168.0.30:/mnt/monkey/corona/ ip=dhcp
initrd /initrd.img-2.6.31-6-generic

Boot hangs somewhere very very early, just after init-bottom, although it's hard to see precisely where. Stops after a pile of complaints from udev (inotify_add_watch failed: bad address).

Changed in mountall (Ubuntu Karmic):
milestone: none → ubuntu-9.10-beta
Changed in mountall (Ubuntu Karmic):
status: In Progress → Fix Committed

The ubuntu-boot PPA has a new version of mountall which I hope will fix this problem, please install that and let me know how it works out.

Tobias Wolf (towolf) wrote :

Yeah, phew. It works.

By the time I got to it a new mountall in Main had already superseded your PPA version. And despite printing out ‹apt-cache policy mountall› I didn’t realise what was going on. Sometimes you’re just struck by blindness in the morning.

So, thank you Scott!

Twigathy (twigathy) wrote :

Thanks for that fix -- I will test it once I have backed up root and report back later today.... :-)

Lou Ruppert (louferd) wrote :

The one on the ubuntu-boot PPA works great. The one in Main does not. I upgraded to the one in Main and it stalled at the same point it always has. I manually "downgraded" to the one in the PPA and it booted it fine.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package mountall - 0.1.8

---------------
mountall (0.1.8) karmic; urgency=low

  [ Scott James Remnant ]
  * Further work on the fix from the previous version where the root
    filesystem would always be considered "local", retain that from the
    POV of the {virtual,local,remote}-filesystems events, but do mount
    the root straight away when it's virtual since there's no device to
    wait until it's ready. LP: #431204.
  * If a remote filesystem is already mounted and doesn't need a remount,
    don't wait for a network device to come up. LP: #430348.

  * Ignore single and double quotes in fstab device specifications, since
    mount -a used to. LP: #431064.
  * Never write mtab when mounting a mount with showthroughs (ie. /var)
    and instead update mtab once we've moved it into place
    later. LP: #434172.

  [ Kees Cook ]
  * src/mountall.c: rework nftw hooks to use a global for argument passing
    instead of using nested functions and the resulting trampolines that
    cause an executable stack. LP: #434813.
  * debian/rules: revert powerpc exception, since the cause is fixed by
    removing the nested functions.

 -- Scott James Remnant <email address hidden> Wed, 23 Sep 2009 14:19:01 -0700

Changed in mountall (Ubuntu Karmic):
status: Fix Committed → Fix Released
Lou Ruppert (louferd) wrote :

Problem solved. Nice work!

I have "mountall 1.0" in karmic root upgraded from jaunty, and I can't boot ltsp clients because of this bug
PS: Sorry for my english :)

Just like (http://<email address hidden>/msg37462.html):

When I switch the new chroot to nfs (still basically
according to https://help.ubuntu.com/community/UbuntuLTSP/LTSPWithoutNFS
), the thin client does boot up but then apparently cannot mount the
nfsroot, or so it seems - the last output message on the screen is from
scripts/nfs-bottom saying /root/etc/hostname can't be written because
it's a read-only filesystem, then there seems to be a call to mountall
that ends in some sort of timeout. According to the messages on the
screen from mtab, the root is already mounted on /, but it's still
trying to mount / and /tmp and fails.

I'm still seeing this issue with mountall 1.0 in Karmic on an nfsroot, just as Dolganov.
mountall freezes trying to remount root and the other nfs shares listed in fstab, it won't work
as the neither portmap nor statd has started yet.

I haven't looked enough at the mountall sources, but it appears it happens in the progress_timer callback, check when the root device was lasted checked.

raliegh (steve-ubuntu-sr-tech) wrote :

Ever since Karmic was released, I have not been able to upgrade my Jaunty root over nfs system. I get the message "inotify_add_watch failed: bad address." as others have. I've tried manually upgrading the mountall package to versions 1.8 and 2.2 with no luck. I'm running kernel 2.6.31 that I compiled with the correct options for root over nfs, and has worked flawless with Jaunty.

From what I can gather, all the changes to the Linux startup in Karmic are to decrease boot time, which is nice. But how about leaving an option to revert to old behavior for those that rely on it?

Has anyone succeeded with root over nfs using Karmic yet? If so, care to share what your tweaks were?

-Steve

Harry Rickards (hrickards) wrote :

I'm getting the same problem with lucid.

I updated my nfs root Karmic install and tried again - same problem persists.

Similarly, all of my Karmix systems are having troubles with NFS mounts at boot time. Since Jaunty, IIRC, they all initially indicate that they've failed to mount nfs volumes at boot time - usually with that "hit esc to get a recovery shell" message. After waiting a few moments, the mounts come up fine and boot continues. Recently, I was surprised to notice how quickly a SystemRescue CD (on the same network as these Karmic machines) mounted an NFS volume. Do the Gentoo folks know something about the portmapper that we don't?

Jim Rees (rees) wrote :

I've just installed lucid on a diskless system. It boots from a usb key, not pxe, but has an nfs root. It hangs in or just after init-bottom, apparently because statd and portmap both exit immediately after starting. My "fix" was to disable them both then add "nolock" to the root mount options in fstab, but I would prefer to have locking turned on.

This seems related but maybe not exactly the same bug. Should I file a new bug for this? Have you tried lucid, and if so with what result?

Jim Rees (rees) wrote :

Never mind, what I'm seeing is Bug #537133.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers