nfs-kernel-server fails to start after kernel upgrade

Bug #540637 reported by ViViD on 2010-03-18
148
This bug affects 25 people
Affects Status Importance Assigned to Milestone
nfs-utils (Ubuntu)
High
Unassigned
Lucid
High
Unassigned

Bug Description

After updating to kernel 2.6.31-20-generic-pae on my Karmic server, nfs-kernel-server fails to start and I cannot mount my network shares. I believe this is because portmap is not started before nfs-kernel-server. To remedy the situation I am forced to:

1. Log in via ssh
2. Stop nfs-kernel-server
3. Stop portmap
4. Start portmap
5. Start nfs-kernel-server

My shares are now available over my network. I have to do this any time the system is rebooted, although restarts are few and far between, there is clearly an issue here with these services. If portmap were started via /etc/init.d I could fix the problem on my own, but I find that I know nothing about upstart.

This is an Ubuntu Karmic 9.10 server with kernel 2.6.31-20-generic-pae.
Portmap is version 6.0-10ubuntu2.
Nfs-kernel-server is verison 1:1.2.0-2ubuntu8.

Before this kernel version the problem was not present. If you need more information, please feel free to ask me for it.

Steve Langasek (vorlon) wrote :

Thank you for taking the time to report this bug and help to improve Ubuntu.

What kernel version were you running before this upgrade? None of the (many!) changes in the -20 uploads look very likely to be related, so it may be that there's an underlying race condition that's aggravated by timing differences in the new kernel.

And indeed there is an underlying race condition, because /etc/init/portmap.conf may run in parallel to /etc/init/rc-sysinit.conf.

The correct fix is to switch nfs-kernel-server over to upstart for lucid, to eliminate the race condition. As a workaround, you can edit /etc/init/rc-sysinit.conf on your own system, and add 'and started portmap' to the list of start conditions.

Changed in nfs-utils (Ubuntu):
importance: Undecided → High
status: New → Triaged
Steve Langasek (vorlon) on 2010-03-18
Changed in nfs-utils (Ubuntu Lucid):
assignee: nobody → Steve Langasek (vorlon)
ViViD (vivnet) wrote :
Download full text (3.4 KiB)

Prior to the current kernel the system was using linux-image-2.6.31-16-generic-pae, its likely I missed a version between them as this is a server. However, after adding your suggestion to /etc/init/rc-sysinit.conf, the problem is still present, with the same resolution. Below is my original rc-sysinit.conf, followed by the new one with the changes. It is quite possible that I administered this change incorrectly, which is why I am providing the examples.

####ORIGINAL
# rc-sysinit - System V initialisation compatibility
#
# This task runs the old System V-style system initialisation scripts,
# and enters the default runlevel when finished.

description "System V initialisation compatibility"
author "Scott James Remnant <email address hidden>"

start on filesystem and net-device-up IFACE=lo
stop on runlevel

# Default runlevel, this may be overriden on the kernel command-line
# or by faking an old /etc/inittab entry
env DEFAULT_RUNLEVEL=2

# There can be no previous runlevel here, but there might be old
# information in /var/run/utmp that we pick up, and we don't want
# that.
#
# These override that
env RUNLEVEL=
env PREVLEVEL=

task

script
    # Check for default runlevel in /etc/inittab
    if [ -r /etc/inittab ]
    then
 eval "$(sed -nre 's/^[^#][^:]*:([0-6sS]):initdefault:.*/DEFAULT_RUNLEVEL="\1";/p' /etc/inittab || true)"
    fi

    # Check kernel command-line for typical arguments
    for ARG in $(cat /proc/cmdline)
    do
 case "${ARG}" in
 -b|emergency)
     # Emergency shell
     [ -n "${FROM_SINGLE_USER_MODE}" ] || sulogin
     ;;
 [0123456sS])
     # Override runlevel
     DEFAULT_RUNLEVEL="${ARG}"
     ;;
 -s|single)
     # Single user mode
     [ -n "${FROM_SINGLE_USER_MODE}" ] || DEFAULT_RUNLEVEL=S
     ;;
 esac
    done

    # Run the system initialisation scripts
    [ -n "${FROM_SINGLE_USER_MODE}" ] || /etc/init.d/rcS

    # Switch into the default runlevel
    telinit "${DEFAULT_RUNLEVEL}"
end script

###MODIFIED
# rc-sysinit - System V initialisation compatibility
#
# This task runs the old System V-style system initialisation scripts,
# and enters the default runlevel when finished.

description "System V initialisation compatibility"
author "Scott James Remnant <email address hidden>"

start on (filesystem
   and net-device-up IFACE=lo
   and started portmap)
stop on runlevel

# Default runlevel, this may be overriden on the kernel command-line
# or by faking an old /etc/inittab entry
env DEFAULT_RUNLEVEL=2

# There can be no previous runlevel here, but there might be old
# information in /var/run/utmp that we pick up, and we don't want
# that.
#
# These override that
env RUNLEVEL=
env PREVLEVEL=

task

script
    # Check for default runlevel in /etc/inittab
    if [ -r /etc/inittab ]
    then
 eval "$(sed -nre 's/^[^#][^:]*:([0-6sS]):initdefault:.*/DEFAULT_RUNLEVEL="\1";/p' /etc/inittab || true)"
    fi

    # Check kernel command-line for typical arguments
    for ARG in $(cat /proc/cmdline)
    do
 case "${ARG}" in
 -b|emergency)
     # Emergency shell
     [ -n "${FROM_SINGLE_USER_MODE}" ] || sulogin
     ;;
 [0123456sS])
     # Override runlevel
     DEFAULT_RUNLEVEL="${ARG}"
     ;;
 -s|single)
     # Single us...

Read more...

João Pinto (joaopinto) wrote :

Vivid, can you paste the output from "who -r" and "initctl list" ?
There was another user reporting a similar issue, on his case other services did not start, including cron. This on a clean Lucid install.
According to him "who -r" produced no output.

ViViD (vivnet) wrote :

Here is the console output of 'who -r' and 'initctl list'.

This server still suffers the same symptoms, I must manually stop and restart nfs-kernel-server and portmap after each reboot to access my network storage.

Luka Bodrozic (lbodrozic) wrote :

All, I'm having the same issue after upgrading to lucid yesterday. One interesting bit of data... I have to run portmap stop and start separately.

So this works:
# sudo service nfs-kernel-server stop
# sudo service portmap stop
# sudo service portmap start
# sudo service nfs-kernel-server start

But this does not:
# sudo service nfs-kernel-server stop
# sudo service portmap restart
# sudo service nfs-kernel-server start

Don't know if that's significant, but seeing similar results.

Also interesting... this problem is causing a Mac OSX nfs client running 10.5 to hang, but my OSX 10.6 client is running just fine, without stop/restart commands.

Strange. Hope it's a clue.

Luka Bodrozic (lbodrozic) wrote :

Wow, wrote this late at night, and realized that last part is not very clear.

From a cold boot under lucid, I have two machines that are nfs clients to this machine. One running OSX 10.6, and one running 10.5 (all with the latest software updates, etc.).

The 10.6 machine works right out of the blocks, and gives no mention in syslog about any traffic or issues.

The 10.5 machine is the one with the problems. I have to do the above workarounds to get it to behave at all. If I try to access my nfs shares before I run the workaround (straight after booting, for example), my syslog is filled with commands like this:

May 3 22:38:55 luka-beast mountd[1268]: authenticated mount request from 192.168.1.101:979 for /media/music-n-vids (/media/music-n-vids)
May 3 22:39:35 luka-beast kernel: [ 230.030039] statd: server rpc.statd not responding, timed out
May 3 22:39:35 luka-beast kernel: [ 230.030062] lockd: cannot monitor luka-laptop.local
May 3 22:40:05 luka-beast kernel: [ 260.030036] statd: server rpc.statd not responding, timed out
May 3 22:40:05 luka-beast kernel: [ 260.030058] lockd: cannot monitor luka-laptop.local
...

And eventually my nfs connection drops, until the OS or anything else touches the mount, then automount kicks in and it starts all over again.

When I run the workaround above, I see this:
May 3 22:44:05 luka-beast kernel: [ 500.031644] nfsd: last server has exited, flushing export cache
May 3 22:44:07 luka-beast kernel: [ 502.831630] svc: failed to register lockdv1 RPC service (errno 97).
May 3 22:44:07 luka-beast kernel: [ 502.832407] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
May 3 22:44:07 luka-beast kernel: [ 502.832428] NFSD: starting 90-second grace period

And then that's it. No messages from lockd, statd, or mountd. And everything runs smoothly on both machines.

Might be a clue, but sounds more like another manifestation of the symptom.

Ville Määttä (vmaatta) wrote :

The latest kernel update, 2.6.32.22.23 server, fixed this for me.

ViViD (vivnet) wrote :

Problem, however, still exists in Karmic. I am extremely hesitant to upgrade this specific production level machine with concerns unforeseen circumstances.

If it is fixed in Lucid though, upgrading to a more long-term LTS release sounds like a good idea.

Steve Langasek (vorlon) wrote :

This bug is not fixed in Lucid. The nfs-kernel-server startup script still needs to be converted to upstart.

Nigel Hsiung (nigelcz) wrote :

My nfs server on Lucid is not working. I don't know if it is related to this bug. I used the following setup:
sudo apt-get install nfs-kernel-server nfs-common portmap
sudo dpkg-reconfigure portmap (set as no)
sudo gedit /etc/exports (/opt *(rw,no_root_squash,async)
sudo /etc/init.d/nfs-kernel-server restart
sudo exportfs -a
This used to work when i was using a non-pae version. Now i've reinstalled with PAE. My kernel is 2.6.32-22-generic-pae. I tried the above stop/start portmap and nfs-kernel-server mentioned above, still no luck getting it to work. My nfs client is an sh4 embedded linux and it connects fine on my Fedora box.

Jun 2 10:34:27 nigel-laptop mountd[22985]: Caught signal 15, un-registering and exiting.
Jun 2 10:34:27 nigel-laptop kernel: [30781.533161] nfsd: last server has exited, flushing export cache
Jun 2 10:34:30 nigel-laptop kernel: [30784.045011] svc: failed to register lockdv1 RPC service (errno 97).
Jun 2 10:34:30 nigel-laptop kernel: [30784.046044] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
Jun 2 10:34:30 nigel-laptop kernel: [30784.046072] NFSD: starting 90-second grace period
Jun 2 10:34:37 nigel-laptop kernel: [30791.091781] Inbound IN=eth0 OUT= MAC=00:26:22:a7:2a:ef:00:50:bf:00:00:01:08:00 SRC=192.168.202.126 DST=192.168.202.26 LEN=132 TOS=0x00 PREC=0x00 TTL=255 ID=64396 DF PROTO=UDP SPT=1000 DPT=50175 LEN=112
Jun 2 10:34:52 nigel-laptop wpa_supplicant[938]: WPS-AP-AVAILABLE
Jun 2 10:35:52 nigel-laptop wpa_supplicant[938]: WPS-AP-AVAILABLE
Jun 2 10:36:15 nigel-laptop kernel: [30889.481297] Inbound IN=eth0 OUT= MAC=00:26:22:a7:2a:ef:00:50:bf:00:00:01:08:00 SRC=192.168.202.126 DST=192.168.202.26 LEN=132 TOS=0x00 PREC=0x00 TTL=255 ID=64399 DF PROTO=UDP SPT=1000 DPT=50175 LEN=112

The inbound is from my nfs client which eventually times out. I'm new to Ubuntu and really appreciate any help. Thanks

scram (scram69) wrote :

I am experiencing the same issue as Luka Bodrozic above with my OS X 10.5 clients. Luckily, the workaround from comment #4 fixes the issue (until the next server reboot).

$ uname -a
Linux mediaserver 2.6.32-22-generic #36-Ubuntu SMP Thu Jun 3 19:31:57 UTC 2010 x86_64 GNU/Linux

Is there a more recent kernel that fixes the issue?

As a temporary workaround I added the "nolock" keyword to the mount options on the clients involved.
This would circumvent some of the timeouts without manually restarting the nfs-server.

jdcharlton (j-d-charlton) wrote :
Download full text (3.3 KiB)

I am having a similar issue with the following server configuration:
vmlinuz-2.6.32-25-generic, nfs-kernel-server and kubuntu 10.04

the workaround restarting portmap before nfs-kernel-server fixes the problem for me. I don't see the problem with normal nfs directory access unless I use thunderbird on an nfs directory which does the file locking and does not work at all until I restart portmap before nfs-kernel-server. I modified the /etc/init/rc-sysinit.conf as suggested in #1 above to the following but I still need to restart portmap and nfs-kernel-server each time the system reboots.

My /etc/init/rc-sysinit.conf is:
# rc-sysinit - System V initialisation compatibility
#
# This task runs the old System V-style system initialisation scripts,
# and enters the default runlevel when finished.

description "System V initialisation compatibility"
author "Scott James Remnant <email address hidden>"

start on filesystem and net-device-up IFACE=lo and started portmap
stop on runlevel

# Default runlevel, this may be overriden on the kernel command-line
# or by faking an old /etc/inittab entry
env DEFAULT_RUNLEVEL=2

# There can be no previous runlevel here, but there might be old
# information in /var/run/utmp that we pick up, and we don't want
# that.
#
# These override that
env RUNLEVEL=
env PREVLEVEL=
.......

who -r and initctl list results follow below.

who -r
      run-level 2 2010-09-03 05:16

$ initctl list
alsa-mixer-save stop/waiting
avahi-daemon start/running, process 1146
mountall-net stop/waiting
nmbd start/running, process 1154
rc stop/waiting
rpc_pipefs start/running
rsyslog start/running, process 1105
screen-cleanup stop/waiting
tty4 start/running, process 1209
udev start/running, process 442
upstart-udev-bridge start/running, process 413
ureadahead-other stop/waiting
apport stop/waiting
console-setup stop/waiting
hwclock-save stop/waiting
irqbalance stop/waiting
plymouth-log stop/waiting
smbd start/running, process 1102
tty5 start/running, process 1213
statd start/running, process 3032
atd start/running, process 1233
dbus start/running, process 1125
failsafe-x stop/waiting
plymouth stop/waiting
portmap start/running, process 3019
ssh start/running, process 1124
control-alt-delete stop/waiting
hwclock stop/waiting
network-manager start/running, process 1166
usplash stop/waiting
module-init-tools stop/waiting
cron start/running, process 1234
mountall stop/waiting
acpid start/running, process 1229
plymouth-stop stop/waiting
rcS stop/waiting
ufw start/running
mounted-varrun stop/waiting
rc-sysinit stop/waiting
anacron stop/waiting
tty2 start/running, process 1224
udevtrigger stop/waiting
mounted-dev stop/waiting
tty3 start/running, process 1225
udev-finish stop/waiting
cryptdisks-udev stop/waiting
hostname stop/waiting
kdm start/running, process 1131
mountall-reboot stop/waiting
mysql start/running, process 1280
gssd stop/waiting
mountall-shell stop/waiting
mounted-tmp stop/waiting
network-interface (lo) start/running
network-interface (eth3) start/running
plymouth-splash stop/waiting
tty1 start/running, process 1656
udevmonitor stop/waiting
cryptdisks-enable stop/waiting
dmesg stop/waiting
network-interface-security s...

Read more...

Timo Suoranta (timo-suoranta) wrote :

Could this be related?

http://forum.linode.com/viewtopic.php?t=5549

I run into this issue and the suggested modification

1. nano +67 /etc/init.d/nfs-kernel-server
2. Comment out this line: "if [ -f /proc/kallsyms ] && ! grep -qE ' nfsd_serv ' /proc/kallsyms; then"
3. Replace with this line: "if [ -f /proc/kallsyms ] && ! grep -qE 'init_nf(sd| )' /proc/kallsyms; then"
4. Save the changes
5. NFS will now start correctly:

worked for me.

Steve Langasek (vorlon) on 2012-03-14
Changed in nfs-utils (Ubuntu Lucid):
assignee: Steve Langasek (vorlon) → nobody
Changed in nfs-utils (Ubuntu):
assignee: Steve Langasek (vorlon) → nobody
Rolf Leggewie (r0lf) wrote :

lucid has seen the end of its life and is no longer receiving any updates. Marking the lucid task for this ticket as "Won't Fix".

Changed in nfs-utils (Ubuntu Lucid):
status: Triaged → Won't Fix
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments