nfs v3 locking fails - rpc-statd not started after minor upgrade

Bug #1956787 reported by Charles Hedrick
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
nfs-utils (Ubuntu)
Fix Released
Undecided
Unassigned

Bug Description

We just upgraded our nfs servers from 5.4.0-72 to 5.4.0-92.

We are using NFS v3 in a large computer science department, with several hundred clients.

Lots of applications are now failing because locks don't work.

/var/log/syslog on the client reports lockd not responding. tcpdump shows that the client tries to connect to lockd but there is no response to the TCP SYN.

netstat on the server shows some connections open, with lots of bytes in the receive queue.

The problem probably occurs only with lots of clients. I believe any typical test would work fine.

The problem is new. That is, it didn't happen in 5.4.0-72.

ProblemType: Bug
DistroRelease: Ubuntu 20.04
Package: kernel-common (not installed)
ProcVersionSignature: Ubuntu 5.4.0-92.103-generic 5.4.157
Uname: Linux 5.4.0-92-generic x86_64
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
ApportVersion: 2.20.11-0ubuntu27.21
Architecture: amd64
CasperMD5CheckResult: pass
Date: Fri Jan 7 13:07:34 2022
InstallationDate: Installed on 2020-10-14 (450 days ago)
InstallationMedia: Ubuntu-Server 20.04.1 LTS "Focal Fossa" - Release amd64 (20200731)
SourcePackage: kernel-package
UpgradeStatus: No upgrade log present (probably fresh install)
modified.conffile..etc.default.apport: [modified]
mtime.conffile..etc.default.apport: 2020-10-14T13:53:33.474467

Related branches

Revision history for this message
Charles Hedrick (hedrick) wrote :
Revision history for this message
Charles Hedrick (hedrick) wrote :

Also, "lslocks" on the server shows no locks associated with lockd, though there are locks associated with nfsd. (nfsd would be locks from NFS 4 I believe.)

summary: - nfs v3 locking fails
+ nfs v3 locking fails - 5.4.0-92 regression
Revision history for this message
Charles Hedrick (hedrick) wrote : Re: nfs v3 locking fails - 5.4.0-92 regression

The problem occurs on 4 different servers.
It does not occur on a server running Centos7 updated to kernel 5.4.170.
This implies that the problem is most likely an Ubuntu patch (or it was fixed upstream between 5.4.157 and 5.4.170).

Revision history for this message
Charles Hedrick (hedrick) wrote :

It has nothing to do with the kernel. For some reasons rpc.statd isn't getting started. The systemd declaration looks fine, but it isn't there. Doing "systemctl start rpc-statd" and "enable" fixes it. The other parts of nfs-utils seem fine.

summary: - nfs v3 locking fails - 5.4.0-92 regression
+ nfs v3 locking fails - rpc-statd not started after minor upgrade
affects: kernel-package (Ubuntu) → nfs-utils (Ubuntu)
Revision history for this message
Andreas Hasenack (ahasenack) wrote :

So if you reboot, rpc.statd will not be started? If it's failing, can you locate that failure in the logs and show us? Also please report the version of the package and the config files in /etc/default/* related to nfs.

Changed in nfs-utils (Ubuntu):
status: New → Incomplete
Revision history for this message
Charles Hedrick (hedrick) wrote :

right. It's not failing. It's not running at all.

systemctl status rpc-statd
● rpc-statd.service - NFS status monitor for NFSv2/3 locking.
     Loaded: loaded (/lib/systemd/system/rpc-statd.service; disabled; vendor preset: enabled)
     Active: inactive (dead)

Journalctl doesn't show that nfs-utils ran. As far as I can tell, that's the only thing that would activate it.

I note that there's an [Install] section in rpc-statd.service. That implies that it's intended to be enabled. It's not. If you enable it, everything works. I suspected maybe the upgrade was supposed to enable it but didn't.

rpc.statd version 1.3.3

ls -l /sbin/rpc.statd
-rwxr-xr-x 1 root root 89016 May 24 2021 /sbin/rpc.statd

ls -lc /sbin/rpc.statd
-rwxr-xr-x 1 root root 89016 Jan 10 15:09 /sbin/rpc.statd

It looks like it changed in the last update. It was "apt upgrade"

/etc/default/nfs-common
STATDOPTS=
NEED_GSSD=yes

/etc/default/nfs-kernel-server
RPCNFSDCOUNT=8
RPCNFSDPRIORITY=0
RPCMOUNTDOPTS="--manage-gids"
NEED_SVCGSSD="yes"
RPCSVCGSSDOPTS=""

/etc/default/rpcbind
OPTIONS=""
OPTIONS="-w"

/etc/systemd/system/rpc-gssd.service.d
gss-krb.conf
[Service]
Environment=KRB5_CONFIG=/etc/krb5.conf:/etc/krb5.conf.gssd

time.conf
[Service]
Environment=GSSDARGS=-t600

The KRB5_CONFIG is to configure
[plugins]
  ccselect = {
     module = nfs:/usr/lib/ccselect_nfs.so
  }
ccselect_nfs.so selects the user's primary principal, so that if they've done kinit as user.admin to do privileged stuff, NFS still uses their default principal.

Note that this was all working before the upgrade. The upgrade was
2022-01-10 15:09:15

Looks like the system was installed July 31, 2020, and no upgrades other than unattended upgrade since then.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

NFS is comprised of many services (specially on nfsv3), and they differ whenever v3 or v4 is used. The packaging tries to make the right choices, and goes through many hoops for that.

In summary, you should have nfs-server.service enabled on the server, and nfs-client.target (target!) on the client, and the correct services should come up after boot. You can even do "systemctl restart nfs-server" on the server and the correct services should be restarted in the right order.

The client target has this note:

 nfs-client.target
    If enabled, daemons needs for an nfs client are enabled.
    This does *not* include rpc.statd. the rpc-statd.service unit
    is started by /usr/sbin/start-statd which mount.nfs will run
    if statd is needed.

I wonder if that script isn't working, or not being called in your case for some reason.

Revision history for this message
Charles Hedrick (hedrick) wrote :

Statd is used by both client and server. I think that note is for the client usage.

Our servers don't necessariy do any NFS3 client mounts, so the client start would't happen. Is it possible that at some point I mounted something via NFS3 and that's the only reason statd was running? I can't prove that that isn't true.

I tried enabling nfs-server and that didn't help.

My solution has been to enable rpc-statd explicitly. That works.

The reason I reported the problem is that it had previously worked automatically, and it took me quite a while to figure out why after the upgrade NFS 3 wasn't working on the server. I might not be alone in this.

It looks like it's supposed to be started by /etc/init.d/nfs-common, but I'm pretty sure that isn't started except in /etc/rcS.d/, which wouldn't normally happen. I put a statement in nfs-common to create a file in /var/log, and it didn't happen, so I don't believe nfs-common ran.

I suspect statd should be started as part of nfs-server, but it doesn't seem to be happening. Unless you assume that people are just using nfs 4 and want to require manual intervention to support 3. I wouldn't expect that. Particularly since the symptoms are subtle. If you try an NFS 3 mount it works. Things don't start failing until someone tries locking. The most common case is probably firefox, thunderbird, etc, which lock their profiles.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

For the server, the nfs-server.service service should suffice:

 nfs-server.service
    If enabled, nfs service is started together with dependencies
    such as mountd, statd, rpc.idmapd
    This is a "service" file rather than a "target" (which is the
    normal grouping construct) so that
        systemctl start nfs-server
    can work (if no type is given, ".service" is assumed).

That being said, I quickly tried on a focal vm and after enabling nfs-server.service, and even rebooting, statd isn't running. I'll dig in.

Changed in nfs-utils (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Charles Hedrick (hedrick) wrote :

I conjecture that the problem occurred when moving from init scripts to systemd. It's pretty clear that the default in the nfs-common init script was to start statd. I conjecture that when converting to systemd, someone forgot to put a Wants in nfs-server. It's got an after but not a Wants. Writing the unit file with an [Install] section implies an explicit enable, but you probably don't want that. You probably want nfs-server to start it.

It wouldn't be easy to start it the first time a client mounts via NFS 3, without a kernel upcall, so I think nfs-common was right to default to using it.

But it looks like nfs-common is a vestige of the init script days and isn't used except in single user.

For what it's worth, centos 7 has a nearly identical unit file for rpc-statd, except it's missing the [Install] section (presumably because it's intended to be invoked by other things and not explicitly enabled). There are Wants for autofs and nfs-server.

I think adding Wants to at least nfs-server makes sense.

rpc-statd.service doesn't have an install section

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

Still digging into some history, found this patch in the package:

debian/patches/27-systemd-enable-with-systemctl-statd.patch
Description: Let sysadmins enable/disable statd services
 As the admin was able to control under upstart the statd services with
 NEED_STATD in default conffiles, mirror this funcationality under systemd
 by letting the user systemctl enable/disable statd services.
Author: Didier Roche <email address hidden>
Bug-Ubuntu: https://launchpad.net/bugs/1428486

Which removes the Wants for rpc-statd in nfs-server.service:
--- a/systemd/nfs-server.service
+++ b/systemd/nfs-server.service
@@ -4,8 +4,7 @@ DefaultDependencies=no
 Requires= network.target proc-fs-nfsd.mount
 Requires= nfs-mountd.service
 Wants=rpcbind.socket
-Wants=rpc-statd.service nfs-idmapd.service
-Wants=rpc-statd-notify.service
+Wants=nfs-idmapd.service

Apparently starting statd or not was controlled by a NEED_STATD var in /etc/default, and that is gone. To not always start statd (because it's not needed in nfsv4 I guess), they removed it from Wants, and let it be controlled via its unit file. That patch is from 2015.

Looks like the final paragraph of your comment #8 was right on the spot.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

I think the conclusion here is that on a focal server, if you expect to server non-nfsv4 clients, you need to enable rpc-statd manually with systemctl, and we should document it in the server guide (https://ubuntu.com/server/docs/service-nfs).

Unless there is a trivial way to change this that for sure won't impact other scenarios, I'm wary of touching the systemd unit files in such a fashion on an LTS release, for fear of introducing other bugs or regressions, specially because this behavior was specifically introduced by a debian/ubuntu patch.

On the flip side, the reasons for the patch might no longer exist nowadays, so I think it's valid to revisit this for the upcoming LTS release, 22.04. In fact, quickly looking at the nfs-utils package in debian/experimental shows they apparently dropped this patch already:

nfs-utils (1:2.5.4-1~exp5) experimental; urgency=medium
...
  * Drop "Let sysadmins enable/disable statd services"
...
 -- Salvatore Bonaccorso <email address hidden> Tue, 14 Sep 2021 09:48:58 +0200

So that's my plan:
- document that rpc-statd might have to be manually enabled (note that even a focal nfs client will default to nfsv4.2, not requiring statd on the server nor the client)
- close this bug for focal
- see what we can do for jammy (22.04)

Revision history for this message
Charles Hedrick (hedrick) wrote :

That's a sensible approach, but there are loads of web pages telling people how to set up NFS, and they all claim that on current distributions you no longer need to enable individual services. That's not true for Ubuntu 20.

Please do make sure it's fix in 22.04.

The original reasoning had a hole in it: with the scripts, there were three states: on, off, and default. Default was on. With the patch the default is off. I see no way in systemd to duplicate the way the scripts worked purely within systemd.

This problem is particularly insidious. First, the symptoms aren't obvious. It took us a couple of days to figure out what was going on. Second, the problem doesn't occur if you mount anything by NFS3. So things worked in testing, but failed the first time we rebooting in production.

Revision history for this message
Charles Hedrick (hedrick) wrote :

Also, can I interest you in 1918312 for 22.04? Everyone agrees it's a security issue. It has a well-known 2-line fix (our version is 4 lines), which we've been using in production for over a year.

Revision history for this message
Andreas Hasenack (ahasenack) wrote :

In Jammy, rpc.statd is started always by default. It can of course be disabled or masked as usual with systemctl.

Changed in nfs-utils (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.