nfsd gets unresponsive after some hours of operation

Bug #2062568 reported by Olle Liljenzin
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
nfs-utils (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I installed the 24.04 Beta on two test machines that were running 22.04 without issues before. One of them exports two volumes that are mounted by the other machine, which primarily uses them as a secondary storage for ccache.

After being up for a couple of hours (happened twice since yesterday evening) it seems that nfsd on the machine exporting the volumes hangs on something.

From dmesg on the server (repeated a few times):

[11183.290548] INFO: task nfsd:1419 blocked for more than 1228 seconds.
[11183.290558] Not tainted 6.8.0-22-generic #22-Ubuntu
[11183.290563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11183.290582] task:nfsd state:D stack:0 pid:1419 tgid:1419 ppid:2 flags:0x00004000
[11183.290587] Call Trace:
[11183.290602] <TASK>
[11183.290606] __schedule+0x27c/0x6b0
[11183.290612] schedule+0x33/0x110
[11183.290615] schedule_timeout+0x157/0x170
[11183.290619] wait_for_completion+0x88/0x150
[11183.290623] __flush_workqueue+0x140/0x3e0
[11183.290629] nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
[11183.290689] nfsd4_destroy_session+0x186/0x260 [nfsd]
[11183.290744] nfsd4_proc_compound+0x3af/0x770 [nfsd]
[11183.290798] nfsd_dispatch+0xd4/0x220 [nfsd]
[11183.290851] svc_process_common+0x44d/0x710 [sunrpc]
[11183.290924] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[11183.290976] svc_process+0x132/0x1b0 [sunrpc]
[11183.291041] svc_handle_xprt+0x4d3/0x5d0 [sunrpc]
[11183.291105] svc_recv+0x18b/0x2e0 [sunrpc]
[11183.291168] ? __pfx_nfsd+0x10/0x10 [nfsd]
[11183.291220] nfsd+0x8b/0xe0 [nfsd]
[11183.291270] kthread+0xef/0x120
[11183.291274] ? __pfx_kthread+0x10/0x10
[11183.291276] ret_from_fork+0x44/0x70
[11183.291279] ? __pfx_kthread+0x10/0x10
[11183.291281] ret_from_fork_asm+0x1b/0x30
[11183.291286] </TASK>

From dmesg on the client (repeated a number of times):
[ 6596.911785] RPC: Could not send backchannel reply error: -110
[ 6596.972490] RPC: Could not send backchannel reply error: -110
[ 6837.281307] RPC: Could not send backchannel reply error: -110

ProblemType: Bug
DistroRelease: Ubuntu 24.04
Package: nfs-kernel-server 1:2.6.4-3ubuntu5
ProcVersionSignature: Ubuntu 6.8.0-22.22-generic 6.8.1
Uname: Linux 6.8.0-22-generic x86_64
.etc.request-key.d.id_resolver.conf: create id_resolver * * /usr/sbin/nfsidmap -t 600 %k %d
ApportVersion: 2.28.1-0ubuntu1
Architecture: amd64
CasperMD5CheckResult: pass
Date: Fri Apr 19 14:10:25 2024
InstallationDate: Installed on 2024-04-16 (3 days ago)
InstallationMedia: Ubuntu-Server 24.04 LTS "Noble Numbat" - Beta amd64 (20240410.1)
NFSMounts:

NFSv4Mounts:

ProcEnviron:
 LANG=en_US.UTF-8
 PATH=(custom, no user)
 SHELL=/bin/bash
 TERM=xterm-256color
 XDG_RUNTIME_DIR=<set>
SourcePackage: nfs-utils
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Olle Liljenzin (liljenzin) wrote :
description: updated
Revision history for this message
Jeff (jeff09) wrote :

This kind of issue appeared with Ubuntu 23.10 for me on the server mostly using an HDD for bulk storage with a not exactly powerful CPU also being occupied with using WireGuard to secure the NFS connection.

Mentioning the performance details because I have a feeling they matter. An also not exactly high performance client connecting over 1 Gb/s only very occasionally caused this problem, however given a 10 Gb/s connection, the issue appeared significantly more commonly. A higher performance setup utilizing a 2.5 Gb/s connection triggered this bug in a couple of days after setup.
The lockup always seem to occur with heavy NFS usage, suspiciously mostly when there's both reading and writing going on, at least I don't recall it happening with reading only, but I'm not confident in stating it didn't happen with a writing only load.

Found this bug report by the client error message, server side differs due to the different version:
```
[300146.046666] INFO: task nfsd:1426 blocked for more than 241 seconds.
[300146.046732] Not tainted 6.5.0-27-generic #28~22.04.1-Ubuntu
[300146.046770] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[300146.046813] task:nfsd state:D stack:0 pid:1426 ppid:2 flags:0x00004000
[300146.046827] Call Trace:
[300146.046832] <TASK>
[300146.046839] __schedule+0x2cb/0x750
[300146.046860] schedule+0x63/0x110
[300146.046870] schedule_timeout+0x157/0x170
[300146.046881] wait_for_completion+0x88/0x150
[300146.046894] __flush_workqueue+0x140/0x3e0
[300146.046908] nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
[300146.047074] nfsd4_destroy_session+0x193/0x260 [nfsd]
[300146.047219] nfsd4_proc_compound+0x3b7/0x770 [nfsd]
[300146.047365] nfsd_dispatch+0xbf/0x1d0 [nfsd]
[300146.047497] svc_process_common+0x420/0x6e0 [sunrpc]
[300146.047695] ? __pfx_read_tsc+0x10/0x10
[300146.047706] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[300146.047848] ? __pfx_nfsd+0x10/0x10 [nfsd]
[300146.047977] svc_process+0x132/0x1b0 [sunrpc]
[300146.048157] nfsd+0xdc/0x1c0 [nfsd]
[300146.048287] kthread+0xf2/0x120
[300146.048299] ? __pfx_kthread+0x10/0x10
[300146.048310] ret_from_fork+0x47/0x70
[300146.048321] ? __pfx_kthread+0x10/0x10
[300146.048331] ret_from_fork_asm+0x1b/0x30
[300146.048341] </TASK>
```

This seems to be matching, but the previous lockups experienced may have been somewhat different.
I mostly remember the client whining about the server not responding instead of the message presented here, and the server call trace used to have btrfs in it which made me suspect it may be exclusive to that, although the issue was always with NFS, nothing else locked up despite having some other sources of heavy I/O.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nfs-utils (Ubuntu):
status: New → Confirmed
Revision history for this message
Jeff (jeff09) wrote :

Ran into this again just hours after commenting by attempting to unpack a large archive file. Apparently the new setup of higher performance host with more network bandwidth is just too overwhelming to be usable with NFS with this issue.
Reading alone still seems to be fine though.

Torturing 50% of the memory with memtester didn't reveal any problems, and getting the server to do the unpacking of files does the same mixed I/O with no issues so far, progressing way beyond where I could get over NFS, so it doesn't look like an HDD issue either.

The info dumping described here seems to be potentially useful for the next catch, so linking it here:
https://www.spinics.net/lists/linux-nfs/msg97486.html

At some point I may try to reproduce the issue with just writing to see if the mixed workload is really required for the freeze, but not sure when will I get to it, can't just restart NFSd, and the mandatory reboot to get it going again is only feasible when all other tasks running on the host can be interrupted.

Revision history for this message
Jeff (jeff09) wrote :

Oh fun, trying to add multiple attachments, I just ended up finding a bug report from 2007 complaining about apport being able to do this, but the web interface is limited, so guess I'll get a bit spammy.

Ran into the same as usual again, even with avoiding heavy utilization. Not seeing anything too interesting in the gathered info and apparently the kernel log buffer is not large enough to be able to completely handle a task list dump, but seems like most of the information is at least there.

Will try to use RPCNFSDCOUNT=1 for some time in case this is a silly deadlock, at least it's definitely easier to give it a try than to downgrade to Ubuntu 22.04 which worked well.

Revision history for this message
Jeff (jeff09) wrote :
Revision history for this message
Jeff (jeff09) wrote :
Revision history for this message
GuoqingJiang (guoqingjiang) wrote :

IIUC, from nfs server side, both 23.10 (6.5 series) and 24.04 (6.8 series) have the similar issue, but 22.04 (probably 5.15.x kernel) was ok.

And what is the kernel version from nfs client? Is it changed or stay on one certain version?

Anyway, I guess the efficient way is to bisect which commit caused the issue, thanks!

Revision history for this message
Jeff (jeff09) wrote :
Download full text (3.3 KiB)

The tricky part for me is that client was regularly changing, so I can't confidently say when did errors start appearing, it's just very suspicious that it needs high load (as new host and higher network bandwidth made the issue more frequent), and uploading to the server as just pure downloading doesn't seem to be a problem even if cached data is getting sent at full bandwidth for minutes.

Moved the server to 24.04, but I've also moved some I/O heavy tasks to it so there would be less need of uploading. Client was on 23.10 and I'm still holding back on upgrading for some more weeks.

Can't say a whole lot about the current situation as I'm not uploading much anymore to avoid the issue, but I actually ran into a hanging issue a few days ago, I just didn't have time to debug it, but the server didn't want to gracefully restart, so ended up hard rebooting.
I believe it was the first time since moving I/O heavy tasks, wanted to upload a few hundred GiB of data back to the server which was downloaded from there a while ago without problems. Otherwise light I/O doesn't seem to run into this problem, like the occasional backup to the server is fine, but that rarely saturates the network, and likely completely fits into the page cache almost every time.

A few hopefully helpful points for reproducing the problem:
- As mentioned multiple times, download alone seems to be unaffected, uploading is what should be stressed, and I suspect that either there's no need to download at the same time, or just casual filesystem browsing is a good enough load.
- A fast client with high bandwidth is key. Ran into this issue a couple times with an older host on 1 Gb/s, but a new fast host with 2.5 Gb/s made the issue appear significantly more frequently.
- Likely doesn't matter how the link gets saturated, but I either processed files cached on the server (mixed R/W), or uploaded cached files (fast SSD should be fine too), meaning that the bottleneck was always the network at least while the caches were large enough.
- Files were large, so there wasn't any stopping for fiddling with metadata as it would happen with small files, and the page cache was often exhausted. The target was a single HDD the majority of the time which often meant that writes started blocking (100-ish MiB/s HDD catching up with close to 250 MiB/s data), occasionally making the hosts freeze as the kernel's background I/O handling is still bad, we just pretend the issue is gone with SSDs being fast enough not to run into this. The page cache draining freezes may be good at exposing race conditions.

It may be more efficient to start looking for what's causing the "RPC: Could not send backchannel reply error: -110" log spam which might be related. The lockup may take significant time to catch while that kernel message showed up quite frequently.
Even now I have plenty of those lines without experiencing issues and not even uploading much, mostly just downloading large files.

Some extra info which may or may not matter:
- The server hardware is quite weak with an old 4 core Broadwell CPU, possibly helping to expose race condition problems
- All file systems are Btrfs with noatime,discard=async,...

Read more...

Revision history for this message
yuhldr (yuh) wrote (last edit ):

I encountered the same problem. After several days of testing, the problem can be reproduced 100%. Ubuntu24.04, a 10Gb/s optical fiber connection is used between the login node and the computing node. The computing node uses nfs to mount the /home of the login node. The entire system is managed using slurm.

 The login node submits files that require a large number of reading and writing files on /home. During the program, my program reads local txt of about 10GB in size by python-numpy, and then separates it into multiple small files of 100MB and saves them as npy files.

I submit 252 similar programs at one time and run them at the same time, within one hour. The nfs service of the login node is stuck. At this time, the nfs-server service of the login node cannot be restarted. The login node cannot ssh to the computing node, and the problem of restarting the computing node still exists. However, the problem of just restarting the login node disappears, ssh is restored, and the computing node Automatically connect to nfs successfully.

```bash
root 1548 0.0 0.0 5632 1792 ? Ss 18:19 0:00 /usr/sbin/nfsdcld
root 2347 4.6 0.0 0 0 ? D 18:19 8:04 [nfsd]
root 53326 0.0 0.0 0 0 ? D 20:00 0:00 [kworker/u112:2+nfsd4_callbacks]
root 68918 0.0 0.0 2704 1792 ? Is 20:47 0:00 /usr/sbin/rpc.nfsd 0
root 74448 0.0 0.0 9436 2240 pts/6 S+ 21:11 0:00 grep --color=auto --ex
```

```log
6月 23 20:48:52 icpcs systemd[1]: nfs-server.service: Stopping timed out. Terminating.
6月 23 20:49:10 icpcs sudo[69464]: root : TTY=pts/6 ; PWD=/root ; USER=root ; COMMAND=/usr/bin/systemctl status nfs-server.service
6月 23 20:50:23 icpcs systemd[1]: nfs-server.service: State 'stop-sigterm' timed out. Killing.
6月 23 20:50:23 icpcs systemd[1]: nfs-server.service: Killing process 68918 (rpc.nfsd) with signal SIGKILL.
6月 23 20:50:27 icpcs kernel: INFO: task nfsd:2347 blocked for more than 1105 seconds.
6月 23 20:50:27 icpcs kernel: task:nfsd state:D stack:0 pid:2347 tgid:2347 ppid:2 flags:0x00004000
6月 23 20:50:27 icpcs kernel: nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
6月 23 20:50:27 icpcs kernel: nfsd4_destroy_session+0x186/0x260 [nfsd]
6月 23 20:50:27 icpcs kernel: nfsd4_proc_compound+0x3af/0x770 [nfsd]
6月 23 20:50:27 icpcs kernel: nfsd_dispatch+0xd4/0x220 [nfsd]
6月 23 20:50:27 icpcs kernel: ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
6月 23 20:50:27 icpcs kernel: ? __pfx_nfsd+0x10/0x10 [nfsd]
6月 23 20:50:27 icpcs kernel: nfsd+0x8b/0xe0 [nfsd]
```

Revision history for this message
Ricardo Cruz (ric8cruz) wrote :

For the benefit of others... Like #10, we also have a cluster with this issue. As a workaround, we are using version 3 of the NFS protocol (`nfsvers=3` in `/etc/fstab`), which so far seems to have eliminated the problem for us.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.