Bug #2062568 “nfsd gets unresponsive after some hours of operati...” : Bugs : nfs-utils package : Ubuntu

Revision history for this message

Olle Liljenzin (liljenzin) wrote on 2024-04-19:

#1

.etc.default.nfs-common.txt Edit (449 bytes, text/plain; charset="utf-8")
.etc.default.nfs-kernel-server.txt Edit (224 bytes, text/plain; charset="utf-8")
.etc.exports.txt Edit (497 bytes, text/plain; charset="utf-8")
.etc.nfs.conf.txt Edit (1.4 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (3.1 KiB, text/plain; charset="utf-8")
ProcCpuinfoMinimal.txt Edit (1.5 KiB, text/plain; charset="utf-8")
SyslogNFS.txt Edit (534 bytes, text/plain; charset="utf-8")

Olle Liljenzin (liljenzin) on 2024-04-19

description:

updated

Revision history for this message

Jeff (jeff09) wrote on 2024-04-21:

#2

This kind of issue appeared with Ubuntu 23.10 for me on the server mostly using an HDD for bulk storage with a not exactly powerful CPU also being occupied with using WireGuard to secure the NFS connection.

Mentioning the performance details because I have a feeling they matter. An also not exactly high performance client connecting over 1 Gb/s only very occasionally caused this problem, however given a 10 Gb/s connection, the issue appeared significantly more commonly. A higher performance setup utilizing a 2.5 Gb/s connection triggered this bug in a couple of days after setup.
The lockup always seem to occur with heavy NFS usage, suspiciously mostly when there's both reading and writing going on, at least I don't recall it happening with reading only, but I'm not confident in stating it didn't happen with a writing only load.

Found this bug report by the client error message, server side differs due to the different version:
```
[300146.046666] INFO: task nfsd:1426 blocked for more than 241 seconds.
[300146.046732] Not tainted 6.5.0-27-generic #28~22.04.1-Ubuntu
[300146.046770] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[300146.046813] task:nfsd state:D stack:0 pid:1426 ppid:2 flags:0x00004000
[300146.046827] Call Trace:
[300146.046832] <TASK>
[300146.046839] __schedule+0x2cb/0x750
[300146.046860] schedule+0x63/0x110
[300146.046870] schedule_timeout+0x157/0x170
[300146.046881] wait_for_completion+0x88/0x150
[300146.046894] __flush_workqueue+0x140/0x3e0
[300146.046908] nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
[300146.047074] nfsd4_destroy_session+0x193/0x260 [nfsd]
[300146.047219] nfsd4_proc_compound+0x3b7/0x770 [nfsd]
[300146.047365] nfsd_dispatch+0xbf/0x1d0 [nfsd]
[300146.047497] svc_process_common+0x420/0x6e0 [sunrpc]
[300146.047695] ? __pfx_read_tsc+0x10/0x10
[300146.047706] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[300146.047848] ? __pfx_nfsd+0x10/0x10 [nfsd]
[300146.047977] svc_process+0x132/0x1b0 [sunrpc]
[300146.048157] nfsd+0xdc/0x1c0 [nfsd]
[300146.048287] kthread+0xf2/0x120
[300146.048299] ? __pfx_kthread+0x10/0x10
[300146.048310] ret_from_fork+0x47/0x70
[300146.048321] ? __pfx_kthread+0x10/0x10
[300146.048331] ret_from_fork_asm+0x1b/0x30
[300146.048341] </TASK>
```

This seems to be matching, but the previous lockups experienced may have been somewhat different.
I mostly remember the client whining about the server not responding instead of the message presented here, and the server call trace used to have btrfs in it which made me suspect it may be exclusive to that, although the issue was always with NFS, nothing else locked up despite having some other sources of heavy I/O.

This kind of issue appeared with Ubuntu 23.10 for me on the server mostly using an HDD for bulk storage with a not exactly powerful CPU also being occupied with using WireGuard to secure the NFS connection.

Mentioning the performance details because I have a feeling they matter. An also not exactly high performance client connecting over 1 Gb/s only very occasionally caused this problem, however given a 10 Gb/s connection, the issue appeared significantly more commonly. A higher performance setup utilizing a 2.5 Gb/s connection triggered this bug in a couple of days after setup.
The lockup always seem to occur with heavy NFS usage, suspiciously mostly when there's both reading and writing going on, at least I don't recall it happening with reading only, but I'm not confident in stating it didn't happen with a writing only load.

Found this bug report by the client error message, server side differs due to the different version:
```
[300146.046666] INFO: task nfsd:1426 blocked for more than 241 seconds.
[300146.046732]       Not tainted 6.5.0-27-generic #28~22.04.1-Ubuntu
[300146.046770] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[300146.046813] task:nfsd            state:D stack:0     pid:1426  ppid:2      flags:0x00004000
[300146.046827] Call Trace:
[300146.046832]  <TASK>
[300146.046839]  __schedule+0x2cb/0x750
[300146.046860]  schedule+0x63/0x110
[300146.046870]  schedule_timeout+0x157/0x170
[300146.046881]  wait_for_completion+0x88/0x150
[300146.046894]  __flush_workqueue+0x140/0x3e0
[300146.046908]  nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
[300146.047074]  nfsd4_destroy_session+0x193/0x260 [nfsd]
[300146.047219]  nfsd4_proc_compound+0x3b7/0x770 [nfsd]
[300146.047365]  nfsd_dispatch+0xbf/0x1d0 [nfsd]
[300146.047497]  svc_process_common+0x420/0x6e0 [sunrpc]
[300146.047695]  ? __pfx_read_tsc+0x10/0x10
[300146.047706]  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[300146.047848]  ? __pfx_nfsd+0x10/0x10 [nfsd]
[300146.047977]  svc_process+0x132/0x1b0 [sunrpc]
[300146.048157]  nfsd+0xdc/0x1c0 [nfsd]
[300146.048287]  kthread+0xf2/0x120
[300146.048299]  ? __pfx_kthread+0x10/0x10
[300146.048310]  ret_from_fork+0x47/0x70
[300146.048321]  ? __pfx_kthread+0x10/0x10
[300146.048331]  ret_from_fork_asm+0x1b/0x30
[300146.048341]  </TASK>
```

This seems to be matching, but the previous lockups experienced may have been somewhat different.
I mostly remember the client whining about the server not responding instead of the message presented here, and the server call trace used to have btrfs in it which made me suspect it may be exclusive to that, although the issue was always with NFS, nothing else locked up despite having some other sources of heavy I/O.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2024-04-21:

#3

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nfs-utils (Ubuntu):
status:	New → Confirmed

Revision history for this message

Jeff (jeff09) wrote on 2024-04-22:

#4

Ran into this again just hours after commenting by attempting to unpack a large archive file. Apparently the new setup of higher performance host with more network bandwidth is just too overwhelming to be usable with NFS with this issue.
Reading alone still seems to be fine though.

Torturing 50% of the memory with memtester didn't reveal any problems, and getting the server to do the unpacking of files does the same mixed I/O with no issues so far, progressing way beyond where I could get over NFS, so it doesn't look like an HDD issue either.

The info dumping described here seems to be potentially useful for the next catch, so linking it here:
https://www.spinics.net/lists/linux-nfs/msg97486.html

At some point I may try to reproduce the issue with just writing to see if the mixed workload is really required for the freeze, but not sure when will I get to it, can't just restart NFSd, and the mandatory reboot to get it going again is only feasible when all other tasks running on the host can be interrupted.

Revision history for this message

Jeff (jeff09) wrote on 2024-04-25:

#5

dmesg.txt Edit (298.7 KiB, text/plain)

Oh fun, trying to add multiple attachments, I just ended up finding a bug report from 2007 complaining about apport being able to do this, but the web interface is limited, so guess I'll get a bit spammy.

Ran into the same as usual again, even with avoiding heavy utilization. Not seeing anything too interesting in the gathered info and apparently the kernel log buffer is not large enough to be able to completely handle a task list dump, but seems like most of the information is at least there.

Will try to use RPCNFSDCOUNT=1 for some time in case this is a silly deadlock, at least it's definitely easier to give it a try than to downgrade to Ubuntu 22.04 which worked well.

Revision history for this message

Jeff (jeff09) wrote on 2024-04-25:

#6

nfs_threads.txt Edit (2.2 KiB, text/plain)

Revision history for this message

Jeff (jeff09) wrote on 2024-04-25:

#7

rpc_tasks.txt Edit (110 bytes, text/plain)

Revision history for this message

GuoqingJiang (guoqingjiang) wrote on 2024-06-18:

#8

IIUC, from nfs server side, both 23.10 (6.5 series) and 24.04 (6.8 series) have the similar issue, but 22.04 (probably 5.15.x kernel) was ok.

And what is the kernel version from nfs client? Is it changed or stay on one certain version?

Anyway, I guess the efficient way is to bisect which commit caused the issue, thanks!

Revision history for this message

Jeff (jeff09) wrote on 2024-06-18:

#9

Download full text (3.3 KiB)

The tricky part for me is that client was regularly changing, so I can't confidently say when did errors start appearing, it's just very suspicious that it needs high load (as new host and higher network bandwidth made the issue more frequent), and uploading to the server as just pure downloading doesn't seem to be a problem even if cached data is getting sent at full bandwidth for minutes.

Moved the server to 24.04, but I've also moved some I/O heavy tasks to it so there would be less need of uploading. Client was on 23.10 and I'm still holding back on upgrading for some more weeks.

Can't say a whole lot about the current situation as I'm not uploading much anymore to avoid the issue, but I actually ran into a hanging issue a few days ago, I just didn't have time to debug it, but the server didn't want to gracefully restart, so ended up hard rebooting.
I believe it was the first time since moving I/O heavy tasks, wanted to upload a few hundred GiB of data back to the server which was downloaded from there a while ago without problems. Otherwise light I/O doesn't seem to run into this problem, like the occasional backup to the server is fine, but that rarely saturates the network, and likely completely fits into the page cache almost every time.

A few hopefully helpful points for reproducing the problem:
- As mentioned multiple times, download alone seems to be unaffected, uploading is what should be stressed, and I suspect that either there's no need to download at the same time, or just casual filesystem browsing is a good enough load.
- A fast client with high bandwidth is key. Ran into this issue a couple times with an older host on 1 Gb/s, but a new fast host with 2.5 Gb/s made the issue appear significantly more frequently.
- Likely doesn't matter how the link gets saturated, but I either processed files cached on the server (mixed R/W), or uploaded cached files (fast SSD should be fine too), meaning that the bottleneck was always the network at least while the caches were large enough.
- Files were large, so there wasn't any stopping for fiddling with metadata as it would happen with small files, and the page cache was often exhausted. The target was a single HDD the majority of the time which often meant that writes started blocking (100-ish MiB/s HDD catching up with close to 250 MiB/s data), occasionally making the hosts freeze as the kernel's background I/O handling is still bad, we just pretend the issue is gone with SSDs being fast enough not to run into this. The page cache draining freezes may be good at exposing race conditions.

It may be more efficient to start looking for what's causing the "RPC: Could not send backchannel reply error: -110" log spam which might be related. The lockup may take significant time to catch while that kernel message showed up quite frequently.
Even now I have plenty of those lines without experiencing issues and not even uploading much, mostly just downloading large files.

Some extra info which may or may not matter:
- The server hardware is quite weak with an old 4 core Broadwell CPU, possibly helping to expose race condition problems
- All file systems are Btrfs with noatime,discard=async,...

The tricky part for me is that client was regularly changing, so I can't confidently say when did errors start appearing, it's just very suspicious that it needs high load (as new host and higher network bandwidth made the issue more frequent), and uploading to the server as just pure downloading doesn't seem to be a problem even if cached data is getting sent at full bandwidth for minutes.

Moved the server to 24.04, but I've also moved some I/O heavy tasks to it so there would be less need of uploading. Client was on 23.10 and I'm still holding back on upgrading for some more weeks.

Can't say a whole lot about the current situation as I'm not uploading much anymore to avoid the issue, but I actually ran into a hanging issue a few days ago, I just didn't have time to debug it, but the server didn't want to gracefully restart, so ended up hard rebooting.
I believe it was the first time since moving I/O heavy tasks, wanted to upload a few hundred GiB of data back to the server which was downloaded from there a while ago without problems. Otherwise light I/O doesn't seem to run into this problem, like the occasional backup to the server is fine, but that rarely saturates the network, and likely completely fits into the page cache almost every time.

A few hopefully helpful points for reproducing the problem:
- As mentioned multiple times, download alone seems to be unaffected, uploading is what should be stressed, and I suspect that either there's no need to download at the same time, or just casual filesystem browsing is a good enough load.
- A fast client with high bandwidth is key. Ran into this issue a couple times with an older host on 1 Gb/s, but a new fast host with 2.5 Gb/s made the issue appear significantly more frequently.
- Likely doesn't matter how the link gets saturated, but I either processed files cached on the server (mixed R/W), or uploaded cached files (fast SSD should be fine too), meaning that the bottleneck was always the network at least while the caches were large enough.
- Files were large, so there wasn't any stopping for fiddling with metadata as it would happen with small files, and the page cache was often exhausted. The target was a single HDD the majority of the time which often meant that writes started blocking (100-ish MiB/s HDD catching up with close to 250 MiB/s data), occasionally making the hosts freeze as the kernel's background I/O handling is still bad, we just pretend the issue is gone with SSDs being fast enough not to run into this. The page cache draining freezes may be good at exposing race conditions.

It may be more efficient to start looking for what's causing the "RPC: Could not send backchannel reply error: -110" log spam which might be related. The lockup may take significant time to catch while that kernel message showed up quite frequently.
Even now I have plenty of those lines without experiencing issues and not even uploading much, mostly just downloading large files.

Some extra info which may or may not matter:
- The server hardware is quite weak with an old 4 core Broadwell CPU, possibly helping to expose race condition problems
- All file systems are Btrfs with noatime,discard=async,compress-force=zstd , the later part surely adding more load
- LUKS is used everywhere, also adding some extra load
- There's a Btrfs (on LUKS) image mounted over NFS (with not a whole lot of usage though)

Revision history for this message

yuhldr (yuh) wrote on 2024-06-23 (last edit on 2024-06-23):

#10

I encountered the same problem. After several days of testing, the problem can be reproduced 100%. Ubuntu24.04, a 10Gb/s optical fiber connection is used between the login node and the computing node. The computing node uses nfs to mount the /home of the login node. The entire system is managed using slurm.

The login node submits files that require a large number of reading and writing files on /home. During the program, my program reads local txt of about 10GB in size by python-numpy, and then separates it into multiple small files of 100MB and saves them as npy files.

I submit 252 similar programs at one time and run them at the same time, within one hour. The nfs service of the login node is stuck. At this time, the nfs-server service of the login node cannot be restarted. The login node cannot ssh to the computing node, and the problem of restarting the computing node still exists. However, the problem of just restarting the login node disappears, ssh is restored, and the computing node Automatically connect to nfs successfully.

```bash
root 1548 0.0 0.0 5632 1792 ? Ss 18:19 0:00 /usr/sbin/nfsdcld
root 2347 4.6 0.0 0 0 ? D 18:19 8:04 [nfsd]
root 53326 0.0 0.0 0 0 ? D 20:00 0:00 [kworker/u112:2+nfsd4_callbacks]
root 68918 0.0 0.0 2704 1792 ? Is 20:47 0:00 /usr/sbin/rpc.nfsd 0
root 74448 0.0 0.0 9436 2240 pts/6 S+ 21:11 0:00 grep --color=auto --ex
```

```log
6月 23 20:48:52 icpcs systemd[1]: nfs-server.service: Stopping timed out. Terminating.
6月 23 20:49:10 icpcs sudo[69464]: root : TTY=pts/6 ; PWD=/root ; USER=root ; COMMAND=/usr/bin/systemctl status nfs-server.service
6月 23 20:50:23 icpcs systemd[1]: nfs-server.service: State 'stop-sigterm' timed out. Killing.
6月 23 20:50:23 icpcs systemd[1]: nfs-server.service: Killing process 68918 (rpc.nfsd) with signal SIGKILL.
6月 23 20:50:27 icpcs kernel: INFO: task nfsd:2347 blocked for more than 1105 seconds.
6月 23 20:50:27 icpcs kernel: task:nfsd state:D stack:0 pid:2347 tgid:2347 ppid:2 flags:0x00004000
6月 23 20:50:27 icpcs kernel: nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
6月 23 20:50:27 icpcs kernel: nfsd4_destroy_session+0x186/0x260 [nfsd]
6月 23 20:50:27 icpcs kernel: nfsd4_proc_compound+0x3af/0x770 [nfsd]
6月 23 20:50:27 icpcs kernel: nfsd_dispatch+0xd4/0x220 [nfsd]
6月 23 20:50:27 icpcs kernel: ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
6月 23 20:50:27 icpcs kernel: ? __pfx_nfsd+0x10/0x10 [nfsd]
6月 23 20:50:27 icpcs kernel: nfsd+0x8b/0xe0 [nfsd]
```

I encountered the same problem. After several days of testing, the problem can be reproduced 100%. Ubuntu24.04, a 10Gb/s optical fiber connection is used between the login node and the computing node. The computing node uses nfs to mount the /home of the login node. The entire system is managed using slurm.

The login node submits files that require a large number of reading and writing files on /home. During the program, my program reads local txt of about 10GB in size by python-numpy, and then separates it into multiple small files of 100MB and saves them as npy files.

I submit 252 similar programs at one time and run them at the same time, within one hour. The nfs service of the login node is stuck. At this time, the nfs-server service of the login node cannot be restarted. The login node cannot ssh to the computing node, and the problem of restarting the computing node still exists. However, the problem of just restarting the login node disappears, ssh is restored, and the computing node Automatically connect to nfs successfully.

```bash
root        1548  0.0  0.0   5632  1792 ?        Ss   18:19   0:00 /usr/sbin/nfsdcld
root        2347  4.6  0.0      0     0 ?        D    18:19   8:04 [nfsd]
root       53326  0.0  0.0      0     0 ?        D    20:00   0:00 [kworker/u112:2+nfsd4_callbacks]
root       68918  0.0  0.0   2704  1792 ?        Is   20:47   0:00 /usr/sbin/rpc.nfsd 0
root       74448  0.0  0.0   9436  2240 pts/6    S+   21:11   0:00 grep --color=auto --ex
```

```log
6月 23 20:48:52 icpcs systemd[1]: nfs-server.service: Stopping timed out. Terminating.
6月 23 20:49:10 icpcs sudo[69464]:     root : TTY=pts/6 ; PWD=/root ; USER=root ; COMMAND=/usr/bin/systemctl status nfs-server.service
6月 23 20:50:23 icpcs systemd[1]: nfs-server.service: State 'stop-sigterm' timed out. Killing.
6月 23 20:50:23 icpcs systemd[1]: nfs-server.service: Killing process 68918 (rpc.nfsd) with signal SIGKILL.
6月 23 20:50:27 icpcs kernel: INFO: task nfsd:2347 blocked for more than 1105 seconds.
6月 23 20:50:27 icpcs kernel: task:nfsd            state:D stack:0     pid:2347  tgid:2347  ppid:2      flags:0x00004000
6月 23 20:50:27 icpcs kernel:  nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
6月 23 20:50:27 icpcs kernel:  nfsd4_destroy_session+0x186/0x260 [nfsd]
6月 23 20:50:27 icpcs kernel:  nfsd4_proc_compound+0x3af/0x770 [nfsd]
6月 23 20:50:27 icpcs kernel:  nfsd_dispatch+0xd4/0x220 [nfsd]
6月 23 20:50:27 icpcs kernel:  ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
6月 23 20:50:27 icpcs kernel:  ? __pfx_nfsd+0x10/0x10 [nfsd]
6月 23 20:50:27 icpcs kernel:  nfsd+0x8b/0xe0 [nfsd]
```

Revision history for this message

Ricardo Cruz (ric8cruz) wrote on 2024-06-27:

#11

For the benefit of others... Like #10, we also have a cluster with this issue. As a workaround, we are using version 3 of the NFS protocol (`nfsvers=3` in `/etc/fstab`), which so far seems to have eliminated the problem for us.

Ubuntu
nfs-utils package

nfsd gets unresponsive after some hours of operation

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntunfs-utils package

nfsd gets unresponsive after some hours of operation

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
nfs-utils package