nfsd gets unresponsive after some hours of operation

Bug #2062568 reported by Olle Liljenzin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
nfs-utils (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

I installed the 24.04 Beta on two test machines that were running 22.04 without issues before. One of them exports two volumes that are mounted by the other machine, which primarily uses them as a secondary storage for ccache.

After being up for a couple of hours (happened twice since yesterday evening) it seems that nfsd on the machine exporting the volumes hangs on something.

From dmesg on the server (repeated a few times):

[11183.290548] INFO: task nfsd:1419 blocked for more than 1228 seconds.
[11183.290558] Not tainted 6.8.0-22-generic #22-Ubuntu
[11183.290563] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[11183.290582] task:nfsd state:D stack:0 pid:1419 tgid:1419 ppid:2 flags:0x00004000
[11183.290587] Call Trace:
[11183.290602] <TASK>
[11183.290606] __schedule+0x27c/0x6b0
[11183.290612] schedule+0x33/0x110
[11183.290615] schedule_timeout+0x157/0x170
[11183.290619] wait_for_completion+0x88/0x150
[11183.290623] __flush_workqueue+0x140/0x3e0
[11183.290629] nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
[11183.290689] nfsd4_destroy_session+0x186/0x260 [nfsd]
[11183.290744] nfsd4_proc_compound+0x3af/0x770 [nfsd]
[11183.290798] nfsd_dispatch+0xd4/0x220 [nfsd]
[11183.290851] svc_process_common+0x44d/0x710 [sunrpc]
[11183.290924] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[11183.290976] svc_process+0x132/0x1b0 [sunrpc]
[11183.291041] svc_handle_xprt+0x4d3/0x5d0 [sunrpc]
[11183.291105] svc_recv+0x18b/0x2e0 [sunrpc]
[11183.291168] ? __pfx_nfsd+0x10/0x10 [nfsd]
[11183.291220] nfsd+0x8b/0xe0 [nfsd]
[11183.291270] kthread+0xef/0x120
[11183.291274] ? __pfx_kthread+0x10/0x10
[11183.291276] ret_from_fork+0x44/0x70
[11183.291279] ? __pfx_kthread+0x10/0x10
[11183.291281] ret_from_fork_asm+0x1b/0x30
[11183.291286] </TASK>

From dmesg on the client (repeated a number of times):
[ 6596.911785] RPC: Could not send backchannel reply error: -110
[ 6596.972490] RPC: Could not send backchannel reply error: -110
[ 6837.281307] RPC: Could not send backchannel reply error: -110

ProblemType: Bug
DistroRelease: Ubuntu 24.04
Package: nfs-kernel-server 1:2.6.4-3ubuntu5
ProcVersionSignature: Ubuntu 6.8.0-22.22-generic 6.8.1
Uname: Linux 6.8.0-22-generic x86_64
.etc.request-key.d.id_resolver.conf: create id_resolver * * /usr/sbin/nfsidmap -t 600 %k %d
ApportVersion: 2.28.1-0ubuntu1
Architecture: amd64
CasperMD5CheckResult: pass
Date: Fri Apr 19 14:10:25 2024
InstallationDate: Installed on 2024-04-16 (3 days ago)
InstallationMedia: Ubuntu-Server 24.04 LTS "Noble Numbat" - Beta amd64 (20240410.1)
NFSMounts:

NFSv4Mounts:

ProcEnviron:
 LANG=en_US.UTF-8
 PATH=(custom, no user)
 SHELL=/bin/bash
 TERM=xterm-256color
 XDG_RUNTIME_DIR=<set>
SourcePackage: nfs-utils
UpgradeStatus: No upgrade log present (probably fresh install)

Revision history for this message
Olle Liljenzin (liljenzin) wrote :
description: updated
Revision history for this message
Jeff (jeff09) wrote :

This kind of issue appeared with Ubuntu 23.10 for me on the server mostly using an HDD for bulk storage with a not exactly powerful CPU also being occupied with using WireGuard to secure the NFS connection.

Mentioning the performance details because I have a feeling they matter. An also not exactly high performance client connecting over 1 Gb/s only very occasionally caused this problem, however given a 10 Gb/s connection, the issue appeared significantly more commonly. A higher performance setup utilizing a 2.5 Gb/s connection triggered this bug in a couple of days after setup.
The lockup always seem to occur with heavy NFS usage, suspiciously mostly when there's both reading and writing going on, at least I don't recall it happening with reading only, but I'm not confident in stating it didn't happen with a writing only load.

Found this bug report by the client error message, server side differs due to the different version:
```
[300146.046666] INFO: task nfsd:1426 blocked for more than 241 seconds.
[300146.046732] Not tainted 6.5.0-27-generic #28~22.04.1-Ubuntu
[300146.046770] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[300146.046813] task:nfsd state:D stack:0 pid:1426 ppid:2 flags:0x00004000
[300146.046827] Call Trace:
[300146.046832] <TASK>
[300146.046839] __schedule+0x2cb/0x750
[300146.046860] schedule+0x63/0x110
[300146.046870] schedule_timeout+0x157/0x170
[300146.046881] wait_for_completion+0x88/0x150
[300146.046894] __flush_workqueue+0x140/0x3e0
[300146.046908] nfsd4_probe_callback_sync+0x1a/0x30 [nfsd]
[300146.047074] nfsd4_destroy_session+0x193/0x260 [nfsd]
[300146.047219] nfsd4_proc_compound+0x3b7/0x770 [nfsd]
[300146.047365] nfsd_dispatch+0xbf/0x1d0 [nfsd]
[300146.047497] svc_process_common+0x420/0x6e0 [sunrpc]
[300146.047695] ? __pfx_read_tsc+0x10/0x10
[300146.047706] ? __pfx_nfsd_dispatch+0x10/0x10 [nfsd]
[300146.047848] ? __pfx_nfsd+0x10/0x10 [nfsd]
[300146.047977] svc_process+0x132/0x1b0 [sunrpc]
[300146.048157] nfsd+0xdc/0x1c0 [nfsd]
[300146.048287] kthread+0xf2/0x120
[300146.048299] ? __pfx_kthread+0x10/0x10
[300146.048310] ret_from_fork+0x47/0x70
[300146.048321] ? __pfx_kthread+0x10/0x10
[300146.048331] ret_from_fork_asm+0x1b/0x30
[300146.048341] </TASK>
```

This seems to be matching, but the previous lockups experienced may have been somewhat different.
I mostly remember the client whining about the server not responding instead of the message presented here, and the server call trace used to have btrfs in it which made me suspect it may be exclusive to that, although the issue was always with NFS, nothing else locked up despite having some other sources of heavy I/O.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in nfs-utils (Ubuntu):
status: New → Confirmed
Revision history for this message
Jeff (jeff09) wrote :

Ran into this again just hours after commenting by attempting to unpack a large archive file. Apparently the new setup of higher performance host with more network bandwidth is just too overwhelming to be usable with NFS with this issue.
Reading alone still seems to be fine though.

Torturing 50% of the memory with memtester didn't reveal any problems, and getting the server to do the unpacking of files does the same mixed I/O with no issues so far, progressing way beyond where I could get over NFS, so it doesn't look like an HDD issue either.

The info dumping described here seems to be potentially useful for the next catch, so linking it here:
https://www.spinics.net/lists/linux-nfs/msg97486.html

At some point I may try to reproduce the issue with just writing to see if the mixed workload is really required for the freeze, but not sure when will I get to it, can't just restart NFSd, and the mandatory reboot to get it going again is only feasible when all other tasks running on the host can be interrupted.

Revision history for this message
Jeff (jeff09) wrote :

Oh fun, trying to add multiple attachments, I just ended up finding a bug report from 2007 complaining about apport being able to do this, but the web interface is limited, so guess I'll get a bit spammy.

Ran into the same as usual again, even with avoiding heavy utilization. Not seeing anything too interesting in the gathered info and apparently the kernel log buffer is not large enough to be able to completely handle a task list dump, but seems like most of the information is at least there.

Will try to use RPCNFSDCOUNT=1 for some time in case this is a silly deadlock, at least it's definitely easier to give it a try than to downgrade to Ubuntu 22.04 which worked well.

Revision history for this message
Jeff (jeff09) wrote :
Revision history for this message
Jeff (jeff09) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.