NFS latency spikes during openat syscalls associated with test_stateid RPC command

Bug #2046572 reported by Aled Cuda
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
New
Undecided
Unassigned

Bug Description

About once a week the NFS mounts on our eda servers (the nfs clients) slow to a crawl for a large number of workloads. strace reveals that openat system calls for files in the mounts take between 0.1 and 0.3 seconds this can be seen directly by touching any file in the homedir (this normally takes 0.001-0.002 seconds). This causes obvious problems, an IPython instance in a conda environment installed in NFS takes around 2-3 minutes to start for example.

I originally thought that this was a manifestation of #2009325, but we are on 5.15.0-89 and have seen this problem across several kernel versions.

Additionally we don't see access rpc calls but we do see loads of test_stateid calls:
    - bursts of 300 or so periodically every couple seconds while idle
    - 380,452 test_stateid calls when an ipython instance is starting (zero access calls in the same period, when at idle we see 10,267 test_stateid calls over the same ~2:30 time period and again no access calls)

We're not using kerberos, just ip based authentication, our mount options are:

exoserver.example.com:/volume1/homes/cudaa on /home/cudaa type nfs4 (rw,relatime,vers=4.1,rsize=131072,wsize=131072,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=xxx.xxx.xxx.219,local_lock=none,addr=xxx.xxx.xxx.146)

We see this on servers with both linux-image-5.15.0-89-generic and linux-headers-5.15.0-83-generic, and have seen it across a few versions (we have the hwe stack installed if that matters). I could upgrade them to 91 but the issue clears after a reboot and we don't have a reliable way to reproduce it so I'm going to keep one online for debugging.
---
ProblemType: Bug
ApportVersion: 2.20.11-0ubuntu27.27
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: gdm 3334 F.... pulseaudio
CasperMD5CheckResult: skip
DistroRelease: Ubuntu 20.04
InstallationDate: Installed on 2022-04-05 (618 days ago)
InstallationMedia: Ubuntu 20.04.4 LTS "Focal Fossa" - Release amd64 (20220223)
IwConfig:
 lo no wireless extensions.

 enp69s0 no wireless extensions.

 enp68s0 no wireless extensions.
MachineType: ASUS System Product Name
NonfreeKernelModules: nvidia_modeset nvidia
Package: linux (not installed)
ProcEnviron:
 TERM=screen
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/zsh
ProcFB: 0 EFI VGA
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-5.15.0-89-generic root=UUID=ec5ca229-1bd1-461a-98fd-d0f4eba0dd26 ro fsck.mode=force fsck.repair=yes
ProcVersionSignature: Ubuntu 5.15.0-89.99~20.04.1-generic 5.15.126
PulseList: Error: command ['pacmd', 'list'] failed with exit code 1: No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-5.15.0-89-generic N/A
 linux-backports-modules-5.15.0-89-generic N/A
 linux-firmware 1.187.39
RfKill:

Tags: focal
Uname: Linux 5.15.0-89-generic x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups: N/A
_MarkForUpload: True
dmi.bios.date: 03/04/2022
dmi.bios.release: 16.3
dmi.bios.vendor: American Megatrends Inc.
dmi.bios.version: 1603
dmi.board.asset.tag: Default string
dmi.board.name: ROG ZENITH II EXTREME
dmi.board.vendor: ASUSTeK COMPUTER INC.
dmi.board.version: Rev 1.xx
dmi.chassis.asset.tag: Default string
dmi.chassis.type: 3
dmi.chassis.vendor: Default string
dmi.chassis.version: Default string
dmi.modalias: dmi:bvnAmericanMegatrendsInc.:bvr1603:bd03/04/2022:br16.3:svnASUS:pnSystemProductName:pvrSystemVersion:rvnASUSTeKCOMPUTERINC.:rnROGZENITHIIEXTREME:rvrRev1.xx:cvnDefaultstring:ct3:cvrDefaultstring:skuSKU:
dmi.product.family: To be filled by O.E.M.
dmi.product.name: System Product Name
dmi.product.sku: SKU
dmi.product.version: System Version
dmi.sys.vendor: ASUS

Revision history for this message
Aled Cuda (aledvirgil) wrote : AlsaInfo.txt

apport information

tags: added: apport-collected focal
description: updated
Revision history for this message
Aled Cuda (aledvirgil) wrote : CRDA.txt

apport information

Revision history for this message
Aled Cuda (aledvirgil) wrote : CurrentDmesg.txt

apport information

Revision history for this message
Aled Cuda (aledvirgil) wrote : Lspci.txt

apport information

Revision history for this message
Aled Cuda (aledvirgil) wrote : Lspci-vt.txt

apport information

Revision history for this message
Aled Cuda (aledvirgil) wrote : Lsusb.txt

apport information

Revision history for this message
Aled Cuda (aledvirgil) wrote : Lsusb-t.txt

apport information

Revision history for this message
Aled Cuda (aledvirgil) wrote : Lsusb-v.txt

apport information

Revision history for this message
Aled Cuda (aledvirgil) wrote : ProcCpuinfo.txt

apport information

Revision history for this message
Aled Cuda (aledvirgil) wrote : ProcCpuinfoMinimal.txt

apport information

Revision history for this message
Aled Cuda (aledvirgil) wrote : ProcInterrupts.txt

apport information

Revision history for this message
Aled Cuda (aledvirgil) wrote : ProcModules.txt

apport information

Revision history for this message
Aled Cuda (aledvirgil) wrote : UdevDb.txt

apport information

Revision history for this message
Aled Cuda (aledvirgil) wrote : WifiSyslog.txt

apport information

Revision history for this message
Aled Cuda (aledvirgil) wrote : acpidump.txt

apport information

Revision history for this message
Aled Cuda (aledvirgil) wrote :

I have tried setting nfs_fasc through sysfs (after the module is loaded) and remounting with "mount -o remount" but that does not clear the issue

Revision history for this message
Aled Cuda (aledvirgil) wrote :

The server in question is a Synology NAS running 'DSM 7.2-64570 Update 3' but this at least on the surface appears to be a client side bug because the same mounts on other machines (with the same options) are fine and it can be cleared by a client reboot but not by a remount

Revision history for this message
Aled Cuda (aledvirgil) wrote :
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.