Release 5.4.0-1097 Floods NFS server with 50-600 metadata requests per second per server
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned | ||
linux-aws (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned | ||
linux-aws-5.4 (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned | ||
linux-signed-aws-5.4 (Ubuntu) |
Confirmed
|
Undecided
|
Unassigned |
Bug Description
We are observing a massive uptick in metadata requests to an NFSv4.1 (AWS EFS) network drive that is linked to a one or more web servers. This started happening about a week ago across a number of stacks on AWS. This issue ended up exhausting all the burst credits we had accrued and caused a production outage.
Diagnostics...
`nfsiostat` shows 50-600 ops/s on the servers where it is happening.
`nfsstat -m` shows these mount flags: `Flags: rw,relatime,
`nfstrace --mode=live --verbose=2` shows that the same operations are happening repeatedly on 2-5 filehandles;
```
CALL [ operations: 4 tag: void minor version: 1
[ SEQUENCE(53) [ sessionid: 0xeab1201a93b66
[ PUTFH(22) [ object: 839c9918169aed4
[ ACCESS(3) [ READ LOOKUP MODIFY EXTEND DELETE ]
[ GETATTR(9) [ mask: 0x18300000 CHANGE SIZE TIME_METADATA TIME_MODIFY ] ]
REPLY [ operations: 4 status: OK tag: void
[ SEQUENCE(53) [ status: OK session: 0xeab1201a93b66
[ PUTFH(22) [ status: OK ]
[ ACCESS(3) [ status: OK supported: READ LOOKUP MODIFY EXTEND DELETE access: READ LOOKUP MODIFY EXTEND DELETE ]
[ GETATTR(9) [ status: OK mask: 0x18300000 CHANGE SIZE TIME_METADATA TIME_MODIFY ] ]
```
`lsof -N` shows nothing using files from the NFS drive.
Unmounting and remounting the disk stops the issue temporarily
I also enabled rpcdebugging via `rpcdebug -m nfs -s all` and a snippet of the output is below;
```
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213426] NFS: nfs_update_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213428] NFS: permission(
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213430] NFS: nfs_lookup_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213432] --> nfs41_call_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] --> nfs4_alloc_slot used_slots=0000 highest_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213433] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213436] encode_sequence: sessionid=
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213842] decode_attr_type: type=00
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213843] decode_attr_change: change attribute=2153
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213844] decode_attr_size: file size=71680
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fsid: fsid=(0x0/0x0)
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213845] decode_attr_fileid: fileid=0
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213846] decode_attr_mode: file mode=00
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213847] decode_attr_nlink: nlink=1
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_attr_rdev: rdev=(0x0:0x0)
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213848] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213849] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213850] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213851] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213852] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213853] --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=64
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213854] <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213855] nfs4_free_slot: slotid 1 highest_used_slotid 0
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213855] nfs41_sequence_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213856] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213860] NFS: nfs_update_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213862] NFS: permission(
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213864] NFS: nfs_lookup_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.213867] NFS: dentry_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214000] NFS: permission(
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214003] --> nfs41_call_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214004] --> nfs4_alloc_slot used_slots=0000 highest_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214005] <-- nfs4_alloc_slot used_slots=0001 highest_used=0 slotid=0
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214009] encode_sequence: sessionid=
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214472] decode_attr_type: type=00
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214473] decode_attr_change: change attribute=1626
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214473] decode_attr_size: file size=38912
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214474] decode_attr_fsid: fsid=(0x0/0x0)
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214475] decode_attr_fileid: fileid=0
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214475] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214476] decode_attr_mode: file mode=00
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214476] decode_attr_nlink: nlink=1
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214477] decode_attr_rdev: rdev=(0x0:0x0)
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214477] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214478] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214479] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214479] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214480] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214480] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214481] decode_
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214483] --> nfs4_alloc_slot used_slots=0001 highest_used=0 max_slots=64
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214483] <-- nfs4_alloc_slot used_slots=0003 highest_used=1 slotid=1
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214484] nfs4_free_slot: slotid 1 highest_used_slotid 0
Mar 17 10:20:41 ip-10-1-2-84 kernel: [ 3897.214485] nfs41_sequence_
```
This loops repeatedly at a rate roughly consistent with ops/s output from `nfsiostat`.
I tried rolling back to `5.4.0-1096-aws`, updating grub and rebooting and the issue no longer occurs. I see the issue again once rolling forward to `5.4.0-1097-aws`, updating grub and rebooting as soon as the web server loads anything from the NFS mount.
Output from `ubuntu-bug linux`
```
ProblemType: Bug
ApportVersion: 2.20.9-0ubuntu7.28
Architecture: amd64
Date: Fri Mar 17 16:29:14 2023
Dependencies:
adduser 3.116ubuntu1
apt 1.6.14
apt-utils 1.6.14
busybox-initramfs 1:1.27.2-2ubuntu3.4
ca-certificates 20211016ubuntu0
coreutils 8.28-1ubuntu1
cpio 2.12+dfsg-
debconf 1.5.66ubuntu1
debconf-i18n 1.5.66ubuntu1
dmsetup 2:1.02.
dpkg 1.19.0.5ubuntu2.4
fdisk 2.31.1-0.4ubuntu3.7
gcc-8-base 8.4.0-1ubuntu1~
gettext-base 0.19.8.1-6ubuntu0.3
gpgv 2.2.4-1ubuntu1.6
grub-common 2.02-2ubuntu8.25
grub-gfxpayloa
grub-pc 2.02-2ubuntu8.25
grub-pc-bin 2.02-2ubuntu8.25
grub2-common 2.02-2ubuntu8.25
init-system-
initramfs-tools 0.130ubuntu3.13
initramfs-
initramfs-
klibc-utils 2.0.4-9ubuntu2.2
kmod 24-1ubuntu3.5
libacl1 2.2.52-3build1
libapt-inst2.0 1.6.14
libapt-pkg5.0 1.6.14
libattr1 1:2.4.47-2build1
libaudit-common 1:2.8.2-1ubuntu1.1
libaudit1 1:2.8.2-1ubuntu1.1
libblkid1 2.31.1-0.4ubuntu3.7
libbz2-1.0 1.0.6-8.1ubuntu0.2
libc6 2.27-3ubuntu1.6
libcap-ng0 0.7.7-3.1
libdb5.3 5.3.28-
libdevmapper1.02.1 2:1.02.
libfdisk1 2.31.1-0.4ubuntu3.7
libffi6 3.2.1-8
libfreetype6 2.8.1-2ubuntu2.2
libfuse2 2.9.7-1ubuntu1
libgcc1 1:8.4.0-
libgcrypt20 1.8.1-4ubuntu1.3
libgmp10 2:6.1.2+
libgnutls30 3.5.18-1ubuntu1.6
libgpg-error0 1.27-6
libgpm2 1.20.7-5
libhogweed4 3.4.1-0ubuntu0.
libidn2-0 2.3.0-1+
libklibc 2.0.4-9ubuntu2.2
libkmod2 24-1ubuntu3.5
liblocale-
liblz4-1 0.0~r131-2ubuntu3.1
liblzma5 5.2.2-1.3ubuntu0.1
libmount1 2.31.1-0.4ubuntu3.7
libncurses5 6.1-1ubuntu1.18.04
libncursesw5 6.1-1ubuntu1.18.04
libnettle6 3.4.1-0ubuntu0.
libp11-kit0 0.23.9-2ubuntu0.1
libpam-modules 1.1.8-3.
libpam-modules-bin 1.1.8-3.
libpam0g 1.1.8-3.
libpcre3 2:8.44-
libpng16-16 1.6.34-
libprocps6 2:3.3.12-3ubuntu1.2
libseccomp2 2.5.1-1ubuntu1~
libselinux1 2.7-2build2
libsemanage-common 2.7-2build2
libsemanage1 2.7-2build2
libsepol1 2.7-1ubuntu0.1
libsmartcols1 2.31.1-0.4ubuntu3.7
libssl1.1 1.1.1-1ubuntu2.
libstdc++6 8.4.0-1ubuntu1~
libsystemd0 237-3ubuntu10.57
libtasn1-6 4.13-2
libtext-
libtext-iconv-perl 1.7-5build6
libtext-
libtinfo5 6.1-1ubuntu1.18.04
libudev1 237-3ubuntu10.57
libunistring2 0.9.9-0ubuntu2
libuuid1 2.31.1-0.4ubuntu3.7
libzstd1 1.4.8+dfsg-
linux-base 4.5ubuntu1.7
linux-
lsb-base 9.20170808ubuntu1
openssl 1.1.1-1ubuntu2.
os-prober 1.74ubuntu1
passwd 1:4.5-1ubuntu2.5
perl-base 5.26.1-6ubuntu0.6
procps 2:3.3.12-3ubuntu1.2
psmisc 23.1-1ubuntu0.1
sensible-utils 0.0.12
tar 1.29b-2ubuntu0.4
ubuntu-keyring 2018.09.
ucf 3.0038
udev 237-3ubuntu10.57
util-linux 2.31.1-0.4ubuntu3.7
uuid-runtime 2.31.1-0.4ubuntu3.7
zlib1g 1:1.2.11.
DistroRelease: Ubuntu 18.04
Ec2AMI: ami-0479a9a73c3
Ec2AMIManifest: (unknown)
Ec2Availability
Ec2InstanceType: t2.medium
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
Package: linux-image-
PackageArchitec
ProcCpuinfoMinimal:
processor : 1
vendor_id : GenuineIntel
cpu family : 6
model : 79
model name : Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
stepping : 1
microcode : 0xb000040
cpu MHz : 2299.922
cache size : 46080 KB
physical id : 0
siblings : 2
core id : 1
cpu cores : 2
apicid : 2
initial apicid : 2
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx rdtscp lm constant_tsc rep_good nopl xtopology cpuid pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm cpuid_fault invpcid_single pti fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
bugs : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs itlb_multihit mmio_stale_data
bogomips : 4600.03
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual
power management:
ProcEnviron:
TERM=xterm-
PATH=(custom, no user)
XDG_RUNTIME_
LANG=C.UTF-8
SHELL=/bin/bash
ProcVersionSign
SourcePackage: linux-signed-
Tags: third-party-
Uname: Linux 5.4.0-1097-aws x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
_MarkForUpload: True
```
We also need:
1) The release of Ubuntu you are using, via 'lsb_release -rd' or System -> About Ubuntu
Description: Ubuntu 18.04.6 LTS
Release: 18.04
2) The version of the package you are using, via 'apt-cache policy pkgname' or by checking in Software Center
linux-aws:
Installed: 5.4.0.1097.75
Candidate: 5.4.0.1097.75
Version table:
*** 5.4.0.1097.75 500
500 http://
500 http://
100 /var/lib/
4.15.0.1007.7 500
500 http://
3) What you expected to happen
The OS to not hose the NFS server with metadata requests until something breaks
4) What happened instead
The OS hosed the NFS server with metadata requests until it caused a production outage
---
ProblemType: Bug
AlsaDevices: Error: command ['ls', '-l', '/dev/snd/'] failed with exit code 2: ls: cannot access '/dev/snd/': No such file or directory
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7.28
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
CRDA: N/A
DistroRelease: Ubuntu 18.04
Ec2AMI: ami-0b76c3b150c
Ec2AMIManifest: (unknown)
Ec2Availability
Ec2InstanceType: t2.medium
Ec2Kernel: unavailable
Ec2Ramdisk: unavailable
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb: Error: command ['lsusb'] failed with exit code 1:
MachineType: Xen HVM domU
Package: linux (not installed)
PciMultimedia:
ProcEnviron:
TERM=xterm-
PATH=(custom, no user)
XDG_RUNTIME_
LANG=C.UTF-8
SHELL=/bin/bash
ProcFB:
ProcKernelCmdLine: BOOT_IMAGE=
ProcVersionSign
RelatedPackageV
linux-
linux-
linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
Tags: bionic ec2-images
Uname: Linux 5.4.0-1097-aws x86_64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:
_MarkForUpload: True
dmi.bios.date: 08/24/2006
dmi.bios.vendor: Xen
dmi.bios.version: 4.11.amazon
dmi.chassis.type: 1
dmi.chassis.vendor: Xen
dmi.modalias: dmi:bvnXen:
dmi.product.name: HVM domU
dmi.product.
dmi.sys.vendor: Xen
description: | updated |
Changed in linux-aws (Ubuntu): | |
status: | New → Confirmed |
Changed in linux-aws-5.4 (Ubuntu): | |
status: | New → Confirmed |
Changed in linux-signed-aws-5.4 (Ubuntu): | |
status: | New → Confirmed |
This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:
apport-collect 2012020
and then change the status of the bug to 'Confirmed'.
If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.
This change has been made by an automated script, maintained by the Ubuntu Kernel Team.