General Protection fault in inotify (fixed upstream)

Bug #1771075 reported by KJ Tsanaktsidis on 2018-05-14
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Status tracked in Cosmic
Cosmic
Medium
Joseph Salisbury
linux-gcp (Ubuntu)
Medium
Joseph Salisbury

Bug Description

We've run into some issues where upgrading the kernel from a 4.10 series to a 4.13 series on Ubuntu 16.04 hosts that make heavy use of inotify causes panics and lockups in the kernel in inotify-related code. Our particular use case seemed to hit these at a rate of one every 30 minutes or so when serving up production traffic. Unfortunately, I have been unable to replicate the issue so far with a simulated load-testing environment.

When the issue occurs, we get dmesg entries like "BUG: soft lockup - CPU#0 stuck for 22s!" or "General protection fault: 0000 [#1] SMP PTI". In the soft lockup case, the host is still up but all I/O operations stall indefinitely (e.g. typing "sync" into the console will hang forever). In the protection fault case, the system reboots. I've attached dmesg output from the two cases to this bugreport.

We have noticed the issue with the following kernels:
- linux-image-4.13.0-1013-gcp
- linux-image-4.13.0-1015-gcp
- linux-image-4.13.0-36-generic

We did _not_ have the issue with
- linux-image-4.10.0-32-generic

I've submitted this bug report from a system which should be configured identically to our production hosts that were having issue (the affected hosts were immediately rolled back to 4.10).

This bug appears to have been fixed upstream as of 4.17-rc3 in this commit: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=d90a10e2444ba5a351fa695917258ff4c5709fa5

I would guess that perhaps this patch should be backported into both the 4.13 HWE and GCP Ubuntu kernel series?

Thanks,
KJ

ProblemType: Bug
DistroRelease: Ubuntu 16.04
Package: linux-image-4.13.0-1013-gcp 4.13.0-1013.17
ProcVersionSignature: Ubuntu 4.13.0-1013.17-gcp 4.13.16
Uname: Linux 4.13.0-1013-gcp x86_64
ApportVersion: 2.20.1-0ubuntu2.16
Architecture: amd64
Date: Mon May 14 07:58:29 2018
ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-gcp
UpgradeStatus: No upgrade log present (probably fresh install)
---
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 May 10 07:57 seq
 crw-rw---- 1 root audio 116, 33 May 10 07:57 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.16
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: N/A
DistroRelease: Ubuntu 16.04
IwConfig: Error: [Errno 2] No such file or directory
Lsusb: Error: command ['lsusb'] failed with exit code 1:
MachineType: Google Google Compute Engine
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.10.0-32-generic root=UUID=73ea38ed-7fcd-4871-8afa-17d36f4e4bfc ro scsi_mod.use_blk_mq=Y console=ttyS0
ProcVersionSignature: Ubuntu 4.10.0-32.36~16.04.1-generic 4.10.17
RelatedPackageVersions:
 linux-restricted-modules-4.10.0-32-generic N/A
 linux-backports-modules-4.10.0-32-generic N/A
 linux-firmware N/A
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial uec-images xenial uec-images
Uname: Linux 4.10.0-32-generic x86_64
UnreportableReason: The report belongs to a package that is not installed.
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

WifiSyslog:

_MarkForUpload: False
dmi.bios.date: 01/01/2011
dmi.bios.vendor: Google
dmi.bios.version: Google
dmi.board.asset.tag: 98BEC19B-1DEB-1A9F-1146-C6E4D8577ADB
dmi.board.name: Google Compute Engine
dmi.board.vendor: Google
dmi.chassis.type: 1
dmi.chassis.vendor: Google
dmi.modalias: dmi:bvnGoogle:bvrGoogle:bd01/01/2011:svnGoogle:pnGoogleComputeEngine:pvr:rvnGoogle:rnGoogleComputeEngine:rvr:cvnGoogle:ct1:cvr:
dmi.product.name: Google Compute Engine
dmi.sys.vendor: Google

KJ Tsanaktsidis (ktsanaktsidis) wrote :
KJ Tsanaktsidis (ktsanaktsidis) wrote :
affects: linux-gcp (Ubuntu) → linux (Ubuntu)

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1771075

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

apport information

tags: added: apport-collected
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

I've upload the apport information from a host that had the issue but note that I had to downgrade the kernel back to 4.10 due to this issue, so that's what is included in the apport information.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Artful):
status: New → Triaged
importance: Undecided → Medium
Changed in linux (Ubuntu):
importance: Undecided → Medium
Changed in linux (Ubuntu Artful):
assignee: nobody → Joseph Salisbury (jsalisbury)
status: Triaged → In Progress
Changed in linux (Ubuntu):
status: Confirmed → In Progress
Changed in linux (Ubuntu Bionic):
status: New → In Progress
importance: Undecided → Medium
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Cosmic):
assignee: nobody → Joseph Salisbury (jsalisbury)
Joseph Salisbury (jsalisbury) wrote :

I built Artful and Bionic test kernels with commit d90a10e2444ba5a351fa695917258ff4c5709fa5. The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1771075

Can you test this kernel and see if it resolves this bug?

Note about installing test kernels:
• If the test kernel is prior to 4.15(Bionic) you need to install the linux-image and linux-image-extra .deb packages.
• If the test kernel is 4.15(Bionic) or newer, you need to install the linux-image-unsigned, linux-modules and linux-modules-extra .deb packages.

Thanks in advance!

Awesome, thanks for getting back to me so quickly! Unfortunately we're in a change freeze at the moment - I'll try and get an exemption approved to test this but I may have to wait until Monday to try it out.

Is there anything special I have to do to boot your test kernels on Xenial other than installing the debs with dpkg, set GRUB_DEFAULT to point to it, running update-grub, and rebooting?

Joseph Salisbury (jsalisbury) wrote :

To install the kernel, just use dpkg like you said. Instead of changing GRUB_DEFAULT, you could also manually select the test kernel from the GRUB menu on boot. The grub menu can be accessed by holding the SHIFT key on boot up, after the BIOS information is displayed.

Good news - I got the chance to test this in our production environment today for about 7 hours and no issues whatsoever. Given that this workload was triggering the issue pretty reliably within about half an hour before, I think this fix has done the trick. Thanks a bunch for your help!

What are the next steps here?

Joseph Salisbury (jsalisbury) wrote :

I will submit an SRU request to have that commit included in the affected Ubuntu kernels. Then the fix will be available in the next set of kernel updates.

Cool! I noticed you've marked the bug as affecting Artful, Bionic and Cosmic but it also affects Xenial with the 4.13 HWE kernels; should that be marked here as well?

Joseph Salisbury (jsalisbury) wrote :

Any fix that goes into Artful is also applied to the 4.13 HWE kernel in Xenial. The fix will get into the HWE kernel that way.

Changed in linux (Ubuntu Cosmic):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Bionic):
status: In Progress → Fix Committed
Joseph Salisbury (jsalisbury) wrote :

The commit to fix this bug was added to Artful and Bionic via bug 1765564, so I'll remove those bug tasks. I did, however, submit a request to have this commit added to Cosmic.

Changed in linux (Ubuntu Artful):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Cosmic):
status: Fix Committed → In Progress
no longer affects: linux (Ubuntu Artful)
no longer affects: linux (Ubuntu Bionic)

Yup, I think this is a dupe of that. I noticed that bug was filed against linux-azure; do I need to file a corresponding bug against linux-gcp to get the patch sent there as well?

Joseph Salisbury (jsalisbury) wrote :

Thanks for the heads up, Joshua!

No need to open a separate bug for linux-gcp. I added that package to this bug. bug 1765564 does not have cosmic or linux-gcp, so we can use this bug for those two.

no longer affects: linux-gcp (Ubuntu Cosmic)
Changed in linux-gcp (Ubuntu):
importance: Undecided → Medium
status: New → In Progress
assignee: nobody → Joseph Salisbury (jsalisbury)

Thanks, much appreciated!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers