4.15 kernel hard lockup about once a week

Bug #1799497 reported by Stéphane Graber on 2018-10-23
14
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Bionic
High
Unassigned

Bug Description

My main server has been running into hard lockups about once a week ever since I switched to the 4.15 Ubuntu 18.04 kernel.

When this happens, nothing is printed to the console, it's effectively stuck showing a login prompt. The system is running with panic=1 on the cmdline but isn't rebooting so the kernel isn't even processing this as a kernel panic.

As this felt like a potential hardware issue, I had my hosting provider give me a completely different system, different motherboard, different CPU, different RAM and different storage, I installed that system on 18.04 and moved my data over, a week later, I hit the issue again.

We've since also had a LXD user reporting similar symptoms here also on varying hardware:
  https://github.com/lxc/lxd/issues/5197

My system doesn't have a lot of memory pressure with about 50% of free memory:

root@vorash:~# free -m
              total used free shared buff/cache available
Mem: 31819 17574 402 513 13842 13292
Swap: 15909 2687 13222

I will now try to increase console logging as much as possible on the system in the hopes that next time it hangs we can get a better idea of what happened but I'm not too hopeful given the complete silence on the console when this occurs.

System is currently on:
  Linux vorash 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

But I've seen this since the GA kernel on 4.15 so it's not a recent regression.
---
ProblemType: Bug
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Oct 23 16:12 seq
 crw-rw---- 1 root audio 116, 33 Oct 23 16:12 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7.4
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse:
 Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/22822/fd/10: Permission denied
 Cannot stat file /proc/22831/fd/10: Permission denied
DistroRelease: Ubuntu 18.04
HibernationDevice:
 RESUME=none
 CRYPTSETUP=n
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 002: ID 046b:ff10 American Megatrends, Inc. Virtual Keyboard and Mouse
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Intel Corporation S1200SP
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 mgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-38-generic root=UUID=575c878a-0be6-4806-9c83-28f67aedea65 ro biosdevname=0 net.ifnames=0 panic=1 verbose console=tty0 console=ttyS0,115200n8
ProcVersionSignature: Ubuntu 4.15.0-38.41-generic 4.15.18
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-38-generic N/A
 linux-backports-modules-4.15.0-38-generic N/A
 linux-firmware 1.173.1
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
Tags: bionic
Uname: Linux 4.15.0-38-generic x86_64
UnreportableReason: This report is about a package that is not installed.
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: False
dmi.bios.date: 01/25/2018
dmi.bios.vendor: Intel Corporation
dmi.bios.version: S1200SP.86B.03.01.1029.012520180838
dmi.board.asset.tag: Base Board Asset Tag
dmi.board.name: S1200SP
dmi.board.vendor: Intel Corporation
dmi.board.version: H57532-271
dmi.chassis.asset.tag: ....................
dmi.chassis.type: 23
dmi.chassis.vendor: ...............................
dmi.chassis.version: ..................
dmi.modalias: dmi:bvnIntelCorporation:bvrS1200SP.86B.03.01.1029.012520180838:bd01/25/2018:svnIntelCorporation:pnS1200SP:pvr....................:rvnIntelCorporation:rnS1200SP:rvrH57532-271:cvn...............................:ct23:cvr..................:
dmi.product.family: Family
dmi.product.name: S1200SP
dmi.product.version: ....................
dmi.sys.vendor: Intel Corporation

Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
tags: added: bionic kernel-key

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1799497

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Bionic):
status: New → Incomplete
Joseph Salisbury (jsalisbury) wrote :

Would you be able to test some kernels? A bisect can be done if we can identify what kernel version introduced this issue.

Stéphane Graber (stgraber) wrote :

Well, kinda, this is a production server running a lot of publicly visible services, so I can run test kernels on it so long as they don't regress system security.

There's also the unfortunate problem that it takes over a week for me to see the problem in most cases and that my last known good kernel was the latest 4.4 kernel from xenial...

Stéphane Graber (stgraber) wrote :

Oh and whatever kernel I boot needs to have support for ZFS 0.7 or I won't be able to read my drives.

tags: added: apport-collected
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Stéphane Graber (stgraber) wrote :

Note that I've deleted the wifisyslog and currentdmesg as they're not relevant (current boot) and included information that I'd rather not have exposed publicly.

Luis Rodriguez (laragones) wrote :

Hello, I sumbitted the report on LXD since that is the only thing I have installed on the server that is actively running as Stéphane mentioned on https://github.com/lxc/lxd/issues/5197

I also thought it maybe hardware issue, but since upgrading to 18.04 in May I have experienced this on a variety of hardware, and even though I thought it may be upgrade issue it is also not the case.

I also thought it was memory related, since now it occurs, as Stéphane mentiones around once a week, but in my case on different servers. THe last server where it happened didn't have any issue for the last maybe two months and was not that loaded in terms of memory, but it seems more frequent in servers that are actively used in both memory and CPU.

It doesn't happen on blade hosts that only have 2-4 LXD containers and 4GB of RAM, it has only happened on 16GB, 24GB, 48GB and 128GB of RAM HP and Dell servers, that have a little more load (minimum 6 containers up to 20)

At least I a not alone, but have no clue how to recreate or address this issue (since also logs provide no information)

I could also try some kernels. On 4.4 as Stephane mentioned didn't happen, int only started happening on GA (as he also mentiones) of 18.04. I have been constantly upgrading the kernel to no avail. So it seems it could have been introduced before.

strangely and thankfully it doesn't happen on my main production server (Except yesterday crash on one of them). Mostly on development servers that are actively used (developers are not happy)

Stefan Bader (smb) wrote :

To add a bit more detail (maybe unrelated but with so little evidence everything helps), when thos lockups happen, is the server at least pingable? Some other idea would be, as long as those servers are accessible enough to see whether sysrq combinations are still handled. Though I fear at least for Stéphane that server is somewhere else with probably only ssh (maybe ipmi) access. But if that was possible and working, maybe one could prepare kdump and enable the sysrq crashing combo.

Otherwise, and that again is probably only possible for Luis if his devel servers do not need zfs, it would help to see how various mainline kernels between 4.4 and 4.15 are doing. And in parallel have some "canary" using the latest update. IIRC the one just released had a large portion of upstream stable pulled in.

Stéphane Graber (stgraber) wrote :

The server doesn't respond to pings when locked up.

I do have IPMI and console redirection going for my server and have enabled all sysrq now though it's unclear whether I can send those through the BMC yet (as just typing them would obviously send them to my laptop...).

I've setup debug console both to screen and to IPMI, raised the kernel log level to 9, setup NMI watchdog and enabled panic on oops and panic on hardlock and disabled reboot on panic, so maybe I'll get lucky with the next hang and get some output on console though that'd be a first...

Stéphane Graber (stgraber) wrote :

Just happened again, though the machine wouldn't reboot at all afterwards, leading to the hosting provider going for a motherboard replacement, so I guess better luck next week with debugging this.

Luis Rodriguez (laragones) wrote :

In my case it hasn't happen again.. Although I removed package zram-config from the host servers ( I think this is the only difference in software from 16.04 to 18.04 that I added. I would like to either discard or confirm that that it has an effect on the issue

Stéphane Graber (stgraber) wrote :

Oh, I am also using zram-config on the affected machine.

Stefan Bader (smb) wrote :

Darn, wanted to reply earlier. So maybe at least for Luis who sounds to have multiple servers in a test environment, it would be possible to run two otherwise identical servers and only remove zram-config on one. Then one locking up and the other not would be quite good proof.

Luis Rodriguez (laragones) wrote :

Correct.. I ould like to give it some more time to see if it doesn't happen. So far so good, no lockups. I hadnt have to restart any server in a week and a half.

I'll try to prepare the same setup on another server with zram-config to see if it happens again on that particular server

Luis Rodriguez (laragones) wrote :

Got a hot locked with no zram-config installed.. Same behaviour, no log information, can't even type in the console, no ssh, no ping. ALso all the LXD containers don't ping either

Stefan Bader (smb) wrote :

Darn, would have been too good if it had only happened with zram. :( Sounds a bit like quickly catching all CPUs in a spin-dead-lock. And I am not sure right now which path to use for debugging. Try turning on lock debug in the kernel, though that often changes timing in ways that prevent the issue from happening again. Or hope its not a hardware driver and try to re-produce in a VM, but even if that is possible there were issues with the crash tools and kernels having certain address space randomization enabled. And that was even before Metldown/Spectre hit us.

tags: added: kernel-da-key
removed: kernel-key
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.