4.15 kernel hard lockup about once a week

Bug #1799497 reported by Stéphane Graber on 2018-10-23
54
This bug affects 9 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Colin Ian King
Bionic
High
Unassigned

Bug Description

== SRU Justification ==

When using zram (as installed and configured with the zram-config package)
systems can lockup after about a week of use. This occurs because of
a hang in a lock in zram.

== Test Case ==

Run stress-ng --brk 0 --stack 0 in a Bionic amd64 server VM with 1GM of
memory, 16 CPU threads and zram-config installed. Without the fix the
kernel will hang in a spinlock after 1-2 hours of run time. With the fix,
the hang does not occur. Testing shows that with the fix, 5 x 16 CPU hours
of stress testing with stress-ng works fine without the lockup occurring.

== The fix ==

Upstream commit c4d6c4cc7bfd ("zram: correct flag name of ZRAM_ACCESS") as
a prerequisite followed by a minor context wiggle backport of the fix with
commit 3c9959e02547 ("zram: fix lockdep warning of free block handling").

== Regression Potential ==

This touches the zram locking, so the core zram driver is affected. However
the fixes are backports from 5.0, so the fixes have had a fair amount of
testing in later kernels.

My main server has been running into hard lockups about once a week ever since I switched to the 4.15 Ubuntu 18.04 kernel.

When this happens, nothing is printed to the console, it's effectively stuck showing a login prompt. The system is running with panic=1 on the cmdline but isn't rebooting so the kernel isn't even processing this as a kernel panic.

As this felt like a potential hardware issue, I had my hosting provider give me a completely different system, different motherboard, different CPU, different RAM and different storage, I installed that system on 18.04 and moved my data over, a week later, I hit the issue again.

We've since also had a LXD user reporting similar symptoms here also on varying hardware:
  https://github.com/lxc/lxd/issues/5197

My system doesn't have a lot of memory pressure with about 50% of free memory:

root@vorash:~# free -m
              total used free shared buff/cache available
Mem: 31819 17574 402 513 13842 13292
Swap: 15909 2687 13222

I will now try to increase console logging as much as possible on the system in the hopes that next time it hangs we can get a better idea of what happened but I'm not too hopeful given the complete silence on the console when this occurs.

System is currently on:
  Linux vorash 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

But I've seen this since the GA kernel on 4.15 so it's not a recent regression.
---
ProblemType: Bug
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Oct 23 16:12 seq
 crw-rw---- 1 root audio 116, 33 Oct 23 16:12 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7.4
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse:
 Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1: Cannot stat file /proc/22822/fd/10: Permission denied
 Cannot stat file /proc/22831/fd/10: Permission denied
DistroRelease: Ubuntu 18.04
HibernationDevice:
 RESUME=none
 CRYPTSETUP=n
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb:
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 002: ID 046b:ff10 American Megatrends, Inc. Virtual Keyboard and Mouse
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
MachineType: Intel Corporation S1200SP
NonfreeKernelModules: zfs zunicode zavl icp zcommon znvpair
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 mgadrmfb
ProcKernelCmdLine: BOOT_IMAGE=/boot/vmlinuz-4.15.0-38-generic root=UUID=575c878a-0be6-4806-9c83-28f67aedea65 ro biosdevname=0 net.ifnames=0 panic=1 verbose console=tty0 console=ttyS0,115200n8
ProcVersionSignature: Ubuntu 4.15.0-38.41-generic 4.15.18
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-38-generic N/A
 linux-backports-modules-4.15.0-38-generic N/A
 linux-firmware 1.173.1
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
Tags: bionic
Uname: Linux 4.15.0-38-generic x86_64
UnreportableReason: This report is about a package that is not installed.
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: False
dmi.bios.date: 01/25/2018
dmi.bios.vendor: Intel Corporation
dmi.bios.version: S1200SP.86B.03.01.1029.012520180838
dmi.board.asset.tag: Base Board Asset Tag
dmi.board.name: S1200SP
dmi.board.vendor: Intel Corporation
dmi.board.version: H57532-271
dmi.chassis.asset.tag: ....................
dmi.chassis.type: 23
dmi.chassis.vendor: ...............................
dmi.chassis.version: ..................
dmi.modalias: dmi:bvnIntelCorporation:bvrS1200SP.86B.03.01.1029.012520180838:bd01/25/2018:svnIntelCorporation:pnS1200SP:pvr....................:rvnIntelCorporation:rnS1200SP:rvrH57532-271:cvn...............................:ct23:cvr..................:
dmi.product.family: Family
dmi.product.name: S1200SP
dmi.product.version: ....................
dmi.sys.vendor: Intel Corporation

CVE References

Changed in linux (Ubuntu):
importance: Undecided → High
Changed in linux (Ubuntu Bionic):
importance: Undecided → High
tags: added: bionic kernel-key

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1799497

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Bionic):
status: New → Incomplete
Joseph Salisbury (jsalisbury) wrote :

Would you be able to test some kernels? A bisect can be done if we can identify what kernel version introduced this issue.

Stéphane Graber (stgraber) wrote :

Well, kinda, this is a production server running a lot of publicly visible services, so I can run test kernels on it so long as they don't regress system security.

There's also the unfortunate problem that it takes over a week for me to see the problem in most cases and that my last known good kernel was the latest 4.4 kernel from xenial...

Stéphane Graber (stgraber) wrote :

Oh and whatever kernel I boot needs to have support for ZFS 0.7 or I won't be able to read my drives.

tags: added: apport-collected
description: updated

apport information

apport information

apport information

apport information

apport information

apport information

apport information

apport information

Stéphane Graber (stgraber) wrote :

Note that I've deleted the wifisyslog and currentdmesg as they're not relevant (current boot) and included information that I'd rather not have exposed publicly.

Luis Rodriguez (laragones) wrote :

Hello, I sumbitted the report on LXD since that is the only thing I have installed on the server that is actively running as Stéphane mentioned on https://github.com/lxc/lxd/issues/5197

I also thought it maybe hardware issue, but since upgrading to 18.04 in May I have experienced this on a variety of hardware, and even though I thought it may be upgrade issue it is also not the case.

I also thought it was memory related, since now it occurs, as Stéphane mentiones around once a week, but in my case on different servers. THe last server where it happened didn't have any issue for the last maybe two months and was not that loaded in terms of memory, but it seems more frequent in servers that are actively used in both memory and CPU.

It doesn't happen on blade hosts that only have 2-4 LXD containers and 4GB of RAM, it has only happened on 16GB, 24GB, 48GB and 128GB of RAM HP and Dell servers, that have a little more load (minimum 6 containers up to 20)

At least I a not alone, but have no clue how to recreate or address this issue (since also logs provide no information)

I could also try some kernels. On 4.4 as Stephane mentioned didn't happen, int only started happening on GA (as he also mentiones) of 18.04. I have been constantly upgrading the kernel to no avail. So it seems it could have been introduced before.

strangely and thankfully it doesn't happen on my main production server (Except yesterday crash on one of them). Mostly on development servers that are actively used (developers are not happy)

Stefan Bader (smb) wrote :

To add a bit more detail (maybe unrelated but with so little evidence everything helps), when thos lockups happen, is the server at least pingable? Some other idea would be, as long as those servers are accessible enough to see whether sysrq combinations are still handled. Though I fear at least for Stéphane that server is somewhere else with probably only ssh (maybe ipmi) access. But if that was possible and working, maybe one could prepare kdump and enable the sysrq crashing combo.

Otherwise, and that again is probably only possible for Luis if his devel servers do not need zfs, it would help to see how various mainline kernels between 4.4 and 4.15 are doing. And in parallel have some "canary" using the latest update. IIRC the one just released had a large portion of upstream stable pulled in.

Stéphane Graber (stgraber) wrote :

The server doesn't respond to pings when locked up.

I do have IPMI and console redirection going for my server and have enabled all sysrq now though it's unclear whether I can send those through the BMC yet (as just typing them would obviously send them to my laptop...).

I've setup debug console both to screen and to IPMI, raised the kernel log level to 9, setup NMI watchdog and enabled panic on oops and panic on hardlock and disabled reboot on panic, so maybe I'll get lucky with the next hang and get some output on console though that'd be a first...

Stéphane Graber (stgraber) wrote :

Just happened again, though the machine wouldn't reboot at all afterwards, leading to the hosting provider going for a motherboard replacement, so I guess better luck next week with debugging this.

Luis Rodriguez (laragones) wrote :

In my case it hasn't happen again.. Although I removed package zram-config from the host servers ( I think this is the only difference in software from 16.04 to 18.04 that I added. I would like to either discard or confirm that that it has an effect on the issue

Stéphane Graber (stgraber) wrote :

Oh, I am also using zram-config on the affected machine.

Stefan Bader (smb) wrote :

Darn, wanted to reply earlier. So maybe at least for Luis who sounds to have multiple servers in a test environment, it would be possible to run two otherwise identical servers and only remove zram-config on one. Then one locking up and the other not would be quite good proof.

Luis Rodriguez (laragones) wrote :

Correct.. I ould like to give it some more time to see if it doesn't happen. So far so good, no lockups. I hadnt have to restart any server in a week and a half.

I'll try to prepare the same setup on another server with zram-config to see if it happens again on that particular server

Luis Rodriguez (laragones) wrote :

Got a hot locked with no zram-config installed.. Same behaviour, no log information, can't even type in the console, no ssh, no ping. ALso all the LXD containers don't ping either

Stefan Bader (smb) wrote :

Darn, would have been too good if it had only happened with zram. :( Sounds a bit like quickly catching all CPUs in a spin-dead-lock. And I am not sure right now which path to use for debugging. Try turning on lock debug in the kernel, though that often changes timing in ways that prevent the issue from happening again. Or hope its not a hardware driver and try to re-produce in a VM, but even if that is possible there were issues with the crash tools and kernels having certain address space randomization enabled. And that was even before Metldown/Spectre hit us.

tags: added: kernel-da-key
removed: kernel-key
Luis Rodriguez (laragones) wrote :

OK.. it is been quite a while with no locks I had it once after the zram config pacakge was removed,, but no other locks since then.

kernel version is 4.15.0-33 to 38 in different servers.. I am going to update the servers to latest version reboot, and wait for a little longer.

then I am going to install back zram-config on certain servers to see if it shows up again.

Brad Figg (brad-figg) on 2019-07-24
tags: added: cscc
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in zram-config (Ubuntu Bionic):
status: New → Confirmed
Changed in zram-config (Ubuntu):
status: New → Confirmed
David Roberts (david.roberts) wrote :

Other reports of similar issues with zram on kernel 4.15:

- https://bbs.archlinux.org/viewtopic.php?id=234951
- https://<email address hidden>/T/

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Bionic):
status: Incomplete → Confirmed
Changed in linux (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
Colin Ian King (colin-king) wrote :

It would be useful to know if one has made any specific zram config changes, and if so, what your current config is just to help with the debugging of this issue.

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Changed in zram-config (Ubuntu):
status: Confirmed → Incomplete
Colin Ian King (colin-king) wrote :

I'm assuming the defaults are being used for the moment, this means 50% of total memory being used in total distributed across the number of CPUs, as defined in /usr/bin/init-zram-swapping

Colin Ian King (colin-king) wrote :

Can reproduce this with stress-ng exercising high memory pressure scenario using:
stress-ng --brk 0 -v --aiol 0

Download full text (5.5 KiB)

Hi.. I had to remove zram config from my production servers long ago. ...
since then I don't have the issue. I was using LXD containers a lot on the
hosts with different kind of usage,, But I don't have any other setup at
the moment

On Fri, Jan 10, 2020 at 12:11 AM Colin Ian King <email address hidden>
wrote:

> Can reproduce this with stress-ng exercising high memory pressure scenario
> using:
> stress-ng --brk 0 -v --aiol 0
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1799497
>
> Title:
> 4.15 kernel hard lockup about once a week
>
> Status in linux package in Ubuntu:
> Incomplete
> Status in zram-config package in Ubuntu:
> Incomplete
> Status in linux source package in Bionic:
> Confirmed
> Status in zram-config source package in Bionic:
> Confirmed
>
> Bug description:
> My main server has been running into hard lockups about once a week
> ever since I switched to the 4.15 Ubuntu 18.04 kernel.
>
> When this happens, nothing is printed to the console, it's effectively
> stuck showing a login prompt. The system is running with panic=1 on
> the cmdline but isn't rebooting so the kernel isn't even processing
> this as a kernel panic.
>
>
> As this felt like a potential hardware issue, I had my hosting provider
> give me a completely different system, different motherboard, different
> CPU, different RAM and different storage, I installed that system on 18.04
> and moved my data over, a week later, I hit the issue again.
>
> We've since also had a LXD user reporting similar symptoms here also on
> varying hardware:
> https://github.com/lxc/lxd/issues/5197
>
>
> My system doesn't have a lot of memory pressure with about 50% of free
> memory:
>
> root@vorash:~# free -m
> total used free shared buff/cache
> available
> Mem: 31819 17574 402 513 13842
> 13292
> Swap: 15909 2687 13222
>
> I will now try to increase console logging as much as possible on the
> system in the hopes that next time it hangs we can get a better idea
> of what happened but I'm not too hopeful given the complete silence on
> the console when this occurs.
>
> System is currently on:
> Linux vorash 4.15.0-36-generic #39-Ubuntu SMP Mon Sep 24 16:19:09 UTC
> 2018 x86_64 x86_64 x86_64 GNU/Linux
>
> But I've seen this since the GA kernel on 4.15 so it's not a recent
> regression.
> ---
> ProblemType: Bug
> AlsaDevices:
> total 0
> crw-rw---- 1 root audio 116, 1 Oct 23 16:12 seq
> crw-rw---- 1 root audio 116, 33 Oct 23 16:12 timer
> AplayDevices: Error: [Errno 2] No such file or directory: 'aplay':
> 'aplay'
> ApportVersion: 2.20.9-0ubuntu7.4
> Architecture: amd64
> ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord':
> 'arecord'
> AudioDevicesInUse:
> Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed
> with exit code 1: Cannot stat file /proc/22822/fd/10: Permission denied
> Cannot stat file /proc/22831/fd/10: Permission denied
> DistroRelease: Ubuntu 18.04
> Hibernatio...

Read more...

Colin Ian King (colin-king) wrote :
Download full text (3.3 KiB)

After quite a bit of experimentation I found that I can reproduce the bug if I have zram *and* also swap on the filesystem enabled while exercising the brk stressors and aiol (to cause lots of I/O). Eventually the system grinds to a halt, we lose interactivity and we eventually get lockups as follows:
[ 2012.040006] watchdog: BUG: soft lockup - CPU#2 stuck for 22s! [stress-ng-brk:1632]
[ 2012.040922] Modules linked in: zram(E) kvm_intel(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E) aesni_intel(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E) psmouse(E) input_leds(E) floppy(E) virtio_scsi(E) serio_raw(E) i2c_piix4(E) mac_hid(E) pata_acpi(E) qemu_fw_cfg(E) 9pnet_virtio(E) 9p(E) 9pnet(E) fscache(E)
[ 2012.044655] CPU: 2 PID: 1632 Comm: stress-ng-brk Tainted: G EL 4.15.18 #1
[ 2012.045581] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
[ 2012.046555] RIP: 0010:__raw_callee_save___pv_queued_spin_unlock+0x10/0x17
[ 2012.047340] RSP: 0018:ffffb73382083718 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
[ 2012.048238] RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000002
[ 2012.049078] RDX: 0000000000000000 RSI: ffff9d327c2f6918 RDI: ffffffffa3269978
[ 2012.049909] RBP: ffffb73382083720 R08: ffff9d327c2f6918 R09: ffff9d327c0a5328
[ 2012.050746] R10: ffff9d327c1e2310 R11: ffff9d327c1e2328 R12: ffff9d327c2f6800
[ 2012.051574] R13: ffff9d327c1e2328 R14: ffff9d327c1e2310 R15: ffff9d327c1e2200
[ 2012.052436] FS: 00007f89f2ccd740(0000) GS:ffff9d327f280000(0000) knlGS:0000000000000000
[ 2012.053382] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2012.054058] CR2: 00007f1350a8dd90 CR3: 00000000311a4004 CR4: 0000000000160ee0
[ 2012.054889] Call Trace:
[ 2012.055192] get_swap_pages+0x193/0x360
[ 2012.055652] get_swap_page+0x13f/0x1e0
[ 2012.056123] add_to_swap+0x14/0x70
[ 2012.056530] shrink_page_list+0x81d/0xbc0
[ 2012.057013] shrink_inactive_list+0x242/0x590
[ 2012.057523] shrink_node_memcg+0x364/0x770
[ 2012.058012] shrink_node+0xf7/0x300
[ 2012.058432] ? shrink_node+0xf7/0x300
[ 2012.058863] do_try_to_free_pages+0xc9/0x330
[ 2012.059368] try_to_free_pages+0xee/0x1b0
[ 2012.059842] __alloc_pages_slowpath+0x3fc/0xe00
[ 2012.060424] __alloc_pages_nodemask+0x29a/0x2c0
[ 2012.060963] alloc_pages_vma+0x88/0x1f0
[ 2012.061414] __handle_mm_fault+0x8b7/0x12e0
[ 2012.061909] handle_mm_fault+0xb1/0x210
[ 2012.062375] __do_page_fault+0x281/0x4b0
[ 2012.062848] do_page_fault+0x2e/0xe0
[ 2012.063274] ? async_page_fault+0x2f/0x50
[ 2012.063751] do_async_page_fault+0x51/0x80
[ 2012.064262] async_page_fault+0x45/0x50
[ 2012.064719] RIP: 0033:0x55ec1997bd0a
[ 2012.065147] RSP: 002b:00007ffeacd21600 EFLAGS: 00010246
[ 2012.065754] RAX: 000055ec28601000 RBX: 0000000000000005 RCX: 00007f89f2de956b
[ 2012.066580] RDX: 000055ec28601000 RSI: 00007ffeacd216d0 RDI: 000055ec28602000
[ 2012.067410] RBP: 00007ffeacd216c0 R08: 0000000000000000 R09: 00007f89f3d0c2f0
[ 2012.068290] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 2012.069129] R13: 0000000000000002 R14: 0000000000000001 R15: 00007ffeacd216d0
[ 2012.069965] Code: 50 41 51 41...

Read more...

Luis Rodriguez (laragones) wrote :
Download full text (8.6 KiB)

It sounds like what I was getting.

On Thu, Jan 16, 2020 at 11:05 PM Colin Ian King <email address hidden>
wrote:

> After quite a bit of experimentation I found that I can reproduce the bug
> if I have zram *and* also swap on the filesystem enabled while exercising
> the brk stressors and aiol (to cause lots of I/O). Eventually the system
> grinds to a halt, we lose interactivity and we eventually get lockups as
> follows:
> [ 2012.040006] watchdog: BUG: soft lockup - CPU#2 stuck for 22s!
> [stress-ng-brk:1632]
> [ 2012.040922] Modules linked in: zram(E) kvm_intel(E) kvm(E) irqbypass(E)
> crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) pcbc(E)
> aesni_intel(E) aes_x86_64(E) crypto_simd(E) glue_helper(E) cryptd(E)
> psmouse(E) input_leds(E) floppy(E) virtio_scsi(E) serio_raw(E) i2c_piix4(E)
> mac_hid(E) pata_acpi(E) qemu_fw_cfg(E) 9pnet_virtio(E) 9p(E) 9pnet(E)
> fscache(E)
> [ 2012.044655] CPU: 2 PID: 1632 Comm: stress-ng-brk Tainted: G
> EL 4.15.18 #1
> [ 2012.045581] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
> 1.13.0-1 04/01/2014
> [ 2012.046555] RIP:
> 0010:__raw_callee_save___pv_queued_spin_unlock+0x10/0x17
> [ 2012.047340] RSP: 0018:ffffb73382083718 EFLAGS: 00000246 ORIG_RAX:
> ffffffffffffff11
> [ 2012.048238] RAX: 0000000000000001 RBX: 0000000000000000 RCX:
> 0000000000000002
> [ 2012.049078] RDX: 0000000000000000 RSI: ffff9d327c2f6918 RDI:
> ffffffffa3269978
> [ 2012.049909] RBP: ffffb73382083720 R08: ffff9d327c2f6918 R09:
> ffff9d327c0a5328
> [ 2012.050746] R10: ffff9d327c1e2310 R11: ffff9d327c1e2328 R12:
> ffff9d327c2f6800
> [ 2012.051574] R13: ffff9d327c1e2328 R14: ffff9d327c1e2310 R15:
> ffff9d327c1e2200
> [ 2012.052436] FS: 00007f89f2ccd740(0000) GS:ffff9d327f280000(0000)
> knlGS:0000000000000000
> [ 2012.053382] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 2012.054058] CR2: 00007f1350a8dd90 CR3: 00000000311a4004 CR4:
> 0000000000160ee0
> [ 2012.054889] Call Trace:
> [ 2012.055192] get_swap_pages+0x193/0x360
> [ 2012.055652] get_swap_page+0x13f/0x1e0
> [ 2012.056123] add_to_swap+0x14/0x70
> [ 2012.056530] shrink_page_list+0x81d/0xbc0
> [ 2012.057013] shrink_inactive_list+0x242/0x590
> [ 2012.057523] shrink_node_memcg+0x364/0x770
> [ 2012.058012] shrink_node+0xf7/0x300
> [ 2012.058432] ? shrink_node+0xf7/0x300
> [ 2012.058863] do_try_to_free_pages+0xc9/0x330
> [ 2012.059368] try_to_free_pages+0xee/0x1b0
> [ 2012.059842] __alloc_pages_slowpath+0x3fc/0xe00
> [ 2012.060424] __alloc_pages_nodemask+0x29a/0x2c0
> [ 2012.060963] alloc_pages_vma+0x88/0x1f0
> [ 2012.061414] __handle_mm_fault+0x8b7/0x12e0
> [ 2012.061909] handle_mm_fault+0xb1/0x210
> [ 2012.062375] __do_page_fault+0x281/0x4b0
> [ 2012.062848] do_page_fault+0x2e/0xe0
> [ 2012.063274] ? async_page_fault+0x2f/0x50
> [ 2012.063751] do_async_page_fault+0x51/0x80
> [ 2012.064262] async_page_fault+0x45/0x50
> [ 2012.064719] RIP: 0033:0x55ec1997bd0a
> [ 2012.065147] RSP: 002b:00007ffeacd21600 EFLAGS: 00010246
> [ 2012.065754] RAX: 000055ec28601000 RBX: 0000000000000005 RCX:
> 00007f89f2de956b
> [ 2012.066580] RDX: 000055ec28601000 RSI: 00007ffeacd216d0 RDI:
> 000055ec28602000
> [ 2012.067410] RBP: 00007ffeacd216c...

Read more...

Colin Ian King (colin-king) wrote :

Captured the hard lock on the following

(gdb) stepi
0xffffffff8c4e29e5 in ?? ()
=> 0xffffffff8c4e29e5: eb ec jmp 0xffffffff8c4e29d3
(gdb) stepi
0xffffffff8c4e29d3 in ?? ()
=> 0xffffffff8c4e29d3: 8b 07 mov (%rdi),%eax
(gdb) stepi
0xffffffff8c4e29d5 in ?? ()
=> 0xffffffff8c4e29d5: 85 c0 test %eax,%eax
(gdb) stepi
0xffffffff8c4e29d7 in ?? ()
=> 0xffffffff8c4e29d7: 75 0a jne 0xffffffff8c4e29e3
(gdb) stepi
0xffffffff8c4e29e3 in ?? ()
=> 0xffffffff8c4e29e3: f3 90 pause
(gdb) stepi
0xffffffff8c4e29e5 in ?? ()
=> 0xffffffff8c4e29e5: eb ec jmp 0xffffffff8c4e29d3

This maps to:

ffffffff810e29c0 <native_queued_spin_lock_slowpath>:
....
ffffffff810e29d3: 8b 07 mov (%rdi),%eax
ffffffff810e29d5: 85 c0 test %eax,%eax
ffffffff810e29d7: 75 0a jne ffffffff810e29e3 <native_queued_spin_lock_slowpath+0x23>
ffffffff810e29d9: f0 0f b1 17 lock cmpxchg %edx,(%rdi)
ffffffff810e29dd: 85 c0 test %eax,%eax
ffffffff810e29df: 75 f2 jne ffffffff810e29d3 <native_queued_spin_lock_slowpath+0x13>
ffffffff810e29e1: 5d pop %rbp
ffffffff810e29e2: c3 retq
ffffffff810e29e3: f3 90 pause
ffffffff810e29e5: eb ec jmp ffffffff810e29d3 <native_queued_spin_lock_slowpath+0x13>

Colin Ian King (colin-king) wrote :

Couple more notes:

1. Disable file based swap on /swapfile - can reproduce issue
2. Use partition based swap on 2nd disk - can reproduce issue

Colin Ian King (colin-king) wrote :

Running w/o swapfile and zswap and just stress-ng brk and stack stressors with NO file I/O can also lock the system.

Colin Ian King (colin-king) wrote :

Cornered this to zswap and not an issue with mm or I/O. Figured out that 3 hours soak testing on each bisect step is the only reliably way to do a bisect. Bisected between 4.20 and 5.0 finally cornered the issue and hence the commits required to fix this.

description: updated
Changed in linux (Ubuntu Bionic):
status: Confirmed → Fix Committed

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-bionic' to 'verification-done-bionic'. If the problem still exists, change the tag 'verification-needed-bionic' to 'verification-failed-bionic'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-bionic
Colin Ian King (colin-king) wrote :

Soak tested the -proposed kernel for 2 hours with no hang occurring. Verified OK.

tags: added: verification-done-bionic
removed: verification-needed-bionic
no longer affects: zram-config (Ubuntu)
no longer affects: zram-config (Ubuntu Bionic)
Changed in linux (Ubuntu):
status: Incomplete → Fix Committed
Launchpad Janitor (janitor) wrote :
Download full text (44.4 KiB)

This bug was fixed in the package linux - 4.15.0-91.92

---------------
linux (4.15.0-91.92) bionic; urgency=medium

  * bionic/linux: 4.15.0-91.92 -proposed tracker (LP: #1865109)

  * CVE-2020-2732
    - KVM: x86: emulate RDPID
    - KVM: nVMX: Don't emulate instructions in guest mode
    - KVM: nVMX: Refactor IO bitmap checks into helper function
    - KVM: nVMX: Check IO instruction VM-exit conditions

linux (4.15.0-90.91) bionic; urgency=medium

  * bionic/linux: 4.15.0-90.91 -proposed tracker (LP: #1864753)

  * dkms artifacts may expire from the pool (LP: #1850958)
    - [Packaging] autoreconstruct -- manage executable debian files
    - [packaging] handle downloads from the librarian better

linux (4.15.0-90.90) bionic; urgency=medium

  * bionic/linux: 4.15.0-90.90 -proposed tracker (LP: #1864753)

  * vm-segv from ubuntu_stress_smoke_test failed on B (LP: #1864063)
    - Revert "apparmor: don't try to replace stale label in ptrace access check"

linux (4.15.0-89.89) bionic; urgency=medium

  * bionic/linux: 4.15.0-89.89 -proposed tracker (LP: #1863350)

  * [SRU][B/OEM-B] Fix multitouch support on some devices (LP: #1862567)
    - HID: core: move the dynamic quirks handling in core
    - HID: quirks: move the list of special devices into a quirk
    - HID: core: move the list of ignored devices in hid-quirks.c
    - HID: core: remove the absolute need of hid_have_special_driver[]

  * [linux] Patch to prevent possible data corruption (LP: #1848739)
    - blk-mq: silence false positive warnings in hctx_unlock()

  * Add bpftool to linux-tools-common (LP: #1774815)
    - tools/bpftool: fix bpftool build with bintutils >= 2.9
    - bpftool: make libbfd optional
    - [Debian] Remove binutils-dev build dependency
    - [Debian] package bpftool in linux-tools-common

  * Root can lift kernel lockdown via USB/IP (LP: #1861238)
    - Revert "UBUNTU: SAUCE: (efi-lockdown) Add a SysRq option to lift kernel
      lockdown"

  * [Bionic] i915 incomplete fix for CVE-2019-14615 (LP: #1862840) //
    CVE-2020-8832
    - drm/i915: Use same test for eviction and submitting kernel context
    - drm/i915: Define an engine class enum for the uABI
    - drm/i915: Force the switch to the i915->kernel_context
    - drm/i915: Move GT powersaving init to i915_gem_init()
    - drm/i915: Move intel_init_clock_gating() to i915_gem_init()
    - drm/i915: Inline intel_modeset_gem_init()
    - drm/i915: Mark the context state as dirty/written
    - drm/i915: Record the default hw state after reset upon load

  * Bionic update: upstream stable patchset 2020-02-12 (LP: #1863019)
    - xfs: Sanity check flags of Q_XQUOTARM call
    - mfd: intel-lpss: Add default I2C device properties for Gemini Lake
    - powerpc/archrandom: fix arch_get_random_seed_int()
    - tipc: fix wrong timeout input for tipc_wait_for_cond()
    - mt7601u: fix bbp version check in mt7601u_wait_bbp_ready
    - crypto: sun4i-ss - fix big endian issues
    - drm/sti: do not remove the drm_bridge that was never added
    - drm/virtio: fix bounds check in virtio_gpu_cmd_get_capset()
    - ALSA: hda: fix unused variable warning
    - apparmor: don't try to replace stale label in ptrace access chec...

Changed in linux (Ubuntu Bionic):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.