Ubuntu
linux package

ThunderX: soft lockup on 4.8+ kernels

Zesty (17.04)
Bug #1672521

Bug #1672521 reported by Alexandru Avadanii on 2017-03-13

This bug affects 2 people

	Status	Importance	Assigned to
linux (Ubuntu)	Triaged	High	Unassigned
Yakkety	Won't Fix	High	Unassigned
Zesty	Triaged	High	Unassigned

Bug Description

I have been trying to easily reproduce this for days.
We initially observed it in OPNFV Armband, when we tried to upgrade our Ubuntu Xenial installation kernel to linux-image-generic-hwe-16.04 (4.8).

In our environment, this was easily triggered on compute nodes, when launching multiple VMs (we suspected OVS, QEMU etc.).
However, in order to rule out our specifics, we looked for a simple way to reproduce it on all ThunderX nodes we have access to, and we finally found it:

$ apt-get install stress-ng
$ stress-ng --hdd 1024

We tested different FW versions, provided by both chip/board manufacturers, and with all of them the result is 100% reproductible, leading to a kernel Oops [1]:
[ 726.070531] INFO: task kworker/0:1:312 blocked for more than 120 seconds.
[ 726.077908] Tainted: G W I 4.8.0-41-generic #44~16.04.1-Ubuntu
[ 726.085850] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 726.094383] kworker/0:1 D ffff0000080861bc 0 312 2 0x00000000
[ 726.094401] Workqueue: events vmstat_shepherd
[ 726.094404] Call trace:
[ 726.094411] [<ffff0000080861bc>] __switch_to+0x94/0xa8
[ 726.094418] [<ffff0000089854f4>] __schedule+0x224/0x718
[ 726.094421] [<ffff000008985a20>] schedule+0x38/0x98
[ 726.094425] [<ffff000008985d84>] schedule_preempt_disabled+0x14/0x20
[ 726.094428] [<ffff000008987644>] __mutex_lock_slowpath+0xd4/0x168
[ 726.094431] [<ffff000008987730>] mutex_lock+0x58/0x70
[ 726.094437] [<ffff0000080c552c>] get_online_cpus+0x44/0x70
[ 726.094440] [<ffff00000820ca24>] vmstat_shepherd+0x3c/0xe8
[ 726.094446] [<ffff0000080e1c60>] process_one_work+0x150/0x478
[ 726.094449] [<ffff0000080e1fd8>] worker_thread+0x50/0x4b8
[ 726.094453] [<ffff0000080e8eac>] kthread+0xec/0x100
[ 726.094456] [<ffff000008083690>] ret_from_fork+0x10/0x40

Over the last few days, I tested all 4.8-* and 4.10 (zesty backport), the soft lockup happens with each and every one of them.
On the other hand, 4.4.0-45-generic seems to work perfectly fine (probably newer 4.4.0-* too, but due to a regression in the ethernet drivers after 4.4.0-45, we can't test those with ease) under normal conditions, yet running stress-ng leads to the same oops.

[1] http://paste.ubuntu.com/24172516/
---
AlsaDevices:
total 0
crw-rw---- 1 root audio 116, 1 Mar 13 19:27 seq
crw-rw---- 1 root audio 116, 33 Mar 13 19:27 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.20.1-0ubuntu2.5
Architecture: arm64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
DistroRelease: Ubuntu 16.04
IwConfig: Error: [Errno 2] No such file or directory
MachineType: GIGABYTE R120-T30
Package: linux (not installed)
PciMultimedia:

ProcEnviron:
TERM=vt220
PATH=(custom, no user)
XDG_RUNTIME_DIR=<set>
LANG=en_US.UTF-8
SHELL=/bin/bash
ProcFB: 0 astdrmfb
ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-4.8.0-41-generic root=/dev/mapper/os-root ro console=tty0 console=ttyS0,115200 console=ttyAMA0,115200 net.ifnames=1 biosdevname=0 rootdelay=90 nomodeset quiet splash vt.handoff=7
ProcVersionSignature: Ubuntu 4.8.0-41.44~16.04.1-generic 4.8.17
RelatedPackageVersions:
linux-restricted-modules-4.8.0-41-generic N/A
linux-backports-modules-4.8.0-41-generic N/A
linux-firmware 1.157.8
RfKill: Error: [Errno 2] No such file or directory
Tags: xenial
Uname: Linux 4.8.0-41-generic aarch64
UpgradeStatus: No upgrade log present (probably fresh install)
UserGroups:

_MarkForUpload: True
dmi.bios.date: 11/22/2016
dmi.bios.vendor: GIGABYTE
dmi.bios.version: T22
dmi.board.asset.tag: 01234567890123456789AB
dmi.board.name: MT30-GS0
dmi.board.vendor: GIGABYTE
dmi.board.version: 01234567
dmi.chassis.asset.tag: 01234567890123456789AB
dmi.chassis.type: 17
dmi.chassis.vendor: GIGABYTE
dmi.chassis.version: 01234567
dmi.modalias: dmi:bvnGIGABYTE:bvrT22:bd11/22/2016:svnGIGABYTE:pnR120-T30:pvr0100:rvnGIGABYTE:rnMT30-GS0:rvr01234567:cvnGIGABYTE:ct17:cvr01234567:
dmi.product.name: R120-T30
dmi.product.version: 0100
dmi.sys.vendor: GIGABYTE

See original description

Tags:

Revision history for this message

Alexandru Avadanii (alexandru-avadanii) wrote on 2017-03-13: CRDA.txt

CRDA.txt Edit (420 bytes, text/plain)

apport information

tags:	added: apport-collected xenial
description:	updated

Revision history for this message

Alexandru Avadanii (alexandru-avadanii) wrote on 2017-03-13: CurrentDmesg.txt

CurrentDmesg.txt Edit (117.1 KiB, text/plain)

apport information

Revision history for this message

Alexandru Avadanii (alexandru-avadanii) wrote on 2017-03-13: JournalErrors.txt

JournalErrors.txt Edit (32.8 KiB, text/plain)

apport information

Revision history for this message

Alexandru Avadanii (alexandru-avadanii) wrote on 2017-03-13: Lspci.txt

Lspci.txt Edit (166.3 KiB, text/plain)

apport information

Revision history for this message

Alexandru Avadanii (alexandru-avadanii) wrote on 2017-03-13: Lsusb.txt

Lsusb.txt Edit (546 bytes, text/plain)

apport information

Revision history for this message

Alexandru Avadanii (alexandru-avadanii) wrote on 2017-03-13: ProcCpuinfo.txt

ProcCpuinfo.txt Edit (8.9 KiB, text/plain)

apport information

Revision history for this message

Alexandru Avadanii (alexandru-avadanii) wrote on 2017-03-13: ProcInterrupts.txt

ProcInterrupts.txt Edit (40.3 KiB, text/plain)

apport information

Revision history for this message

Alexandru Avadanii (alexandru-avadanii) wrote on 2017-03-13: ProcModules.txt

ProcModules.txt Edit (4.3 KiB, text/plain)

apport information

Revision history for this message

Alexandru Avadanii (alexandru-avadanii) wrote on 2017-03-13: UdevDb.txt

UdevDb.txt Edit (205.7 KiB, text/plain)

apport information

Revision history for this message

Alexandru Avadanii (alexandru-avadanii) wrote on 2017-03-13: WifiSyslog.txt

#10

WifiSyslog.txt Edit (161.8 KiB, text/plain)

apport information

Revision history for this message

Brad Figg (brad-figg) wrote on 2017-03-13: Status changed to Confirmed

#11

This change was made by a bot.

Changed in linux (Ubuntu):
status:	New → Confirmed

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-03-14:

#12

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.11 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.11-rc2

Changed in linux (Ubuntu):
importance:	Undecided → High

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2017-03-14:

#13

If the mainline kernel still exhibits the bug, we can perform a kernel bisect to identify what commit introduced the regression.

If the mainline kernel fixes the bug, we can perform a "Reverse" bisect to identify the fix.

Changed in linux (Ubuntu Yakkety):
status:	New → Triaged
Changed in linux (Ubuntu Zesty):
status:	Confirmed → Triaged
Changed in linux (Ubuntu Yakkety):
importance:	Undecided → High
tags:	added: kernel-da-key needs-bisect yakkety zesty

Revision history for this message

Alexandru Avadanii (alexandru-avadanii) wrote on 2017-03-14:

#14

Hi,
I tried out 4.11-rc1 a few days ago. Unfortunately, I did not get the board to boot properly from the start, since ThunderX networking drivers failed to allocate MSI-X/MSI interrupts, and polling on some registers also failed ...

So, with 4.11-rc1, at least one networking interfaces was never coming online due to unmapped interrupts/failed polling, but unloading `nicpf` and reloading it seemed to work (networking worked after this). After this, the soft lockup happened, but I can't be sure I did not mess something else.

Let me try this again and get back to you with some proper logs, but off the top of my head, things got worse with 4.11-rc1 ...

Thanks,
Alex

Revision history for this message

Ciprian Barbu (ciprian-barbu) wrote on 2017-03-14:

#15

dmesg.log Edit (285.5 KiB, text/plain)

Hi,

The same bug happened again on a similar board with T27 firmware, but this time running kernel 4.4.0-45-generic. I'm attaching log with serial console (with debug info from the FW). I can't attach more because the kernel hanged.

So far 4.4.0-45-generic was stable on our lab, this happened with no obvious reason.

/ciprian

Revision history for this message

Ciprian Barbu (ciprian-barbu) wrote on 2017-03-14:

#16

Just one addition, the log before contains dmesg output too. The task that hanged was systemd, it might be related with some VMs from the previous boot record being restarted automatically, but it still doesn't explain the crash.

Rebooting the node again with 4.4 did not result in kernel crash.

Revision history for this message

Alexandru Avadanii (alexandru-avadanii) wrote on 2017-03-14:

#17

ThunderX 4.11-rc1 console log Edit (572.7 KiB, text/plain)

4.11-rc1 console log attached.
Board firmware is latest available on Gigabyte's site (T31).

1. Install 4.11-rc1 (`make modules_install install`) and reboot
2. Observe networking driver issues in boot log
   Dmesg: 4.11-rc1_dmesg_on_clean_boot.log [3]
3. Try `ping google.com`, obviously not working
4. `modprobe -r nicpf` (leads to multiple oopses in dmesg)
    Console log: 4.11-rc1_modprobe_r_nicpf_output.log [1]
    Dmesg :4.11-rc1_dmesg_after_modprobe_r_nicpf.log [2]
5. `modprobe nicpf` (this usually works, and afterwards network is up and running - not sure whether ALL interfaces are ok, as not all of them are connected) - however this time it led to a soft lockup (see full logs attached here);

[1] http://paste.ubuntu.com/24178311/
[2] http://paste.ubuntu.com/24178312/
[3] http://paste.ubuntu.com/24178313/

Revision history for this message

dann frazier (dannf) wrote on 2017-03-21:

#18

@Alexandru: do you have a console log of a system hitting the issue w/ the VM use case? Soft lockups are a fairly generic failure mode, and it would not surprise me if stress-ng was triggering a different issue than the VM case, but both emitting soft lockups.

Revision history for this message

Alexandru Avadanii (alexandru-avadanii) wrote on 2017-03-21:

#19

Hi, Dann,
First of all, I think the bug title is misleading, as this issue happens on all kernels we tested (4.4.0-45..66, 4.8.0-x, 4.10.0-x etc).

To be fair, we haven't this exact bug (or at least I don't think we did) in practice, i.e. without running stress-ng, 4.4.0-x never ever crashed.

The VM use case turned out to be a different bug [1], triggered 100% by AAVMF + vhost.

Let me know if I can provide anything else.
I consider this particular bug minor (if we don't poke it with stress-ng, everything works well), compared to AAVMF + vhost [1].

Thanks,
Alex

[1] https://bugs.launchpad.net/ubuntu/+source/edk2/+bug/1673564

Revision history for this message

Andy Whitcroft (apw) wrote on 2017-07-26: Closing unsupported series nomination.

#20

This bug was nominated against a series that is no longer supported, ie yakkety. The bug task representing the yakkety nomination is being closed as Won't Fix.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu Yakkety):
status:	Triaged → Won't Fix

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntulinux package

ThunderX: soft lockup on 4.8+ kernels

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package