Ubuntu 16.04.03: "NMI watchdog: BUG: soft lockup" occurs while running stress-ng on PowerNV machine.

Bug #1693566 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Incomplete
Medium
Joseph Salisbury
Zesty
Incomplete
Medium
Joseph Salisbury

Bug Description

== Comment: #0 - PAVITHRA R. PRAKASH <email address hidden> - 2017-05-17 05:55:38 ==
--- Problem description ----

Ubuntu 16.04.03: "NMI watchdog: BUG: soft lockup" occurs while running stress-ng on NV machine.

--- Steps to recreate------

1. Install ubuntu16.04.03.
2. Run "stress-ng -a 0".

Logs:
====

[ 2660.437087] INFO: rcu_sched self-detected stall on CPU
[ 2660.437111] 22-...: (5247 ticks this GP) idle=e19/140000000000001/0 softirq=905/905 fqs=2380
[ 2660.437114] (t=5251 jiffies g=95606 c=95605 q=2545946)
[ 2660.437750] 24-...: (5250 ticks this GP) idle=0b7/140000000000001/0 softirq=5805/5805 fqs=2380
[ 2660.437859]
[ 2664.172796] NMI watchdog: BUG: soft lockup - CPU#22 stuck for 23s! [stress-ng-mmap:3509]
[ 2664.172808] NMI watchdog: BUG: soft lockup - CPU#24 stuck for 23s! [stress-ng-mrema:3536]
[ 2674.848037] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 33s! [stress-ng-fork:3381]
[ 2676.172894] NMI watchdog: BUG: soft lockup - CPU#30 stuck for 22s! [kswapd0:992]
[ 2680.336844] NMI watchdog: BUG: soft lockup - CPU#98 stuck for 23s! [stress-ng-clock:5099]
[ 2686.140931] NMI watchdog: BUG: soft lockup - CPU#16 stuck for 39s! [stress-ng-clone:3366]
[ 2686.987192] xhci_hcd 0003:09:00.0: HC died; cleaning up
[ 2686.987212] usb 1-3-port3: cannot reset (err = -108)

After few hours machine will become completely unresponsive

[pavithra@localhost ~]$ ping 9.47.69.255
PING 9.47.69.255 (9.47.69.255) 56(84) bytes of data.
^C
--- 9.47.69.255 ping statistics ---
12 packets transmitted, 0 received, 100% packet loss, time 11000ms

Thanks,
Pavithra

== Comment: #6 - VIPIN K. PARASHAR <email address hidden> - 2017-05-25 03:32:26 ==
ubuntu@ltc-firep2:~$ hostname -i
9.47.69.255
ubuntu@ltc-firep2:~$ uname -a
Linux ltc-firep2 4.10.0-21-generic #23~16.04.1-Ubuntu SMP Tue May 2 12:54:57 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux
ubuntu@ltc-firep2:~$ cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.2 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.2 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
ubuntu@ltc-firep2:~$ tail /proc/cpuinfo
processor : 159
cpu : POWER8 (raw), altivec supported
clock : 2061.000000MHz
revision : 2.0 (pvr 004d 0200)

timebase : 512000000
platform : PowerNV
model : 8335-GTA
machine : PowerNV 8335-GTA
firmware : OPAL
ubuntu@ltc-firep2:~$

== Comment: #11 - VIPIN K. PARASHAR <email address hidden> - 2017-05-25 06:27:12 ==
System Memory stats
==============
ubuntu@ltc-firep2:~$ numactl -H
available: 2 nodes (0,8)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 0 size: 61321 MB
node 0 free: 60297 MB
node 8 cpus: 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159
node 8 size: 65303 MB
node 8 free: 64923 MB
node distances:
node 0 8
  0: 10 40
  8: 40 10
ubuntu@ltc-firep2:~$ free -h
              total used free shared buff/cache available
Mem: 123G 534M 122G 20M 868M 121G
Swap: 37G 0B 37G
ubuntu@ltc-firep2:~$ sudo sysctl vm | grep free
vm.min_free_kbytes = 360448
ubuntu@ltc-firep2:~$

Host is having 123 GB of memory spread across two nodes.
Swap is configured to be 37GB and VM min free bytes is set to 360MB.

== Comment: #13 - VIPIN K. PARASHAR <email address hidden> - 2017-05-25 07:25:35 ==

[ 280.494345] NMI watchdog: BUG: soft lockup - CPU#5 stuck for 23s! [stress-ng-mmap:4172]
[ 280.495250] CPU: 5 PID: 4172 Comm: stress-ng-mmap Not tainted 4.10.0-21-generic #23~16.04.1-Ubuntu
[ 280.495262] task: c000000fe318c600 task.stack: c000000fc0d7c000
[ 280.495271] NIP: c0000000001a3248 LR: c0000000001a3204 CTR: c0000000000871f0
[ 280.495285] REGS: c000000fc0d7f7d0 TRAP: 0901 Not tainted (4.10.0-21-generic)
[ 280.495299] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[ 280.495408] CR: 44424444 XER: 20000000
[ 280.495416] CFAR: c0000000001a3250 SOFTE: 1
[ 280.495624] NIP [c0000000001a3248] smp_call_function_many+0x358/0x3f0
[ 280.495636] LR [c0000000001a3204] smp_call_function_many+0x314/0x3f0
[ 280.495645] Call Trace:
[ 280.495660] [c000000fc0d7fa50] [c0000000001a31e4] smp_call_function_many+0x2f4/0x3f0 (unreliable)
[ 280.495697] [c000000fc0d7fac0] [c0000000001a3430] kick_all_cpus_sync+0x40/0x50
[ 280.495726] [c000000fc0d7fae0] [c000000000069728] hash__pmdp_huge_get_and_clear+0xa8/0xf0
[ 280.495742] [c000000fc0d7fb10] [c00000000032b600] change_huge_pmd+0x210/0x2d0
[ 280.495762] [c000000fc0d7fb80] [c0000000002df638] change_protection_range+0xb38/0xe60
[ 280.495789] [c000000fc0d7fcc0] [c00000000030994c] change_prot_numa+0x3c/0xc0
[ 280.495815] [c000000fc0d7fcf0] [c00000000012e854] task_numa_work+0x2d4/0x3f0
[ 280.495844] [c000000fc0d7fdb0] [c00000000010f330] task_work_run+0x140/0x1a0
[ 280.495868] [c000000fc0d7fe00] [c00000000001db04] do_notify_resume+0xe4/0xf0
[ 280.495885] [c000000fc0d7fe30] [c00000000000b744] ret_from_except_lite+0x70/0x74
[ 280.495909] Instruction dump:
[ 280.495925] 3d020003 78691f24 39480fe0 7d2a482a e95d0000 7d4a4a14 812a0018 792707e1
[ 280.496022] 4182001c 60420000 7c210b78 7c421378 <812a0018> 792807e1 4082fff0 7c2004ac

[ 636.509312] NMI watchdog: BUG: soft lockup - CPU#29 stuck for 22s! [stress-ng-mrema:4205]
[ 636.510076] CPU: 29 PID: 4205 Comm: stress-ng-mrema Tainted: G L 4.10.0-21-generic #23~16.04.1-Ubuntu
[ 636.510090] task: c000000fdef86e00 task.stack: c000000fdd074000
[ 636.510104] NIP: c0000000001a3244 LR: c0000000001a3204 CTR: c0000000000871f0
[ 636.510136] REGS: c000000fdd077760 TRAP: 0901 Tainted: G L (4.10.0-21-generic)
[ 636.510146] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE>
[ 636.510302] CR: 44484824 XER: 20000000
[ 636.510319] CFAR: c0000000001a3250 SOFTE: 1
[ 636.510620] NIP [c0000000001a3244] smp_call_function_many+0x354/0x3f0
[ 636.510647] LR [c0000000001a3204] smp_call_function_many+0x314/0x3f0
[ 636.510658] Call Trace:
[ 636.510676] [c000000fdd0779e0] [c0000000001a31e4] smp_call_function_many+0x2f4/0x3f0 (unreliable)
[ 636.510759] [c000000fdd077a50] [c0000000001a3430] kick_all_cpus_sync+0x40/0x50
[ 636.510791] [c000000fdd077a70] [c00000000006f350] pmdp_invalidate+0x80/0xc0
[ 636.510820] [c000000fdd077aa0] [c000000000327d7c] __split_huge_pmd_locked+0x5bc/0xaa0
[ 636.510842] [c000000fdd077b60] [c00000000032b834] __split_huge_pmd+0x174/0x280
[ 636.510876] [c000000fdd077bc0] [c00000000032bc04] vma_adjust_trans_huge+0x134/0x1a0
[ 636.510909] [c000000fdd077c10] [c0000000002da1e4] __vma_adjust+0x114/0x8e0
[ 636.510932] [c000000fdd077cf0] [c0000000002dac2c] __split_vma.isra.5+0x27c/0x2a0
[ 636.510969] [c000000fdd077d40] [c0000000002dbb34] do_munmap+0x134/0x480
[ 636.510991] [c000000fdd077db0] [c0000000002e1550] SyS_mremap+0x1f0/0x550
[ 636.511029] [c000000fdd077e30] [c00000000000b184] system_call+0x38/0xe0
[ 636.511048] Instruction dump:
[ 636.511065] 409dfda4 3d020003 78691f24 39480fe0 7d2a482a e95d0000 7d4a4a14 812a0018
[ 636.511184] 792707e1 4182001c 60420000 7c210b78 <7c421378> 812a0018 792807e1 4082fff0

Even after increasing vm.min_free_kbytes to 2GB also, soft lockups and hang is still being
seen after running stress-ng tool. This seems to be kernel issue.

Revision history for this message
bugproxy (bugproxy) wrote : console log

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-154759 severity-high targetmilestone-inin16043
Revision history for this message
bugproxy (bugproxy) wrote : Kernel logs - dmesg

Default Comment by Bridge

Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1693566/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.12 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.12-rc2

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-07-10 04:44 EDT-------
$ git log 3d3efb68c19e539f -1
commit 3d3efb68c19e539f0535c93a5258c1299270215f
Author: Paul Mackerras <email address hidden>
Date: Tue Jun 6 14:35:30 2017 +1000

KVM: PPC: Book3S HV: Ignore timebase offset on POWER9 DD1
POWER9 DD1 has an erratum where writing to the TBU40 register, which
is used to apply an offset to the timebase, can cause the timebase to
lose counts. This results in the timebase on some CPUs getting out of
sync with other CPUs, which then results in misbehaviour of the
timekeeping code.
To work around the problem, we make KVM ignore the timebase offset for
all guests on POWER9 DD1 machines. This means that live migration
cannot be supported on POWER9 DD1 machines.
Cc: <email address hidden> # v4.10+
Signed-off-by: Paul Mackerras <email address hidden>
$

$ git tag --contains 3d3efb68c19e539f
v4.12
v4.12-rc7
$

Hello Canonical,

Commit 3d3efb68c19e539f fixes this issue. Its available in 4.12-rc7 and higher kernels.
Please include it with ubuntu 16.04.03 kernel.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-10-06 21:45 EDT-------
Hello Canonical,

Any update on this one?

no longer affects: linux (Ubuntu Vivid)
Changed in linux (Ubuntu Zesty):
status: New → Triaged
importance: Undecided → Medium
Changed in linux (Ubuntu):
status: Triaged → In Progress
Changed in linux (Ubuntu Zesty):
status: Triaged → In Progress
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu Zesty):
assignee: nobody → Joseph Salisbury (jsalisbury)
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Zesty test kernel with a pick of commit 3d3efb68c19e539f. The test kernel can be downloaded from:

http://kernel.ubuntu.com/~jsalisbury/lp1693566/

Can you see if this kernel resolves this bug?

Changed in linux (Ubuntu Zesty):
status: In Progress → Incomplete
Changed in linux (Ubuntu):
status: In Progress → Incomplete
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.