[Ubuntu 16.10] NMI watchdog and soft lockup while running htx memory tests in kernel 4.8.0-17-generic

Bug #1649513 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Invalid
High
Unassigned
linux (Ubuntu)
Invalid
High
Ubuntu on IBM Power Systems Bug Triage

Bug Description

Issue:
--------------
NMI Watchdog Bug and soft lockup occurs when htx memory test is run in ubuntu 16.10.

Environment:
--------------------------
Arch : ppc64le
Platform : Ubuntu KVM Guest
Host : ubuntu 16.10 [4.8.0-17 -kernel ]
Guest : ubuntu 16.10 [4.8.0-17 - Kernel]

Steps To Reproduce:
-----------------------------------

1 - Install a Ubuntu KVM Guest and install htx package in the guest got from the link,
http://ausgsa.ibm.com/projects/h/htx/public_html/htxonly/htxubuntu-413.deb

2 - Run the Htx mdt.mem

3 - The system Hits soft lockup Issue as below:

dmesg o/p:
[60287.590335] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 1141s! [hxemem64:23468]
[60287.590572] Modules linked in: vmx_crypto ip_tables x_tables autofs4 ibmvscsi crc32c_vpmsum
[60287.590585] CPU: 3 PID: 23468 Comm: hxemem64 Tainted: G L 4.8.0-17-generic #19-Ubuntu
[60287.590587] task: c0000012a0971e00 task.stack: c0000012a2d40000
[60287.590589] NIP: c000000000015004 LR: c000000000015004 CTR: c000000000165e90
[60287.590591] REGS: c0000012a2d439a0 TRAP: 0901 Tainted: G L (4.8.0-17-generic)
[60287.590592] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 48004244 XER: 00000000
[60287.590603] CFAR: c000000000165890 SOFTE: 1
               GPR00: c000000000165f9c c0000012a2d43c20 c0000000014e5e00 0000000000000900
               GPR04: 0000000000000000 0000000000000008 0000000100e4d61a 0000000000000000
               GPR08: 0000000000000000 0000000000000006 0000000100e4d619 c0000012bfee3130
               GPR12: 00003fffae6cdc70 00003fffae436900
[60287.590627] NIP [c000000000015004] arch_local_irq_restore+0x74/0x90
[60287.590630] LR [c000000000015004] arch_local_irq_restore+0x74/0x90
[60287.590631] Call Trace:
[60287.590634] [c0000012a2d43c20] [c0000012bfeccd80] 0xc0000012bfeccd80 (unreliable)
[60287.590639] [c0000012a2d43c40] [c000000000165f9c] run_timer_softirq+0x10c/0x230
[60287.590644] [c0000012a2d43ce0] [c000000000b94adc] __do_softirq+0x18c/0x3fc
[60287.590648] [c0000012a2d43de0] [c0000000000d5828] irq_exit+0xc8/0x100
[60287.590653] [c0000012a2d43e00] [c000000000024810] timer_interrupt+0xa0/0xe0
[60287.590657] [c0000012a2d43e30] [c000000000002814] decrementer_common+0x114/0x180
[60287.590659] Instruction dump:
[60287.590662] 994d023a 2fa30000 409e0024 e92d0020 61298000 7d210164 38210020 e8010010
[60287.590670] 7c0803a6 4e800020 60420000 4bfed259 <60000000> 4bffffe4 60420000 e92d0020
[63127.581494] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 339s! [hxemem64:23467]
[63127.629682] Modules linked in: vmx_crypto ip_tables x_tables autofs4 ibmvscsi crc32c_vpmsum
[63127.629699] CPU: 2 PID: 23467 Comm: hxemem64 Tainted: G L 4.8.0-17-generic #19-Ubuntu
[63127.629701] task: c0000012a0965800 task.stack: c0000012a2d58000
[63127.629703] NIP: 0000000010011e60 LR: 000000001000ec6c CTR: 0000000000f33196
[63127.629706] REGS: c0000012a2d5bea0 TRAP: 0901 Tainted: G L (4.8.0-17-generic)
[63127.629707] MSR: 800000010000d033 <SF,EE,PR,ME,IR,DR,RI,LE,TM[E]> CR: 42004482 XER: 00000000
[63127.629719] CFAR: 0000000010011e68 SOFTE: 1
               GPR00: 000000001000e854 00003fffadc2e540 0000000010047f00 000000000000000d
               GPR04: 0000000002000000 00003ff5a8000000 5a5a5a5a5a5a5a5a 00003ff5b0667348
               GPR08: 0000000000000000 000000001006c8e0 000000001006ca04 fffffffffffff001
               GPR12: 00003fffae6cdc70 00003fffadc36900
[63127.629740] NIP [0000000010011e60] 0x10011e60
[63127.629742] LR [000000001000ec6c] 0x1000ec6c
[63127.629743] Call Trace:

== Comment: #3 - Santhosh G <email address hidden> - 2016-09-28 02:17:29 ==
Memory Info :

root@ubuntu:~# cat /proc/meminfo
MemTotal: 78539776 kB
MemFree: 72219392 kB
MemAvailable: 77217088 kB
Buffers: 212544 kB
Cached: 5249088 kB
SwapCached: 0 kB
Active: 1440832 kB
Inactive: 4107264 kB
Active(anon): 93888 kB
Inactive(anon): 8640 kB
Active(file): 1346944 kB
Inactive(file): 4098624 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 3443648 kB
SwapFree: 3443648 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 87296 kB
Mapped: 30400 kB
Shmem: 16128 kB
Slab: 381440 kB
SReclaimable: 295872 kB
SUnreclaim: 85568 kB
KernelStack: 2176 kB
PageTables: 2048 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 42639808 kB
Committed_AS: 224768 kB
VmallocTotal: 8589934592 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 9
HugePages_Free: 9
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 16384 kB

free -h :
              total used free shared buff/cache available
Mem: 74G 545M 68G 15M 5.5G 73G
Swap: 3.3G 0B 3.3G

== Comment: #5 - Santhosh G <email address hidden> - 2016-09-29 02:49:49 ==
(In reply to comment #4)
> Hi Santhosh,
> After how long are you seeing this error ?
> Can you share the output by:
> 1) start the mdt.mem tests.
> 2) While the tests are running what is the output of 'free -h' ?
> 3) Attach /tmp/htxerr
>
> Thank you.

Hi Vaishnavi,

I have run the test for more than 12 hours and not sure exactly when the lockup occurs.

Before starting the tests,

free -h :
              total used free shared buff/cache available
Mem: 74G 528M 68G 15M 5.5G 73G
Swap: 3.3G 0B 3.3G

After running the tests for more than 10 min :

total used free shared buff/cache available
Mem: 74G 570M 20G 48G 53G 25G
Swap: 3.3G 0B 3.3G

The memory usage gradually Increases.

Not sure exactly at which point the lockup occurs.

And /tmp/htxerror is empty.

== Comment: #7 - Vaishnavi Bhat <email address hidden> - 2016-09-30 04:03:23 ==
Hi Santhosh ,

While running the mdt.mem, we see that the about 60% of memory is used and free swap is reduced to 0B.
total used free shared buff/cache available
Mem: 74G 570M 20G 48G 53G 25G
Swap: 3.3G 0B 3.3G

Top output
 PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
 1860 root 38 18 48.484g 0.046t 0.046t S 318.1 63.5 4865:53 hxemem64

Also the dmesg shows traces of OOM and softlock up with hxemem.

Can you please try increasing vm.min_free_kbytes value and see if it shows any improvement? I would suggest starting with the double of the current value.
Current value :
$ sysctl -n vm.min_free_kbytes
180224
New value:
$sysctl -w vm.min_free_kbytes=<new value>

Thank you.

== Comment: #10 - Vaishnavi Bhat <email address hidden> - 2016-10-20 04:06:20 ==
(In reply to comment #9)
> Hi Vaishnavi,
>
> I am able to reproduce this issue even in 4.8.0-22-generic
>
> o/p:
> sysctl -n vm.min_free_kbytes
> 360448
>
> Please, take a look in to the issue.
>
> Thanks.

Thanks for the confirmation, the issue is being reproduced with
sysctl -n vm.min_free_kbytes
360448

Thank you.

Revision history for this message
bugproxy (bugproxy) wrote : Attcached /var/log/syslog

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-146905 severity-high targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : sosreport

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Manoj Iyer (manjo) wrote :

IBM needs to identify a patch that fixes this issue. We do not have a good mechanism to reproduce the bug.

Changed in linux (Ubuntu):
status: New → Incomplete
bugproxy (bugproxy)
tags: added: targetmilestone-inin1704
removed: targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : dmesg log

------- Comment (attachment only) From <email address hidden> 2017-02-21 03:59 EDT-------

Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
status: New → Incomplete
Manoj Iyer (manjo)
tags: added: ubuntu-16.10
Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
importance: Undecided → High
Changed in linux (Ubuntu):
importance: Undecided → High
assignee: Taco Screen team (taco-screen-team) → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
Revision history for this message
Manoj Iyer (manjo) wrote :

Please note that release 16.10 is no longer supported, if this problem is still valid for 16.04.3 or in the latest development release please open a separate bug.

Changed in linux (Ubuntu):
status: Incomplete → Invalid
Changed in ubuntu-power-systems:
status: Incomplete → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.