[Ubuntu 16.10] NMI watchdog and soft lockup while running htx memory tests in kernel 4.8.0-17-generic
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
The Ubuntu-power-systems project |
Invalid
|
High
|
Unassigned | ||
linux (Ubuntu) |
Invalid
|
High
|
Ubuntu on IBM Power Systems Bug Triage |
Bug Description
Issue:
--------------
NMI Watchdog Bug and soft lockup occurs when htx memory test is run in ubuntu 16.10.
Environment:
-------
Arch : ppc64le
Platform : Ubuntu KVM Guest
Host : ubuntu 16.10 [4.8.0-17 -kernel ]
Guest : ubuntu 16.10 [4.8.0-17 - Kernel]
Steps To Reproduce:
-------
1 - Install a Ubuntu KVM Guest and install htx package in the guest got from the link,
http://
2 - Run the Htx mdt.mem
3 - The system Hits soft lockup Issue as below:
dmesg o/p:
[60287.590335] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 1141s! [hxemem64:23468]
[60287.590572] Modules linked in: vmx_crypto ip_tables x_tables autofs4 ibmvscsi crc32c_vpmsum
[60287.590585] CPU: 3 PID: 23468 Comm: hxemem64 Tainted: G L 4.8.0-17-generic #19-Ubuntu
[60287.590587] task: c0000012a0971e00 task.stack: c0000012a2d40000
[60287.590589] NIP: c000000000015004 LR: c000000000015004 CTR: c000000000165e90
[60287.590591] REGS: c0000012a2d439a0 TRAP: 0901 Tainted: G L (4.8.0-17-generic)
[60287.590592] MSR: 8000000000009033 <SF,EE,
[60287.590603] CFAR: c000000000165890 SOFTE: 1
[60287.590627] NIP [c000000000015004] arch_local_
[60287.590630] LR [c000000000015004] arch_local_
[60287.590631] Call Trace:
[60287.590634] [c0000012a2d43c20] [c0000012bfeccd80] 0xc0000012bfeccd80 (unreliable)
[60287.590639] [c0000012a2d43c40] [c000000000165f9c] run_timer_
[60287.590644] [c0000012a2d43ce0] [c000000000b94adc] __do_softirq+
[60287.590648] [c0000012a2d43de0] [c0000000000d5828] irq_exit+0xc8/0x100
[60287.590653] [c0000012a2d43e00] [c000000000024810] timer_interrupt
[60287.590657] [c0000012a2d43e30] [c000000000002814] decrementer_
[60287.590659] Instruction dump:
[60287.590662] 994d023a 2fa30000 409e0024 e92d0020 61298000 7d210164 38210020 e8010010
[60287.590670] 7c0803a6 4e800020 60420000 4bfed259 <60000000> 4bffffe4 60420000 e92d0020
[63127.581494] NMI watchdog: BUG: soft lockup - CPU#2 stuck for 339s! [hxemem64:23467]
[63127.629682] Modules linked in: vmx_crypto ip_tables x_tables autofs4 ibmvscsi crc32c_vpmsum
[63127.629699] CPU: 2 PID: 23467 Comm: hxemem64 Tainted: G L 4.8.0-17-generic #19-Ubuntu
[63127.629701] task: c0000012a0965800 task.stack: c0000012a2d58000
[63127.629703] NIP: 0000000010011e60 LR: 000000001000ec6c CTR: 0000000000f33196
[63127.629706] REGS: c0000012a2d5bea0 TRAP: 0901 Tainted: G L (4.8.0-17-generic)
[63127.629707] MSR: 800000010000d033 <SF,EE,
[63127.629719] CFAR: 0000000010011e68 SOFTE: 1
[63127.629740] NIP [0000000010011e60] 0x10011e60
[63127.629742] LR [000000001000ec6c] 0x1000ec6c
[63127.629743] Call Trace:
== Comment: #3 - Santhosh G <email address hidden> - 2016-09-28 02:17:29 ==
Memory Info :
root@ubuntu:~# cat /proc/meminfo
MemTotal: 78539776 kB
MemFree: 72219392 kB
MemAvailable: 77217088 kB
Buffers: 212544 kB
Cached: 5249088 kB
SwapCached: 0 kB
Active: 1440832 kB
Inactive: 4107264 kB
Active(anon): 93888 kB
Inactive(anon): 8640 kB
Active(file): 1346944 kB
Inactive(file): 4098624 kB
Unevictable: 0 kB
Mlocked: 0 kB
SwapTotal: 3443648 kB
SwapFree: 3443648 kB
Dirty: 0 kB
Writeback: 0 kB
AnonPages: 87296 kB
Mapped: 30400 kB
Shmem: 16128 kB
Slab: 381440 kB
SReclaimable: 295872 kB
SUnreclaim: 85568 kB
KernelStack: 2176 kB
PageTables: 2048 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 42639808 kB
Committed_AS: 224768 kB
VmallocTotal: 8589934592 kB
VmallocUsed: 0 kB
VmallocChunk: 0 kB
HardwareCorrupted: 0 kB
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
ShmemPmdMapped: 0 kB
CmaTotal: 0 kB
CmaFree: 0 kB
HugePages_Total: 9
HugePages_Free: 9
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 16384 kB
free -h :
total used free shared buff/cache available
Mem: 74G 545M 68G 15M 5.5G 73G
Swap: 3.3G 0B 3.3G
== Comment: #5 - Santhosh G <email address hidden> - 2016-09-29 02:49:49 ==
(In reply to comment #4)
> Hi Santhosh,
> After how long are you seeing this error ?
> Can you share the output by:
> 1) start the mdt.mem tests.
> 2) While the tests are running what is the output of 'free -h' ?
> 3) Attach /tmp/htxerr
>
> Thank you.
Hi Vaishnavi,
I have run the test for more than 12 hours and not sure exactly when the lockup occurs.
Before starting the tests,
free -h :
total used free shared buff/cache available
Mem: 74G 528M 68G 15M 5.5G 73G
Swap: 3.3G 0B 3.3G
After running the tests for more than 10 min :
total used free shared buff/cache available
Mem: 74G 570M 20G 48G 53G 25G
Swap: 3.3G 0B 3.3G
The memory usage gradually Increases.
Not sure exactly at which point the lockup occurs.
And /tmp/htxerror is empty.
== Comment: #7 - Vaishnavi Bhat <email address hidden> - 2016-09-30 04:03:23 ==
Hi Santhosh ,
While running the mdt.mem, we see that the about 60% of memory is used and free swap is reduced to 0B.
total used free shared buff/cache available
Mem: 74G 570M 20G 48G 53G 25G
Swap: 3.3G 0B 3.3G
Top output
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1860 root 38 18 48.484g 0.046t 0.046t S 318.1 63.5 4865:53 hxemem64
Also the dmesg shows traces of OOM and softlock up with hxemem.
Can you please try increasing vm.min_free_kbytes value and see if it shows any improvement? I would suggest starting with the double of the current value.
Current value :
$ sysctl -n vm.min_free_kbytes
180224
New value:
$sysctl -w vm.min_
Thank you.
== Comment: #10 - Vaishnavi Bhat <email address hidden> - 2016-10-20 04:06:20 ==
(In reply to comment #9)
> Hi Vaishnavi,
>
> I am able to reproduce this issue even in 4.8.0-22-generic
>
> o/p:
> sysctl -n vm.min_free_kbytes
> 360448
>
> Please, take a look in to the issue.
>
> Thanks.
Thanks for the confirmation, the issue is being reproduced with
sysctl -n vm.min_free_kbytes
360448
Thank you.
tags: |
added: targetmilestone-inin1704 removed: targetmilestone-inin--- |
Changed in ubuntu-power-systems: | |
status: | New → Incomplete |
tags: | added: ubuntu-16.10 |
Changed in ubuntu-power-systems: | |
importance: | Undecided → High |
Changed in linux (Ubuntu): | |
importance: | Undecided → High |
assignee: | Taco Screen team (taco-screen-team) → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) |
Default Comment by Bridge