[Ubuntu 16.10] - System crashes and gives out call traces when libhugetlbfs test suite is run.

Bug #1632458 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Canonical Kernel Team

Bug Description

== Comment: #0 - Santhosh G <email address hidden> - 2016-09-27 01:55:00 ==
Issue:
Kernel unable to handle page request when heapshrink test case is run from libhugetlbfs suite.

Environment:
arch - ppc64le
ubuntu kvm guest

Host related Info:
Kernel:
-----------------
uname -a
Linux ltc-haba1 4.8.0-17-generic #19-Ubuntu SMP Sun Sep 25 06:35:40 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Memory:
--------------------
oot@ltc-haba1:~# free -h
              total used free shared buff/cache available
Mem: 255G 65G 187G 22M 1.9G 188G
Swap: 225G 0B 225G

Hugepages configured:
----------------------------------------
root@ltc-haba1:~# cat /proc/meminfo | grep -i Huge
AnonHugePages: 81920 kB
ShmemHugePages: 0 kB
HugePages_Total: 4096
HugePages_Free: 3584
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 16384 kB

Guest Related Info:
--------------------------------------
-------------------------------------
Kernel:
-------------------------
root@ubuntu:~/libhugetlbfs# uname -a
Linux ubuntu 4.8.0-17-generic #19-Ubuntu SMP Sun Sep 25 06:35:40 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Memory:
---------------------------------
root@ubuntu:~/libhugetlbfs# free -h
              total used free shared buff/cache available
Mem: 8.0G 133M 7.7G 15M 132M 7.5G
Swap: 3.3G 0B 3.3G

Hugepages configured:
-------------------------------------------
root@ubuntu:~/libhugetlbfs# cat /proc/meminfo | grep -i Huge
AnonHugePages: 0 kB
ShmemHugePages: 0 kB
HugePages_Total: 256
HugePages_Free: 256
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 16384 kB

Steps to reproduce:
1- Install a ubuntu kvm guest with hugepages memory Backing.
2 - git clone the latest libhugetlbfs from https://github.com/libhugetlbfs/libhugetlbfs.git
3 - configure huge[pages in guest and run make check.

xmon is configured in the system .
The system gets call traces and enters xmon console:

HUGETLB_VERBOSE=1 HUGETLB_MORECORE=yes heap-overflow (16M: 64): [ 281.735713] Unable to handle kernel paging request for data at address 0x4200000000328e38
[ 281.735804] Faulting instruction address: 0xc00000000027b410
cpu 0x1: Vector: 300 (Data Access) at [c0000001fa8c3730]
    pc: c00000000027b410: shrink_active_list+0x300/0x4d0
    lr: c00000000027b3f4: shrink_active_list+0x2e4/0x4d0
    sp: c0000001fa8c39b0
   msr: 800000010280b033
   dar: 4200000000328e38
 dsisr: 42000000
  current = 0xc0000001fa8adc00
  paca = 0xc00000000fb80900 softe: 0 irq_happened: 0x01
    pid = 50, comm = kswapd0
Linux version 4.8.0-17-generic (buildd@bos01-ppc64el-025) (gcc version 6.2.0 20160914 (Ubuntu 6.2.0-3ubuntu15) ) #19-Ubuntu SMP Sun Sep 25 06:35:40 UTC 2016 (Ubuntu 4.8.0-17.19-generic 4.8.0-rc7)
enter ? for help
[c0000001fa8c3aa0] c00000000027bbdc shrink_node_memcg+0x5fc/0x800
[c0000001fa8c3bc0] c00000000027bf0c shrink_node+0x12c/0x3f0
[c0000001fa8c3c80] c00000000027d500 kswapd+0x460/0x990
[c0000001fa8c3d80] c0000000000fd120 kthread+0x110/0x130
[c0000001fa8c3e30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c

xmon logs:

1:mon> e
cpu 0x1: Vector: 300 (Data Access) at [c0000001fa8e7730]
    pc: c00000000027b410: shrink_active_list+0x300/0x4d0
    lr: c00000000027b3f4: shrink_active_list+0x2e4/0x4d0
    sp: c0000001fa8e79b0
   msr: 800000010280b033
   dar: 42000000000c58d0
 dsisr: 42000000
  current = 0xc0000001fa8a0000
  paca = 0xc00000000fb80900 softe: 0 irq_happened: 0x01
    pid = 50, comm = kswapd0
Linux version 4.8.0-17-generic (buildd@bos01-ppc64el-025) (gcc version 6.2.0 20160914 (Ubuntu 6.2.0-3ubuntu15) ) #19-Ubuntu SMP Sun Sep 25 06:35:40 UTC 2016 (Ubuntu 4.8.0-17.19-generic 4.8.0-rc7)

1:mon> r
R00 = c00000000027b3f4 R16 = c0000001fffcfe00
R01 = c0000001fa8e79b0 R17 = 000000000000010a
R02 = c0000000014e5e00 R18 = 42000000000cbdd0
R03 = 0000000000000001 R19 = c0000001fffc6300
R04 = 0000000000000005 R20 = c0000001fa8e79e0
R05 = 0000000000000000 R21 = c0000001fe144800
R06 = f0000000003bc9a0 R22 = 0000000000000001
R07 = 00000001fee30000 R23 = 0000000000000005
R08 = 000000000000002a R24 = 000000000000207d
R09 = 0000000000000000 R25 = 0000000000000100
R10 = c000000001034e86 R26 = 0000000000000200
R11 = 0000000000000000 R27 = c0000001fa8e79d0
R12 = 0000000000002200 R28 = c0000001fa8e7ca0
R13 = c00000000fb80900 R29 = 0000000000000040
R14 = f000000000380000 R30 = c0000001fe144800
R15 = f000000000380020 R31 = c0000001fa8e79f0
pc = c00000000027b410 shrink_active_list+0x300/0x4d0
cfar= c0000000000b47a4 kvmppc_call_hv_entry+0x130/0x134
lr = c00000000027b3f4 shrink_active_list+0x2e4/0x4d0
msr = 800000010280b033 cr = 24022222
ctr = c0000000002ba900 xer = 0000000020000000 trap = 300
dar = 42000000000c58d0 dsisr = 42000000

1:mon> t
[c0000001fa8e7aa0] c00000000027bc70 shrink_node_memcg+0x690/0x800
[c0000001fa8e7bc0] c00000000027bf0c shrink_node+0x12c/0x3f0
[c0000001fa8e7c80] c00000000027d500 kswapd+0x460/0x990
[c0000001fa8e7d80] c0000000000fd120 kthread+0x110/0x130
[c0000001fa8e7e30] c0000000000098f0 ret_from_kernel_thread+0x5c/0x6c

== Comment: #2 - Santhosh G <email address hidden> - 2016-09-27 04:28:02 ==
Something similar to this issue is observed when mm tests in ltp is run.

Call Traces Output:
oom01 0 TINFO [ 2577.866629] Unable to handle kernel paging request for data at address 0x42000000004311d0
[ 2577.866759] Faulting instruction address: 0xc00000000027b410
[ 2577.866846] Oops: Kernel access of bad area, sig: 11 [#1]
[ 2577.866911] SMP NR_CPUS=2048 NUMA pSeries
[ 2577.866980] Modules linked in: vmx_crypto ip_tables x_tables autofs4 ibmvscsi crc32c_vpmsum
[ 2577.867152] CPU: 119 PID: 116856 Comm: oom01 Not tainted 4.8.0-17-generic #19-Ubuntu
[ 2577.867252] task: c000000db5d56000 task.stack: c00000031a898000
[ 2577.867334] NIP: c00000000027b410 LR: c00000000027b3f4 CTR: 0000000000000006
[ 2577.867433] REGS: c00000031a89b3e0 TRAP: 0300 Not tainted (4.8.0-17-generic)
[ 2577.867531] MSR: 800000010280b033 <SF,VEC,VSX,EE,FP,ME,IR,DR,RI,LE,TM[E]> CR: 28422222 XER: 20000000
[ 2577.867864] CFAR: c0000000000b477c DAR: 42000000004311d0 DSISR: 42000000 SOFTE: 0
GPR00: c00000000027b3f4 c00000031a89b660 c0000000014e5e00 0000000000000001
GPR04: 0000000000000005 0000000000000000 f000000000252960 0000000de7db0000
GPR08: 000000000000007d 0000000000000000 c000000001034e86 0000000000000000
GPR12: 0000000000002200 c00000000fbc2f00 f000000001ec8000 f000000001ec8020
GPR16: c000000defb93e00 0000000000000111 42000000004376d0 c000000defb8a300
GPR20: c00000031a89b690 c000000dee0a4800 0000000000000001 0000000000000005
GPR24: 0000000000023657 0000000000000100 0000000000000200 c00000031a89b680
GPR28: c00000031a89ba00 0000000000000040 c000000dee0a4800 c00000031a89b6a0
[ 2577.869185] NIP [c00000000027b410] shrink_active_list+0x300/0x4d0
[ 2577.869268] LR [c00000000027b3f4] shrink_active_list+0x2e4/0x4d0
[ 2577.869349] Call Trace:
[ 2577.869385] [c00000031a89b660] [c00000000027b3f4] shrink_active_list+0x2e4/0x4d0 (unreliable)
[ 2577.869518] [c00000031a89b750] [c00000000027bc70] shrink_node_memcg+0x690/0x800
[ 2577.869633] [c00000031a89b870] [c00000000027bf0c] shrink_node+0x12c/0x3f0
[ 2577.869733] [c00000031a89b930] [c00000000027c308] do_try_to_free_pages+0x138/0x480
[ 2577.869849] [c00000031a89b9e0] [c00000000027c74c] try_to_free_pages+0xfc/0x270
[ 2577.869963] [c00000031a89ba70] [c000000000264afc] __alloc_pages_nodemask+0x72c/0xee0
[ 2577.870081] [c00000031a89bc30] [c0000000002e1758] alloc_pages_vma+0x108/0x360
[ 2577.870181] [c00000031a89bcc0] [c0000000002ac5d4] handle_mm_fault+0x1024/0x14e0
[ 2577.870299] [c00000031a89bd80] [c000000000b90d50] do_page_fault+0x350/0x7d0
[ 2577.870435] [c00000031a89be30] [c000000000008948] handle_page_fault+0x10/0x30
[ 2577.870532] Instruction dump:
[ 2577.870578] 4bffbc19 7cb100d0 7ee4bb78 7e639b78 4800dbf9 60000000 892d023c 2f890000
[ 2577.870716] 409e01a4 7c2004ac 39200000 38600001 <91329b00> 4bd99b85 60000000 7fe3fb78
[ 2577.870845] ---[ end trace b2b062e289b7708f ]---
[ 2577.873701]

== Comment: #3 - Chandan Kumar <email address hidden> - 2016-09-27 05:18:41 ==

== Comment: #13 - Laurent Dufour <email address hidden> - 2016-10-04 11:51:59 ==

== Comment: #14 - Laurent Dufour <email address hidden> - 2016-10-05 04:18:52 ==

== Comment: #15 - Laurent Dufour <email address hidden> - 2016-10-05 05:12:41 ==

== Comment: #17 - Luciano Chavez <email address hidden> - 2016-10-05 15:40:06 ==

== Comment: #22 - Richard M. Scheller <email address hidden> - 2016-10-06 22:21:26 ==
(In reply to comment #21)
> Patched ubuntu kernel packages based on 4.8.0-19.21 are available here:
> http://www.lab.toulouse-stg.fr.ibm.com/~laurent/BZ146511/
>
> laurent@test1:~$ uname -v
> #21+bz146511 SMP Thu Oct 6 16:37:38 CEST 2016
>
> Please give a try.

I have run with this patched kernel on four guests on my Ubuntu 16.10 KVM host. Three of my guests are NOT backed by huge pages. The fourth guest is backed by huge pages. All four of these guests have PCI passthrough adapters.

All four of these guests crashed and rebooted within a few hours with out-of-memory errors, both with the standard Ubuntu 4.8.0-19 kernel and with this patched kernel.

There are five other guests on the same host system which do not have PCI passthrough adapters. None of these guests are reproducing the out-of-memory errors, despite running the same test suites.

Revision history for this message
bugproxy (bugproxy) wrote : xmon log

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-146776 severity-critical targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : XMON logs from PowerVM lpar

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
bugproxy (bugproxy)
tags: added: targetmilestone-inin1610
removed: targetmilestone-inin---
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Question to IBM: have you made any progress towards identifying a patch to address this issue?

Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

Closed as a duplicate of LP1628976

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-02-27 13:11 EDT-------
(In reply to comment #33)
> Question to IBM: have you made any progress towards identifying a patch to
> address this issue?

Yep fixed by https://patchwork.kernel.org/patch/9364805/

Revision history for this message
Michael Hohnbaum (hohnbaum) wrote : Re: [Bug 1632458] Comment bridged from LTC Bugzilla

Leann,

This one is now ready for the Kernel Team to evaluate.

                     Michael

On 02/27/2017 10:19 AM, bugproxy wrote:
> ------- Comment From <email address hidden> 2017-02-27 13:11 EDT-------
> (In reply to comment #33)
>> Question to IBM: have you made any progress towards identifying a patch to
>> address this issue?
> Yep fixed by https://patchwork.kernel.org/patch/9364805/
>

--
Michael Hohnbaum
OIL Program Manager
Power (ppc64el) Development Project Manager
Canonical, Ltd.

Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: Incomplete → Triaged
Revision history for this message
Seth Forshee (sforshee) wrote :

Comment #4 says that this is a duplicate of LP #1628976, and that appears to be true. The fix identified here was applied for that bug and was released several months ago.

Closing this bug as a duplicate of that one.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.