ISST-LTE: Ubuntu 16.04 LPAR has Kernel panic when running base, IO, NFS, TCP tests together

Bug #1546442 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
New
Undecided
Unassigned

Bug Description

== Comment: #0 - YUECHANG E. MEI - 2016-02-10 16:58:26 ==
---Problem Description---
Our LPAR, conelp1, has Ubuntu 16.04 installed and is using Houston adapter for network and SAN disks.

When we ran base, IO, and TCP tests on conelp1, it would stop the tests by itself or hung after the test ran for 10+ hrs.

Then, I tried to increase min_free_kbytes and started base, IO, NFS, and TCP tests on it. Later, it has Kernel panic after it ran the ST tests for an hour.

root@conelp1:~# echo 365536 > /proc/sys/vm/min_free_kbytes
root@conelp1:~# cat /proc/sys/vm/min_free_kbytes
365536

 root@conelp1:~# [ 1682.274535] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
[ 1682.274535]
[ 1682.274567] CPU: 11 PID: 1 Comm: systemd Not tainted 4.4.0-2-generic #16-Ubuntu
[ 1682.274573] Call Trace:
[ 1682.274583] [c00000017ad83a60] [c000000000ad5ec0] dump_stack+0x90/0xbc (unreliable)
[ 1682.274593] [c00000017ad83a90] [c000000000ad2140] panic+0x100/0x2c8
[ 1682.274601] [c00000017ad83b20] [c0000000000bce18] do_exit+0xbe8/0xbf0
[ 1682.274608] [c00000017ad83be0] [c0000000000bcf04] do_group_exit+0x64/0x100
[ 1682.274616] [c00000017ad83c20] [c0000000000ce23c] get_signal+0x52c/0x770
[ 1682.274626] [c00000017ad83d10] [c0000000000173d4] do_signal+0x54/0x2b0
[ 1682.274633] [c00000017ad83e00] [c00000000001782c] do_notify_resume+0xbc/0xd0
[ 1682.274641] [c00000017ad83e30] [c000000000009838] ret_from_except_lite+0x64/0x68
[ 1682.284440] ---[ end Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
[ 1682.284440]

I tried to collect the dump when it crashed at the first time, but failed. conelp1 is in xmon now.

---uname output---
4.4.0-2-generic

Machine Type = EUH Alpine 8408-E8E

---Debugger Data---

16:mon> r
R00 = c0000000000439a8 R16 = c000000000f69c28
R01 = c00000017ff27cf0 R17 = 0000000000000001
R02 = c000000001583800 R18 = c00000017adfc000
R03 = 1600000000000000 R19 = 0000000000000008
R04 = 0000000000000000 R20 = c00000017adfc080
R05 = 0000000000000000 R21 = 0000000000000001
R06 = 0000000000000000 R22 = 0000000000000002
R07 = 0000000000000000 R23 = 0000000000000010
R08 = c0000000015beaa8 R24 = c00000017e2c0800
R09 = 0000000000000000 R25 = 0000000000000000
R10 = 0000000000000000 R26 = 0000000000000000
R11 = 0000000000000002 R27 = 0000000000000001
R12 = c000000000043980 R28 = 0000000000000000
R13 = c000000007afd100 R29 = c000000000043980
R14 = c000000000b03a18 R30 = 0000000000000000
R15 = c000000000f7d670 R31 = 0000000000000000
pc = b8390400000000c0
cfar= 000000000000011c
lr = c0000000000439a8 stop_this_cpu+0x28/0x40
msr = 1000000000000080 cr = 28002882
ctr = c000000000043980 xer = 0000000000000000 trap = 100
16:mon> e
cpu 0x16: Vector: 100 (System Reset) at [c00000017ff27a70]
    pc: b8390400000000c0
    lr: c0000000000439a8: stop_this_cpu+0x28/0x40
    sp: c00000017ff27cf0
   msr: 1000000000000080
  current = 0xc00000017ad69370
  paca = 0xc000000007afd100 softe: 0 irq_happened: 0x0d
    pid = 0, comm = swapper/22
16:mon> d
0000000000000000 **************** **************** | |
16:mon> t
[c00000017ff27cf0] c0000000000439a8 stop_this_cpu+0x28/0x40 (unreliable)
[c00000017ff27d10] c000000000168510 flush_smp_call_function_queue+0x120/0x1d0
[c00000017ff27d90] c000000000044568 smp_ipi_demux+0x98/0x100
[c00000017ff27dd0] c00000000006c494 icp_hv_ipi_action+0x64/0xd0
[c00000017ff27e40] c00000000012f370 handle_irq_event_percpu+0x90/0x2b0
[c00000017ff27f00] c0000000001351e4 handle_percpu_irq+0x84/0xd0
[c00000017ff27f30] c00000000012e564 generic_handle_irq+0x54/0x80
[c00000017ff27f60] c000000000011300 __do_irq+0x80/0x190
[c00000017ff27f90] c000000000024760 call_do_irq+0x14/0x24
[c00000017adff9f0] c0000000000114a8 do_IRQ+0x98/0x140
[c00000017adffa40] c000000000002594 hardware_interrupt_common+0x114/0x180
--- Exception: 501 (Hardware Interrupt) at c000000000085d5c plpar_hcall_norets+0x1c/0x28
[link register ] c0000000008f61e4 check_and_cede_processor+0x34/0x50
[c00000017adffd30] c0000000008f61d0 check_and_cede_processor+0x20/0x50 (unreliable)
[c00000017adffd90] c0000000008f6274 dedicated_cede_loop+0x74/0x190
[c00000017adffdd0] c0000000008f3460 cpuidle_enter_state+0x160/0x3c0
[c00000017adffe30] c000000000118c18 call_cpuidle+0x78/0xd0
[c00000017adffe70] c000000000118fac cpu_startup_entry+0x33c/0x450
[c00000017adfff30] c00000000004559c start_secondary+0x33c/0x360
[c00000017adfff90] c000000000008b6c start_secondary_prolog+0x10/0x14

---Steps to Reproduce---
 1. Install Ubuntu 16.04 in LPAR (using SAN disks) with multipath disable (because of bug 136777)
2. mount kte folder, and setup ST (base, IO, NFS, and TCP) tests
3. increase the value of min_free_kbytes to 365536 (echo 365536 > /proc/sys/vm/min_free_kbytes)
4. start ST (base, IO, NFS, and TCP) tests
5. leave the console open, and the LPAR will crash after an hour run

Stack trace output:
no

Oops output:
 no

System Dump Info:
  The system was configured to capture a dump, however a dump was not produced.

== Comment: #3 - PAWAN K. SINGH - 2016-02-11 09:47:09 ==
As per the log description it looks like energy management issue like:-

[c00000017adffa40] c000000000002594 hardware_interrupt_common+0x114/0x180
--- Exception: 501 (Hardware Interrupt) at c000000000085d5c plpar_hcall_norets+0x1c/0x28
[link register ] c0000000008f61e4 check_and_cede_processor+0x34/0x50
[c00000017adffd30] c0000000008f61d0 check_and_cede_processor+0x20/0x50 (unreliable)

adding Shreyas and Shilpa for further assistance

Thanks ,

== Comment: #6 - YUECHANG E. MEI - 2016-02-12 18:25:54 ==

We are still in the middle of recreating the issue. However, the ST tests running on LPARs (conelp1 and conelp2) were stopped by the LPARs, and same messages were printed before tests stopped. I am not sure if this related to the kernel panic problem, please advise if we need to open a new bug. Thank you!

....
[Fri Feb 12 16:59:03 2016] systemd[1]: Starting NFS server and services...
[Fri Feb 12 16:59:03 2016] systemd[1]: Started D-Bus System Message Bus.
[Fri Feb 12 16:59:03 2016] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.PolicyKit1': Device or resource busy
[Fri Feb 12 16:59:03 2016] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.timedate1': Device or resource busy
[Fri Feb 12 16:59:03 2016] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.locale1': Device or resource busy
[Fri Feb 12 16:59:03 2016] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.Accounts': Device or resource busy
[Fri Feb 12 16:59:03 2016] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.login1': Device or resource busy
[Fri Feb 12 16:59:03 2016] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.hostname1': Device or resource busy

(to see the detail of the message, please read the attached file)

== Comment: #7 - Manjunatha H R - 2016-02-15 01:38:17 ==
Same issue is seen on roselp1, lpar crashed on fell into XMON.

d:mon> t
[link register ] c000000000ad2300 panic+0x2c0/0x2c8
[c000000276003a90] c000000000ad22b8 panic+0x278/0x2c8 (unreliable)
[c000000276003b20] c0000000000bce18 do_exit+0xbe8/0xbf0
[c000000276003be0] c0000000000bcf04 do_group_exit+0x64/0x100
[c000000276003c20] c0000000000ce23c get_signal+0x52c/0x770
[c000000276003d10] c0000000000173d4 do_signal+0x54/0x2b0
[c000000276003e00] c00000000001782c do_notify_resume+0xbc/0xd0
[c000000276003e30] c000000000009838 ret_from_except_lite+0x64/0x68
--- Exception: 0 at 0101010101010100

d:mon> e
cpu 0xd: Vector: 100 (System Reset) at [c000000276003810]
    pc: 04f00100000000c0
    lr: c000000000ad2300: panic+0x2c0/0x2c8
    sp: c000000276003a90
   msr: 1000000200000080
  current = 0xc000000277f40000
  paca = 0xc000000007af7b80 softe: 0 irq_happened: 0x01
    pid = 1, comm = systemd
d:mon>

== Comment: #10 - Shreyas B. Prabhu - 2016-02-15 02:11:26 ==
(In reply to comment #3)
> As per the log description it looks like energy management issue like:-
>
> [c00000017adffa40] c000000000002594 hardware_interrupt_common+0x114/0x180
> --- Exception: 501 (Hardware Interrupt) at c000000000085d5c
> plpar_hcall_norets+0x1c/0x28
> [link register ] c0000000008f61e4 check_and_cede_processor+0x34/0x50
> [c00000017adffd30] c0000000008f61d0 check_and_cede_processor+0x20/0x50
> (unreliable)
>
> adding Shreyas and Shilpa for further assistance
>
> Thanks ,

From the logs reported by Yuechang E. Mei and Manjunatha, the crash seems to be due to unhandled exceptions in systemd which lead to systemd crash (Note - "Attempted to kill init!" in the logs ). Since the lpars were configured enter xmon upon crash, all the other cpu's would have got interrupts to bring them to xmon. Any cpu which was idle when it got this interrupt will have cpuidle functions in its call stack.
So its unlikely that this bug is related to cpuidle and is more likely a systemd bug.

== Comment: #11 - Manjunatha H R - 2016-02-15 02:43:51 ==
Looks like this issue needs to be addressed by systemd expert : Lpars which hit this bug has shown : Crash OR Tests abort (all systemd services restarts)

1. In case of crash: traces shows -
---------------------------------
[ 1682.274567] CPU: 11 PID: 1 Comm: systemd Not tainted 4.4.0-2-generic #16-Ubuntu
[ 1682.274573] Call Trace:

2. In case of Tests abort: dmesg shows -
-------------------------
[220784.781637] systemd[1]: systemd-journald.service: Failed with result 'signal'.
[220784.791216] systemd[1]: lvm2-lvmetad.service: Main process exited, code=killed, status=9/KILL
[220784.791684] systemd[1]: lvm2-lvmetad.service: Unit entered failed state.
[220784.791730] systemd[1]: lvm2-lvmetad.service: Failed with result 'signal'.
[220784.791856] systemd[1]: systemd-udevd.service: Main process exited, code=killed, status=9/KILL
[220784.792159] systemd[1]: systemd-udevd.service: Unit entered failed state.
[220784.792181] systemd[1]: systemd-udevd.service: Failed with result 'signal'.
[220784.792275] systemd[1]: accounts-daemon.service: Main process exited, code=killed, status=9/KILL
[220784.792610] systemd[1]: accounts-daemon.service: Unit entered failed state.
[220784.792624] systemd[1]: accounts-daemon.service: Failed with result 'signal'.
[220784.792750] systemd[1]: dbus.service: Main process exited, code=killed, status=9/KILL
[220784.830737] systemd[1]: systemd-logind.service: Main process exited, code=killed, status=9/KILL
[220784.831167] systemd[1]: systemd-logind.service: Unit entered failed state.
[220784.831223] systemd[1]: systemd-logind.service: Failed with result 'signal'.
[220784.831342] systemd[1]: iprdump.service: Main process exited, code=killed, status=9/KILL
[220784.831749] systemd[1]: iprdump.service: Unit entered failed state.
[220784.831774] systemd[1]: iprdump.service: Failed with result 'signal'.
[220784.831863] systemd[1]: cron.service: Main process exited, code=killed, status=9/KILL
[220784.832216] systemd[1]: cron.service: Unit entered failed state.
[220784.832230] systemd[1]: cron.service: Failed with result 'signal'.
[220784.832305] systemd[1]: rtas_errd.service: Main process exited, code=killed, status=9/KILL
[220784.832759] systemd[1]: rtas_errd.service: Unit entered failed state.
[220784.832797] systemd[1]: rtas_errd.service: Failed with result 'signal'.
[220784.832910] systemd[1]: polkitd.service: Main process exited, code=killed, status=9/KILL
[220784.833269] systemd[1]: polkitd.service: Unit entered failed state.
[220784.833284] systemd[1]: polkitd.service: Failed with result 'signal'.
[220784.833400] systemd[1]: ssh.service: Main process exited, code=killed, status=9/KILL
[220784.833660] systemd[1]: ssh.service: Unit entered failed state.
[220784.833673] systemd[1]: ssh.service: Failed with result 'signal'.
[220784.833757] systemd[1]: iprupdate.service: Main process exited, code=killed, status=9/KILL
[220784.834088] systemd[1]: iprupdate.service: Unit entered failed state.
[220784.834102] systemd[1]: iprupdate.service: Failed with result 'signal'.
[220784.834173] systemd[1]: iprinit.service: Main process exited, code=killed, status=9/KILL
[220784.834503] systemd[1]: iprinit.service: Unit entered failed state.
[220784.834516] systemd[1]: iprinit.service: Failed with result 'signal'.
[220784.834647] systemd[1]: postgresql@9.5-main.service: Main process exited, code=killed, status=9/KILL
[220784.842900] systemd[1]: staf.service: Main process exited, code=killed, status=9/KILL
[220784.844557] systemd[1]: rsyslog.service: Main process exited, code=killed, status=9/KILL
[220784.844903] systemd[1]: rsyslog.service: Unit entered failed state.
[220784.844930] systemd[1]: rsyslog.service: Failed with result 'signal'.
[220784.852532] systemd[1]: dbus.service: Unit entered failed state.
[220784.852565] systemd[1]: dbus.service: Failed with result 'signal'.
[220784.852987] systemd[1]: staf.service: Unit entered failed state.
[220784.853005] systemd[1]: staf.service: Failed with result 'signal'.
[220784.855176] systemd[1]: <email address hidden>: Service has no hold-off time, scheduling restart.
[220784.855473] systemd[1]: systemd-journald.service: Service has no hold-off time, scheduling restart.
[220784.855504] systemd[1]: systemd-logind.service: Service has no hold-off time, scheduling restart.
[220784.856557] systemd[1]: systemd-udevd.service: Service has no hold-off time, scheduling restart.
[220784.893452] systemd[1]: lvm2-lvmetad.service: Service hold-off time over, scheduling restart.
[220784.932383] systemd[1]: Stopped LVM2 metadata daemon.
[220784.945979] systemd[1]: Started LVM2 metadata daemon.
[220784.946532] systemd[1]: Stopped udev Kernel Device Manager.
[220784.948518] systemd[1]: Starting udev Kernel Device Manager...
[220784.948724] systemd[1]: Stopped Login Service.
[220784.951208] systemd[1]: Starting Login Service...
[220784.951649] systemd[1]: Stopped Flush Journal to Persistent Storage.
[220784.951690] systemd[1]: Stopping Flush Journal to Persistent Storage...
[220784.951872] systemd[1]: Stopped Journal Service.
[220784.952634] systemd[1]: Starting Journal Service...
[220784.952734] systemd[1]: Stopped Getty on tty1.
[220784.954067] systemd[1]: Started Getty on tty1.
[220784.960680] systemd[1]: rsyslog.service: Service hold-off time over, scheduling restart.
[220784.960951] systemd[1]: <email address hidden>: Service hold-off time over, scheduling restart.
[220784.961186] systemd[1]: ssh.service: Service hold-off time over, scheduling restart.
[220784.972830] systemd-journald[162446]: File /run/log/journal/01ce52241c3846edade41f23178de962/system.journal corrupted or uncleanly shut down, renaming and replacing.
[220784.975464] systemd[1]: Started D-Bus System Message Bus.
[220784.980448] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.login1': Device or resource busy
[220784.980457] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.Accounts': Device or resource busy
[220784.980463] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.PolicyKit1': Device or resource busy
[220784.980470] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.timedate1': Device or resource busy
[220784.980476] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.locale1': Device or resource busy
[220784.980482] systemd[1]: Failed to subscribe to NameOwnerChanged signal for 'org.freedesktop.hostname1': Device or resource busy
[220785.011703] systemd[1]: Stopped OpenBSD Secure Shell server.
[220785.013617] systemd[1]: Starting OpenBSD Secure Shell server...
[220785.013748] systemd[1]: Stopped Serial Getty on hvc0.
[220785.014853] systemd[1]: Started Serial Getty on hvc0.
[220785.015193] systemd[1]: Stopped System Logging Service.
[220785.020295] systemd[1]: Starting System Logging Service...
[220785.020555] systemd[1]: Started Journal Service.

Lpars affected:
--------------------
1. Crashed lpars so far : conelp1, roselp1, pinelp3

2. Tests aborted lpars so far : conelp2, roselp2, pinelp3 and conelp1

roslep1 is left for debugging this issue. Also please let us if a separate bug required to handle crash and test abort cases separately.

Thanks,
Manju

== Comment: #13 - Ping Tian Han 2016-02-16 01:48:37 ==
pinelp3 hitted this problem again with 4.4.0-4-generic. systemd killed and all ST stopped.

== Comment: #16 - Ping Tian Han - 2016-02-17 02:03:57 ==
systemd killed after running stress tests for about 5 hours on pinelp1. Kernel version: 4.4.0-4-generic.

== Comment: #17 - PAWAN K. SINGH - 2016-02-17 02:15:08 ==
form the previous comments it looks like a systemd issue thus we are mirroring so that distro is aware of it

Thanks,

Revision history for this message
bugproxy (bugproxy) wrote : conelp2_dmesg_output

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-136929 severity-critical targetmilestone-inin---
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1546442/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-02-17 15:44 EDT-------
top - 14:37:22 up 1:00, 1 user, load average: 0.65, 1.00, 0.91
Tasks: 545 total, 2 running, 543 sleeping, 0 stopped, 0 zombie
%Cpu(s): 0.0 us, 0.3 sy, 0.0 ni, 94.0 id, 5.6 wa, 0.0 hi, 0.0 si, 0.0 st
KiB Mem : 10310272 total, 39680 free, 9790784 used, 479808 buff/cache
KiB Swap: 4204416 total, 1902848 free, 2301568 used. 85696 avail Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31218 root 20 0 11.518g 8.976g 1216 R 11.2 91.3 0:01.93 swapping01 <<<<<<<<<<<<<<<<<<<<<<<<<<<<<
5061 root 20 0 12736 3264 2496 R 0.6 0.0 0:06.02 top
1 root 20 0 11840 2304 2304 S 0.0 0.0 0:02.70 systemd
2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd
3 root 20 0 0 0 0 S 0.0 0.0 0:00.02 ksoftirqd/0
4 root 20 0 0 0 0 S 0.0 0.0 0:00.02 kworker/0:0
5 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/0:+
7 root 20 0 0 0 0 S 0.0 0.0 0:00.58 rcu_sched
8 root 20 0 0 0 0 S 0.0 0.0 0:00.00 rcu_bh
9 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/0
10 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/0
11 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/1
12 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/1
13 root 20 0 0 0 0 S 0.0 0.0 0:00.00 ksoftirqd/1
15 root 0 -20 0 0 0 S 0.0 0.0 0:00.00 kworker/1:+
16 root rt 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/2
17 root rt 0 0 0 0 S 0.0 0.0 0:00.00 migration/2 [ 3646.092945] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
[ 3646.092945]
[ 3646.092983] CPU: 3 PID: 1 Comm: systemd Not tainted 4.4.0-4-generic #19-Ubuntu
[ 3646.092989] Call Trace:
[ 3646.092997] [c00000027ad83a60] [c000000000ad8aa0] dump_stack+0x90/0xbc (unreliable)
[ 3646.093005] [c00000027ad83a90] [c000000000ad4d20] panic+0x100/0x2c8
[ 3646.093012] [c00000027ad83b20] [c0000000000bce18] do_exit+0xbe8/0xbf0
[ 3646.093019] [c00000027ad83be0] [c0000000000bcf04] do_group_exit+0x64/0x100
[ 3646.093025] [c00000027ad83c20] [c0000000000ce23c] get_signal+0x52c/0x770
[ 3646.093032] [c00000027ad83d10] [c0000000000173c4] do_signal+0x54/0x2b0
[ 3646.093038] [c00000027ad83e00] [c00000000001781c] do_notify_resume+0xbc/0xd0
[ 3646.093045] [c00000027ad83e30] [c000000000009838] ret_from_except_lite+0x64/0x68
[ 3646.093053] Sending IPI to other CPUs
[ 3646.094114] IPI complete

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-02-17 16:06 EDT-------
nvm.. can't reproduce the panic with /testcases/ltp/testcases/bin/swapping01

Revision history for this message
bugproxy (bugproxy) wrote :
Download full text (5.3 KiB)

------- Comment From <email address hidden> 2016-02-17 16:15 EDT-------
conelp1 has kernel panic again, but I cannot force it to enter xmon this time.

Seeing these new error messages (bad frame in setup_rt_frame) printed out when it crash this time:

[11769.476647] Process 16338(waitpid02) has RLIMIT_CORE set to 1
[11769.476694] Aborting core

root@conelp1:~#
root@conelp1:~# [13095.548735] systemd[1]: unhandled signal 11 at 01010101010090 39 nip 00000000551129e4 lr 00000000551a4578 code 30001
[13095.619591] systemd[1]: unhandled signal 11 at 0101010101010100 nip 010101010 1010100 lr 000000005512cab0 code 30001
[13095.619627] systemd[1]: unhandled signal 11 at 0101010101010100 nip 010101010 1010100 lr 000000005512cab0 code 30001
[13095.619637] systemd[1]: unhandled signal 11 at 0101010101010100 nip 010101010 1010100 lr 000000005512cab0 code 30001
[13095.619646] systemd[1]: unhandled signal 11 at 0101010101010100 nip 010101010 1010100 lr 000000005512cab0 code 30001
[13095.619655] systemd[1]: unhandled signal 11 at 0101010101010100 nip 010101010 1010100 lr 000000005512cab0 code 30001
[13095.619664] systemd[1]: unhandled signal 11 at 0101010101010100 nip 010101010 1010100 lr 000000005512cab0 code 30001
[13095.619673] systemd[1]: unhandled signal 11 at 0101010101010100 nip 010101010 1010100 lr 000000005512cab0 code 30001
[13095.619682] systemd[1]: unhandled signal 11 at 0101010101010100 nip 010101010 1010100 lr 000000005512cab0 code 30001
[13095.619691] systemd[1]: unhandled signal 11 at 0101010101010100 nip 010101010 1010100 lr 000000005512cab0 code 30001
[13095.624120] systemd[1]: bad frame in setup_rt_frame: 00003fffc6f3fc70 nip 010 1010101010100 lr 000000005512cab0
[13097.032289] Thread-40[9785]: bad frame in setup_rt_frame: 00003fff4e59fb50 ni p 0101010101010100 lr 00003fff777c1434
[13101.282166] _exception: 1764 callbacks suppressed
[13101.282196] locktests.sh[17286]: unhandl...

Read more...

Revision history for this message
bugproxy (bugproxy) wrote : error messages were printed from conelp1 when it hung

------- Comment (attachment only) From <email address hidden> 2016-02-17 19:08 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment (attachment only) From <email address hidden> 2016-02-17 19:08 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla
Download full text (3.9 KiB)

------- Comment From <email address hidden> 2016-02-18 16:23 EDT-------
8:mon> t
[c00000009d637b50] 0000000000000054 (unreliable)
[c00000009d637b70] c000000000acd67c _raw_spin_unlock_irqrestore+0x4c/0xb0
[c00000009d637ba0] c0000000000cc700 force_sig_info+0x110/0x140
[c00000009d637bf0] c000000000020d80 _exception+0xd0/0x1e0
[c00000009d637d80] c000000000acfd84 do_page_fault+0x624/0x7f0
[c00000009d637e30] c000000000008664 handle_page_fault+0x10/0x30
--- Exception: 301 (Data Access) at 0000000010141b58
SP (3fffceda9480) is in userspace

8:mon> e
cpu 0x8: Vector: 501 (Hardware Interrupt) at [c00000009d6378d0]
pc: c000000000010a04: arch_local_irq_restore+0x74/0x90
lr: c000000000010a04: arch_local_irq_restore+0x74/0x90
sp: c00000009d637b50
msr: 8000000000009033
current = 0xc000000093449370
paca = 0xc000000007af4c00 softe: 0 irq_happened: 0x01
pid = 25278, comm = ntwk_files01
###/testcases/tcp/ltp/tcp_cmds/ntwk_files/ntwk_files01

*************************************************************************

10:mon> t
[link register ] c000000000304d28 iput+0x88/0x320
[c00000017543bc40] 0000000000000000 (unreliable)
[c00000017543bc90] c00000000031fe0c sync_inodes_sb+0x1bc/0x2b0
[c00000017543bd50] c000000000328248 sync_inodes_one_sb+0x38/0x50
[c00000017543bd80] c0000000002e2be8 iterate_supers+0x1b8/0x200
[c00000017543bdf0] c000000000328828 sys_sync+0x58/0xf0
[c00000017543be30] c000000000009204 system_call+0x38/0xb4
--- Exception: c01 (System Call) at 00003fff99f7d5e8
SP (3fff89ffe6d0) is in userspace

10:mon> e
cpu 0x10: Vector: 501 (Hardware Interrupt) at [c00000017543b9c0]
pc: c00000000056a8b4: _atomic_dec_and_lock+0x14/0xe0
lr: c000000000304d28: iput+0x88/0x320
sp: c00000017543bc40
msr: 8000000100009033
current = 0xc0000000e7093bf0
paca = 0xc000000007af9800 softe: 0 irq_happened: 0x01
pid = 16707, comm = make_tree
#/testcases/ltp/testcases/bin/make_tree
# PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
# 5489 root 20 0 593152 1536 896 S 230.6 0.0 1:39.58 make_tree

*************************************************************************

0:mon> t
[link register ] c0000000008f8cc4 check_and_cede_processor+0x34/0x50
[c000000001593ce0] c0000000008f8cb0 check_and_cede_processor+0x20/0x50 (unreliable)
[c000000001593d40] c0000000008f8d54 dedicated_cede_loop+0x74/0x190
[c000000001593d80] c0000000008f5f40 cpuidle_enter_state+0x160/0x3c0
[c000000001593de0] c000000000118c58 call_cpuidle+0x78/0xd0
[c000000001593e20] c000000000118fec cpu_startup_entry+0x33c/0x450
[c000000001593ee0] c00000000000bdcc rest_init+0xac/0xc0
[c000000001593f00] c000000000e83f5c start_kernel+0x53c/0x558
[c000000001593f90] c000000000008c6c start_here_common+0x20/0xa8

0:mon> e
cpu 0x0: Vector: 100 (System Reset) at [c000000001593a60]
pc: 5c5d0800000000c0
lr: c0000000008f8cc4: check_and_cede_processor+0x34/0x50
sp: c000000001593ce0
msr: 1000000000000080
current = 0xc000000001535d40
paca = 0xc000000007af0000 softe: 0 irq_happened: 0x01
pid = 0, comm = swapper/0

*************************************************************************

1:mon> t
[link register ] c0000000000912cc pseries_mach_cpu_die+0x1ec/0...

Read more...

Revision history for this message
bugproxy (bugproxy) wrote : make_tree test case

------- Comment on attachment From <email address hidden> 2016-02-18 16:31 EDT-------

Anyone see a memory leak here? Valgrind is not installed.

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-02-19 11:13 EDT-------
Any update here? We have multiple systems hitting this bug and this is impacting our testing.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-02-19 18:14 EDT-------
So, is the systemd being killed by the OOM?

bugproxy (bugproxy)
tags: removed: bot-comment bugnameltc-136929 severity-critical
Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp2 before ST test stop

------- Comment (attachment only) From <email address hidden> 2016-02-24 14:35 EDT-------

tags: added: bugnameltc-136929 severity-critical targetmilestone-inin1604
removed: targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp2 after ST test stop

------- Comment (attachment only) From <email address hidden> 2016-02-24 14:36 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : dmesg from conelp2 2/24/2016

------- Comment on attachment From <email address hidden> 2016-02-24 14:43 EDT-------

The ST tests were stopped after 4 hours run in conelp2.

root@conelp2:~# uname -a
Linux conelp2 4.4.0-6-generic #21-Ubuntu SMP Tue Feb 16 20:31:37 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

root@conelp2:~# free -h
              total used free shared buff/cache available
Mem: 9.8G 379M 8.0G 54M 1.4G 8.1G
Swap: 58G 167M 57G

Revision history for this message
bugproxy (bugproxy) wrote : error messages were printed from conelp1 when it hung

------- Comment (attachment only) From <email address hidden> 2016-02-17 19:08 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : dmesg from conelp1 before test started

------- Comment (attachment only) From <email address hidden> 2016-02-25 11:41 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp1 before we start ST test

------- Comment (attachment only) From <email address hidden> 2016-02-25 11:43 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : cpu stacks from xmon

------- Comment (attachment only) From <email address hidden> 2016-02-25 16:37 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from pinelp3

------- Comment (attachment only) From <email address hidden> 2016-02-25 20:43 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp2 after ST test stop

------- Comment (attachment only) From <email address hidden> 2016-02-24 14:36 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : console output when pinelp3 systemd services stopped

------- Comment on attachment From <email address hidden> 2016-03-01 21:11 EDT-------

Stress tests run about 20 minutes then all stopped on pinelp3. This is the console output.

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp1 after start ST test for 5 minutes

------- Comment (attachment only) From <email address hidden> 2016-03-02 13:02 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp1 after ST tests were stopped

------- Comment (attachment only) From <email address hidden> 2016-03-02 13:03 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp1 before we start ST test

------- Comment (attachment only) From <email address hidden> 2016-02-25 11:43 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp1 before start ST test kernel test only

------- Comment (attachment only) From <email address hidden> 2016-03-04 14:32 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp1 after ST test stop kernel test only

------- Comment (attachment only) From <email address hidden> 2016-03-04 14:34 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : dmesg form conelp1 kernel test only

------- Comment on attachment From <email address hidden> 2016-03-04 14:42 EDT-------

In addition, I have increased the memory from 6 G to 24 G, and conelp1 is still hitting this problem.

root@conelp1:~# free -h
              total used free shared buff/cache available
Mem: 24G 472M 5.9G 45M 18G 23G
Swap: 1.7G 1.8M 1.7G

For system access info, please refer to above comments.

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp1 before start ST test kernel test only

------- Comment (attachment only) From <email address hidden> 2016-03-04 14:32 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp1 after ST test stop kernel test only

------- Comment (attachment only) From <email address hidden> 2016-03-04 14:34 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : dmesg form conelp1 kernel test only

------- Comment on attachment From <email address hidden> 2016-03-04 14:42 EDT-------

In addition, I have increased the memory from 6 G to 24 G, and conelp1 is still hitting this problem.

root@conelp1:~# free -h
              total used free shared buff/cache available
Mem: 24G 472M 5.9G 45M 18G 23G
Swap: 1.7G 1.8M 1.7G

For system access info, please refer to above comments.

Revision history for this message
bugproxy (bugproxy) wrote : conelp1 console output

------- Comment (attachment only) From <email address hidden> 2016-03-04 15:45 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : cpu stacks from xmon

------- Comment (attachment only) From <email address hidden> 2016-02-25 16:37 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : syslog from pinelp3

------- Comment on attachment From <email address hidden> 2016-03-08 19:05 EDT-------

The tests were stop around Tue Mar 8 02:09:38 2016 (Chicago Time).

[Tue Mar 8 02:09:38 2016] systemd[1]: systemd-journald.service: Failed with result 'signal'.
[Tue Mar 8 02:09:38 2016] systemd[1]: lvm2-lvmetad.service: Main process exited, code=killed, status=9/KILL
[Tue Mar 8 02:09:38 2016] systemd[1]: lvm2-lvmetad.service: Unit entered failed state.
[Tue Mar 8 02:09:38 2016] systemd[1]: lvm2-lvmetad.service: Failed with result 'signal'.
[Tue Mar 8 02:09:38 2016] systemd[1]: systemd-udevd.service: Main process exited, code=killed, status=9/KILL

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-03-09 13:39 EDT-------
*** Bug 138543 has been marked as a duplicate of this bug. ***

Revision history for this message
bugproxy (bugproxy) wrote : make_tree test case

------- Comment on attachment From <email address hidden> 2016-02-18 16:31 EDT-------

Anyone see a memory leak here? Valgrind is not installed.

Revision history for this message
bugproxy (bugproxy) wrote : console output conelp1 crashed on 03/11

------- Comment (attachment only) From <email address hidden> 2016-03-11 14:03 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : make_tree test case

------- Comment on attachment From <email address hidden> 2016-02-18 16:31 EDT-------

Anyone see a memory leak here? Valgrind is not installed.

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp2 after ST test stop

------- Comment (attachment only) From <email address hidden> 2016-02-24 14:36 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : dmesg from conelp2 2/24/2016

------- Comment on attachment From <email address hidden> 2016-02-24 14:43 EDT-------

The ST tests were stopped after 4 hours run in conelp2.

root@conelp2:~# uname -a
Linux conelp2 4.4.0-6-generic #21-Ubuntu SMP Tue Feb 16 20:31:37 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

root@conelp2:~# free -h
              total used free shared buff/cache available
Mem: 9.8G 379M 8.0G 54M 1.4G 8.1G
Swap: 58G 167M 57G

Revision history for this message
bugproxy (bugproxy) wrote : dmesg from conelp1 before test started

------- Comment (attachment only) From <email address hidden> 2016-02-25 11:41 EDT-------

117 comments hidden view all 197 comments
Revision history for this message
bugproxy (bugproxy) wrote : conelp2 console log 0324

------- Comment (attachment only) From <email address hidden> 2016-03-24 12:20 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : console output from conelp2

------- Comment (attachment only) From <email address hidden> 2016-03-21 23:09 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : conelp2 console log 0324

------- Comment (attachment only) From <email address hidden> 2016-03-24 12:20 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : syslog from pinelp3

------- Comment on attachment From <email address hidden> 2016-03-08 19:05 EDT-------

The tests were stop around Tue Mar 8 02:09:38 2016 (Chicago Time).

[Tue Mar 8 02:09:38 2016] systemd[1]: systemd-journald.service: Failed with result 'signal'.
[Tue Mar 8 02:09:38 2016] systemd[1]: lvm2-lvmetad.service: Main process exited, code=killed, status=9/KILL
[Tue Mar 8 02:09:38 2016] systemd[1]: lvm2-lvmetad.service: Unit entered failed state.
[Tue Mar 8 02:09:38 2016] systemd[1]: lvm2-lvmetad.service: Failed with result 'signal'.
[Tue Mar 8 02:09:38 2016] systemd[1]: systemd-udevd.service: Main process exited, code=killed, status=9/KILL

Revision history for this message
bugproxy (bugproxy) wrote : console output conelp1 crashed on 03/11

------- Comment (attachment only) From <email address hidden> 2016-03-11 14:03 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp2 before start ST test

------- Comment (attachment only) From <email address hidden> 2016-03-16 12:23 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp2 after ST test stop

------- Comment (attachment only) From <email address hidden> 2016-03-16 12:25 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : console log from conelp2

------- Comment (attachment only) From <email address hidden> 2016-03-16 12:27 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : conelp2 crash console log

------- Comment (attachment only) From <email address hidden> 2016-03-17 16:34 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : console output from conelp2

------- Comment (attachment only) From <email address hidden> 2016-03-21 23:09 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : conelp2 console log 0324

------- Comment (attachment only) From <email address hidden> 2016-03-24 12:20 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp2 before start ST test

------- Comment (attachment only) From <email address hidden> 2016-03-16 12:23 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : sosreport from conelp2 after ST test stop

------- Comment (attachment only) From <email address hidden> 2016-03-16 12:25 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : console log from conelp2

------- Comment (attachment only) From <email address hidden> 2016-03-16 12:27 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : conelp2 crash console log

------- Comment (attachment only) From <email address hidden> 2016-03-17 16:34 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : console output from conelp2

------- Comment (attachment only) From <email address hidden> 2016-03-21 23:09 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : conelp2 console log 0324

------- Comment (attachment only) From <email address hidden> 2016-03-24 12:20 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : console output from conelp2

------- Comment (attachment only) From <email address hidden> 2016-03-21 23:09 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : conelp2 console log 0324

------- Comment (attachment only) From <email address hidden> 2016-03-24 12:20 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : console output from conelp2

------- Comment (attachment only) From <email address hidden> 2016-03-21 23:09 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-04-04 11:32 EDT-------
Hi Kevin,

Since conelp2 is also hitting bug 136788, can yo finish debugging soon? then I can try the patched kernel and verify it.

Thanks.
Erin

Revision history for this message
bugproxy (bugproxy) wrote : console output from conelp2

------- Comment (attachment only) From <email address hidden> 2016-03-21 23:09 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : conelp2 console log 0324

------- Comment (attachment only) From <email address hidden> 2016-03-24 12:20 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-04-04 11:32 EDT-------
Hi Kevin,

Since conelp2 is also hitting bug 136788, can yo finish debugging soon? then I can try the patched kernel and verify it.

Thanks.
Erin

Revision history for this message
bugproxy (bugproxy) wrote : console output from conelp2

------- Comment (attachment only) From <email address hidden> 2016-03-21 23:09 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : conelp2 console log 0324

------- Comment (attachment only) From <email address hidden> 2016-03-24 12:20 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-04-04 11:32 EDT-------
Hi Kevin,

Since conelp2 is also hitting bug 136788, can yo finish debugging soon? then I can try the patched kernel and verify it.

Thanks.
Erin

Revision history for this message
bugproxy (bugproxy) wrote : console output from conelp2

------- Comment (attachment only) From <email address hidden> 2016-03-21 23:09 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : conelp2 console log 0324

------- Comment (attachment only) From <email address hidden> 2016-03-24 12:20 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-04-04 11:32 EDT-------
Hi Kevin,

Since conelp2 is also hitting bug 136788, can yo finish debugging soon? then I can try the patched kernel and verify it.

Thanks.
Erin

Revision history for this message
bugproxy (bugproxy) wrote : console output from conelp2

------- Comment (attachment only) From <email address hidden> 2016-03-21 23:09 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : conelp2 console log 0324

------- Comment (attachment only) From <email address hidden> 2016-03-24 12:20 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : console output from conelp2

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : conelp2 console log 0324

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : console output from conelp2

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : conelp2 console log 0324

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : console output from conelp2

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : conelp2 console log 0324

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : console output from conelp2

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : conelp2 console log 0324

Default Comment by Bridge

Steve Langasek (vorlon)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → nobody
Displaying first 40 and last 40 comments. View all 197 comments or add a comment.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.