in Ubuntu16.10: Hit on Call traces and system goes down when transactional memory tests are running in 32TB Brazos system

Bug #1606786 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Fix Released
Undecided
Tim Gardner
Yakkety
Fix Released
Undecided
Unassigned

Bug Description

== Comment: #0 - Praveen K. Pandey <email address hidden> - 2016-07-16 14:21:39 ==
---Problem Description---
 In Ubuntu16.10 Call traces and unrecoverable exception occurs when Transactional memory tests are executed in 32TB system.

Steps to reproduce:

1- Download tm tests from the link - 'http://ozlabs.au.ibm.com/~mikey/tm-test-le.tar.gz'

2- untar tm-test-le.tar.gz; make clean; make; ./runtests.sh

The system gives below call traces and goes down. [Not able to do ssh and the console gets hung]

LOG:

Running ./htm_demo
Starting 6144 worker threads, 50000 loops
------------------------------------------------------
------------------------------------------------------
[ 438.064162] CFAR: c000000000008d18 SOFTE: 0
PACATMSCRATCH: 8000000200009033
GPR00: c000000000111028 c00005f82e36f4b0 c0000000015b5d00 0000000000000000
GPR04: c000000000b28863 0000000000000001 c000000001755d00 0000000100008748
GPR08: 00000000000004f0 00000000f5257d14 000001fab9800000 0000000000000005
GPR12: 0000000000000500 c00000000bc9dd00 00003ffcf5fa0000 0000000000000100
GPR16: 0000000000000000 c00001faba7898e8 0000000000000001 c00003f9865bf804
GPR20: 0000000000000003 c0000000001638b0 0000000000000001 000000661a113a10
GPR24: 0000000000000000 c00005f82e9f2060 000001fab9800000 c0000000015eaa60
GPR28: c000000000f9e900 000000000000009e c0000000015eaa60 c00001faba79e900
[ 438.064303] NIP [c0000000000a1ba4] kvmppc_interrupt_hv+0x28/0x15c
[ 438.064315] LR [c000000000111028] trigger_load_balance+0x58/0x340
[ 438.064323] Call Trace:
[ 438.064330] [c00005f82e36f4b0] [c000000000111028] trigger_load_balance+0x58/0x340 (unreliable)
[ 438.064346] [c00005f82e36f4f0] [c0000000000f9ee4] scheduler_tick+0x104/0x180
[ 438.064360] [c00005f82e36f550] [c00000000014c128] update_process_times+0x78/0xa0
[ 438.064378] [c00005f82e36f580] [c000000000163818] tick_sched_handle.isra.6+0x48/0xe0
[ 438.064392] [c00005f82e36f5c0] [c000000000163914] tick_sched_timer+0x64/0xd0
[ 438.064404] [c00005f82e36f600] [c00000000014cbd4] __hrtimer_run_queues+0x124/0x450
[ 438.064418] [c00005f82e36f690] [c00000000014dbfc] hrtimer_interrupt+0xec/0x2c0
[ 438.064431] [c00005f82e36f750] [c00000000001f5bc] __timer_interrupt+0x8c/0x290
[ 438.064444] [c00005f82e36f7a0] [c00000000001f970] timer_interrupt+0xa0/0xe0
[ 438.064456] [c00005f82e36f7d0] [c000000000002714] decrementer_common+0x114/0x180
[ 438.064471] --- interrupt: 901 at _raw_spin_lock_irqsave+0xac/0x130
[ 438.064471] LR = _raw_spin_lock_irqsave+0x9c/0x130
[ 438.064486] [c00005f82e36fac0] [000000000000009e] 0x9e (unreliable)
[ 438.064499] [c00005f82e36fb00] [c00000000010fe0c] load_balance+0x80c/0xa90
[ 438.064511] [c00005f82e36fc40] [c00000000011036c] rebalance_domains+0x2dc/0x3b0
[ 438.064525] [c00005f82e36fcf0] [c0000000000beb98] __do_softirq+0x188/0x3e0
[ 438.064537] [c00005f82e36fde0] [c0000000000bf068] irq_exit+0xc8/0x100
[ 438.064549] [c00005f82e36fe00] [c00000000001f974] timer_interrupt+0xa4/0xe0
[ 438.064561] [c00005f82e36fe30] [c000000000002714] decrementer_common+0x114/0x180
[ 438.064572] Instruction dump:
[ 438.064578] 7c892378 48000160 f92d0850 892d0858 2c090004 418210b4 2c090001 e92d0850
[ 438.064598] 4182f2f8 39200004 992d0858 e92d0860 <f8090c18> f8290c20 f8490c28 f8690c30
[ 438.064621] ---[ end trace a8961db98dfe068b ]---
[ 438.072474]
[ 440.072525] Kernel panic - not syncing: Fatal exception in interrupt
[ 440.073966] Unrecoverable exception 4100 at c0000000000a1ba4
[ 440.073992] Oops: Unrecoverable exception, sig: 6 [#3]
[ 440.074001] SMP NR_CPUS=2048 NUMA pSeries
[ 440.074013] Modules linked in: pseries_rng btrfs xor raid6_pq rtc_generic sunrpc autofs4 ses enclosure ipr
[ 440.074052] CPU: 291 PID: 404588 Comm: htm_demo Tainted: G D L 4.4.0-30-generic #49-Ubuntu
[ 440.074067] task: c00005f82da42a20 ti: c00005f83031c000 task.ti: c00005f83031c000
[ 440.074079] NIP: c0000000000a1ba4 LR: c000000000aedfdc CTR: 0000000000000000
[ 440.074091] REGS: c00005f83031f750 TRAP: 4100 Tainted: G D L (4.4.0-30-generic)
[ 440.074104] MSR: 8000000200001031 <SF,ME,IR,DR,LE> CR: 48004884 XER: 00000000
[ 440.074130] CFAR: c000000000008d18 SOFTE: 1
PACATMSCRATCH: 8000000200009033
GPR00: c000000000118d14 c00005f83031f9d0 c0000000015b5d00 0000000000000001
GPR04: c00005f83031fad8 0000000000000082 0000000000000000 f00000017e1bff80
GPR08: 0000000000000000 00000000eac0c6e6 0000000000000000 c00005f83031c000
GPR12: 0000000000000500 c00000000bcecc80 00003ffe0ffa0000 0000000000000000
GPR16: 0000000000000003 0000000000000322 0000000010002570 c00005f832062068
GPR20: c00005f857610080 0000000000000000 c00005f832062000 c00005f85754c080
GPR24: 0000000000000054 f00000017e1bff80 0000000000000000 c000000000ae97b0
GPR28: c00005f9afce5f40 0000000000000000 0000000000000001 c00005f9afce5f40
[ 440.074292] NIP [c0000000000a1ba4] kvmppc_interrupt_hv+0x28/0x15c
[ 440.074306] LR [c000000000aedfdc] _raw_spin_lock_irqsave+0x9c/0x130
[ 440.074315] Call Trace:
[ 440.074324] [c00005f83031f9d0] [c00005f83031fa10] 0xc00005f83031fa10 (unreliable)
[ 440.074341] [c00005f83031fa10] [c000000000118d14] finish_wait+0x54/0xb0
[ 440.074360] [c00005f83031fa50] [c000000000ae90bc] __wait_on_bit+0xac/0x170
[ 440.074380] [c00005f83031faa0] [c00000000022f070] wait_on_page_bit_killable+0xf0/0x110
[ 440.074396] [c00005f83031fb10] [c00000000022f18c] __lock_page_or_retry+0xfc/0x120
[ 440.074411] [c00005f83031fb50] [c00000000022f4fc] filemap_fault+0x34c/0x500
[ 440.074426] [c00005f83031fbd0] [c0000000003b23f0] ext4_filemap_fault+0x50/0x80
[ 440.074447] [c00005f83031fc10] [c00000000026e944] __do_fault+0x84/0x160
[ 440.074461] [c00005f83031fcb0] [c000000000274578] handle_mm_fault+0xd78/0x1980
[ 440.074476] [c00005f83031fd80] [c000000000af04f4] do_page_fault+0x354/0x7f0
[ 440.074494] [c00005f83031fe30] [c000000000008664] handle_page_fault+0x10/0x30
[ 440.074505] Instruction dump:
[ 440.074513] 7c892378 48000160 f92d0850 892d0858 2c090004 418210b4 2c090001 e92d0850
[ 440.074537] 4182f2f8 39200004 992d0858 e92d0860 <f8090c18> f8290c20 f8490c28 f8690c30
[ 440.074564] ---[ end trace a8961db98dfe068c ]---
[ 440.076808] Unrecoverable exception 4100 at c0000000000a1ba4
[ 440.083422] pstore: pstore dump routine blocked in Panic path, may corrupt error record
[ 440.083424]
[ 440.083428] Oops: Unrecoverable exception, sig: 6 [#4]
[ 440.083432] SMP NR_CPUS=2048 NUMA pSeries
[ 440.083450] Modules linked in: pseries_rng btrfs xor raid6_pq rtc_generic sunrpc autofs4 ses enclosure ipr
[ 440.083457] CPU: 1370 PID: 405427 Comm: htm_demo Tainted: G D L 4.4.0-30-generic #49-Ubuntu
[ 440.083460] task: c000194911f30b50 ti: c000194912214000 task.ti: c000194912214000
[ 440.083462] NIP: c0000000000a1ba4 LR: c000000000aedfdc CTR: 0000000000000000
[ 440.083464] REGS: c000194912217750 TRAP: 4100 Tainted: G D L (4.4.0-30-generic)
[ 440.083473] MSR: 8000000200001031 <SF,ME,IR,DR,LE> CR: 48004884 XER: 20000000
[ 440.083503] CFAR: c000000000008d18 SOFTE: 1
[ 440.083503] PACATMSCRATCH: 8000000200009033
[ 440.083503] GPR00: c000000000118b28 c0001949122179d0 c0000000015b5d00 0000000000000001
[ 440.083503] GPR04: c000194912217ad8 0000000000000082 0000000000000082 0000000000000002
[ 440.083503] GPR08: c000000000ae5d00 00000000f5257d14 0000000000000000 c000194912214000
[ 440.083503] GPR12: 0000000000000500 c00000000bf6d700 00003ffc6efa0000 0000000000000000
[ 440.083503] GPR16: 0000000000000003 0000000000000664 0000000010002570 c00005f832062068
[ 440.083503] GPR20: c00005f857610080 0000000000000000 c00005f832062000 c00005f85754c080
[ 440.083503] GPR24: 0000000000000054 f00000017e1bff80 c00005f87fa62ac8 c000000000ae97b0
[ 440.083503] GPR28: c00005f9afce5f40 0000000000000000 0000000000000001 c00005f9afce5f40
[ 440.083514] NIP [c0000000000a1ba4] kvmppc_interrupt_hv+0x28/0x15c
[ 440.083520] LR [c000000000aedfdc] _raw_spin_lock_irqsave+0x9c/0x130
[ 440.083521] Call Trace:
[ 440.083527] [c0001949122179d0] [0000000200000001] 0x200000001 (unreliable)
[ 440.083533] [c000194912217a10] [c000000000118b28] prepare_to_wait+0x48/0xf0
[ 440.083542] [c000194912217a50] [c000000000ae9070] __wait_on_bit+0x60/0x170
[ 440.083548] [c000194912217aa0] [c00000000022f070] wait_on_page_bit_killable+0xf0/0x110
[ 440.083553] [c000194912217b10] [c00000000022f18c] __lock_page_or_retry+0xfc/0x120
[ 440.083557] [c000194912217b50] [c00000000022f4fc] filemap_fault+0x34c/0x500
[ 440.083561] [c000194912217bd0] [c0000000003b23f0] ext4_filemap_fault+0x50/0x80
[ 440.083567] [c000194912217c10] [c00000000026e944] __do_fault+0x84/0x160
[ 440.083571] [c000194912217cb0] [c000000000274578] handle_mm_fault+0xd78/0x1980
[ 440.083575] [c000194912217d80] [c000000000af04f4] do_page_fault+0x354/0x7f0
[ 440.083584] [c000194912217e30] [c000000000008664] handle_page_fault+0x10/0x30
[ 440.083597] Instruction dump:
[ 440.083603] 7c892378 48000160 f92d0850 892d0858 2c090004 418210b4 2c090001 e92d0850
[ 440.083607] 4182f2f8 39200004 992d0858 e92d0860 <f8090c18> f8290c20 f8490c28 f8690c30
[ 440.083611] ---[ end trace a8961db98dfe068d ]---
[ 440.091723]
[ 440.091985] ---[ end Kernel panic - not syncing: Fatal exception in interrupt

Regards
Praveen

== Comment: #2 - Michael Neuling <email address hidden> - 2016-07-25 22:56:05 ==
So the same issues as bz142887 are going to hit everywhere as this was an upstream bug. You'll need these two patches.

https://git.kernel.org/powerpc/c/6bcb80143e792becfd2b9cc6a339ce523e4e2219
https://git.kernel.org/powerpc/c/190ce8693c23eae09ba5f303a83bf2fbeb6478b1

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-143823 severity-critical targetmilestone-inin1610
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → kernel-package (Ubuntu)
Gary Gaydos (gmgaydos)
affects: kernel-package (Ubuntu) → linux (Ubuntu)
Revision history for this message
Tim Gardner (timg-tpi) wrote :

Merged in 4.8

Changed in linux (Ubuntu Yakkety):
assignee: Taco Screen team (taco-screen-team) → nobody
status: New → Fix Released
Revision history for this message
Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Xenial):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla
Download full text (3.6 KiB)

------- Comment From <email address hidden> 2016-08-18 02:11 EDT-------
(In reply to comment #7)
> This bug is awaiting verification that the kernel in -proposed solves the
> problem. Please test the kernel and update this bug with the results. If the
> problem is solved, change the tag 'verification-needed-xenial' to
> 'verification-done-xenial'.
>
> If verification is not done by 5 working days from today, this fix will be
> dropped from the source code, and this bug will be closed.
>
> See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to
> enable and use -proposed. Thank you!

Hi Canonical ,

Thanks For Fix , need one help regarding proposed build .

1- I added deb http://archive.ubuntu.com/ubuntu/ yakkety-proposed restricted main multiverse universe in /etc/apt/source.list
2- added file /etc/apt/preferences.d/proposed-updates

but still not getting any proposed packages please help me on this .

LOG:

root@ltc-system:~# apt-get update
Get:1 http://us.ports.ubuntu.com/ubuntu-ports yakkety InRelease [247 kB]
Get:2 http://archive.ubuntu.com/ubuntu yakkety-proposed InRelease [95.7 kB]
Hit:3 http://ports.ubuntu.com/ubuntu-ports yakkety-security InRelease
Hit:4 http://us.ports.ubuntu.com/ubuntu-ports yakkety-updates InRelease
Hit:5 http://us.ports.ubuntu.com/ubuntu-ports yakkety-backports InRelease
Get:6 http://us.ports.ubuntu.com/ubuntu-ports yakkety/universe ppc64el Packages [7,509 kB]
Ign:7 http://archive.ubuntu.com/ubuntu yakkety-proposed/main ppc64el Packages
Get:8 http://archive.ubuntu.com/ubuntu yakkety-proposed/main Translation-en [63.7 kB]
Ign:9 http://archive.ubuntu.com/ubuntu yakkety-proposed/multiverse ppc64el Packages
Get:10 http://archive.ubuntu.com/ubuntu yakkety-proposed/multiverse Translation-en [3,732 B]
Ign:11 http://archive.ubuntu.com/ubuntu yakkety-proposed/universe ppc64el Packages
Get:12 http://archive.ubuntu.com/ubuntu yakkety-proposed/universe Translation-en [274 kB]
Ign:7 http://archive.ubuntu.com/ubuntu yakkety-proposed/main ppc64el Packages
Ign:9 http://archive.ubuntu.com/ubuntu yakkety-proposed/multiverse ppc64el Packages
Ign:11 http://archive.ubuntu.com/ubuntu yakkety-proposed/universe ppc64el Packages
Ign:7 http://archive.ubuntu.com/ubuntu yakkety-proposed/main ppc64el Packages
Ign:9 http://archive.ubuntu.com/ubuntu yakkety-proposed/multiverse ppc64el Packages
Ign:11 http://archive.ubuntu.com/ubuntu yakkety-proposed/universe ppc64el Packages
Err:7 http://archive.ubuntu.com/ubuntu yakkety-proposed/main ppc64el Packages
404 Not Found [IP: 91.189.88.161 80]
Ign:9 http://archive.ubuntu.com/ubuntu yakkety-proposed/multiverse ppc64el Packages
Ign:11 http://archive.ubuntu.com/ubuntu yakkety-proposed/universe ppc64el Packages
Fetched 8,193 kB in 7s (1,125 kB/s)
Reading package lists... Done
E: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/yakkety-proposed/main/binary-ppc64el/Packages 404 Not Found [IP: 91.189.88.161 80]
E: Some index files failed to download. They have been ignored, or old ones used instead.

root@ltc-system:~# apt-get dist-upgrade
Reading package lists... Done
Building dependency tree
Reading state information... Done
Calculating upgrade... Done
0 upgraded, 0 newly ...

Read more...

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-08-18 10:53 EDT-------
Hi Praveen,

Please see

https://wiki.ubuntu.com/ppc64el/CommonQuestions#How_to_enable_the_-proposed_repository_in_ubuntu

The repo for ppc64el is different than for x86.

Thanks, Gary

Revision history for this message
bugproxy (bugproxy) wrote : dmesg output

------- Comment (attachment only) From <email address hidden> 2016-08-22 03:16 EDT-------

bugproxy (bugproxy)
tags: added: verification-done
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (13.4 KiB)

This bug was fixed in the package linux - 4.4.0-36.55

---------------
linux (4.4.0-36.55) xenial; urgency=low

  [ Stefan Bader ]

  * Release Tracking Bug
    - LP: #1612305

  * I2C touchpad does not work on AMD platform (LP: #1612006)
    - SAUCE: pinctrl/amd: Remove the default de-bounce time

  * CVE-2016-5696
    - tcp: make challenge acks less predictable

linux (4.4.0-35.54) xenial; urgency=low

  [ Stefan Bader ]

  * Release Tracking Bug
    - LP: #1611215

  * [i915_bpo] Sync with v4.7 (LP: #1609742)
    - SAUCE: i915_bpo: Sync with v4.7

  * s390/cio: fix reset of channel measurement block (LP: #1609415)
    - s390/cio: allow to reset channel measurement block

  * in Ubuntu16.10: Hit on Call traces and system goes down when transactional
    memory tests are running in 32TB Brazos system (LP: #1606786)
    - powerpc/tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0
    - powerpc/tm: Fix stack pointer corruption in __tm_recheckpoint()

  * Power Menu does not display after press the Power Button (LP: #1609204)
    - intel-vbtn: new driver for Intel Virtual Button
    - [config] enable CONFIG_INTEL_VBTN=m

  * OptiPlex 7450 AIO hangs when rebooting (LP: #1608762)
    - x86/reboot: Add Dell Optiplex 7450 AIO reboot quirk

  * virtualbox+usb 3.0 breaks boot, -28 kernel works (LP: #1604058)
    - SAUCE: xhci: Fix soft lockup in xhci_pci_probe path when XHCI_STATE_HALTED

  * linux-kernel: Freeing IRQ from IRQ context (LP: #1597908)
    - block: defer timeouts to a workqueue

  * Tunnel offload indications not stripped from encapsulated packets, causing
    performance overhead (LP: #1602755)
    - tunnels: Remove encapsulation offloads on decap.

  * lm-sensors is throwing "ERROR: Can't get value of subfeature temp1_input:
    I/O error" for be2net driver (LP: #1607387)
    - be2net: perform temperature query in adapter regardless of its interface
      state

  * Dell dock MAC Address pass through doesn't work in Ubuntu (LP: #1579984)
    - r8152: Add support for setting pass through MAC address on RTL8153-AD

  * vmxnet3 LRO IPv6 performance issues (stalling TCP) (LP: #1605494)
    - Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets

  * ISST-LTE:pVM:monklp5:Ubuntu16.04.1:system crashed at
    lpfc_sli4_scmd_to_wqidx_distr (LP: #1597974)
    - SAUCE: lpfc: fix oops in lpfc_sli4_scmd_to_wqidx_distr() from
      lpfc_send_taskmgmt()

  * Backport cxlflash shutdown patch to Xenial SRU (LP: #1605405)
    - SAUCE: cxlflash: Verify problem state area is mapped before notifying
      shutdown

  * Xenial update to v4.4.16 stable release (LP: #1607404)
    - mac80211: fix fast_tx header alignment
    - mac80211: mesh: flush mesh paths unconditionally
    - mac80211_hwsim: Add missing check for HWSIM_ATTR_SIGNAL
    - mac80211: Fix mesh estab_plinks counting in STA removal case
    - EDAC, sb_edac: Fix rank lookup on Broadwell
    - IB/cm: Fix a recently introduced locking bug
    - IB/mlx4: Properly initialize GRH TClass and FlowLabel in AHs
    - powerpc/pseries: Fix IBM_ARCH_VEC_NRCORES_OFFSET since POWER8NVL was added
    - powerpc/tm: Always reclaim in start_thread() for exec() class syscalls
    - usb: dwc2: fix reg...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Tim Gardner (timg-tpi) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-10-04 04:27 EDT-------
Hi

verified this bug on Ubuntu16.10(4.8 ) it works fine .

LOG:

root@ltc-brazos1:~/tm-test-le# ./runtests.sh
Running ./htm_fork
PASSED!
Running ./htm_vmxcopy
PASSED!
Running ./htm_dscr
Starting, 10000 loops
PASSED!
Running ./htm_dscr
Starting, 10000 loops
PASSED!
Running ./htm_tar
Starting, 10000 loops
PASSED!
Running ./htm_fpunavailable
Thread doing stuff!
Entering FP function
Transaction done, ret = 0, TEXASR 0x11c000001,
TFIAR 0x100029e9 f = 1337.000000.
FAIL, wrong number of aborts.
Done.
Test failed code:0 ./htm_fpunavailable
root@ltc-brazos1:~/tm-test-le#

Regards
Praveen

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.