[LTC-Test] - NMI watchdog Bug and call traces when trinity is executed.

Bug #1602524 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Canonical Kernel Team
Xenial
Fix Released
Undecided
Tim Gardner
Yakkety
Fix Released
High
Canonical Kernel Team

Bug Description

== Comment: #0 - Santhosh G ==
Problem Statement:
NMI watchdog bug and call traces occurs when trinity is executed.

Environment:
P8 PowerVM Lpar

uname o/p:
uname -a
Linux tuleta4u-lp5 4.4.0-11-generic #26-Ubuntu SMP Sat Mar 5 14:21:51 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Steps to reproduce:

1) Install ubuntu 16.04 in a PowerVM LPAR.
2) Download trinity-1.5 and set up ./configure.sh;make;make install
3)Execute trinity as
   './trinity --dangerous'

The test runs for more than one hour and trinity gets killed with call traces:

[19744.229979] NMI watchdog: BUG: soft lockup - CPU#3 stuck for 21s! [trinity-c3:26544]
[19744.229991] Modules linked in: hidp hid bnep rfcomm l2tp_ppp l2tp_netlink l2tp_core ip6_udp_tunnel udp_tunnel af_key mpls_router llc2 nfnetlink dn_rtmsg xfrm_user xfrm_algo can_raw crypto_user can_bcm cmtp kernelcapi scsi_transport_iscsi sctp libcrc32c nfc af_alg caif_socket caif phonet af_rxrpc bluetooth can pppoe pppox irda crc_ccitt atm appletalk ipx p8023 p8022 psnap llc pseries_rng rtc_generic autofs4 ibmvscsi ibmveth
[19744.230024] CPU: 3 PID: 26544 Comm: trinity-c3 Not tainted 4.4.0-11-generic #26-Ubuntu
[19744.230026] task: c00000000ae87e60 ti: c00000000ae24000 task.ti: c00000000ae24000
[19744.230028] NIP: c0000000003fac78 LR: c0000000003fabfc CTR: c00000000039ef10
[19744.230029] REGS: c00000000ae27980 TRAP: 0901 Not tainted (4.4.0-11-generic)
[19744.230030] MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24004444 XER: 20000000
[19744.230035] CFAR: c0000000003fae6c SOFTE: 1
               GPR00: c0000000003fabfc c00000000ae27c00 c0000000015a3b00 c0000000f7f03ba8
               GPR04: 000000000e02adcb c00000000ae27cb0 0000000000000000 0000000000000000
               GPR08: 8000000000000000 0000000000000000 c0000000ef886000 c000000000af0870
               GPR12: 0000000024004444 c00000000e7f1c80
[19744.230045] NIP [c0000000003fac78] ext4_es_lookup_extent+0xc8/0x2c0
[19744.230047] LR [c0000000003fabfc] ext4_es_lookup_extent+0x4c/0x2c0
[19744.230048] Call Trace:
[19744.230050] [c00000000ae27c00] [c0000000003fabfc] ext4_es_lookup_extent+0x4c/0x2c0 (unreliable)
[19744.230053] [c00000000ae27c50] [c0000000003a6f18] ext4_map_blocks+0x78/0x610
[19744.230055] [c00000000ae27d10] [c00000000039f14c] ext4_llseek+0x23c/0x3f0
[19744.230057] [c00000000ae27de0] [c0000000002e02a8] SyS_lseek+0xe8/0x130
[19744.230060] [c00000000ae27e30] [c000000000009204] system_call+0x38/0xb4
[19744.230061] Instruction dump:
[19744.230062] 2fa90000 409effec e93e0028 3b800000 e9490458 e92a0440 39290001 f92a0440
[19744.230065] 7c2004ac 7d20d828 3129ffff 7d20d92d <40c2fff4> 60000000 7f83e378 38210050

== Comment: #8 - Santhosh G ==

Tried the scenario as given in https://bugzilla.linux.ibm.com/show_bug.cgi?id=128126#c26
-----
# Create a 624GiB file; Mostly filled with holes though
$ dd if=/dev/zero of=file-0.bin bs=1M count=1 seek=598382
# Invoke lseek with SEEK_DATA option starting with file offset 0
while [ 1 ]; do xfs_io -f -c "seek -d 0" file-0.bin; done
----
and I was able to hit the issue in 16.04.1

kernel version:
4.4.0-28-generic

dmesg o/p:

[ 1197.994822] 40-...: (5249 ticks this GP) idle=975/140000000000001/0 softirq=7812/7812 fqs=5251
[ 1197.995071] (t=5251 jiffies g=29144 c=29143 q=3418)
[ 1197.995115] Task dump for CPU 40:
[ 1197.995117] xfs_io R running task 0 3601 3489 0x00040004
[ 1197.995121] Call Trace:
[ 1197.995126] [c000003c7c8675b0] [c0000000000fbc00] sched_show_task+0xe0/0x180 (unreliable)
[ 1197.995131] [c000003c7c867620] [c00000000013eb74] rcu_dump_cpu_stacks+0xe4/0x150
[ 1197.995134] [c000003c7c867670] [c0000000001442a4] rcu_check_callbacks+0x6b4/0x9b0
[ 1197.995136] [c000003c7c8677a0] [c00000000014c108] update_process_times+0x58/0xa0
[ 1197.995140] [c000003c7c8677d0] [c000000000163818] tick_sched_handle.isra.6+0x48/0xe0
[ 1197.995143] [c000003c7c867810] [c000000000163914] tick_sched_timer+0x64/0xd0
[ 1197.995146] [c000003c7c867850] [c00000000014cbd4] __hrtimer_run_queues+0x124/0x450
[ 1197.995148] [c000003c7c8678e0] [c00000000014dbfc] hrtimer_interrupt+0xec/0x2c0
[ 1197.995152] [c000003c7c8679a0] [c00000000001f5bc] __timer_interrupt+0x8c/0x290
[ 1197.995154] [c000003c7c8679f0] [c00000000001f970] timer_interrupt+0xa0/0xe0
[ 1197.995157] [c000003c7c867a20] [c000000000002714] decrementer_common+0x114/0x180
[ 1197.995163] --- interrupt: 901 at ext4_es_find_delayed_extent_range+0x20/0x2b0
                   LR = ext4_llseek+0x268/0x3f0
[ 1197.995166] [c000003c7c867d10] [c0000000003a170c] ext4_llseek+0x23c/0x3f0 (unreliable)
[ 1197.995170] [c000003c7c867de0] [c0000000002e1f08] SyS_lseek+0xe8/0x130
[ 1197.995173] [c000003c7c867e30] [c000000000009204] system_call+0x38/0xb4

=====

The call traces does not occur when tried with the kernel with patch.

Revision history for this message
bugproxy (bugproxy) wrote : Attaching full dmesg logs

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-138650 severity-high targetmilestone-inin16041
Revision history for this message
bugproxy (bugproxy) wrote : Backported patches to fix the Ext4 soft lockup issues

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → kernel-package (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-07-13 04:31 EDT-------
Canonical,

Please consider the attached backported patch which fixes this issue.

Thanks,
Chandan.

affects: kernel-package (Ubuntu) → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
Revision history for this message
Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Yakkety):
status: Triaged → Fix Released
Changed in linux (Ubuntu Xenial):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Revision history for this message
Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
Stefan Bader (smb) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-08-17 06:38 EDT-------
### External Comment ###

Enabled proposed as mentioned in the link - https://wiki.ubuntu.com/Testing/EnableProposed
Updated the kernel to version - 4.4.0-36-generic from 4.4.0-34-generic[which is the default version of 16.04]

And ran the below command for more than 1 hour:

# Create a 624GiB file; Mostly filled with holes though
$ dd if=/dev/zero of=file-0.bin bs=1M count=1 seek=598382
# Invoke lseek with SEEK_DATA option starting with file offset 0
while [ 1 ]; do xfs_io -f -c "seek -d 0" file-0.bin; done

I was not able to hit the issue.

Note:
However, I was able to reproduce the issue with default 16.04 kernel 4.4.0-34-generic

bugproxy (bugproxy)
tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (13.4 KiB)

This bug was fixed in the package linux - 4.4.0-36.55

---------------
linux (4.4.0-36.55) xenial; urgency=low

  [ Stefan Bader ]

  * Release Tracking Bug
    - LP: #1612305

  * I2C touchpad does not work on AMD platform (LP: #1612006)
    - SAUCE: pinctrl/amd: Remove the default de-bounce time

  * CVE-2016-5696
    - tcp: make challenge acks less predictable

linux (4.4.0-35.54) xenial; urgency=low

  [ Stefan Bader ]

  * Release Tracking Bug
    - LP: #1611215

  * [i915_bpo] Sync with v4.7 (LP: #1609742)
    - SAUCE: i915_bpo: Sync with v4.7

  * s390/cio: fix reset of channel measurement block (LP: #1609415)
    - s390/cio: allow to reset channel measurement block

  * in Ubuntu16.10: Hit on Call traces and system goes down when transactional
    memory tests are running in 32TB Brazos system (LP: #1606786)
    - powerpc/tm: Avoid SLB faults in treclaim/trecheckpoint when RI=0
    - powerpc/tm: Fix stack pointer corruption in __tm_recheckpoint()

  * Power Menu does not display after press the Power Button (LP: #1609204)
    - intel-vbtn: new driver for Intel Virtual Button
    - [config] enable CONFIG_INTEL_VBTN=m

  * OptiPlex 7450 AIO hangs when rebooting (LP: #1608762)
    - x86/reboot: Add Dell Optiplex 7450 AIO reboot quirk

  * virtualbox+usb 3.0 breaks boot, -28 kernel works (LP: #1604058)
    - SAUCE: xhci: Fix soft lockup in xhci_pci_probe path when XHCI_STATE_HALTED

  * linux-kernel: Freeing IRQ from IRQ context (LP: #1597908)
    - block: defer timeouts to a workqueue

  * Tunnel offload indications not stripped from encapsulated packets, causing
    performance overhead (LP: #1602755)
    - tunnels: Remove encapsulation offloads on decap.

  * lm-sensors is throwing "ERROR: Can't get value of subfeature temp1_input:
    I/O error" for be2net driver (LP: #1607387)
    - be2net: perform temperature query in adapter regardless of its interface
      state

  * Dell dock MAC Address pass through doesn't work in Ubuntu (LP: #1579984)
    - r8152: Add support for setting pass through MAC address on RTL8153-AD

  * vmxnet3 LRO IPv6 performance issues (stalling TCP) (LP: #1605494)
    - Driver: Vmxnet3: set CHECKSUM_UNNECESSARY for IPv6 packets

  * ISST-LTE:pVM:monklp5:Ubuntu16.04.1:system crashed at
    lpfc_sli4_scmd_to_wqidx_distr (LP: #1597974)
    - SAUCE: lpfc: fix oops in lpfc_sli4_scmd_to_wqidx_distr() from
      lpfc_send_taskmgmt()

  * Backport cxlflash shutdown patch to Xenial SRU (LP: #1605405)
    - SAUCE: cxlflash: Verify problem state area is mapped before notifying
      shutdown

  * Xenial update to v4.4.16 stable release (LP: #1607404)
    - mac80211: fix fast_tx header alignment
    - mac80211: mesh: flush mesh paths unconditionally
    - mac80211_hwsim: Add missing check for HWSIM_ATTR_SIGNAL
    - mac80211: Fix mesh estab_plinks counting in STA removal case
    - EDAC, sb_edac: Fix rank lookup on Broadwell
    - IB/cm: Fix a recently introduced locking bug
    - IB/mlx4: Properly initialize GRH TClass and FlowLabel in AHs
    - powerpc/pseries: Fix IBM_ARCH_VEC_NRCORES_OFFSET since POWER8NVL was added
    - powerpc/tm: Always reclaim in start_thread() for exec() class syscalls
    - usb: dwc2: fix reg...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.