ISST-LTE:pKVM311:lotg5:Ubutu16041:lotg5 crashed @ writeback_sb_inodes+0x30c/0x590

Bug #1614565 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Unassigned
Xenial
Fix Released
Undecided
Tim Gardner
Yakkety
Fix Released
Undecided
Unassigned

Bug Description

== Comment: #0 - PRIYA M. A <email address hidden> - 2016-06-17 10:01:28 ==
Problem Description:
================
- lotg5 crashed at writeback_sb_inodes+0x30c/0x590

Steps to re-create:
==============
- Install lotg5 with Ubuntu16041(4.4.0-24-generic)
- Start the regression tests in lotg5
Logs:
====
root@lotg5:~# show.report.py
HOSTNAME KERNEL VERSION DISTRO INFO
-------- ---------------- -----------
lotg5 4.4.0-24-generic Ubuntu 16.04 LTS \n \l

######## Current Time: Fri Jun 17 01:10:46 2016 ########
Job-ID FOCUS Start-Time Duration Function
------ ----- ---------- -------- --------
1 BASE 20160614-05:50:19 67.0 hr(s) 20.0 min(s) Test
2 IO 20160614-05:50:26 67.0 hr(s) 20.0 min(s) IO_Focus
3 NFS 20160614-06:24:35 66.0 hr(s) 46.0 min(s) DistributeFS_Testing
4 TCP 20160614-06:32:03 66.0 hr(s) 38.0 min(s) networkTest2lotg3

FOCUS BASE IO NFS TCP SUM
TOTAL 48647 1825 517 82690 133679
FAIL 5028 0 0 24 5052
PASS 43619 1825 517 82666 128627
(%) (89%) (100%) (100%) (99%) (96%)

DLPAR is not tested!
root@lotg5:~#

- After 65+ hr of execution lotg5 crashed with follwoing call traces
Logs:
====
[root@lotkvm ~]# virsh console lotg5
Connected to domain lotg5
Escape character is ^]

0:mon> c
cpus stopped: 0x0 0x4 0x8 0xc
0:mon> d
0000000000000000 **************** **************** | |
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c0000000c4f4b620]
    pc: c000000000323720: locked_inode_to_wb_and_lock_list+0x50/0x290
    lr: c000000000326dbc: writeback_sb_inodes+0x30c/0x590
    sp: c0000000c4f4b8a0
   msr: 8000000100009033
   dar: 0
 dsisr: 40000000
  current = 0xc00000017191cf60
  paca = 0xc000000007b40000 softe: 0 irq_happened: 0x01
    pid = 5792, comm = kworker/u32:5
0:mon> t
[c0000000c4f4b900] c000000000326dbc writeback_sb_inodes+0x30c/0x590
[c0000000c4f4ba10] c000000000327124 __writeback_inodes_wb+0xe4/0x150
[c0000000c4f4ba70] c00000000032758c wb_writeback+0x30c/0x450
[c0000000c4f4bb40] c00000000032803c wb_workfn+0x14c/0x570
[c0000000c4f4bc50] c0000000000dd1d0 process_one_work+0x1e0/0x5a0
[c0000000c4f4bce0] c0000000000dd724 worker_thread+0x194/0x680
[c0000000c4f4bd80] c0000000000e61e0 kthread+0x110/0x130
[c0000000c4f4be30] c000000000009538 ret_from_kernel_thread+0x5c/0xa4
--- Exception: 0 at 0000000000000000
0:mon>

== Comment: #4 - Chandan Kumar <email address hidden> - 2016-06-20 06:23:33 ==
dmesg log:
-------------
[251403.003999] EXT4-fs (loop0): mounted filesystem without journal. Opts: (null)
[251403.471118] Unable to handle kernel paging request for data at address 0x00000000
[251403.473391] Faulting instruction address: 0xc000000000323720 << ---- PC
-------------

0:mon> di c000000000323720
c000000000323720 e93f0000 ld r9,0(r31)
// [R31 = 0000000000000000, trying to de-reference null address]
c000000000323724 39290050 addi r9,r9,80
c000000000323728 7fbf4840 cmpld cr7,r31,r9

====

Dominic,

Can you please take a look and assign this to suitable developer.

Thanks,
Chandan

== Comment: #6 - Laurent Dufour <email address hidden> - 2016-06-20 13:03:15 ==
It sounds that inode->i_wb has been cleared while waiting for IO to be dropped in writeback_sb_inodes().

That's need to be double checked...

== Comment: #10 - Laurent Dufour <email address hidden> - 2016-06-21 05:11:35 ==
That seems to be an already known issue raised by commit 43d1c0eb7e11 "block: detach bdev inode from its wb in __blkdev_put()".

There is a patch pushed on the lkml but there is still on going discussion about it :
https://patchwork.kernel.org/patch/9184495/
https://lkml.org/lkml/2016/6/17/676

== Comment: #13 - Laurent Dufour <email address hidden> - 2016-06-22 03:29:00 ==
It appears that the right way to fix that would be https://patchwork.kernel.org/patch/9187409/.

I may build a patched ubuntu kernel on your node and you may restart the test again.
Do you agree ?

== Comment: #14 - PRIYA M. A <email address hidden> - 2016-06-22 03:44:00 ==
Sure Laurent. lotg5 is being installed. Will update this bug once installation is complete so that you can apply on lotg5 and I will start tests in it

== Comment: #16 - Laurent Dufour <email address hidden> - 2016-06-22 06:21:05 ==
root@lotg5:~# uname -a
Linux lotg5 4.4.0-24-generic #43+ldu SMP Wed Jun 22 03:24:05 CDT 2016 ppc64le ppc64le ppc64le GNU/Linux

The patch kernel (#43+ldu) is installed in place of the ubuntu one and is running on lotg5.
Please give it a try...

== Comment: #19 - PRIYA M. A <email address hidden> - 2016-06-29 02:33:54 ==
- Issue is not seen at lotg5

== Comment: #21 - Laurent Dufour <email address hidden> - 2016-07-12 12:01:00 ==
(In reply to comment #20)
> (In reply to comment #19)
> > - Issue is not seen at lotg5
>
> Can we close this bug then?

I would prefer waiting for the patch mentioned in comment #13 to be accepted upstream.
I'll update this bug once this done.

== Comment: #22 - Laurent Dufour <email address hidden> - 2016-07-25 08:00:20 ==
I asked on the mailing list why the patch mentioned in comment #13 is not yet upstream.
I'll update the bug once I got a reply.

== Comment: #23 - Laurent Dufour <email address hidden> - 2016-07-26 10:27:34 ==
The patch has been applied on the linux-fsdevel tree, it is on the way to be applied in 4.8.
I think this can now be closed

== Comment: #24 - Laurent Dufour <email address hidden> - 2016-07-26 10:30:14 ==
For the record: https://patchwork.kernel.org/patch/9247955/

== Comment: #29 - Laurent Dufour <email address hidden> - 2016-08-18 09:14:41 ==
The patch is now part of the kernel 4.8-rc1.
It would have to be backported to 16.04.

== Comment: #31 - Laurent Dufour <email address hidden> - 2016-08-18 09:16:25 ==
Requesting mirroring to get the kernel commit dc5ff2b1d66f21c27a4c37236636dff6946437e4 backported to Ubuntu kernel.

Revision history for this message
bugproxy (bugproxy) wrote : dmesg, var_log_msgs log, dmesg and qemu logs of lotg5 at lotkvm

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-142781 severity-critical targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : sosreport of lotkvm

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : xmon log

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
Tim Gardner (timg-tpi) wrote :

git describe --contains dc5ff2b1d66f21c27a4c37236636dff6946437e4
v4.8-rc1~22^2~13

Changed in linux (Ubuntu Yakkety):
assignee: Taco Screen team (taco-screen-team) → nobody
status: New → Fix Released
Revision history for this message
Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Xenial):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Revision history for this message
Tim Gardner (timg-tpi) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
Sandhya Venugopala (vsandhya) wrote :

Verification done on Xenial

bugproxy (bugproxy)
tags: added: targetmilestone-inin16041
removed: targetmilestone-inin---
Revision history for this message
bugproxy (bugproxy) wrote : chig5 logs

------- Comment (attachment only) From <email address hidden> 2016-09-19 06:39 EDT-------

Tim Gardner (timg-tpi)
tags: added: verification-done-xenial
removed: verification-needed-xenial
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (22.8 KiB)

This bug was fixed in the package linux - 4.4.0-38.57

---------------
linux (4.4.0-38.57) xenial; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1620658

  * CIFS client: access problems after updating to kernel 4.4.0-29-generic
    (LP: #1612135)
    - Revert "UBUNTU: SAUCE: (namespace) Bypass sget() capability check for nfs"
    - fs: Call d_automount with the filesystems creds

  * apt-key add fails in overlayfs (LP: #1618572)
    - SAUCE: overlayfs: fix regression in whiteout detection

linux (4.4.0-37.56) xenial; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1618040

  * [Feature] Instruction decoder support for new SKX instructions- AVX512
    (LP: #1591655)
    - x86/insn: perf tools: Fix vcvtph2ps instruction decoding
    - x86/insn: Add AVX-512 support to the instruction decoder
    - perf tools: Add AVX-512 support to the instruction decoder used by Intel PT
    - perf tools: Add AVX-512 instructions to the new instructions test

  * [Ubuntu 16.04] FCoE Lun not visible in OS with inbox driver - Issue with
    ioremap() call on 32bit kernel (LP: #1608652)
    - lpfc: Correct issue with ioremap() call on 32bit kernel

  * [Feature] turbostat support for Skylake-SP server (LP: #1591802)
    - tools/power turbostat: decode more CPUID fields
    - tools/power turbostat: CPUID(0x16) leaf shows base, max, and bus frequency
    - tools/power turbostat: decode HWP registers
    - tools/power turbostat: Decode MSR_MISC_PWR_MGMT
    - tools/power turbostat: allow sub-sec intervals
    - tools/power turbostat: Intel Xeon x200: fix erroneous bclk value
    - tools/power turbostat: Intel Xeon x200: fix turbo-ratio decoding
    - tools/power turbostat: re-name "%Busy" field to "Busy%"
    - tools/power turbostat: add --out option for saving output in a file
    - tools/power turbostat: fix compiler warnings
    - tools/power turbostat: make fewer systems calls
    - tools/power turbostat: show IRQs per CPU
    - tools/power turbostat: show GFXMHz
    - tools/power turbostat: show GFX%rc6
    - tools/power turbostat: detect and work around syscall jitter
    - tools/power turbostat: indicate SMX and SGX support
    - tools/power turbostat: call __cpuid() instead of __get_cpuid()
    - tools/power turbostat: correct output for MSR_NHM_SNB_PKG_CST_CFG_CTL dump
    - tools/power turbostat: bugfix: TDP MSRs print bits fixing
    - tools/power turbostat: SGX state should print only if --debug
    - tools/power turbostat: print IRTL MSRs
    - tools/power turbostat: initial BXT support
    - tools/power turbostat: decode BXT TSC frequency via CPUID
    - tools/power turbostat: initial SKX support

  * [BYT] display hotplug doesn't work on console (LP: #1616894)
    - drm/i915/vlv: Make intel_crt_reset() per-encoder
    - drm/i915/vlv: Reset the ADPA in vlv_display_power_well_init()
    - drm/i915/vlv: Disable HPD in valleyview_crt_detect_hotplug()
    - drm/i915: Enable polling when we don't have hpd

  * [Feature]intel_idle enabling on Broxton-P (LP: #1520446)
    - intel_idle: add BXT support

  * [Feature] EDAC: Update driver for SKX-SP (LP: #1591815)
    - [Config] CONFIG_EDAC_SKX=m
    - EDAC, skx_edac: Ad...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-09-28 14:06 EDT-------
(In reply to comment #48)
>
> As the issue created in ubuntu14.04.5 (Trusty Tahr), a separate Bug: 146487
> has been opened to track the issue.
>
> Thanks,
> Lata

From bug 146487, this is still being seen in the trusty 4.4.0-38 kernel. Can we get this fix rolled into Trusty also?

Revision history for this message
bugproxy (bugproxy) wrote : panic dmesg from chig5/Trusty

------- Comment (attachment only) From <email address hidden> 2016-09-28 14:08 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : xmon log

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : chig5 logs

------- Comment (attachment only) From <email address hidden> 2016-09-19 06:39 EDT-------

Revision history for this message
bugproxy (bugproxy) wrote : panic dmesg from chig5/Trusty

------- Comment (attachment only) From <email address hidden> 2016-09-28 14:08 EDT-------

Revision history for this message
Tim Gardner (timg-tpi) wrote :

In response to comment #10, commit dc5ff2b1d66f21c27a4c37236636dff6946437e4 ('writeback: Write dirty times for WB_SYNC_ALL writeback') was backported to Ubuntu-4.4.0-37.56. If this issue is still seen in 4.4.0-38, then why would it be appropriate to backport that commit to Trusty ? (besides which it doesn't backport to a 3.13 kernel)

Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-10-28 11:21 EDT-------
What's the next step for this bug?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-11-07 17:40 EDT-------
Lata informed me that she was not able to reproduce this bug. Closing.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.