Ubuntu 17.04: Guest crashed @writeback_sb_inodes+0x310/0x590

Bug #1702998 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Won't Fix
High
Canonical Kernel Team
linux (Ubuntu)
Won't Fix
High
Unassigned
Zesty
Won't Fix
High
Unassigned

Bug Description

== Comment: #0 - Lata Kuntal <email address hidden> - 2017-03-03 00:50:54 ==
Ubuntu 17.04 guest dropped at xmon after crashing at writeback_sb_inodes+0x310/0x590.
The guest is having XFS rootfs and NPIV disk. It crashed after 30+ hrs of BASE and NFS stress test .

Crash logs
=======
root@guskvm:~# virsh console gusg1 --force
Connected to domain gusg1
Escape character is ^]

0:mon>
0:mon> t
[c0000000a4bc7940] c00000000036f790 writeback_sb_inodes+0x310/0x590
[c0000000a4bc7a50] c00000000036faf4 __writeback_inodes_wb+0xe4/0x150
[c0000000a4bc7ab0] c00000000036ff1c wb_writeback+0x2cc/0x440
[c0000000a4bc7b80] c000000000370c30 wb_workfn+0x150/0x560
[c0000000a4bc7c90] c0000000000ed8c0 process_one_work+0x2b0/0x5a0
[c0000000a4bc7d20] c0000000000edc58 worker_thread+0xa8/0x650
[c0000000a4bc7dc0] c0000000000f67b4 kthread+0x154/0x1a0
[c0000000a4bc7e30] c00000000000b4e8 ret_from_kernel_thread+0x5c/0x74
0:mon> r
R00 = c00000000036f790 R16 = c0000000eca70300
R01 = c0000000a4bc78e0 R17 = c0000000f7035240
R02 = c00000000143c900 R18 = 0000000000000000
R03 = c0000000f7035150 R19 = 0000000000000000
R04 = 0000000000000019 R20 = c0000000a4bc4000
R05 = 0000000000000100 R21 = ffffffffffffff7f
R06 = 0000000000000000 R22 = c00000000433d758
R07 = 0000000000000000 R23 = c00000000433d738
R08 = 0000000000034995 R24 = 0000000000000000
R09 = 0000000000000000 R25 = 0000000000000000
R10 = 0000000080000000 R26 = c0000000f70351d8
R11 = c0000000a4bc7a40 R27 = 0000000000000000
R12 = 0000000000002200 R28 = 0000000000000001
R13 = c00000000fb80000 R29 = c00000000433d728
R14 = 0000000000000000 R30 = c0000000f7035150
R15 = c0000000f70351d8 R31 = 0000000000000000
pc = c00000000036c120 locked_inode_to_wb_and_lock_list+0x50/0x290
cfar= c0000000000b2a14 kvmppc_save_tm+0x168/0x16c
lr = c00000000036f790 writeback_sb_inodes+0x310/0x590
msr = 8000000000009033 cr = 24002482
ctr = c000000000381e30 xer = 0000000000000000 trap = 300
dar = 0000000000000000 dsisr = 40000000
0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c0000000a4bc7660]
    pc: c00000000036c120: locked_inode_to_wb_and_lock_list+0x50/0x290
    lr: c00000000036f790: writeback_sb_inodes+0x310/0x590
    sp: c0000000a4bc78e0
   msr: 8000000000009033
   dar: 0
 dsisr: 40000000
  current = 0xc0000000fbe96000
  paca = 0xc00000000fb80000 softe: 0 irq_happened: 0x01
    pid = 17305, comm = kworker/u16:0
Linux version 4.10.0-8-generic (buildd@bos01-ppc64el-001) (gcc version 6.3.0 20161229 (Ubuntu 6.3.0-2ubuntu1) ) #10-Ubuntu SMP Mon Feb 13 14:00:06 UTC 2017 (Ubuntu 4.10.0-8.10-generic 4.10.0-rc8)
0:mon> d
0000000000000000 **************** **************** | |
0:mon>

Host and guest kernel build
=====================
4.10.0-8-generic

OPAL firmware version
----------------------------------------
  T side : FW860.20 (SV860_078)
  Boot side : FW860.20 (SV860_078)

== Comment: #4 - VIPIN K. PARASHAR <email address hidden> - 2017-03-03 02:55:20 ==
[140071.761707] Adding 153536k swap on /dev/loop0. Priority:-2 extents:1 across:153536k FS
[140072.153143] Adding 153472k swap on /dev/loop0. Priority:-2 extents:1 across:153472k FS
[140072.441833] Unable to handle kernel paging request for data at address 0x00000000
[140072.442064] Faulting instruction address: 0xc00000000036c120
0:mon>

0:mon> e
cpu 0x0: Vector: 300 (Data Access) at [c0000000a4bc7660]
    pc: c00000000036c120: locked_inode_to_wb_and_lock_list+0x50/0x290
    lr: c00000000036f790: writeback_sb_inodes+0x310/0x590
    sp: c0000000a4bc78e0
   msr: 8000000000009033
   dar: 0
 dsisr: 40000000
  current = 0xc0000000fbe96000
  paca = 0xc00000000fb80000 softe: 0 irq_happened: 0x01
    pid = 17305, comm = kworker/u16:0
Linux version 4.10.0-8-generic (buildd@bos01-ppc64el-001) (gcc version 6.3.0 20161229 (Ubuntu 6.3.0-2ubuntu1) ) #10-Ubuntu SMP Mon Feb 13 14:00:06 UTC 2017 (Ubuntu 4.10.0-8.10-generic 4.10.0-rc8)
0:mon> t
[c0000000a4bc7940] c00000000036f790 writeback_sb_inodes+0x310/0x590
[c0000000a4bc7a50] c00000000036faf4 __writeback_inodes_wb+0xe4/0x150
[c0000000a4bc7ab0] c00000000036ff1c wb_writeback+0x2cc/0x440
[c0000000a4bc7b80] c000000000370c30 wb_workfn+0x150/0x560
[c0000000a4bc7c90] c0000000000ed8c0 process_one_work+0x2b0/0x5a0
[c0000000a4bc7d20] c0000000000edc58 worker_thread+0xa8/0x650
[c0000000a4bc7dc0] c0000000000f67b4 kthread+0x154/0x1a0
[c0000000a4bc7e30] c00000000000b4e8 ret_from_kernel_thread+0x5c/0x74
0:mon> r
R00 = c00000000036f790 R16 = c0000000eca70300
R01 = c0000000a4bc78e0 R17 = c0000000f7035240
R02 = c00000000143c900 R18 = 0000000000000000
R03 = c0000000f7035150 R19 = 0000000000000000
R04 = 0000000000000019 R20 = c0000000a4bc4000
R05 = 0000000000000100 R21 = ffffffffffffff7f
R06 = 0000000000000000 R22 = c00000000433d758
R07 = 0000000000000000 R23 = c00000000433d738
R08 = 0000000000034995 R24 = 0000000000000000
R09 = 0000000000000000 R25 = 0000000000000000
R10 = 0000000080000000 R26 = c0000000f70351d8
R11 = c0000000a4bc7a40 R27 = 0000000000000000
R12 = 0000000000002200 R28 = 0000000000000001
R13 = c00000000fb80000 R29 = c00000000433d728
R14 = 0000000000000000 R30 = c0000000f7035150
R15 = c0000000f70351d8 R31 = 0000000000000000
pc = c00000000036c120 locked_inode_to_wb_and_lock_list+0x50/0x290
cfar= c0000000000b2a14 kvmppc_save_tm+0x168/0x16c
lr = c00000000036f790 writeback_sb_inodes+0x310/0x590
msr = 8000000000009033 cr = 24002482
ctr = c000000000381e30 xer = 0000000000000000 trap = 300
dar = 0000000000000000 dsisr = 40000000
0:mon> S
msr = 8000000000001033 sprg0 = 0000000000000000
pvr = 00000000004b0201 sprg1 = c00000000fb80000
dec = 00000000b56746ff sprg2 = c00000000fb80000
sp = c0000000a4bc7100 sprg3 = 0000000000000000
toc = c00000000143c900 dar = 0000000000000400
srr0 = 000000000008c59c srr1 = 0000000000001033 dsisr = 40000000
dscr = 0000000000000000 ppr = 0000000000000000 pir = 00000030
dpdes = 0000000000000000 tir = 0000000000000000 cir = 00000000
fscr = 0000000000000180 tar = 0000000000000000 pspb = 00000000
mmcr0 = 0000000080000000 mmcr1 = 0000000000000000 mmcr2 = 0000000000000000
pmc1 = 00000000 pmc2 = 00000000 pmc3 = 00000000 pmc4 = 00000000
mmcra = 0000000000000000 siar = 0000000000000000 pmc5 = b9ad0e28
sdar = 0000000000000000 sier = 0000000000000000 pmc6 = 7f0fdfbe
ebbhr = 0000000000000000 ebbrr = 0000000000000000 bescr = 0000000000000000
0:mon>

Crash is due to Kernel hitting a DSI while executing locked_inode_to_wb_and_lock_list routine.

== Comment: #8 - VIPIN K. PARASHAR <email address hidden> - 2017-03-03 05:07:03 ==
Its crashing at fs/fs-writeback.c

static struct bdi_writeback *
locked_inode_to_wb_and_lock_list(struct inode *inode)
        __releases(&inode->i_lock)
        __acquires(&wb->list_lock)
{
        while (true) {
                struct bdi_writeback *wb = inode_to_wb(inode);

                /*
                 * inode_to_wb() association is protected by both
                 * @inode->i_lock and @wb->list_lock but list_lock nests
                 * outside i_lock. Drop i_lock and verify that the
                 * association hasn't changed after acquiring list_lock.
                 */
                wb_get(wb); <-----------
                spin_unlock(&inode->i_lock);
                spin_lock(&wb->list_lock);

Revision history for this message
bugproxy (bugproxy) wrote : host(dmesg,var/log/syslog) guest(xmon > dl o/p)

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-152231 severity-critical targetmilestone-inin1710
Revision history for this message
bugproxy (bugproxy) wrote : Guest dmesg logs - xmon

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Guest - xmon data

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Host - sosreport

Default Comment by Bridge

Changed in ubuntu:
assignee: nobody → Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage)
affects: ubuntu → linux (Ubuntu)
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
assignee: nobody → Canonical Kernel Team (canonical-kernel-team)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2017-07-07 17:18 EDT-------
This is same issue being debugged under LTC bug 149014 / LP1659111

tags: added: kernel-da-key
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Is there a way to reproduce this bug, or was it a one time event? Is there a know patch available for this bug?

Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Triaged
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: New → Triaged
importance: Undecided → High
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-07-18 01:04 EDT-------
I believe the following two commits associated with "writeback" code are required for fixing this bug,

commit 03e262798884b0a5f948b17433afd80606cb3497
Author: Jan Kara <email address hidden>
Date: Thu Mar 23 01:36:53 2017 +0100

block: Fix bdi assignment to bdev inode when racing with disk delete

When disk->fops->open() in __blkdev_get() returns -ERESTARTSYS, we
restart the process of opening the block device. However we forget to
switch bdev->bd_bdi back to noop_backing_dev_info and as a result bdev
inode will be pointing to a stale bdi. Fix the problem by setting
bdev->bd_bdi later when __blkdev_get() is already guaranteed to succeed.

commit f759741d9d913eb57784a94b9bca78b376fc26a9
Author: Jan Kara <email address hidden>
Date: Thu Mar 23 01:37:00 2017 +0100

block: Fix oops in locked_inode_to_wb_and_lock_list()

When block device is closed, we call inode_detach_wb() in __blkdev_put()
which sets inode->i_wb to NULL. That is contrary to expectations that
inode->i_wb stays valid once set during the whole inode's lifetime and
leads to oops in wb_get() in locked_inode_to_wb_and_lock_list() because
inode_to_wb() returned NULL.

The reason why we called inode_detach_wb() is not valid anymore though.
BDI is guaranteed to stay along until we call bdi_put() from
bdev_evict_inode() so we can postpone calling inode_detach_wb() to that
moment.

Also add a warning to catch if someone uses inode_detach_wb() in a
dangerous way.

Manoj Iyer (manjo)
tags: added: triage-g
Manoj Iyer (manjo)
tags: added: triage-r
removed: triage-g
Manoj Iyer (manjo)
Changed in linux (Ubuntu):
assignee: Ubuntu on IBM Power Systems Bug Triage (ubuntu-power-triage) → Canonical Kernel Team (canonical-kernel-team)
Changed in linux (Ubuntu Zesty):
status: New → In Progress
importance: Undecided → High
assignee: nobody → Joseph Salisbury (jsalisbury)
Changed in linux (Ubuntu):
assignee: Canonical Kernel Team (canonical-kernel-team) → Joseph Salisbury (jsalisbury)
status: Triaged → In Progress
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I'll build a test kernel and post a link to it shortly.

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

I built a Zesty test kernel with the requested two commits, 03e262798884b0a5f948b17433afd80606cb3497 and f759741d9d913eb57784a94b9bca78b376fc26a9. The also required a prereq commit to built properly: b1d2dc5659b41741f5a29b2ade76ffb4e5bb13d8.

The test kernel can be downloaded from:
http://kernel.ubuntu.com/~jsalisbury/lp1702998/

Can you test this kernel and see if it resolves this bug?

Changed in ubuntu-power-systems:
status: Triaged → Incomplete
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-09-11 14:23 EDT-------
Lata, Chandan,

Canonical created a special kernel with this fix. They need us to test it before integrating the patch in the kernel. Could you please test it and let them know the result of this one-off kernel?

Manoj Iyer (manjo)
tags: added: triage-g
removed: triage-r
Frank Heimes (fheimes)
Changed in ubuntu-power-systems:
status: Incomplete → In Progress
status: In Progress → Incomplete
Revision history for this message
Manoj Iyer (manjo) wrote :

IBM, could you please provide test results for the kernel mentioned in comment #9 ?

Revision history for this message
Manoj Iyer (manjo) wrote :

Zesty has reached end of life as of Jan 13 2018, please re-test with Artful and reopen this bug if you are able to reproduce it.

Changed in linux (Ubuntu Zesty):
status: In Progress → Won't Fix
Revision history for this message
Manoj Iyer (manjo) wrote :

It appears these patches should be available in 4.13 or later versions of the kernel. Please retest with Artful and re-open this bug if you are able to reproduce this issue.

Changed in linux (Ubuntu):
status: In Progress → Won't Fix
Changed in ubuntu-power-systems:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.