ISST-LTE: system dropped into xmon at pcibios_release_device+0x5c/0x80 during running dlpar test on monklp3

Bug #1618151 reported by bugproxy on 2016-08-29
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Medium
Unassigned
Trusty
Medium
Tim Gardner
Xenial
Medium
Tim Gardner
Yakkety
Medium
Unassigned

Bug Description

monklp3 is installed with ubuntu 14.04.4. System crashed and dropped into xmon at pcibios_release_device+0x5c/0x80 during running dlpar test(CPU, MEM and SLOT).

The output from vterm:

[ 1333.379900] lpfc 0005:a0:00.1: 2:1303 Link Up Event x1 received Data: x1 x1 x20 x0 x0 x0 0
[ 1334.522315] Unable to handle kernel paging request for instruction fetch
[ 1334.522340] Faulting instruction address: 0x2f30613a35303030
cpu 0x42: Vector: 400 (Instruction Access) at [c0000002b35fedd0]
    pc: 2f30613a35303030
    lr: c000000000047a9c: pcibios_release_device+0x5c/0x80
    sp: c0000002b35ff050
   msr: 8000000140009033
  current = 0xc0000002b35290d0
  paca = 0xc000000007b07300 softe: 0 irq_happened: 0x01
    pid = 5756, comm = multipathd
enter ? for help
[link register ] c000000000047a9c pcibios_release_device+0x5c/0x80
[c0000002b35ff050] c000000000047a78 pcibios_release_device+0x38/0x80 (unreliable)
[c0000002b35ff080] c000000000585ed4 pci_release_dev+0x84/0xd0
[c0000002b35ff0b0] c00000000066f210 device_release+0x60/0xf0
[c0000002b35ff130] c000000000532a44 kobject_cleanup+0xd4/0x240
[c0000002b35ff1b0] c00000000066f8a4 put_device+0x34/0x50
[c0000002b35ff1e0] c0000000006f3f78 scsi_host_dev_release+0x118/0x180
[c0000002b35ff220] c00000000066f210 device_release+0x60/0xf0
[c0000002b35ff2a0] c000000000532a44 kobject_cleanup+0xd4/0x240
[c0000002b35ff320] c00000000066f8a4 put_device+0x34/0x50
[c0000002b35ff350] d00000000533096c fc_rport_dev_release+0x2c/0x50 [scsi_transport_fc]
[c0000002b35ff380] c00000000066f210 device_release+0x60/0xf0
[c0000002b35ff400] c000000000532a44 kobject_cleanup+0xd4/0x240
[c0000002b35ff480] c00000000066f8a4 put_device+0x34/0x50
[c0000002b35ff4b0] c000000000700550 scsi_target_dev_release+0x40/0x60
[c0000002b35ff4e0] c00000000066f210 device_release+0x60/0xf0
[c0000002b35ff560] c000000000532a44 kobject_cleanup+0xd4/0x240
[c0000002b35ff5e0] c00000000066f8a4 put_device+0x34/0x50
[c0000002b35ff610] c000000000704e68 scsi_device_dev_release_usercontext+0x178/0x1b0
[c0000002b35ff670] c0000000000d69d4 execute_in_process_context+0xa4/0xd0
[c0000002b35ff6a0] c000000000704cd4 scsi_device_dev_release+0x34/0x50
[c0000002b35ff6d0] c00000000066f210 device_release+0x60/0xf0
[c0000002b35ff750] c000000000532a44 kobject_cleanup+0xd4/0x240
[c0000002b35ff7d0] c00000000066f8a4 put_device+0x34/0x50
[c0000002b35ff800] c0000000006f1a60 scsi_device_put+0x40/0x60
[c0000002b35ff830] c000000000716568 scsi_disk_put+0x58/0x90
[c0000002b35ff870] c000000000311398 __blkdev_put+0x278/0x2e0
[c0000002b35ff8f0] c0000000008861bc dm_put_table_device+0xcc/0x140
[c0000002b35ff930] c00000000088b12c dm_put_device+0x9c/0x130
[c0000002b35ff9b0] d000000005f31864 free_priority_group+0xe4/0x140 [dm_multipath]
[c0000002b35ffa10] d000000005f31944 free_multipath+0x84/0xf0 [dm_multipath]
[c0000002b35ffa60] c00000000088c310 dm_table_destroy+0xb0/0x1a0
[c0000002b35ffaf0] c000000000891b9c dev_suspend+0x14c/0x330
[c0000002b35ffb30] c000000000892a8c ctl_ioctl+0x1cc/0x380
[c0000002b35ffd10] c000000000892c78 dm_ctl_ioctl+0x38/0x50
[c0000002b35ffd40] c0000000002d7380 do_vfs_ioctl+0x4f0/0x7c0
[c0000002b35ffde0] c0000000002d7724 SyS_ioctl+0xd4/0xf0
[c0000002b35ffe30] c000000000009204 system_call+0x38/0xb4
--- Exception: c01 (System Call) at 00003fff88031480
SP (3fff87addb20) is in userspace

42:mon> e
cpu 0x42: Vector: 400 (Instruction Access) at [c0000002b35fedd0]
    pc: 2f30613a35303030
    lr: c000000000047a9c: pcibios_release_device+0x5c/0x80
    sp: c0000002b35ff050
   msr: 8000000140009033
  current = 0xc0000002b35290d0
  paca = 0xc000000007b07300 softe: 0 irq_happened: 0x01
    pid = 5756, comm = multipathd
42:mon> t
[link register ] c000000000047a9c pcibios_release_device+0x5c/0x80
[c0000002b35ff050] c000000000047a78 pcibios_release_device+0x38/0x80 (unreliable)
[c0000002b35ff080] c000000000585ed4 pci_release_dev+0x84/0xd0
[c0000002b35ff0b0] c00000000066f210 device_release+0x60/0xf0
[c0000002b35ff130] c000000000532a44 kobject_cleanup+0xd4/0x240
[c0000002b35ff1b0] c00000000066f8a4 put_device+0x34/0x50
[c0000002b35ff1e0] c0000000006f3f78 scsi_host_dev_release+0x118/0x180
[c0000002b35ff220] c00000000066f210 device_release+0x60/0xf0
[c0000002b35ff2a0] c000000000532a44 kobject_cleanup+0xd4/0x240
[c0000002b35ff320] c00000000066f8a4 put_device+0x34/0x50
[c0000002b35ff350] d00000000533096c fc_rport_dev_release+0x2c/0x50 [scsi_transport_fc]
[c0000002b35ff380] c00000000066f210 device_release+0x60/0xf0
[c0000002b35ff400] c000000000532a44 kobject_cleanup+0xd4/0x240
[c0000002b35ff480] c00000000066f8a4 put_device+0x34/0x50
[c0000002b35ff4b0] c000000000700550 scsi_target_dev_release+0x40/0x60
[c0000002b35ff4e0] c00000000066f210 device_release+0x60/0xf0
[c0000002b35ff560] c000000000532a44 kobject_cleanup+0xd4/0x240
[c0000002b35ff5e0] c00000000066f8a4 put_device+0x34/0x50
[c0000002b35ff610] c000000000704e68 scsi_device_dev_release_usercontext+0x178/0x1b0
[c0000002b35ff670] c0000000000d69d4 execute_in_process_context+0xa4/0xd0
[c0000002b35ff6a0] c000000000704cd4 scsi_device_dev_release+0x34/0x50
[c0000002b35ff6d0] c00000000066f210 device_release+0x60/0xf0
[c0000002b35ff750] c000000000532a44 kobject_cleanup+0xd4/0x240
[c0000002b35ff7d0] c00000000066f8a4 put_device+0x34/0x50
[c0000002b35ff800] c0000000006f1a60 scsi_device_put+0x40/0x60
[c0000002b35ff830] c000000000716568 scsi_disk_put+0x58/0x90
[c0000002b35ff870] c000000000311398 __blkdev_put+0x278/0x2e0
[c0000002b35ff8f0] c0000000008861bc dm_put_table_device+0xcc/0x140
[c0000002b35ff930] c00000000088b12c dm_put_device+0x9c/0x130
[c0000002b35ff9b0] d000000005f31864 free_priority_group+0xe4/0x140 [dm_multipath]
[c0000002b35ffa10] d000000005f31944 free_multipath+0x84/0xf0 [dm_multipath]
[c0000002b35ffa60] c00000000088c310 dm_table_destroy+0xb0/0x1a0
[c0000002b35ffaf0] c000000000891b9c dev_suspend+0x14c/0x330
[c0000002b35ffb30] c000000000892a8c ctl_ioctl+0x1cc/0x380
[c0000002b35ffd10] c000000000892c78 dm_ctl_ioctl+0x38/0x50
[c0000002b35ffd40] c0000000002d7380 do_vfs_ioctl+0x4f0/0x7c0
[c0000002b35ffde0] c0000000002d7724 SyS_ioctl+0xd4/0xf0
[c0000002b35ffe30] c000000000009204 system_call+0x38/0xb4
--- Exception: c01 (System Call) at 00003fff88031480
SP (3fff87addb20) is in userspace
42:mon> r
R00 = c000000000047a78 R16 = 00003fff8824eea8
R01 = c0000002b35ff050 R17 = 00003fff8824eea8
R02 = c0000000014fdf00 R18 = 00003fff8824eea8
R03 = c00000028118b000 R19 = 00003fff8824eea8
R04 = 0000000000000001 R20 = 00003fff880f043c
R05 = c0000002d2052cc0 R21 = c000000001460d90
R06 = c00000000003ad24 R22 = 0000000000000000
R07 = 0000000080000000 R23 = 0000000000000001
R08 = 0000000000000337 R24 = 0000000000000083
R09 = 2f30613a35303030 R25 = c0000002a2d60800
R10 = 0000000000000000 R26 = c00000028c268c28
R11 = c0000002f3b6d300 R27 = 0000000000100100
R12 = 2f30613a35303030 R28 = c0000000014700c0
R13 = c000000007b07300 R29 = c0000002a2f6ba00
R14 = 00003fff8824eea8 R30 = c0000002fe05dc00
R15 = 00003fff8824eea8 R31 = c00000028118b000
pc = 2f30613a35303030
cfar= c000000000008468 slb_miss_realmode+0x50/0x78
lr = c000000000047a9c pcibios_release_device+0x5c/0x80
msr = 8000000140009033 cr = 28008484
ctr = 2f30613a35303030 xer = 0000000000000000 trap = 400
42:mon>

Hi Canonical,

Can you please include this patch for 16.04.x and 14.04.x?

It's a very contained fix -- impacts only powerpc, pseries DLPAR (hotplug remove) of PHBs -- and it fixes a crash.

It's present in this pull request from Ben H. to Linus for linux 4.8 [1], and the commit is here [2].

Thanks!

Links:
[1] https://lkml.org/lkml/2016/8/29/7
"[GIT PULL] Please pull powerpc/linux.git powerpc-4.8-4 tag"
[2] https://git.kernel.org/cgit/linux/kernel/git/powerpc/linux.git/commit/arch/powerpc/kernel?h=powerpc-4.8-4&id=2dd9c11b9d4dfbd6c070eab7b81197f65e82f1a0
"powerpc/pseries: use pci_host_bridge.release_fn() to kfree(phb)"

CVE References

bugproxy (bugproxy) wrote : xmon log

Default Comment by Bridge

tags: added: architecture-ppc64 bugnameltc-136042 severity-high targetmilestone-inin14045
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Trusty):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Changed in linux (Ubuntu Xenial):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Tim Gardner (timg-tpi) wrote :

Merged in 4.8

Changed in linux (Ubuntu Yakkety):
assignee: Canonical Kernel Team (canonical-kernel-team) → nobody
status: New → Fix Released
Tim Gardner (timg-tpi) on 2016-09-01
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Tim Gardner (timg-tpi) wrote :

Whoops.

Changed in linux (Ubuntu Trusty):
status: Fix Committed → In Progress
Tim Gardner (timg-tpi) on 2016-09-01
Changed in linux (Ubuntu Trusty):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Xenial):
status: In Progress → Fix Committed
Changed in linux (Ubuntu Trusty):
importance: Undecided → Medium
Changed in linux (Ubuntu Xenial):
importance: Undecided → Medium
Changed in linux (Ubuntu Yakkety):
importance: Undecided → Medium
tags: added: kernel-da-key
Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial

------- Comment From <email address hidden> 2016-09-27 02:59 EDT-------
(In reply to comment #44)
> This bug is awaiting verification that the kernel in -proposed solves the
> problem. Please test the kernel and update this bug with the results. If the
> problem is solved, change the tag 'verification-needed-xenial' to
> 'verification-done-xenial'.
>
> If verification is not done by 5 working days from today, this fix will be
> dropped from the source code, and this bug will be closed.
>
> See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to
> enable and use -proposed. Thank you!

Hi,
It's a long time from the bug opened and we don't have the monklp3 reached this bug now. We'll reopen this bug if hit this issue again. Thanks.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-09-27 14:58 EDT-------
I'll verify this.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-09-27 19:58 EDT-------
This is verified successfully.
Updating tags.

The controller release function is only called when the last reference to one of its devices (eg, disks) is dropped.

# echo 'func pcibios_free_controller_deferred +pf' > /sys/kernel/debug/dynamic_debug/control

# lspci | grep Fibre
0021:01:00.0 Fibre Channel: Emulex Corporation Lancer-X: LightPulse Fibre Channel Host Adapter (rev 30)
0021:01:00.1 Fibre Channel: Emulex Corporation Lancer-X: LightPulse Fibre Channel Host Adapter (rev 30)

# ls -ld /sys/block/sd* | grep -m1 0021:01:00.0
<...> /sys/block/sdav -> ../devices/pci0021:01/0021:01:00.0/<...>

# ls -ld /sys/block/sd* | grep -m1 0021:01:00.1
<...> /sys/block/sdaa -> ../devices/pci0021:01/0021:01:00.1/<...>

# cat >/dev/sdav & pid1=$!
# cat >/dev/sdaa & pid2=$!

# dmesg -c >/dev/null

# drmgr -w 5 -d 1 -c phb -s 'PHB 33' -r
Validating PHB DLPAR capability...yes.

# dmesg -c | tail -n3
[11629.473248] iommu: Removing device 0021:01:00.0 from group 1
[11639.480792] pci_bus 0021:01: busn_res: [bus 01-ff] is released
[11639.480874] rpadlpar_io: slot PHB 33 removed

# kill -9 $pid1
# dmesg -c
[11682.255067] scsi 5:0:0:1: rdac: Detached

# kill -9 $pid2
# dmesg -c
[11695.380191] scsi 4:0:1:2: rdac: Detached
[11695.380359] pcibios_free_controller_deferred: domain 33, dynamic 1

tags: added: verification-done-xenial
removed: verification-needed-xenial
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-09-28 10:10 EDT-------
*** Bug 143294 has been marked as a duplicate of this bug. ***

Launchpad Janitor (janitor) wrote :
Download full text (17.5 KiB)

This bug was fixed in the package linux - 4.4.0-42.62

---------------
linux (4.4.0-42.62) xenial; urgency=low

  * Fix GRO recursion overflow for tunneling protocols (LP: #1631287)
    - tunnels: Don't apply GRO to multiple layers of encapsulation.
    - gro: Allow tunnel stacking in the case of FOU/GUE

  * CVE-2016-7039
    - SAUCE: net: add recursion limit to GRO

linux (4.4.0-41.61) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1628204

  * nvme drive probe failure (LP: #1626894)
    - (fix) NVMe: Don't unmap controller registers on reset

linux (4.4.0-40.60) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1627074

  * Permission denied in CIFS with kernel 4.4.0-38 (LP: #1626112)
    - Fix memory leaks in cifs_do_mount()
    - Compare prepaths when comparing superblocks
    - SAUCE: Fix regression which breaks DFS mounting

  * Backlight does not change when adjust it higher than 50% after S3
    (LP: #1625932)
    - SAUCE: i915_bpo: drm/i915/backlight: setup and cache pwm alternate
      increment value
    - SAUCE: i915_bpo: drm/i915/backlight: setup backlight pwm alternate
      increment on backlight enable

linux (4.4.0-39.59) xenial; urgency=low

  [ Joseph Salisbury ]

  * Release Tracking Bug
    - LP: #1625303

  * thunder: chip errata w/ multiple CQEs for a TSO packet (LP: #1624569)
    - net: thunderx: Fix for issues with multiple CQEs posted for a TSO packet

  * thunder: faulty TSO padding (LP: #1623627)
    - net: thunderx: Fix for HW issue while padding TSO packet

  * CVE-2016-6828
    - tcp: fix use after free in tcp_xmit_retransmit_queue()

  * Sennheiser Officerunner - cannot get freq at ep 0x83 (LP: #1622763)
    - SAUCE: (no-up) ALSA: usb-audio: Add quirk for sennheiser officerunner

  * Backport E3 Skylake Support in ie31200_edac to Xenial (LP: #1619766)
    - EDAC, ie31200_edac: Add Skylake support

  * Ubuntu 16.04 - Full EEH Recovery Support for NVMe devices (LP: #1602724)
    - SAUCE: nvme: Don't suspend admin queue that wasn't created

  * ISST-LTE:pNV: system ben is hung during ST (nvme) (LP: #1620317)
    - blk-mq: Allow timeouts to run while queue is freezing
    - blk-mq: improve warning for running a queue on the wrong CPU
    - blk-mq: don't overwrite rq->mq_ctx

  * lsattr 32bit does not work on 64bit kernel (Inappropriate ioctl error)
    (LP: #1619918)
    - btrfs: bugfix: handle FS_IOC32_{GETFLAGS, SETFLAGS, GETVERSION} in
      btrfs_ioctl

  * radeon: monitor connected to onboard VGA doesn't work with Xenial
    (LP: #1600092)
    - drm/radeon/dp: add back special handling for NUTMEG

  * initramfs includes qle driver, but not firmware (LP: #1623187)
    - qed: add MODULE_FIRMWARE()

  * [Hyper-V] Rebase Hyper-V to 4.7.2 (stable) (LP: #1616677)
    - hv_netvsc: Implement support for VF drivers on Hyper-V
    - hv_netvsc: Fix the list processing for network change event
    - Drivers: hv: vmbus: Introduce functions for estimating room in the ring
      buffer
    - Drivers: hv: vmbus: Use READ_ONCE() to read variables that are volatile
    - Drivers: hv: vmbus: Export the vmbus_set_event() API
    - lcoking/barriers, arch: Use smp barriers...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Bug attachments