Kernel OOPS during DLPAR operation with Fibre Channel adapter

Bug #1486180 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
Undecided
Tim Gardner
Wily
Fix Released
Undecided
Tim Gardner
Xenial
Fix Released
Undecided
Tim Gardner

Bug Description

-- Problem Description --
Kernel OOPS during DLPAR operation with Fibre Channel adapter

---uname output---
4.1.0-1-generic

---Additional Hardware Info---
Fibre Channel: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter (rev 03)

Machine Type = POWER8

---Steps to Reproduce---
1) Install Ubuntu 15.10 on a Power VM LPAR.
2) Configure and start rtas_errd daemon
3) Via HMC try to add a Fibre channel adapter via dynamic partitioning
 During the operation following OOPS message is observed

Oops output:

 !!! 00E0806 Fcode, Copyright (c) 2000-2012 Emulex !!! Version 3.10x2

!!! 00E0806 Fcode, Copyright (c) 2000-2012 Emulex !!! Version 3.10x2
[ 8696.808703] PCI host bridge /pci@800000020000020 ranges:
[ 8696.808708] MEM 0x0003ff8400000000..0x0003ff847effffff -> 0x0000000080000000
[ 8696.808716] PCI: I/O resource not set for host bridge /pci@800000020000020 (domain 1)
[ 8696.808761] PCI host bridge to bus 0001:01
[ 8696.808765] pci_bus 0001:01: root bus resource [mem 0x3ff8400000000-0x3ff847effffff] (bus address [0x80000000-0xfeffffff])
[ 8696.808768] pci_bus 0001:01: root bus resource [bus 01-ff]
[ 8696.897390] rpaphp: Slot [U78C7.001.RCH0042-P1-C8] registered
[ 8696.897395] rpadlpar_io: slot PHB 32 added
[ 8696.972155] Emulex LightPulse Fibre Channel SCSI driver 10.5.0.0.
[ 8696.972157] Copyright(c) 2004-2015 Emulex. All rights reserved.
[ 8696.972438] lpfc 0001:01:00.1: enabling device (0140 -> 0142)
[ 8696.976145] Unable to handle kernel paging request for data at address 0x0000000c
[ 8696.976174] Faulting instruction address: 0xc000000000084cc4
[ 8696.976182] Oops: Kernel access of bad area, sig: 11 [#1]
[ 8696.976188] SMP NR_CPUS=2048 NUMA pSeries
[ 8696.976196] Modules linked in: lpfc(+) scsi_transport_fc rpadlpar_io rpaphp rtc_generic pseries_rng autofs4
[ 8696.976220] CPU: 3 PID: 1426 Comm: systemd-udevd Not tainted 4.1.0-1-generic #1~dogfoodv1-Ubuntu
[ 8696.976230] task: c0000003857737e0 ti: c0000000fd08c000 task.ti: c0000000fd08c000
[ 8696.976239] NIP: c000000000084cc4 LR: c000000000084ca8 CTR: 0000000000000000
[ 8696.976247] REGS: c0000000fd08f0f0 TRAP: 0300 Not tainted (4.1.0-1-generic)
[ 8696.976255] MSR: 8000000100009033 <SF,EE,ME,IR,DR,RI,LE> CR: 82228888 XER: 20000000
[ 8696.976278] CFAR: c000000000008468 DAR: 000000000000000c DSISR: 40000000 SOFTE: 1
               GPR00: c000000000084ca8 c0000000fd08f370 c0000000014bda00 0000000000000000
               GPR04: 0000000000000001 c0000000fd08f408 0000000000000003 d000000002c31e60
               GPR08: c0000000013bda00 0000000000000000 c0000003873e6b80 d000000002ca7c98
               GPR12: 0000000000008800 c00000000e831b00 d0000000029421f8 00003ffff8ca4522
               GPR16: c0000000fd08fdc0 c0000000fd08fe04 d000000002941878 c0000000fc8054c0
               GPR20: d000000002380000 d000000002380000 d000000002ccff90 0000000000000000
               GPR24: c00000000165074c c00000038e17e000 c0000000013b5e00 c00000038e17e000
               GPR28: c0000000013b5e28 c00000000a590600 c0000000013b5df0 c0000000013b5e20
[ 8696.976396] NIP [c000000000084cc4] enable_ddw+0x254/0x7b0
[ 8696.976405] LR [c000000000084ca8] enable_ddw+0x238/0x7b0
[ 8696.976411] Call Trace:
[ 8696.976419] [c0000000fd08f370] [c000000000084ca8] enable_ddw+0x238/0x7b0 (unreliable)
[ 8696.976431] [c0000000fd08f4b0] [c0000000000866d8] dma_set_mask_pSeriesLP+0x218/0x2a0
[ 8696.976444] [c0000000fd08f540] [c000000000023528] dma_set_mask+0x58/0xa0
[ 8696.976474] [c0000000fd08f570] [d000000002c71280] lpfc_pci_probe_one+0xb0/0xc50 [lpfc]
[ 8696.976486] [c0000000fd08f610] [c0000000005987fc] local_pci_probe+0x6c/0x140
[ 8696.976497] [c0000000fd08f6a0] [c000000000598a28] pci_device_probe+0x158/0x1e0
[ 8696.976510] [c0000000fd08f700] [c00000000067b744] driver_probe_device+0x1c4/0x5a0
[ 8696.976522] [c0000000fd08f790] [c00000000067bcdc] __driver_attach+0x11c/0x120
[ 8696.976533] [c0000000fd08f7d0] [c00000000067854c] bus_for_each_dev+0x9c/0x110
[ 8696.976544] [c0000000fd08f820] [c00000000067adbc] driver_attach+0x3c/0x60
[ 8696.976555] [c0000000fd08f850] [c00000000067a768] bus_add_driver+0x208/0x320
[ 8696.976565] [c0000000fd08f8e0] [c00000000067c99c] driver_register+0x9c/0x180
[ 8696.976576] [c0000000fd08f950] [c0000000005978ec] __pci_register_driver+0x6c/0x90
[ 8696.976604] [c0000000fd08f990] [d000000002ca7848] lpfc_init+0x17c/0x1d8 [lpfc]
[ 8696.976617] [c0000000fd08fa20] [c00000000000b42c] do_one_initcall+0x12c/0x280
[ 8696.976628] [c0000000fd08faf0] [c000000000a6c7c8] do_init_module+0x98/0x238
[ 8696.976640] [c0000000fd08fb80] [c000000000163fa4] load_module+0x1354/0x14d0
[ 8696.976651] [c0000000fd08fd50] [c0000000001643d0] SyS_finit_module+0xc0/0x120
[ 8696.976662] [c0000000fd08fe30] [c0000000000091fc] system_call+0x38/0xb4
[ 8696.976669] Instruction dump:
[ 8696.976675] 7fa3eb78 38842388 38a10090 38c00003 488345a5 60000000 2fa30000 409e0170
[ 8696.976694] 2fb90000 419e0438 ea7902f0 e93902e8 <83e9000c> 81490008 2f9f0000 7bff0020
[ 8696.976716] ---[ end trace c6f99bed0288dc0c ]---

The DLPAR operation completes successfully. lspci does display following information after this operation

# lspci
0001:01:00.0 Fibre Channel: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter (rev 03)
0001:01:00.1 Fibre Channel: Emulex Corporation Saturn-X: LightPulse Fibre Channel Host Adapter (rev 03)
#

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-127497 severity-critical targetmilestone-inin1510
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1486180/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Luciano Chavez (lnx1138)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2015-12-07 21:56 EDT-------
Quick update/information: I wasn't able to perform DLPAR in the way described on bug. Normally I use command-line tool called "drmgr", and even with this tool, I couldn't perform DLPAR in the usual way.

I was able to add the adapter in LPAR via HMC, with partition powered off. Then, once it boots, I was able to see adapter there, and in this scenario I could perform DLPAR using the "drmgr" tool without issues, i.e., bug wasn't reproduced in this case.

The problem is that, if the adapter is not present as "required" in LPAR configuration, i.e, if the adapter is not present on partition boot time, the DLPAR facility does not work for me. I got messages like

-"Dynamic reconfiguration is not supported for connector type slots on this system" ;
-"Validating PHB DLPAR capability...yes.
There are no DR capable slots on this system. Could not find drc index for 32, unable to add thePHB."

in partition console (when using "drmgr" command) or the message

"RMC network connection to the source partition is not present"

in HMC, when trying DLPAR via web interface.

Sachin, do you know what's going on? Can you help me perform the correct DLPAR operation to reproduce the bug?

BTW, Murilo (my co-worker) suggested that the machine's firmware level is too outdated - should we upgrade it?

Cheers,

Guilherme

Revision history for this message
bugproxy (bugproxy) wrote : boot log with lpfc

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : dmesg log captured after DLPAR operation

Default Comment by Bridge

Revision history for this message
bugproxy (bugproxy) wrote : Patch

Default Comment by Bridge

Luciano Chavez (lnx1138)
Changed in linux (Ubuntu):
assignee: nobody → Taco Screen team (taco-screen-team)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-01-11 10:34 EDT-------
Status update:

The root cause was found, and a patch is provided.
The problem happens when DLPAR of PCI device is done in LPAR with no PCI devices present at boot time. When DDW is being enabled (in function query_ddw() specifically), a NULL pointer dereference happens because a member of struct eeh_dev is NULL.

This is caused because EEH is not initialized correctly, by not probing PCI devices as expected, and so not initializing the eeh_dev struct.

The commit 89a51df5ab1d ("powerpc/eeh: Fix crash in eeh_add_device_early() on Cell") added a check to avoid oops in Cell architecture in function eeh_add_device_early() - this function is used to probe PCI devices in hotplug/DLPAR operation. The check is performed by evaluating the return of eeh_enable() function.

The issue then happens because since we have no PCI device on boot time, EEH is not enabled and this check fails on eeh_add_device_early(). Our patch changes the way the arch checking is done, and so this bug does not happen anymore.

The patch was submitted upstream. I don't know exactly the procedure regarding Canonical - I think we should wait the upstream acceptance and then request Canonical to add the patch to Ubuntu's 14.04.4/15.10/16.04 kernel.
The patch's description provides a bit more details of the issue and the proposed solution.

Link to patch on ppc-dev list: https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-January/137695.html

Thanks Shryia for all the help provided.
Cheers,

Guilherme

Revision history for this message
Breno Leitão (breno-leitao) wrote :

We want to have the same fix backported to 14.04 release. I understand that cherry picking this patch in kernel 4.2,, would automatically solve the problem on both releases (15.10 and 14.04.4), right?

Revision history for this message
Tim Gardner (timg-tpi) wrote :
Changed in linux (Ubuntu Wily):
assignee: nobody → Tim Gardner (timg-tpi)
status: New → In Progress
Changed in linux (Ubuntu Xenial):
assignee: Taco Screen team (taco-screen-team) → Tim Gardner (timg-tpi)
status: New → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote : boot log with lpfc

Default Comment by Bridge

Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.3.0-6.17

---------------
linux (4.3.0-6.17) xenial; urgency=low

  [ Tim Gardner ]

  * Release Tracking Bug
    - LP: #1532958

  [ Eric Dumazet ]

  * SAUCE: (noup) net: fix IP early demux races
    - LP: #1526946

  [ Guilherme G. Piccoli ]

  * SAUCE: powerpc/eeh: Validate arch in eeh_add_device_early()
    - LP: #1486180

  [ Hui Wang ]

  * [Config] CONFIG_I2C_DESIGNWARE_BAYTRAIL=y, CONFIG_IOSF_MBI=y
    - LP: #1527096

  [ Jann Horn ]

  * ptrace: being capable wrt a process requires mapped uids/gids
    - LP: #1527374

  [ Serge Hallyn ]

  * SAUCE: add a sysctl to disable unprivileged user namespace unsharing

  [ Tim Gardner ]

  * [Config] CONFIG_ZONE_DEVICE=y for amd64
  * [Config] CONFIG_VIRTIO_BLK=y, CONFIG_VIRTIO_NET=y for s390
    - LP: #1532886

  [ Upstream Kernel Changes ]

  * rhashtable: Fix walker list corruption
    - LP: #1526811
  * rhashtable: Kill harmless RCU warning in rhashtable_walk_init
    - LP: #1526811
  * ovl: fix permission checking for setattr
    - LP: #1528904
    - CVE-2015-8660

 -- Tim Gardner <email address hidden> Thu, 17 Dec 2015 05:34:47 -0700

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Brad Figg (brad-figg)
Changed in linux (Ubuntu Wily):
status: In Progress → Fix Committed
bugproxy (bugproxy)
tags: added: severity-high targetmilestone-inin14044
removed: severity-critical targetmilestone-inin1510
Revision history for this message
Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-wily' to 'verification-done-wily'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-wily
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-01-29 02:24 EDT-------
Verified the fix for Ubuntu Xenial 16.04.

uname output : 4.3.0-7-generic

dmesg : After adding FC Adapter

root@alp2:~# dmesg -c
[55673.527853] PCI host bridge /pci@800000020000020 ranges:
[55673.527859] MEM 0x00003fc400000000..0x00003fc47effffff -> 0x0000000080000000
[55673.527861] MEM 0x0000308000000000..0x0000308fffffffff -> 0x0003d08000000000
[55673.532672] PCI: I/O resource not set for host bridge /pci@800000020000020 (domain 4)
[55673.532730] PCI host bridge to bus 0004:01
[55673.532737] pci_bus 0004:01: root bus resource [mem 0x3fc400000000-0x3fc47effffff] (bus address [0x80000000-0xfeffffff])
[55673.532740] pci_bus 0004:01: root bus resource [mem 0x308000000000-0x308fffffffff] (bus address [0x3d08000000000-0x3d08fffffffff])
[55673.532743] pci_bus 0004:01: root bus resource [bus 01-ff]
[55673.616152] iommu: Adding device 0004:01:00.1 to group 0
[55673.616542] iommu: Adding device 0004:01:00.0 to group 0
[55673.617583] lpfc 0004:01:00.1: enabling device (0140 -> 0142)
[55673.619762] lpfc 0004:01:00.1: ibm,query-pe-dma-windows(53) 10000 8000000 20000020 returned 0
[55673.621344] lpfc 0004:01:00.1: ibm,create-pe-dma-window(54) 10000 8000000 20000020 10 24 returned 0 (liobn = 0x70000020 starting addr = 8000000 0)
[55673.709324] lpfc 0004:01:00.1: Using 64-bit direct DMA at offset 800000000000000
[55673.709860] scsi host5: Emulex LPe12000 PCIe Fibre Channel Adapter on PCI bus 01 device 01 irq 507
[55675.855373] lpfc 0004:01:00.0: enabling device (0140 -> 0142)
[55675.857444] lpfc 0004:01:00.0: Using 64-bit direct DMA at offset 800000000000000
[55675.857944] scsi host6: Emulex LPe12000 PCIe Fibre Channel Adapter on PCI bus 01 device 00 irq 508
[55678.007129] rpaphp: Slot [U78C7.001.RCH0042-P1-C8] registered
[55678.007133] rpadlpar_io: slot PHB 32 added
[55678.361113] lpfc 0004:01:00.0: 1:1303 Link Up Event x1 received Data: x1 x1 x20 x1 x0 x0 0
[55678.361126] lpfc 0004:01:00.0: 1:1309 Link Up Event npiv not supported in loop topology
[55678.362164] lpfc 0004:01:00.0: 1:(0):2858 FLOGI failure Status:x3/x18 TMO:x0
[55678.363126] lpfc 0004:01:00.0: 1:(0):2858 FLOGI failure Status:x3/x18 TMO:x0
[55678.364042] lpfc 0004:01:00.0: 1:(0):2858 FLOGI failure Status:x3/x18 TMO:x0
[55678.364046] lpfc 0004:01:00.0: 1:(0):0100 FLOGI failure Status:x3/x18 TMO:x0

Call trace is not seen.

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-01-29 04:46 EDT-------
Also verified the same on Ubuntu 14.04.04

uname : 4.2.0-27-generic

Call trace is not seen.

bugproxy (bugproxy)
tags: added: verification-done
removed: verification-needed-wily
bugproxy (bugproxy)
tags: added: verification-done-wily
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.2.0-27.32

---------------
linux (4.2.0-27.32) wily; urgency=low

  [ Brad Figg ]

  * Release Tracking Bug
    - LP: #1536867

  [ Andy Whitcroft ]

  * SAUCE: (no-up) add compat_uts_machine= kernel command line override
    - LP: #1520627

  [ Colin Ian King ]

  * SAUCE: (no-up) ACPI / tables: Add acpi_force_32bit_fadt_addr option to
    force 32 bit FADT addresses
    - LP: #1529381

  [ Eric Dumazet ]

  * SAUCE: (no-up) udp: properly support MSG_PEEK with truncated buffers
    - LP: #1527902

  [ Guilherme G. Piccoli ]

  * SAUCE: powerpc/eeh: Validate arch in eeh_add_device_early()
    - LP: #1486180

  [ Tim Gardner ]

  * SAUCE: (no-up) Revert "[SCSI] libiscsi: Reduce locking contention in
    fast path"
    - LP: #1517142
  * [Config] Add DRM ast driver to udeb installer image
    - LP: #1514711

  [ Upstream Kernel Changes ]

  * net/mlx5e: Re-eanble client vlan TX acceleration
    - LP: #1533249
  * net/mlx5e: Fix LSO vlan insertion
    - LP: #1533249
  * net/mlx5e: Fix inline header size calculation
    - LP: #1533249
  * net: usb: cdc_ncm: Adding Dell DW5812 LTE Verizon Mobile Broadband Card
    - LP: #1533118
  * net: usb: cdc_ncm: Adding Dell DW5813 LTE AT&T Mobile Broadband Card
    - LP: #1533118
  * powerpc/eeh: Fix recursive fenced PHB on Broadcom shiner adapter
    - LP: #1532942

linux (4.2.0-26.31) wily; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1535795
  * Merged back Ubuntu-4.2.0-25.30

 -- Brad Figg <email address hidden> Thu, 21 Jan 2016 18:44:37 -0800

Changed in linux (Ubuntu Wily):
status: Fix Committed → Fix Released
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2016-02-12 01:54 EDT-------
The fix works fine and it is verified on : 4.4.0-4-generic Ubuntu 16.04.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.