drmgr failed to remove i/o slot

Bug #1587295 reported by bugproxy
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Triaged
High
Canonical Kernel Team

Bug Description

== Comment: #0 - Minh Nguyen <email address hidden> - 2015-12-04 10:01:38 ==
---Problem Description---
While performing drmgr to remove an IO slot, we encounter a failure:
>pvmctl IOSlot detach --drc-names U78C9.001.WZS005Z-P1-C3 -p id=1
[PVME0105FF05-0187] Command /usr/sbin/pvmdrmgr drmgr -c phb -s 'PHB 41' -r returned 255. Additional messages: /usr/sbin/pvmdrmgr drmgr -c phb -s 'PHB 41' -r
Validating PHB DLPAR capability...yes.
Isolation failed for 20000029 with -9001
Valid outstanding translations exist.

/var/log/syslog showed:

Dec 3 15:07:22 yc00sp-neo kernel: [ 395.877784] rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
Dec 3 15:07:22 yc00sp-neo kernel: [ 395.878122] rpaphp: Slot [U78C9.001.WZS005Z-P1-C3] registered
Dec 3 15:07:23 yc00sp-neo kernel: [ 396.625406] iommu: Removing device 0001:01:00.0 from group 1
Dec 3 15:07:24 yc00sp-neo kernel: [ 397.293386] iommu: Removing device 0001:01:00.1 from group 1
Dec 3 15:07:34 yc00sp-neo kernel: [ 407.298765] pci_bus 0001:01: busn_res: [bus 01-ff] is released
Dec 3 15:07:34 yc00sp-neo kernel: [ 407.298844] rpadlpar_io: slot PHB 41 removed

/var/log/drmgr showed:

retrieving hotplug nodes
Could not find DRC property group in path: /proc/device-tree/pci@80000002000001b.
hp adapter status for U78C9.001.WZS005Z-P1-C3 is 1
setting hp adapter status to UNCONFIG adapter for U78C9.001.WZS005Z-P1-C3
hp adapter status for U78C9.001.WZS005Z-P1-C3 is 2
Removing device-tree node /proc/device-tree/pci@800000020000029/ethernet@0,1
Removing device-tree node /proc/device-tree/pci@800000020000029/ethernet@0
HPDEV: /sys/bus/pci/devices/0000:50:00.0
       /pci@80000002000001b/usb@0
performing kernel op for PHB 41, file is /sys/bus/pci/slots/control/remove_slot
Removing device-tree node /proc/device-tree/pci@800000020000029
Removing device-tree node /proc/device-tree/interrupt-controller@800000025000029
Releasing drc index 0x20000029
get-sensor for 20000029: 0, 1
Setting isolation state to 'isolate'
Isolation failed for 20000029 with -9001
Valid outstanding translations exist.

The slot has a 10 Gigabit Etherenet-SFP+ SR PCI-E adapter

Contact Information = Minh Nguyen (<email address hidden>) Jeremy Arnold (<email address hidden>)

---uname output---
Linux yc00sp-neo 4.2.0-16-generic #19-Ubuntu SMP Thu Oct 8 14:49:47 UTC 2015 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = 8286-42A

---Debugger---
A debugger is not configured

---Steps to Reproduce---
 Run the command:

pvmctl IOSlot detach --drc-names U78C9.001.WZS005Z-P1-C3 -p id=1

Userspace tool common name: gdb

The userspace tool has the following bit modes: 64bit

Userspace rpm: powerpc-ibm-utils

Userspace tool obtained from project website: na

*Additional Instructions for Minh Nguyen (<email address hidden>) Jeremy Arnold (<email address hidden>) :
-Post a private note with access information to the machine that the bug is occuring on.
-Attach ltrace and strace of userspace application.

== Comment: #7 - Carol L. Soto <email address hidden> - 2016-02-08 16:15:57 ==
I sniff in the /var/log/kern.log.4
I put in /tmp/kern.log.4
I see this
Dec 3 15:00:51 yc00sp-neo kernel: [ 4.762738] ibmvmc: sethmcid: Set HMC ID: "neo 1"
Dec 3 15:00:51 yc00sp-neo kernel: [ 4.817873] DCCP: Activated CCID 2 (TCP-like)
Dec 3 15:07:22 yc00sp-neo kernel: [ 395.877784] rpaphp: RPA HOT Plug PCI Controller Driver version: 0.1
Dec 3 15:07:22 yc00sp-neo kernel: [ 395.878122] rpaphp: Slot [U78C9.001.WZS005Z-P1-C3] registered
Dec 3 15:07:23 yc00sp-neo kernel: [ 396.625406] iommu: Removing device 0001:01:00.0 from group 1
Dec 3 15:07:24 yc00sp-neo kernel: [ 397.293386] iommu: Removing device 0001:01:00.1 from group 1
Dec 3 15:07:34 yc00sp-neo kernel: [ 407.298765] pci_bus 0001:01: busn_res: [bus 01-ff] is released
Dec 3 15:07:34 yc00sp-neo kernel: [ 407.298844] rpadlpar_io: slot PHB 41 removed
~

but I do not see Mellanox traces I only see be2net traces. That is another device.

== Comment: #15 - Douglas Miller <email address hidden> - 2016-02-18 13:49:26 ==
Looking around the system, I notice that 'lspci' shows no (ethernet) device. I looked at the kernel and the module 'be2net' was still loaded, but had zero dependents. I ran "rmmod be2net" and the module was removed without error. I then ran the pvmctl remove command and it appeared to succeed:

root@cs-tul6-neo:~# pvmctl IOSlot detach --drc-names U78CB.001.WZS00D0-P1-C6 -p id=1
[PVME0105FF05-0187] Command /usr/sbin/pvmdrmgr drmgr -c phb -s 'PHB 24' -r returned 3. Additional messages: /usr/sbin/pvmdrmgr drmgr -c phb -s 'PHB 24' -r
Validating PHB DLPAR capability...yes.
root@cs-tul6-neo:~#

and pvmctl io list does not show the device any more.

== Comment: #27 - Douglas Miller <email address hidden> - 2016-02-25 12:30:26 ==
With a point in the right direction from Alexey, I think I've found the problem. The adapter->pcicfg is either derived from the existing map of adapter->db or mapped anew depending on circumstances. However, no record is kept of which was done, and at remove time no attempt is made to release the map. The following debug output from be2net shows the problem:

[ 81.383949] be2net 0000:01:00.0: be2net version is 10.6.0.3debug
[ 81.383953] be2net : be_probe() entered
[ 81.384531] be2net 0000:01:00.0: Using 64-bit direct DMA at offset 800000000000000
[ 81.384715] be2net 0000:01:00.0: PCIe error reporting enabled
[ 81.384779] be2net : d000080080200000 = pci_iomap(csr)
[ 81.384780] be2net : d000080080240000 = pci_iomap(db)
[ 81.384782] be2net : d0000800801e4000 = pci_iomap(pcicfg)
[ 81.562417] be2net 0000:01:00.0: adapter not in advanced mode
[ 81.714383] be2net 0000:01:00.0: FW config: function_mode=0x2003, function_caps=0xf
[ 81.778370] be2net 0000:01:00.0: Max: txqs 16, rxqs 5, rss 4, eqs 16, vfs 0
[ 81.778373] be2net 0000:01:00.0: Max: uc-macs 30, mc-macs 64, vlans 64
[ 81.780257] be2net 0000:01:00.0: enabled 4 MSI-x vector(s) for NIC
[ 82.066316] be2net 0000:01:00.0: created 4 TX queue(s)
[ 82.146293] be2net 0000:01:00.0: created 5 RX queue(s)
[ 82.281405] be2net 0000:01:00.0: FW version is 4.4.180.7
[ 82.282109] be2net 0000:01:00.0: HW Flow control - TX:1 RX:1
[ 82.283251] be2net 0000:01:00.0: Emulex OneConnect(be3): PF port 0
[ 82.283253] be2net : be_probe() left
[ 82.283263] be2net 0000:01:00.1: be2net version is 10.6.0.3debug
[ 82.283264] be2net : be_probe() entered
[ 82.283769] be2net 0000:01:00.1: Using 64-bit direct DMA at offset 800000000000000
[ 82.283952] be2net 0000:01:00.1: PCIe error reporting enabled
[ 82.284743] be2net : d0000800802c0000 = pci_iomap(csr)
[ 82.284745] be2net : d000080080300000 = pci_iomap(db)
[ 82.284747] be2net : d0000800802a0000 = pci_iomap(pcicfg)
[ 82.286982] be2net 0000:01:00.0 enp1s0f0: renamed from eth2
[ 82.462224] be2net 0000:01:00.1: adapter not in advanced mode
[ 82.614194] be2net 0000:01:00.1: FW config: function_mode=0x2003, function_caps=0xf
[ 82.678188] be2net 0000:01:00.1: Max: txqs 16, rxqs 5, rss 4, eqs 16, vfs 0
[ 82.678191] be2net 0000:01:00.1: Max: uc-macs 30, mc-macs 64, vlans 64
[ 82.680083] be2net 0000:01:00.1: enabled 4 MSI-x vector(s) for NIC
[ 82.962129] be2net 0000:01:00.1: created 4 TX queue(s)
[ 83.042104] be2net 0000:01:00.1: created 5 RX queue(s)
[ 83.121652] be2net 0000:01:00.1: FW version is 4.4.180.7
[ 83.122356] be2net 0000:01:00.1: HW Flow control - TX:1 RX:1
[ 83.123492] be2net 0000:01:00.1: Emulex OneConnect(be3): PF port 1
[ 83.123493] be2net : be_probe() left
[ 83.125255] be2net 0000:01:00.1 enp1s0f1: renamed from eth2
[ 165.196825] be2net : be_remove() entered
[ 165.585166] be2net : pci_iounmap(d000080080200000)
[ 165.585172] be2net : pci_iounmap(d000080080240000)
[ 165.585423] be2net : be_remove() left
[ 165.585638] be2net : be_remove() entered
[ 165.981157] be2net : pci_iounmap(d0000800802c0000)
[ 165.981163] be2net : pci_iounmap(d000080080300000)
[ 165.981415] be2net : be_remove() left

Since the fix is more than simply adding a (unconditional) call to pci_iounmap(), we probably need to get Emulex involved to see how they want to fix this.

As an experiment, I added code to track the condition and do the unmap. However, the remove still fails with the same error message, even though the pcicfg mapping is now removed. So, there may still be other resources - or else this was not the cause of the error.

== Comment: #28 - Douglas Miller <email address hidden> - 2016-02-25 12:44:35 ==
Jesse ran the f/w debug again, got this:

Failed with the same return code: looks like two page table entries in there for 21010018
                                                                H S
                                                                V V C R T G B S L H W I M G N E UT P PS SS K
                                                                a a h e a r l p p n pi p ai ei e
                          Vpn RealAddr l l g f g p t V g dm gz gz y
==RA=0003FF8200000000==================================================================================================
HPTE 80000020FEDA4700 0013D349C0080120 Phy 8003FF8200100000 X X X X X X X X 000 NAU 64K 1T 00
HPTE 80000020FEDA4D00 0013D349C0080060 Phy 8003FF8200100000 X X X X X X X X 000 NAU 64K 1T 00
=======================================================================================================================
The bold are the virtual page numbers that are still registered

So, what I found does not appear to have been the HPTEs that are causing the problem - even though it does appear to be a bug in be2net. Back to hunting down these addresses.

== Comment: #37 - Douglas Miller <email address hidden> - 2016-03-08 07:56:14 ==
The fix is now in kernel.org origin/master commit a69bf3c5b49ef488970c74e26ba0ec12f08491c2

== Comment: #39 - Douglas Miller <email address hidden> - 2016-03-30 15:00:56 ==
I'm not sure what the correct state is. I think I saw notes on another bugzilla asking Cononical to update 15.10, so I wonder what this bug is for. Should it be changed to FIXED awaiting a new kernel from Canonical?

== Comment: #42 - Douglas Miller <email address hidden> - 2016-05-26 16:27:49 ==
This needs to be mirrored to Canonical so they can pull the commit from kernel.org.

== Comment: #43 - Douglas Miller <email address hidden> - 2016-05-26 16:29:10 ==
 kernel.org origin/master commit a69bf3c5b49ef488970c74e26ba0ec12f08491c2 needs to be pulled into Ubuntu 16.04.1

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-133845 severity-critical targetmilestone-inin1604
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
Revision history for this message
Ubuntu Foundations Team Bug Bot (crichton) wrote :

Thank you for taking the time to report this bug and helping to make Ubuntu better. It seems that your bug report is not filed about a specific source package though, rather it is just filed against Ubuntu in general. It is important that bug reports be filed about source packages so that people interested in the package can find the bugs about it. You can find some hints about determining what package your bug might be about at https://wiki.ubuntu.com/Bugs/FindRightPackage. You might also ask for help in the #ubuntu-bugs irc channel on Freenode.

To change the source package that this bug is filed about visit https://bugs.launchpad.net/ubuntu/+bug/1587295/+editstatus and add the package name in the text box next to the word Package.

[This is an automated message. I apologize if it reached you inappropriately; please just reply to this message indicating so.]

tags: added: bot-comment
Luciano Chavez (lnx1138)
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.