STCOP810:Firestone: frsfp6 EEH on Bluefin does not recover with Ubuntu

Bug #1502982 reported by bugproxy on 2015-10-05
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Undecided
Taco Screen team
Vivid
Undecided
Tim Gardner
Wily
Undecided
Taco Screen team

Bug Description

Problem:
==========
Test Case Execution Record:

95613: EEH_Firestone_Ubuntu 14.04.03_Bluefin_Standalone on frsfp6

Error Injection Method: err_injct_inboundA

Step 1. Start HTX (I used mdt.hdbuster & only ran htx on bluefin disks)
Step 2. Inject EEH error

bluefin is in slot P1-C4 (PCI0004)

 echo 0x8000000000000000 > /sys/kernel/debug/powerpc/PCI0004/err_injct_inboundA; sleep 1; echo 0x0 > /sys/kernel/debug/powerpc/PCI0004/err_injct_inboundA

Expected Result: Adapter/SAN disks to recover and htx still run

Actual Result: Adapter did not recover... continuous EEH errors until limit of 6 is reached in 1 hour

There're two patches: one for skiboot firmware and another patch, which has been in upstream, was missed in ubuntu distro (at least 15.04). The skiboot patch has been merged to upstream.

c7192a4 PHB3: Fix wrong PE number in error injection (skiboot)
2aa5cf9 powerpc/eeh: Fix missed PE#0 on P7IOC (linux)

If I'm correct, I think this bug needs to be mirrored so that the Linux patch (commit 2aa5cf9) can be backported to ubuntu distro. With the patch backported to ubuntu 15.04, EEH works fine on Broadcom adapter (not exactly the one where the bug was reported initially):

root@fstn2-p1:/# dmesg | grep EEH
[ 0.216919] EEH: PowerNV platform initialized
[ 0.570606] EEH: devices created
[ 1.302482] EEH: PCI Enhanced I/O Error Handling Enabled
[ 90.566761] EEH: PHB location: Slot1
[ 90.567503] EEH: Frozen PHB#4-PE#0 detected
[ 90.567673] EEH: PE location: Slot1, PHB location: Slot1
[ 90.567930] EEH: Detected PCI bus error on PHB#4-PE#0
[ 90.567935] EEH: This PCI device has failed 1 times in the last hour
[ 90.567937] EEH: Notify device drivers to shutdown
[ 90.567985] EEH: Collect temporary log
[ 90.568971] EEH: Reset without hotplug activity
[ 94.585540] EEH: Notify device drivers the completion of reset
[ 94.585934] EEH: Notify device driver to resume

----

The story about this bug is: Without commit 2aa5cf9 ("powerpc/eeh: Fix missed PE#0 on P7IOC"). PE#0 is regarded as invalid one. When kernel sees the frozen PE#0, the frozen state is cleared and dump the PHB diag-data, then try to recover it. When resetting the PE, the driver, which wasn't stopped by error_detected() completely, access the MMIO space and just causes another (recursive) EEH error. Eventually, the EEH recovery failed. During the PE reset, the I/O path for the PE should be frozen and MMIO access during the period should be dropped to avoid recursive EEH error.

Default Comment by Bridge

tags: added: architecture-ppc64le bugnameltc-131243 severity-high targetmilestone-inin14043

Default Comment by Bridge

Default Comment by Bridge

Luciano Chavez (lnx1138) on 2015-10-05
affects: ubuntu → linux (Ubuntu)
Changed in linux (Ubuntu):
assignee: nobody → Taco Screen team (taco-screen-team)
Tim Gardner (timg-tpi) on 2015-10-05
Changed in linux (Ubuntu Vivid):
status: New → Fix Released
assignee: nobody → Tim Gardner (timg-tpi)
status: Fix Released → In Progress
Changed in linux (Ubuntu Wily):
status: New → Fix Released

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

Brad Figg (brad-figg) on 2015-10-07
Changed in linux (Ubuntu Vivid):
status: In Progress → Fix Committed

------- Comment From <email address hidden> 2015-10-07 22:20 EDT-------
Hello Tim,

This may have been asked before, but by the patch being committed to Vivid, this also means it is automatically picked up for an upcoming linux-lts-vivid kernel for Trusty (14.04), right?

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-10-07 23:51 EDT-------
*** Bug 128309 has been marked as a duplicate of this bug. ***

Tim Gardner (timg-tpi) wrote :

chavez - yes, LTS kernels are derived from their associated master repository. Any patches committed to the master will also eventually show up in the LTS.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-10-09 13:55 EDT-------
==== State: Verify by: anitrap on 09 October 2015 08:45:31 ====

Sorry about password confusion...So unless something has changed, I thought we were not supposed to put lab passwords in defect.

You can request access to the passwords from https://pcajet.austin.ibm.com and then follow directions to see current password (I put that in seq 1 and that is why I never put password in defect).

Directions to check passwords (from https://pcajet.austin.ibm.com) :
The Lab Test Passwords are now accessible only through the auto or manual install web apps. For example, from the manual install web app, enter your email address, check the Lab Passwords checkbox and then click on Submit.

Also, you can always go to sol console...

I'll send a note out with password. Thanks for help and will have system available.

On Fri, Oct 09, 2015 at 02:01:18PM -0000, bugproxy wrote:
> ------- Comment From <email address hidden> 2015-10-09 13:55 EDT-------
> ==== State: Verify by: anitrap on 09 October 2015 08:45:31 ====

> Sorry about password confusion...So unless something has changed, I
> thought we were not supposed to put lab passwords in defect.

Your comments here are being mirrored to Ubuntu's public bug tracker, so
yes, please don't send us any lab passwords.

Default Comment by Bridge

Default Comment by Bridge

Default Comment by Bridge

------- Comment From <email address hidden> 2015-10-12 13:36 EDT-------
There is another related fix that needs backporting as well. So there're two patches in total:

commit 2aa5cf9 ("powerpc/eeh: Fix missed PE#0 on P7IOC")
commit 433185d2 ("powerpc/eeh: Fix PE#0 check in eeh_add_to_parent_pe()")

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-10-14 17:35 EDT-------
==== State: Verify by: trantow on 14 October 2015 12:26:47 ====

Need to understand from Seq 19 & 20
commit 2aa5cf9 ("powerpc/eeh: Fix missed PE#0 on P7IOC")
commit 433185d2 ("powerpc/eeh: Fix PE#0 check in eeh_add_to_parent_pe()")

Are those both Skiboot fixes that need to be in the 1539H driver?
or were we supposed to already have one or more of them in 1539G?

Assuming these are NOT Ubuntu fixes?
Is there also a set of Ubuntu 14.04.3 3.19.x kernel updates, and are these already planned for a 10/17 release?

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-10-14 21:15 EDT-------
==== State: Verify by: trantow on 14 October 2015 16:08:58 ====

Are Three patches required?
one for skiboot firmware (in OPAL) ... delivered in 5.1.5 -> op810 1539G ?
two Linux ones, and when available for 14.04.3 kernel 3.19 ?

1) c7192a4 PHB3: Fix wrong PE number in error injection (skiboot)
2) 2aa5cf9 powerpc/eeh: Fix missed PE#0 on P7IOC (linux)
.... later add ...
3) 433185d2 ("powerpc/eeh: Fix PE#0 check in eeh_add_to_parent_pe()") (Linux)

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-10-14 21:34 EDT-------
Yes, all above 3 patches are required. Otherwise, error recovery can't work on PE#0. Also, error injection to PE#0 just fail. For this particular bug, I think it's just tracking the backporting for those 2 Linux patches from upstream to ubuntu distro.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-10-15 04:58 EDT-------
Hi Canonical, those two patches are very critical for EEH functionality to work properly. If it's possible, could you please include them in next release cycle, which is 10/17 as I was told.

If there are any assistance needed, please let me know.

Tim Gardner (timg-tpi) wrote :

Submitted "[PATCH Vivid SRU] powerpc/eeh: Fix PE#0 check in eeh_add_to_parent_pe()" to the k-team list.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-10-22 14:54 EDT-------
==== State: Verify by: anitrap on 22 October 2015 09:38:53 ====

Gavin added patched kernel to my system and EEH on bluefin worked. I'm waiting for official fix before verifying.

bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2015-10-22 20:35 EDT-------
==== State: Verify by: lieder on 22 October 2015 15:30:54 ====

#=#=# 2015-10-22 15:30:42 (CDT) #=#=#
New Fix_Potential = [P810.00D]
#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#=#

Brad Figg (brad-figg) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-vivid' to 'verification-done-vivid'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-vivid
tags: added: verification-done-vivid
removed: verification-needed-vivid
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 3.19.0-32.37

---------------
linux (3.19.0-32.37) vivid; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1508381

  [ Joseph Salisbury ]

  * SAUCE: storvsc: use small sg_tablesize on x86
    - LP: #1495983

  [ Phidias Chiang ]

  * SAUCE: dma: dw_dmac: Workaround for stop probing on HP X360 laptop v2
    - LP: #1501580

  [ Tim Gardner ]

  * [Config] Add MMC modules sufficient for net booting
    - LP: #1502772

  [ Upstream Kernel Changes ]

  * USB: whiteheat: fix potential null-deref at probe
    - LP: #1478826
    - CVE-2015-5257
  * dcache: Handle escaped paths in prepend_path
    - LP: #1441108
    - CVE-2015-2925
  * vfs: Test for and handle paths that are unreachable from their mnt_root
    - LP: #1441108
    - CVE-2015-2925
  * hv_netvsc: Add support to set MTU reservation from guest side
    - LP: #1494431
  * hv_netvsc: Add close of RNDIS filter into change mtu call
    - LP: #1494431
  * powerpc/eeh: Fix missed PE#0 on P7IOC
    - LP: #1502982
  * powerpc/powernv: display reason for Malfunction Alert HMI.
    - LP: #1482343
  * powerpc/powernv: Pull all HMI events before panic.
    - LP: #1482343
  * powerpc/powernv: Invoke opal_cec_reboot2() on unrecoverable machine
    check errors.
    - LP: #1482343
  * powerpc/powernv: Invoke opal_cec_reboot2() on unrecoverable HMI.
    - LP: #1482343
  * powerpc/eeh: Fix PE#0 check in eeh_add_to_parent_pe()
    - LP: #1502982
  * HID: i2c-hid: The interrupt should be level sensitive v2
    - LP: #1501187
  * HID: i2c-hid: Add support for ACPI GPIO interrupts v2
    - LP: #1501187

 -- Luis Henriques <email address hidden> Wed, 21 Oct 2015 10:30:13 +0100

Changed in linux (Ubuntu Vivid):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers