PowerNV: PCI Slot is invalid after fencedPHB Error injection

Bug #1652018 reported by bugproxy
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
In Progress
High
Thadeu Lima de Souza Cascardo
Xenial
Fix Released
Undecided
Unassigned
Yakkety
Fix Released
Undecided
Unassigned

Bug Description

== Comment: #0 - Pridhiviraj Paidipeddi <email address hidden> - 2016-12-21 01:16:41 ==
---Problem Description---
PCI Slot is in invalid state after fencedPHB Error injection Test.

Contact Information = <email address hidden>

---uname output---
Linux brigstrat1p1 4.4.0-57-generic #78-Ubuntu SMP Fri Dec 9 23:46:13 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = PowerNV CSE-829U

---Debugger---
A debugger is not configured

---Steps to Reproduce---
 1. Boot the system to runtime.
2. Inject fencedPHB Error.
echo 0x8000000000000000 > /sys/kernel/debug/powerpc/PCI0002/err_injct_outbound

dmesg:
[42725.641368] EEH: PHB#2 failure detected, location: N/A
[42725.641450] CPU: 8 PID: 898 Comm: kworker/u320:1 Not tainted 4.4.0-57-generic #78-Ubuntu
[42725.641461] Workqueue: i40e i40e_service_task [i40e]
[42725.641464] Call Trace:
[42725.641469] [c00000000407f9e0] [c000000000b13b4c] dump_stack+0xb0/0xf0 (unreliable)
[42725.641474] [c00000000407fa20] [c0000000000376e0] eeh_dev_check_failure+0x200/0x580
[42725.641477] [c00000000407fac0] [c000000000037ae4] eeh_check_failure+0x84/0xd0
[42725.641485] [c00000000407fb00] [d000000035845710] i40e_service_task+0x17b0/0x1a30 [i40e]
[42725.641489] [c00000000407fc50] [c0000000000dde10] process_one_work+0x1e0/0x5a0
[42725.641492] [c00000000407fce0] [c0000000000de364] worker_thread+0x194/0x680
[42725.641496] [c00000000407fd80] [c0000000000e6e60] kthread+0x110/0x130
[42725.641499] [c00000000407fe30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4
[42725.641509] EEH: Detected error on PHB#2
[42725.641514] EEH: This PCI device has failed 1 times in the last hour
[42725.641516] EEH: Notify device drivers to shutdown
[42725.641523] i40e 0002:01:00.0: i40e_pci_error_detected: error 2
[42725.641907] i40e 0002:01:00.0: VSI seid 396 Tx ring 0 disable timeout
[42725.642144] i40e 0002:01:00.0: VSI seid 396 Rx ring 0 disable timeout
[42725.666205] i40e 0002:01:00.1: i40e_pci_error_detected: error 2
[42725.666499] i40e 0002:01:00.2: i40e_pci_error_detected: error 2
[42725.666533] i40e 0002:01:00.0: ARQ event error -32
[42725.666601] i40e 0002:01:00.3: i40e_pci_error_detected: error 2
[42725.666700] EEH: Collect temporary log
[42725.666702] PHB3 PHB#2 Diag-data (Version: 1)
[42725.666703] brdgCtl: 0000ffff
[42725.666704] UtlSts: 00100000 00000000 00000000
[42725.666706] RootSts: ffffffff ffffffff ffffffff ffffffff 0000ffff
[42725.666707] RootErrSts: ffffffff ffffffff ffffffff
[42725.666708] RootErrLog: ffffffff ffffffff ffffffff ffffffff
[42725.666709] RootErrLog1: ffffffff 0000000000000000 0000000000000000
[42725.666711] nFir: 0000808000000000 0030006e00000000 0000800000000000
[42725.666712] PhbSts: 0000001800000000 0000001800000000
[42725.666713] Lem: 8000020000800000 42498e367f502eae 8000000000000000
[42725.666715] OutErr: 8000002000000000 8000000000000000 120800600003fffe 402002a800000000
[42725.666716] InBErr: 0000000040000000 0000000040000000 0000080000000000 000c10c010010000
[42725.666718] EEH: Reset without hotplug activity
[42730.052455] EEH: Notify device drivers the completion of reset
[42730.053334] EEH: Notify device driver to resume
[42730.184457] i40e 0002:01:00.0 enP2p1s0f0: NIC Link is Down
[42731.568230] i40e 0002:01:00.0 enP2p1s0f0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

OPAL LOG:
[42990.475630456,7] PHB#0002: CRESET: Starts
[42990.482717333,7] PHB#0002: CRESET: No pending transactions
[42991.023963215,7] PHB#0002: CRESET: Reinitialization
[42991.023964143,7] PHB#0002: Initializing PHB...
[42991.075167078,7] PHB#0002: Core revision 0xa30005
[42991.075171529,7] PHB#0002: Default system config: 0x421100fc30000000
[42991.075172655,7] PHB#0002: New system config : 0x421000fc30000000
[42991.075174000,7] PHB#0002: PHB_RESET is 0x2000000000000000
[42991.075410938,7] PHB#0002: Waiting for DLP PG reset to complete...
[42991.083713914,7] PHB#0002: Initialization complete
[42991.136599535,7] PHB#0002: FRESET: Starts
[42991.136600954,7] PHB#0002: FRESET: Prepare for link down
[42991.136602933,7] PHB#0002: FRESET: Assert
[42992.138625290,7] PHB#0002: FRESET: Deassert
[42993.140657592,7] PHB#0002: LINK: Start polling
[42993.193893558,7] PHB#0002: LINK: Electrical link detected
[42993.247138072,7] PHB#0002: LINK: Link is up
[42993.247174237,3] PCI-SLOT-0000000000000002 Invalid state 00000000

== Comment: #2 - VIPIN K. PARASHAR <email address hidden> - 2016-12-22 04:57:28 ==

$ git log fbce44d0ed42e465317 -1
commit fbce44d0ed42e4653172376f4dfeaa5710f06a27
Author: Gavin Shan <email address hidden>
Date: Fri Jun 24 16:44:19 2016 +1000

    powerpc/powernv: Call opal_pci_poll() if needed

    When issuing PHB reset, OPAL API opal_pci_poll() is called to drive
    the state machine in OPAL forward. However, we needn't always call
    the function under some circumstances like reset deassert.

    This avoids calling opal_pci_poll() when OPAL_SUCCESS is returned
    from opal_pci_reset(). Except the overhead introduced by additional
    one unnecessary OPAL call, I didn't run into real issue because of
    this.

    Reported-by: Pridhiviraj Paidipeddi <email address hidden>
    Signed-off-by: Gavin Shan <email address hidden>
    Signed-off-by: Michael Ellerman <email address hidden>

$ git tag --contains fbce44d0e
v4.9
v4.9-rc1
v4.9-rc2
v4.9-rc3
v4.9-rc4
v4.9-rc5
v4.9-rc6
v4.9-rc7
v4.9-rc8
$

This issue is fixed by commit # fbce44d0ed4, available in kernel version 4.9.

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-150063 severity-high targetmilestone-inin16041
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2016-12-23 07:01 EDT-------
Hello Canonical,

Please include commit # fbce44d0ed4 with kernel for fix of this issue.

Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
bugproxy (bugproxy)
tags: removed: bugnameltc-150063 severity-high
Changed in linux (Ubuntu):
assignee: Canonical Kernel Team (canonical-kernel-team) → Thadeu Lima de Souza Cascardo (cascardo)
status: Triaged → In Progress
Revision history for this message
Tim Gardner (timg-tpi) wrote :
Luis Henriques (henrix)
Changed in linux (Ubuntu Xenial):
status: New → Fix Committed
Changed in linux (Ubuntu Yakkety):
status: New → Fix Committed
bugproxy (bugproxy)
tags: added: bugnameltc-150063 severity-high
Revision history for this message
John Donnelly (jpdonnelly) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
Revision history for this message
John Donnelly (jpdonnelly) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'. If the problem still exists, change the tag 'verification-needed-yakkety' to 'verification-failed-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-yakkety
Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (10.8 KiB)

This bug was fixed in the package linux - 4.4.0-62.83

---------------
linux (4.4.0-62.83) xenial; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1657430

  * Backport DP MST fixes to i915 (LP: #1657353)
    - SAUCE: i915_bpo: Fix DP link rate math
    - SAUCE: i915_bpo: Validate mode against max. link data rate for DP MST

  * Ubuntu xenial - 4.4.0-59-generic i3 I/O performance issue (LP: #1657281)
    - blk-mq: really fix plug list flushing for nomerge queues

linux (4.4.0-61.82) xenial; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1656810

  * Xen MSI setup code incorrectly re-uses cached pirq (LP: #1656381)
    - SAUCE: xen: do not re-use pirq number cached in pci device msi msg data

  * nvme drive probe failure (LP: #1626894)
    - nvme: revert NVMe: only setup MSIX once

linux (4.4.0-60.81) xenial; urgency=low

  [ John Donnelly ]

  * Release Tracking Bug
    - LP: #1656084

  * Couldn't emulate instruction 0x7813427c (LP: #1634129)
    - KVM: PPC: Book3S PR: Fix illegal opcode emulation

  * perf: 24x7: Eliminate domain name suffix in event names (LP: #1560482)
    - powerpc/perf/hv-24x7: Fix usage with chip events.
    - powerpc/perf/hv-24x7: Display change in counter values
    - powerpc/perf/hv-24x7: Display domain indices in sysfs
    - powerpc/perf/24x7: Eliminate domain suffix in event names

  * i386 ftrace tests hang on ADT testing (LP: #1655040)
    - ftrace/x86_32: Set ftrace_stub to weak to prevent gcc from using short jumps
      to it

  * VMX module autoloading if available (LP: #1651322)
    - powerpc: Add module autoloading based on CPU features
    - crypto: vmx - Convert to CPU feature based module autoloading

  * ACPI probe support for AD5592/3 configurable multi-channel converter
    (LP: #1654497)
    - SAUCE: iio: dac: ad5592r: Add ACPI support
    - SAUCE: iio: dac: ad5593r: Add ACPI support

  * Xenial update to v4.4.40 stable release (LP: #1654602)
    - btrfs: limit async_work allocation and worker func duration
    - Btrfs: fix tree search logic when replaying directory entry deletes
    - btrfs: store and load values of stripes_min/stripes_max in balance status
      item
    - Btrfs: fix qgroup rescan worker initialization
    - USB: serial: option: add support for Telit LE922A PIDs 0x1040, 0x1041
    - USB: serial: option: add dlink dwm-158
    - USB: serial: kl5kusb105: fix open error path
    - USB: cdc-acm: add device id for GW Instek AFG-125
    - usb: hub: Fix auto-remount of safely removed or ejected USB-3 devices
    - usb: gadget: f_uac2: fix error handling at afunc_bind
    - usb: gadget: composite: correctly initialize ep->maxpacket
    - USB: UHCI: report non-PME wakeup signalling for Intel hardware
    - ALSA: usb-audio: Add QuickCam Communicate Deluxe/S7500 to
      volume_control_quirks
    - ALSA: hiface: Fix M2Tech hiFace driver sampling rate change
    - ALSA: hda/ca0132 - Add quirk for Alienware 15 R2 2016
    - ALSA: hda - ignore the assoc and seq when comparing pin configurations
    - ALSA: hda - fix headset-mic problem on a Dell laptop
    - ALSA: hda - Gate the mic jack on HP Z1 Gen3 AiO
    - ALSA: hd...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.8.0-37.39

---------------
linux (4.8.0-37.39) yakkety; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1659381

  * Mouse cursor invisible or does not move (LP: #1646574)
    - drm/nouveau/disp/nv50-: split chid into chid.ctrl and chid.user
    - drm/nouveau/disp/nv50-: specify ctrl/user separately when constructing
      classes
    - drm/nouveau/disp/gp102: fix cursor/overlay immediate channel indices

 -- Benjamin M Romer <email address hidden> Wed, 25 Jan 2017 16:12:02 -0200

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-04-03 01:23 EDT-------
Tested the fix on latest kernel, issue is fixed. Not seeing any in-valid pci state in OPAL Logs.

[ 103.449048801,7] PHB#0000: CRESET: Starts
[ 103.456298692,7] PHB#0000: CRESET: No pending transactions
[ 103.509546415,7] PHB#0000: CRESET: Reinitialization
[ 103.509547391,7] PHB#0000: Initializing PHB...
[ 104.048749632,7] PHB#0000: Core revision 0xa30005
[ 104.048753727,7] PHB#0000: Default system config: 0x441100fc30000000
[ 104.048754789,7] PHB#0000: New system config : 0x441000fc30000000
[ 104.048756109,7] PHB#0000: PHB_RESET is 0x2000000000000000
[ 104.048974703,7] PHB#0000: Waiting for DLP PG reset to complete...
[ 104.057293884,7] PHB#0000: Initialization complete
[ 104.110186799,7] PHB#0000: FRESET: Starts
[ 104.110187743,7] PHB#0000: FRESET: Prepare for link down
[ 104.110189115,7] PHB#0000: FRESET: Assert
[ 105.112237435,7] PHB#0000: FRESET: Deassert
[ 106.114285244,7] PHB#0000: LINK: Start polling
[ 106.167530144,7] PHB#0000: LINK: Electrical link detected
[ 106.220778015,7] PHB#0000: LINK: Link is up

root@ltc-test-hab02:~# uname -a
Linux ltc-test-hab02 4.4.0-71-generic #92-Ubuntu SMP Fri Mar 24 13:00:23 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux
root@ltc-test-hab02:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.2 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.2 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
root@ltc-test-hab02:~#

tags: added: targetmilestone-inin16042
removed: targetmilestone-inin16041 verification-needed-xenial verification-needed-yakkety
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.