PowerNV: PCI Slot is invalid after fencedPHB Error injection

Bug #1652018 reported by bugproxy on 2016-12-22
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Thadeu Lima de Souza Cascardo
Xenial
Undecided
Unassigned
Yakkety
Undecided
Unassigned

Bug Description

== Comment: #0 - Pridhiviraj Paidipeddi <email address hidden> - 2016-12-21 01:16:41 ==
---Problem Description---
PCI Slot is in invalid state after fencedPHB Error injection Test.

Contact Information = <email address hidden>

---uname output---
Linux brigstrat1p1 4.4.0-57-generic #78-Ubuntu SMP Fri Dec 9 23:46:13 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux

Machine Type = PowerNV CSE-829U

---Debugger---
A debugger is not configured

---Steps to Reproduce---
 1. Boot the system to runtime.
2. Inject fencedPHB Error.
echo 0x8000000000000000 > /sys/kernel/debug/powerpc/PCI0002/err_injct_outbound

dmesg:
[42725.641368] EEH: PHB#2 failure detected, location: N/A
[42725.641450] CPU: 8 PID: 898 Comm: kworker/u320:1 Not tainted 4.4.0-57-generic #78-Ubuntu
[42725.641461] Workqueue: i40e i40e_service_task [i40e]
[42725.641464] Call Trace:
[42725.641469] [c00000000407f9e0] [c000000000b13b4c] dump_stack+0xb0/0xf0 (unreliable)
[42725.641474] [c00000000407fa20] [c0000000000376e0] eeh_dev_check_failure+0x200/0x580
[42725.641477] [c00000000407fac0] [c000000000037ae4] eeh_check_failure+0x84/0xd0
[42725.641485] [c00000000407fb00] [d000000035845710] i40e_service_task+0x17b0/0x1a30 [i40e]
[42725.641489] [c00000000407fc50] [c0000000000dde10] process_one_work+0x1e0/0x5a0
[42725.641492] [c00000000407fce0] [c0000000000de364] worker_thread+0x194/0x680
[42725.641496] [c00000000407fd80] [c0000000000e6e60] kthread+0x110/0x130
[42725.641499] [c00000000407fe30] [c000000000009538] ret_from_kernel_thread+0x5c/0xa4
[42725.641509] EEH: Detected error on PHB#2
[42725.641514] EEH: This PCI device has failed 1 times in the last hour
[42725.641516] EEH: Notify device drivers to shutdown
[42725.641523] i40e 0002:01:00.0: i40e_pci_error_detected: error 2
[42725.641907] i40e 0002:01:00.0: VSI seid 396 Tx ring 0 disable timeout
[42725.642144] i40e 0002:01:00.0: VSI seid 396 Rx ring 0 disable timeout
[42725.666205] i40e 0002:01:00.1: i40e_pci_error_detected: error 2
[42725.666499] i40e 0002:01:00.2: i40e_pci_error_detected: error 2
[42725.666533] i40e 0002:01:00.0: ARQ event error -32
[42725.666601] i40e 0002:01:00.3: i40e_pci_error_detected: error 2
[42725.666700] EEH: Collect temporary log
[42725.666702] PHB3 PHB#2 Diag-data (Version: 1)
[42725.666703] brdgCtl: 0000ffff
[42725.666704] UtlSts: 00100000 00000000 00000000
[42725.666706] RootSts: ffffffff ffffffff ffffffff ffffffff 0000ffff
[42725.666707] RootErrSts: ffffffff ffffffff ffffffff
[42725.666708] RootErrLog: ffffffff ffffffff ffffffff ffffffff
[42725.666709] RootErrLog1: ffffffff 0000000000000000 0000000000000000
[42725.666711] nFir: 0000808000000000 0030006e00000000 0000800000000000
[42725.666712] PhbSts: 0000001800000000 0000001800000000
[42725.666713] Lem: 8000020000800000 42498e367f502eae 8000000000000000
[42725.666715] OutErr: 8000002000000000 8000000000000000 120800600003fffe 402002a800000000
[42725.666716] InBErr: 0000000040000000 0000000040000000 0000080000000000 000c10c010010000
[42725.666718] EEH: Reset without hotplug activity
[42730.052455] EEH: Notify device drivers the completion of reset
[42730.053334] EEH: Notify device driver to resume
[42730.184457] i40e 0002:01:00.0 enP2p1s0f0: NIC Link is Down
[42731.568230] i40e 0002:01:00.0 enP2p1s0f0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None

OPAL LOG:
[42990.475630456,7] PHB#0002: CRESET: Starts
[42990.482717333,7] PHB#0002: CRESET: No pending transactions
[42991.023963215,7] PHB#0002: CRESET: Reinitialization
[42991.023964143,7] PHB#0002: Initializing PHB...
[42991.075167078,7] PHB#0002: Core revision 0xa30005
[42991.075171529,7] PHB#0002: Default system config: 0x421100fc30000000
[42991.075172655,7] PHB#0002: New system config : 0x421000fc30000000
[42991.075174000,7] PHB#0002: PHB_RESET is 0x2000000000000000
[42991.075410938,7] PHB#0002: Waiting for DLP PG reset to complete...
[42991.083713914,7] PHB#0002: Initialization complete
[42991.136599535,7] PHB#0002: FRESET: Starts
[42991.136600954,7] PHB#0002: FRESET: Prepare for link down
[42991.136602933,7] PHB#0002: FRESET: Assert
[42992.138625290,7] PHB#0002: FRESET: Deassert
[42993.140657592,7] PHB#0002: LINK: Start polling
[42993.193893558,7] PHB#0002: LINK: Electrical link detected
[42993.247138072,7] PHB#0002: LINK: Link is up
[42993.247174237,3] PCI-SLOT-0000000000000002 Invalid state 00000000

== Comment: #2 - VIPIN K. PARASHAR <email address hidden> - 2016-12-22 04:57:28 ==

$ git log fbce44d0ed42e465317 -1
commit fbce44d0ed42e4653172376f4dfeaa5710f06a27
Author: Gavin Shan <email address hidden>
Date: Fri Jun 24 16:44:19 2016 +1000

    powerpc/powernv: Call opal_pci_poll() if needed

    When issuing PHB reset, OPAL API opal_pci_poll() is called to drive
    the state machine in OPAL forward. However, we needn't always call
    the function under some circumstances like reset deassert.

    This avoids calling opal_pci_poll() when OPAL_SUCCESS is returned
    from opal_pci_reset(). Except the overhead introduced by additional
    one unnecessary OPAL call, I didn't run into real issue because of
    this.

    Reported-by: Pridhiviraj Paidipeddi <email address hidden>
    Signed-off-by: Gavin Shan <email address hidden>
    Signed-off-by: Michael Ellerman <email address hidden>

$ git tag --contains fbce44d0e
v4.9
v4.9-rc1
v4.9-rc2
v4.9-rc3
v4.9-rc4
v4.9-rc5
v4.9-rc6
v4.9-rc7
v4.9-rc8
$

This issue is fixed by commit # fbce44d0ed4, available in kernel version 4.9.

bugproxy (bugproxy) on 2016-12-22
tags: added: architecture-ppc64le bugnameltc-150063 severity-high targetmilestone-inin16041
Changed in ubuntu:
assignee: nobody → Taco Screen team (taco-screen-team)
affects: ubuntu → linux (Ubuntu)

------- Comment From <email address hidden> 2016-12-23 07:01 EDT-------
Hello Canonical,

Please include commit # fbce44d0ed4 with kernel for fix of this issue.

Changed in linux (Ubuntu):
assignee: Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team)
importance: Undecided → High
status: New → Triaged
bugproxy (bugproxy) on 2017-01-03
tags: removed: bugnameltc-150063 severity-high
Changed in linux (Ubuntu):
assignee: Canonical Kernel Team (canonical-kernel-team) → Thadeu Lima de Souza Cascardo (cascardo)
status: Triaged → In Progress
Luis Henriques (henrix) on 2017-01-06
Changed in linux (Ubuntu Xenial):
status: New → Fix Committed
Changed in linux (Ubuntu Yakkety):
status: New → Fix Committed
bugproxy (bugproxy) on 2017-01-09
tags: added: bugnameltc-150063 severity-high
John Donnelly (jpdonnelly) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-xenial' to 'verification-done-xenial'. If the problem still exists, change the tag 'verification-needed-xenial' to 'verification-failed-xenial'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-xenial
John Donnelly (jpdonnelly) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'. If the problem still exists, change the tag 'verification-needed-yakkety' to 'verification-failed-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-yakkety
Launchpad Janitor (janitor) wrote :
Download full text (10.8 KiB)

This bug was fixed in the package linux - 4.4.0-62.83

---------------
linux (4.4.0-62.83) xenial; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1657430

  * Backport DP MST fixes to i915 (LP: #1657353)
    - SAUCE: i915_bpo: Fix DP link rate math
    - SAUCE: i915_bpo: Validate mode against max. link data rate for DP MST

  * Ubuntu xenial - 4.4.0-59-generic i3 I/O performance issue (LP: #1657281)
    - blk-mq: really fix plug list flushing for nomerge queues

linux (4.4.0-61.82) xenial; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1656810

  * Xen MSI setup code incorrectly re-uses cached pirq (LP: #1656381)
    - SAUCE: xen: do not re-use pirq number cached in pci device msi msg data

  * nvme drive probe failure (LP: #1626894)
    - nvme: revert NVMe: only setup MSIX once

linux (4.4.0-60.81) xenial; urgency=low

  [ John Donnelly ]

  * Release Tracking Bug
    - LP: #1656084

  * Couldn't emulate instruction 0x7813427c (LP: #1634129)
    - KVM: PPC: Book3S PR: Fix illegal opcode emulation

  * perf: 24x7: Eliminate domain name suffix in event names (LP: #1560482)
    - powerpc/perf/hv-24x7: Fix usage with chip events.
    - powerpc/perf/hv-24x7: Display change in counter values
    - powerpc/perf/hv-24x7: Display domain indices in sysfs
    - powerpc/perf/24x7: Eliminate domain suffix in event names

  * i386 ftrace tests hang on ADT testing (LP: #1655040)
    - ftrace/x86_32: Set ftrace_stub to weak to prevent gcc from using short jumps
      to it

  * VMX module autoloading if available (LP: #1651322)
    - powerpc: Add module autoloading based on CPU features
    - crypto: vmx - Convert to CPU feature based module autoloading

  * ACPI probe support for AD5592/3 configurable multi-channel converter
    (LP: #1654497)
    - SAUCE: iio: dac: ad5592r: Add ACPI support
    - SAUCE: iio: dac: ad5593r: Add ACPI support

  * Xenial update to v4.4.40 stable release (LP: #1654602)
    - btrfs: limit async_work allocation and worker func duration
    - Btrfs: fix tree search logic when replaying directory entry deletes
    - btrfs: store and load values of stripes_min/stripes_max in balance status
      item
    - Btrfs: fix qgroup rescan worker initialization
    - USB: serial: option: add support for Telit LE922A PIDs 0x1040, 0x1041
    - USB: serial: option: add dlink dwm-158
    - USB: serial: kl5kusb105: fix open error path
    - USB: cdc-acm: add device id for GW Instek AFG-125
    - usb: hub: Fix auto-remount of safely removed or ejected USB-3 devices
    - usb: gadget: f_uac2: fix error handling at afunc_bind
    - usb: gadget: composite: correctly initialize ep->maxpacket
    - USB: UHCI: report non-PME wakeup signalling for Intel hardware
    - ALSA: usb-audio: Add QuickCam Communicate Deluxe/S7500 to
      volume_control_quirks
    - ALSA: hiface: Fix M2Tech hiFace driver sampling rate change
    - ALSA: hda/ca0132 - Add quirk for Alienware 15 R2 2016
    - ALSA: hda - ignore the assoc and seq when comparing pin configurations
    - ALSA: hda - fix headset-mic problem on a Dell laptop
    - ALSA: hda - Gate the mic jack on HP Z1 Gen3 AiO
    - ALSA: hd...

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.8.0-37.39

---------------
linux (4.8.0-37.39) yakkety; urgency=low

  [ Thadeu Lima de Souza Cascardo ]

  * Release Tracking Bug
    - LP: #1659381

  * Mouse cursor invisible or does not move (LP: #1646574)
    - drm/nouveau/disp/nv50-: split chid into chid.ctrl and chid.user
    - drm/nouveau/disp/nv50-: specify ctrl/user separately when constructing
      classes
    - drm/nouveau/disp/gp102: fix cursor/overlay immediate channel indices

 -- Benjamin M Romer <email address hidden> Wed, 25 Jan 2017 16:12:02 -0200

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2017-04-03 01:23 EDT-------
Tested the fix on latest kernel, issue is fixed. Not seeing any in-valid pci state in OPAL Logs.

[ 103.449048801,7] PHB#0000: CRESET: Starts
[ 103.456298692,7] PHB#0000: CRESET: No pending transactions
[ 103.509546415,7] PHB#0000: CRESET: Reinitialization
[ 103.509547391,7] PHB#0000: Initializing PHB...
[ 104.048749632,7] PHB#0000: Core revision 0xa30005
[ 104.048753727,7] PHB#0000: Default system config: 0x441100fc30000000
[ 104.048754789,7] PHB#0000: New system config : 0x441000fc30000000
[ 104.048756109,7] PHB#0000: PHB_RESET is 0x2000000000000000
[ 104.048974703,7] PHB#0000: Waiting for DLP PG reset to complete...
[ 104.057293884,7] PHB#0000: Initialization complete
[ 104.110186799,7] PHB#0000: FRESET: Starts
[ 104.110187743,7] PHB#0000: FRESET: Prepare for link down
[ 104.110189115,7] PHB#0000: FRESET: Assert
[ 105.112237435,7] PHB#0000: FRESET: Deassert
[ 106.114285244,7] PHB#0000: LINK: Start polling
[ 106.167530144,7] PHB#0000: LINK: Electrical link detected
[ 106.220778015,7] PHB#0000: LINK: Link is up

root@ltc-test-hab02:~# uname -a
Linux ltc-test-hab02 4.4.0-71-generic #92-Ubuntu SMP Fri Mar 24 13:00:23 UTC 2017 ppc64le ppc64le ppc64le GNU/Linux
root@ltc-test-hab02:~# cat /etc/os-release
NAME="Ubuntu"
VERSION="16.04.2 LTS (Xenial Xerus)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 16.04.2 LTS"
VERSION_ID="16.04"
HOME_URL="http://www.ubuntu.com/"
SUPPORT_URL="http://help.ubuntu.com/"
BUG_REPORT_URL="http://bugs.launchpad.net/ubuntu/"
VERSION_CODENAME=xenial
UBUNTU_CODENAME=xenial
root@ltc-test-hab02:~#

tags: added: targetmilestone-inin16042
removed: targetmilestone-inin16041 verification-needed-xenial verification-needed-yakkety
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers