PowerNV: PCI Slot is invalid after fencedPHB Error injection
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
linux (Ubuntu) |
In Progress
|
High
|
Thadeu Lima de Souza Cascardo | ||
Xenial |
Fix Released
|
Undecided
|
Unassigned | ||
Yakkety |
Fix Released
|
Undecided
|
Unassigned |
Bug Description
== Comment: #0 - Pridhiviraj Paidipeddi <email address hidden> - 2016-12-21 01:16:41 ==
---Problem Description---
PCI Slot is in invalid state after fencedPHB Error injection Test.
Contact Information = <email address hidden>
---uname output---
Linux brigstrat1p1 4.4.0-57-generic #78-Ubuntu SMP Fri Dec 9 23:46:13 UTC 2016 ppc64le ppc64le ppc64le GNU/Linux
Machine Type = PowerNV CSE-829U
---Debugger---
A debugger is not configured
---Steps to Reproduce---
1. Boot the system to runtime.
2. Inject fencedPHB Error.
echo 0x8000000000000000 > /sys/kernel/
dmesg:
[42725.641368] EEH: PHB#2 failure detected, location: N/A
[42725.641450] CPU: 8 PID: 898 Comm: kworker/u320:1 Not tainted 4.4.0-57-generic #78-Ubuntu
[42725.641461] Workqueue: i40e i40e_service_task [i40e]
[42725.641464] Call Trace:
[42725.641469] [c00000000407f9e0] [c000000000b13b4c] dump_stack+
[42725.641474] [c00000000407fa20] [c0000000000376e0] eeh_dev_
[42725.641477] [c00000000407fac0] [c000000000037ae4] eeh_check_
[42725.641485] [c00000000407fb00] [d000000035845710] i40e_service_
[42725.641489] [c00000000407fc50] [c0000000000dde10] process_
[42725.641492] [c00000000407fce0] [c0000000000de364] worker_
[42725.641496] [c00000000407fd80] [c0000000000e6e60] kthread+0x110/0x130
[42725.641499] [c00000000407fe30] [c000000000009538] ret_from_
[42725.641509] EEH: Detected error on PHB#2
[42725.641514] EEH: This PCI device has failed 1 times in the last hour
[42725.641516] EEH: Notify device drivers to shutdown
[42725.641523] i40e 0002:01:00.0: i40e_pci_
[42725.641907] i40e 0002:01:00.0: VSI seid 396 Tx ring 0 disable timeout
[42725.642144] i40e 0002:01:00.0: VSI seid 396 Rx ring 0 disable timeout
[42725.666205] i40e 0002:01:00.1: i40e_pci_
[42725.666499] i40e 0002:01:00.2: i40e_pci_
[42725.666533] i40e 0002:01:00.0: ARQ event error -32
[42725.666601] i40e 0002:01:00.3: i40e_pci_
[42725.666700] EEH: Collect temporary log
[42725.666702] PHB3 PHB#2 Diag-data (Version: 1)
[42725.666703] brdgCtl: 0000ffff
[42725.666704] UtlSts: 00100000 00000000 00000000
[42725.666706] RootSts: ffffffff ffffffff ffffffff ffffffff 0000ffff
[42725.666707] RootErrSts: ffffffff ffffffff ffffffff
[42725.666708] RootErrLog: ffffffff ffffffff ffffffff ffffffff
[42725.666709] RootErrLog1: ffffffff 0000000000000000 0000000000000000
[42725.666711] nFir: 0000808000000000 0030006e00000000 0000800000000000
[42725.666712] PhbSts: 0000001800000000 0000001800000000
[42725.666713] Lem: 8000020000800000 42498e367f502eae 8000000000000000
[42725.666715] OutErr: 8000002000000000 8000000000000000 120800600003fffe 402002a800000000
[42725.666716] InBErr: 0000000040000000 0000000040000000 0000080000000000 000c10c010010000
[42725.666718] EEH: Reset without hotplug activity
[42730.052455] EEH: Notify device drivers the completion of reset
[42730.053334] EEH: Notify device driver to resume
[42730.184457] i40e 0002:01:00.0 enP2p1s0f0: NIC Link is Down
[42731.568230] i40e 0002:01:00.0 enP2p1s0f0: NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
OPAL LOG:
[42990.475630456,7] PHB#0002: CRESET: Starts
[42990.482717333,7] PHB#0002: CRESET: No pending transactions
[42991.023963215,7] PHB#0002: CRESET: Reinitialization
[42991.023964143,7] PHB#0002: Initializing PHB...
[42991.075167078,7] PHB#0002: Core revision 0xa30005
[42991.075171529,7] PHB#0002: Default system config: 0x421100fc30000000
[42991.075172655,7] PHB#0002: New system config : 0x421000fc30000000
[42991.075174000,7] PHB#0002: PHB_RESET is 0x2000000000000000
[42991.075410938,7] PHB#0002: Waiting for DLP PG reset to complete...
[42991.083713914,7] PHB#0002: Initialization complete
[42991.136599535,7] PHB#0002: FRESET: Starts
[42991.136600954,7] PHB#0002: FRESET: Prepare for link down
[42991.136602933,7] PHB#0002: FRESET: Assert
[42992.138625290,7] PHB#0002: FRESET: Deassert
[42993.140657592,7] PHB#0002: LINK: Start polling
[42993.193893558,7] PHB#0002: LINK: Electrical link detected
[42993.247138072,7] PHB#0002: LINK: Link is up
[42993.247174237,3] PCI-SLOT-
== Comment: #2 - VIPIN K. PARASHAR <email address hidden> - 2016-12-22 04:57:28 ==
$ git log fbce44d0ed42e465317 -1
commit fbce44d0ed42e46
Author: Gavin Shan <email address hidden>
Date: Fri Jun 24 16:44:19 2016 +1000
powerpc/
When issuing PHB reset, OPAL API opal_pci_poll() is called to drive
the state machine in OPAL forward. However, we needn't always call
the function under some circumstances like reset deassert.
This avoids calling opal_pci_poll() when OPAL_SUCCESS is returned
from opal_pci_reset(). Except the overhead introduced by additional
one unnecessary OPAL call, I didn't run into real issue because of
this.
Reported-by: Pridhiviraj Paidipeddi <email address hidden>
Signed-off-by: Gavin Shan <email address hidden>
Signed-off-by: Michael Ellerman <email address hidden>
$ git tag --contains fbce44d0e
v4.9
v4.9-rc1
v4.9-rc2
v4.9-rc3
v4.9-rc4
v4.9-rc5
v4.9-rc6
v4.9-rc7
v4.9-rc8
$
This issue is fixed by commit # fbce44d0ed4, available in kernel version 4.9.
tags: | added: architecture-ppc64le bugnameltc-150063 severity-high targetmilestone-inin16041 |
Changed in ubuntu: | |
assignee: | nobody → Taco Screen team (taco-screen-team) |
affects: | ubuntu → linux (Ubuntu) |
Changed in linux (Ubuntu): | |
assignee: | Taco Screen team (taco-screen-team) → Canonical Kernel Team (canonical-kernel-team) |
importance: | Undecided → High |
status: | New → Triaged |
tags: | removed: bugnameltc-150063 severity-high |
Changed in linux (Ubuntu): | |
assignee: | Canonical Kernel Team (canonical-kernel-team) → Thadeu Lima de Souza Cascardo (cascardo) |
status: | Triaged → In Progress |
Changed in linux (Ubuntu Xenial): | |
status: | New → Fix Committed |
Changed in linux (Ubuntu Yakkety): | |
status: | New → Fix Committed |
tags: | added: bugnameltc-150063 severity-high |
------- Comment From <email address hidden> 2016-12-23 07:01 EDT-------
Hello Canonical,
Please include commit # fbce44d0ed4 with kernel for fix of this issue.