Unable to put offline CPU back online on Bionic/B-hwe-edge P9

Bug #1827335 reported by Po-Hsu Lin
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Invalid
High
bugproxy
linux (Ubuntu)
Invalid
Undecided
Unassigned

Bug Description

You will see these CPUs in offline state after boot.

ubuntu@baltar:~$ cat /sys/devices/system/cpu/offline
156-159

ubuntu@baltar:~$ echo 1 | sudo tee /sys/devices/system/cpu/cpu156/online
1
tee: /sys/devices/system/cpu/cpu156/online: Invalid argument
ubuntu@baltar:~$ echo 1 | sudo tee /sys/devices/system/cpu/cpu157/online
1
tee: /sys/devices/system/cpu/cpu157/online: Invalid argument
ubuntu@baltar:~$ echo 1 | sudo tee /sys/devices/system/cpu/cpu158/online
1
tee: /sys/devices/system/cpu/cpu158/online: Invalid argument
ubuntu@baltar:~$ echo 1 | sudo tee /sys/devices/system/cpu/cpu159/online
1
tee: /sys/devices/system/cpu/cpu159/online: Invalid argument

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-5.0.0-14-generic 5.0.0-14.15~18.04.1+signed1
ProcVersionSignature: Ubuntu 5.0.0-14.15~18.04.1-generic 5.0.6
Uname: Linux 5.0.0-14-generic ppc64le
ApportVersion: 2.20.9-0ubuntu7.6
Architecture: ppc64el
Date: Thu May 2 06:46:36 2019
ProcLoadAvg: 0.00 0.00 0.00 1/1298 6496
ProcSwaps:
 Filename Type Size Used Priority
 /swap.img file 8388544 0 -2
ProcVersion: Linux version 5.0.0-14-generic (buildd@bos02-ppc64el-014) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #15~18.04.1-Ubuntu SMP Thu Apr 25 18:55:27 UTC 2019
SourcePackage: linux-signed-hwe-edge
UpgradeStatus: No upgrade log present (probably fresh install)
VarLogDump_list: total 0
cpu_cores: Number of cores present = 40
cpu_coreson: Number of cores online = 39
cpu_dscr: DSCR is 16
cpu_freq:
 min: 2.862 GHz (cpu 79)
 max: 2.945 GHz (cpu 81)
 avg: 2.903 GHz
cpu_runmode:
 Could not retrieve current diagnostics mode,
 No kernel interface to firmware
cpu_smt: SMT=4

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Po-Hsu Lin (cypressyew)
description: updated
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

This issue can be reproduced with 4.15 Bionic as well.

However, I can't see any CPU in offline state for a 4.15 Bionic P8 node after boot.

affects: linux-signed-hwe-edge (Ubuntu) → linux (Ubuntu)
summary: - Unable to put offline CPU back online on B-hwe-edge P9
+ Unable to put offline CPU back online on Bionic P9
summary: - Unable to put offline CPU back online on Bionic P9
+ Unable to put offline CPU back online on Bionic/B-hwe-edge P9
description: updated
description: updated
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1827335

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
importance: Undecided → High
assignee: nobody → bugproxy (bugproxy)
Revision history for this message
Manoj Iyer (manjo) wrote :

The pnor firmware that we have on our bostons are backlevel and needs to be upgraded. We have production and development level hardware, and when we performed a firmware upgrade we ran into issues related to secure boot. I will work with IBM (Michael) and get these systems upgraded to the latest firmware levels.

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-177392 severity-high targetmilestone-inin---
tags: added: bugnameltc-177393
removed: bugnameltc-177392
bugproxy (bugproxy)
tags: added: bugnameltc-177392
removed: bugnameltc-177393
Revision history for this message
Frank Heimes (fheimes) wrote :

Waiting for a firmware update from IBM - hence set to Incomplete.

Changed in ubuntu-power-systems:
status: New → Incomplete
Revision history for this message
Manoj Iyer (manjo) wrote :

upgraded the firmware on dradis to P9DSU20190404_IBM_prod_sign.pnor and tested with bionic and disco and the issue does not reproduce. Marking this bug as fix-committed, and if you are able to reproduce this again please re-open this bug.

Changed in ubuntu-power-systems:
status: Incomplete → Fix Committed
Changed in linux (Ubuntu):
status: Incomplete → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2019-05-08 06:52 EDT-------
(In reply to comment #17)
> upgraded the firmware on dradis to P9DSU20190404_IBM_prod_sign.pnor and
> tested with bionic and disco and the issue does not reproduce. Marking this
> bug as fix-committed, and if you are able to reproduce this again please
> re-open this bug.

As i understand that fix is in firmware and no fix dropped into Linux (bionic/disco)
And thus this bug should be rejected as "not a bug" from Linux point of view as no
linux fix is dropped here ? Please advise.

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Po-Hsu Lin (cypressyew) are you able to confirm that the issue can no longer be reproduced now that the firmware has been updated to the latest GA version?

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Hello,

This issue is still affecting node "baltar", I can't put CPU 156-59 online:
ubuntu@baltar:~$ echo 1 | sudo tee /sys/devices/system/cpu/cpu156/online
1
tee: /sys/devices/system/cpu/cpu156/online: Invalid argument

Interesting difference between "baltar" and "dradis"

ubuntu@baltar:~$ cat /sys/devices/system/cpu/possible
0-159
ubuntu@baltar:~$ cat /sys/devices/system/cpu/present
0-155

ubuntu@dradis:~$ cat /sys/devices/system/cpu/possible
0-159
ubuntu@dradis:~$ cat /sys/devices/system/cpu/present
0-159

Thanks

Changed in ubuntu-power-systems:
status: Fix Committed → Confirmed
Changed in linux (Ubuntu):
status: Fix Committed → Confirmed
Revision history for this message
Manoj Iyer (manjo) wrote :

Po-Hsu Lin, could you please tell me the release and the kernel version used?

Revision history for this message
Manoj Iyer (manjo) wrote :

== From the BMC web UI ==

Sensor Readings => Select a sensor type category: All Sensors:
CPU Core Func 48 Processor disabled

== In dmesg ==
[ 0.000000] CPU maps initialized for 4 threads per core

[ 0.021044] smp: Bringing up secondary CPUs ...
[ 0.704642] smp: Brought up 2 nodes, 156 CPUs
[ 0.704709] numa: Node 0 CPUs: 0-79
[ 0.704773] numa: Node 8 CPUs: 80-155

== Result ==
Looks like exactly 4 threads are missing between 155 and 159 which corresponds to the disabled Core as reported by firmware. BMC logs does not give me any reason for why a core was disabled (may be a bad core?)... May be someone @IBM could tell me if there was a firmware command to get that information?

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-05-10 17:33 EDT-------
From the opal-utils package can you try: sudo opal-gard list? I think if there's something there you can use opal-gard show to get more details.

bugproxy (bugproxy)
tags: added: bugnameltc-177393
removed: bugnameltc-177392
Revision history for this message
Manoj Iyer (manjo) wrote :

ubuntu@baltar:~$ sudo opal-gard list
 ID | Error | Type | Path
-----------------------------------------------------------------------
 00000001 | 90000012 | Predictive | /Sys0/Node0/Proc1/EQ5/EX1/Core1
=======================================================================

ubuntu@baltar:~$ sudo opal-gard show 00000001
Record ID: 0x00000001
========================
Error ID: 0x90000012
Error Type: Predictive (0xe6)
Path Type: physical
>Sys, Instance #0
 >Node, Instance #0
  >Proc, Instance #1
   >EQ, Instance #5
    >EX, Instance #1
     >Core, Instance #1

Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-05-13 08:59 EDT-------
*** Bug 177393 has been marked as a duplicate of this bug. ***

tags: added: bugnameltc-177392
removed: bugnameltc-177393
Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Status update: we believe that the off-lined cores are being intentionally forced off-line by the f/w after f/w tests have determined that the some cores are faulty. Waiting on IBM to confirm this interpretation.

Marking as "Incomplete" while awaiting this confirmation.

Changed in ubuntu-power-systems:
status: Confirmed → Incomplete
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Revision history for this message
Manoj Iyer (manjo) wrote :

Ran plc -b bmc on baltar and the files generated at included in this tar file.

Revision history for this message
Mike Ranweiler (mranweil) wrote :

This isn't a kernel issue - I don't have an answer on the why for being forced offline but it's the not a kernel bug.

Revision history for this message
Manoj Iyer (manjo) wrote :

Based on my debugging as well as confirmation from IBM this appears to be a hardware issue that triggers the firmware to offline faulty cores. This is not a kernel issue so closing this bug as invalid.

Changed in ubuntu-power-systems:
status: Incomplete → Invalid
Changed in linux (Ubuntu):
status: Incomplete → Invalid
Revision history for this message
bugproxy (bugproxy) wrote :

------- Comment From <email address hidden> 2019-06-25 01:36 EDT-------
Closing this bug as its not a kernel issue.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Hello,

so does it means we have something inside this box that needs to be replaced?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.