CPU hard lockup when turning CPU back online on Bionic P9

Bug #1827343 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Fix Released
High
bugproxy

Bug Description

Found on another Boston Power9 box "dradis".

Steps to reproduce:
1. Check online CPUs
     $ cat /sys/devices/system/cpu/online
     0-159
2. Do a CPU hotplug to take one off:
     $ echo 0 | sudo tee /sys/devices/system/cpu/cpu159/online
     0
3. Check dmesg, you should see:
     [ 410.890106] IRQ 174: no longer affine to CPU159
4. Put that CPU back online and check dmesg again:
     $ echo 1 | sudo tee /sys/devices/system/cpu/cpu159/online

System complains about CPU hard lockup:
[ 410.890106] IRQ 174: no longer affine to CPU159
[ 421.168052] Watchdog CPU:128 Hard LOCKUP
[ 421.168054] Modules linked in: joydev input_leds mac_hid idt_89hpesx ipmi_powernv opal_prd ipmi_devintf ibmpowernv ofpart at24 cmdlinepart uio_pdrv_genirq uio powernv_flash mtd ipmi_msghandler vmx_crypto sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear ses enclosure scsi_transport_sas uas usb_storage ast hid_generic i2c_algo_bit ttm drm_kms_helper usbhid syscopyarea sysfillrect sysimgblt hid fb_sys_fops crct10dif_vpmsum crc32c_vpmsum drm i40e aacraid
[ 421.168108] CPU: 128 PID: 778 Comm: watchdog/128 Not tainted 4.15.0-48-generic #51-Ubuntu
[ 421.168109] NIP: c000000000d082e8 LR: c00000000016c3b0 CTR: c000000000ac5d80
[ 421.168111] REGS: c00000003f9ffd80 TRAP: 0900 Not tainted (4.15.0-48-generic)
[ 421.168112] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24000484 XER: 00000000
[ 421.168118] CFAR: c00000000016c3ac SOFTE: 0
               GPR00: c00000000016c3b0 c000200e55743af0 c0000000016eb400 c000200e614a8f20
               GPR04: 000000000000088c c000200e614a4360 c0000000fd6879b8 0000000000000008
               GPR08: 000000000054cd92 00000000389fd980 0000000080000080 0000000000000005
               GPR12: c000000000ac5d80 c00000000fad8000 c00000000013e648 c000000ff90e9640
               GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
               GPR20: 0000000000000000 0000000000000000 c000200e55852b80 0000200e602d0000
               GPR24: c000200e614a8580 0000000000000000 c000000000d01ff0 c000200e614a6fb8
               GPR28: c000200e614a8880 000000000000088c c000200e614a8580 c000200e614a8f20
[ 421.168142] NIP [c000000000d082e8] _raw_spin_lock+0x28/0xe0
[ 421.168146] LR [c00000000016c3b0] update_curr_rt+0x1d0/0x3f0
[ 421.168147] Call Trace:
[ 421.168150] [c000200e55743af0] [c00000000171dd78] __per_cpu_offset+0x0/0x4000 (unreliable)
[ 421.168154] [c000200e55743b20] [c00000000016c2b0] update_curr_rt+0xd0/0x3f0
[ 421.168156] [c000200e55743bb0] [c00000000016c7bc] dequeue_task_rt+0x3c/0xf0
[ 421.168159] [c000200e55743bf0] [c00000000014e9b0] deactivate_task+0xb0/0x160
[ 421.168161] [c000200e55743c70] [c000000000d0187c] __schedule+0x3bc/0xaf0
[ 421.168164] [c000200e55743d40] [c000000000d01ff0] schedule+0x40/0xc0
[ 421.168167] [c000200e55743d60] [c000000000144bd4] smpboot_thread_fn+0x284/0x290
[ 421.168169] [c000200e55743dc0] [c00000000013e7e8] kthread+0x1a8/0x1b0
[ 421.168172] [c000200e55743e30] [c00000000000b658] ret_from_kernel_thread+0x5c/0x84
[ 421.168173] Instruction dump:
[ 421.168175] 7c0803a6 4bffff98 3c4c009e 38423140 7c0802a6 60000000 fbe1fff8 f821ffd1
[ 421.168179] 7c7f1b78 39400000 994d028d 814d0008 <7d201829> 2c090000 40c20010 7d40192d

But the CPU is actually back online:
$ cat /sys/devices/system/cpu/online
0-159
$ cat /sys/devices/system/cpu/cpu159/online
1

ProblemType: Bug
DistroRelease: Ubuntu 18.04
Package: linux-image-4.15.0-48-generic 4.15.0-48.51
ProcVersionSignature: Ubuntu 4.15.0-48.51-generic 4.15.18
Uname: Linux 4.15.0-48-generic ppc64le
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 May 2 08:06 seq
 crw-rw---- 1 root audio 116, 33 May 2 08:06 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.9-0ubuntu7.6
Architecture: ppc64el
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Thu May 2 08:16:36 2019
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
PciMultimedia:

ProcFB: 0 astdrmfb
ProcKernelCmdLine: root=UUID=82644f4d-d7cb-4abf-b5e9-d8b5644f77dd ro console=hvc0
ProcLoadAvg: 0.02 0.42 0.37 1/1332 5365
ProcLocks:
 1: FLOCK ADVISORY WRITE 4460 00:17:336 0 EOF
 2: POSIX ADVISORY WRITE 3994 00:17:604 0 EOF
 3: FLOCK ADVISORY WRITE 3913 00:17:586 0 EOF
 4: POSIX ADVISORY WRITE 3984 00:17:620 0 EOF
 5: POSIX ADVISORY WRITE 1816 00:17:356 0 EOF
ProcSwaps:
 Filename Type Size Used Priority
 /swap.img file 8388544 0 -2
ProcVersion: Linux version 4.15.0-48-generic (buildd@bos02-ppc64el-010) (gcc version 7.3.0 (Ubuntu 7.3.0-16ubuntu3)) #51-Ubuntu SMP Wed Apr 3 08:26:19 UTC 2019
RelatedPackageVersions:
 linux-restricted-modules-4.15.0-48-generic N/A
 linux-backports-modules-4.15.0-48-generic N/A
 linux-firmware 1.173.5
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
VarLogDump_list: total 0
cpu_cores: Number of cores present = 40
cpu_coreson: Number of cores online = 40
cpu_dscr: DSCR is 16
cpu_freq:
 min: 2.862 GHz (cpu 159)
 max: 2.862 GHz (cpu 81)
 avg: 2.862 GHz
cpu_runmode:
 Could not retrieve current diagnostics mode,
 No kernel interface to firmware
cpu_smt: SMT=4

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Tried again on the very same node after re-deploy the system with Bionic again.
I can't reproduce this issue.

Tried with the "cpu-on-off-test.sh" script in kernel selftesting tools with "-a" flag, I can't reproduce this as well.

Manoj Iyer (manjo)
Changed in ubuntu-power-systems:
importance: Undecided → High
assignee: nobody → bugproxy (bugproxy)
Revision history for this message
Manoj Iyer (manjo) wrote :

The pnor firmware that we have on our bostons are backlevel and needs to be upgraded. We have production and development level hardware, and when we performed a firmware upgrade we ran into issues related to secure boot. I will work with IBM (Michael) and get these systems upgraded to the latest firmware levels.

bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-177391 severity-high targetmilestone-inin---
Revision history for this message
Manoj Iyer (manjo) wrote :

Waiting on firmware (pnor) from IBM.

Changed in ubuntu-power-systems:
status: New → Incomplete
Changed in linux (Ubuntu):
importance: Undecided → High
Revision history for this message
Manoj Iyer (manjo) wrote :

upgraded the firmware on dradis to P9DSU20190404_IBM_prod_sign.pnor and tested with bionic and disco and the issue does not reproduce. Marking this bug as fix-committed, and if you are able to reproduce this again please re-open this bug.

Changed in ubuntu-power-systems:
status: Incomplete → Fix Committed
Changed in linux (Ubuntu):
status: Confirmed → Fix Committed
Revision history for this message
bugproxy (bugproxy) wrote : Comment bridged from LTC Bugzilla

------- Comment From <email address hidden> 2019-05-09 02:51 EDT-------
(In reply to comment #25)
> upgraded the firmware on dradis to P9DSU20190404_IBM_prod_sign.pnor and
> tested with bionic and disco and the issue does not reproduce. Marking this
> bug as fix-committed, and if you are able to reproduce this again please
> re-open this bug.

As i understand that fix is in firmware and no fix dropped into Linux
(bionic/disco) And thus this bug should be rejected as "not a bug" from Linux point
of view as no linux fix is dropped here ? Please advise.

Revision history for this message
Andrew Cloke (andrew-cloke) wrote :

Po-Hsu Lin (cypressyew) are you able to verify that the firmware update has indeed resolved the issue?

Po-Hsu Lin (cypressyew)
no longer affects: linux (Ubuntu)
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Hello,
I cannot reproduce this CPU hard lockup issue on node "dradis"

Tested with tweaked cpu-on-off-test.sh script with -a flag to offline / online all available CPUs for 10 times, and offline / online cpu159 for 100 times.

I will remove the linux project here.

Thanks!

Changed in ubuntu-power-systems:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.