cpuhotplug04 in cpuhotplug from ubuntu_ltp cause test hang on J-5.15 P9

Bug #1979179 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
In Progress
Medium
Unassigned
ubuntu-kernel-tests
In Progress
Undecided
Unassigned

Bug Description

Issue found on Jammy 5.15.0-40.43 Power 9 node "baltar"

The cpuhotplug04 test will hang with:

Test output:
<<<test_start>>>
tag=cpuhotplug04 stime=1655706629
cmdline="cpuhotplug04.sh -l 1"
contacts=""
analysis=exit
<<<test_output>>>
incrementing stop
Name: cpuhotplug04
Date: Mon Jun 20 06:30:29 UTC 2022
Desc: Does it prevent us from offlining the last CPU?

dmesg output:
[ 342.099325] LTP: starting cpuhotplug04 (cpuhotplug04.sh -l 1)
[ 343.109144] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 343.336137] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 343.571276] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 343.787207] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 344.019402] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 344.255955] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 344.495940] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 344.659877] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 344.927993] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 345.124037] NOHZ tick-stop error: Non-RCU local softirq work is pending, handler #20!!!
[ 346.583794] EEH: Recovering PHB#3-PE#1fd
[ 346.583840] EEH: PE location: Onboard SAS, PHB location: N/A
[ 346.583885] EEH: Frozen PHB#3-PE#1fd detected
[ 346.583927] EEH: Call Trace:
[ 346.583950] EEH: [c0000000000554ec] __eeh_send_failure_event+0x6c/0x120
[ 346.583989] EEH: [c00000000004e0b4] eeh_dev_check_failure.part.0+0x254/0x630
[ 346.584015] EEH: [c000000000aba988] __pci_write_msi_msg+0x2a8/0x350
[ 346.584048] EEH: [c000000000245308] msi_domain_set_affinity+0xe8/0x100
[ 346.584081] EEH: [c0000000002354cc] irq_do_set_affinity+0x7c/0x230
[ 346.584104] EEH: [c0000000002446a4] irq_migrate_all_off_this_cpu+0x334/0x3a0
[ 346.584138] EEH: [c0000000000d79a4] xive_smp_disable_cpu+0x34/0xd0
[ 346.584160] EEH: [c0000000000eaaa8] pnv_smp_cpu_disable+0x88/0x130
[ 346.584192] EEH: [c0000000000644f4] __cpu_disable+0x54/0xb0
[ 346.584213] EEH: [c0000000001788cc] take_cpu_down+0x4c/0x110
[ 346.584236] EEH: [c0000000002db254] multi_cpu_stop+0xe4/0x240
[ 346.584273] EEH: [c0000000002daa84] cpu_stopper_thread+0x164/0x2c0
[ 346.584298] EEH: [c0000000001bc2e0] smpboot_thread_fn+0x1f0/0x260
[ 346.584340] EEH: [c0000000001b5774] kthread+0x184/0x190
[ 346.584378] EEH: [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64
[ 346.584417] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[ 346.584420] EEH: Notify device drivers to shutdown
[ 346.584423] EEH: Beginning: 'error_detected(IO frozen)'
[ 346.584427] PCI 0003:01:00.0#01fd: EEH: Invoking aacraid->error_detected(IO frozen)
[ 346.584434] aacraid 0003:01:00.0: aacraid: PCI error detected 2
[ 346.622048] sd 0:2:0:0: [sda] tag#764 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[ 346.622067] sd 0:2:0:0: [sda] tag#764 CDB: Write(10) 2a 00 95 69 92 80 00 00 10 00
[ 346.622075] blk_update_request: I/O error, dev sda, sector 2506723968 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[ 346.622123] EXT4-fs warning (device sda2): ext4_end_bio:342: I/O error 10 writing to inode 38276474 starting block 313340498)
[ 346.622134] Buffer I/O error on device sda2, logical block 313338192
[ 346.622167] Buffer I/O error on device sda2, logical block 313338193
[ 346.622213] sd 0:2:0:0: [sda] tag#765 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[ 346.622220] sd 0:2:0:0: [sda] tag#765 CDB: Write(10) 2a 00 95 69 cc e8 00 00 08 00
[ 346.622226] blk_update_request: I/O error, dev sda, sector 2506738920 op 0x1:(WRITE) flags 0x0 phys_seg 1 prio class 0
[ 346.622278] EXT4-fs warning (device sda2): ext4_end_bio:342: I/O error 10 writing to inode 38276475 starting block 313342366)
[ 346.622289] Buffer I/O error on device sda2, logical block 313340061
[ 346.622329] sd 0:2:0:0: [sda] tag#766 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[ 346.622336] sd 0:2:0:0: [sda] tag#766 CDB: Write(10) 2a 00 6e 80 49 00 00 00 08 00
[ 346.622341] blk_update_request: I/O error, dev sda, sector 1853901056 op 0x1:(WRITE) flags 0x103000 phys_seg 1 prio class 0
[ 346.622415] Buffer I/O error on dev sda2, logical block 231735328, lost async page write
[ 346.622461] EXT4-fs error (device sda2): ext4_check_bdev_write_error:217: comm rs:main Q:Reg: Error while async write back metadata
[ 346.622591] sd 0:2:0:0: [sda] tag#513 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[ 346.622602] sd 0:2:0:0: [sda] tag#513 CDB: Synchronize Cache(10) 35 00 00 00 00 00 00 00 00 00
[ 346.622613] blk_update_request: I/O error, dev sda, sector 1950670200 op 0x1:(WRITE) flags 0x20800 phys_seg 1 prio class 0
[ 346.622694] Aborting journal on device sda2-8.
[ 346.623901] PCI 0003:01:00.0#01fd: EEH: aacraid driver reports: 'need reset'
[ 346.623910] EEH: Finished:'error_detected(IO frozen)' with aggregate recovery state:'need reset'
[ 346.623919] EEH: Collect temporary log
[ 346.623947] EEH: of node=0003:01:00.0
[ 346.623954] EEH: PCI device/vendor: 028d9005
[ 346.623960] EEH: PCI cmd/status register: 00100146
[ 346.623964] EEH: PCI-E capabilities and status follow:
[ 346.623979] EEH: PCI-E 00: 00020010 000081a2 00002950 00437083
[ 346.623991] EEH: PCI-E 10: 10830000 00000000 00000000 00000000
[ 346.623995] EEH: PCI-E 20: 00000000
[ 346.623998] EEH: PCI-E AER capability register set follows:
[ 346.624013] EEH: PCI-E AER 00: 30020001 00000000 00400000 00462030
[ 346.624025] EEH: PCI-E AER 10: 00000000 0000e000 000001e0 00000000
[ 346.624037] EEH: PCI-E AER 20: 00000000 00000000 00000000 00000000
[ 346.624045] EEH: PCI-E AER 30: 00000000 00000000
[ 346.624048] PHB4 PHB#3 Diag-data (Version: 1)
[ 346.624052] brdgCtl: 00000002
[ 346.624055] RootSts: 00060040 00402000 a0830008 00100107 00000800
[ 346.624061] PhbSts: 0000001c00000000 0000001c00000000
[ 346.624065] Lem: 0000000100000080 0000000000000000 0000000000000080
[ 346.624069] PhbErr: 0000028000000000 0000020000000000 2148000098000240 a008400000000000
[ 346.624075] RxeTceErr: 2000000000000000 2000000000000000 40000000000001fd 0000000000000000
[ 346.624081] PblErr: 0000000000020000 0000000000020000 0000000000000000 0000000000000000
[ 346.624085] RegbErr: 0000004000000000 0000004000000000 8800003c00000000 0000000007011000
[ 346.624094] PE[1fd] A/B: 8000b03800000000 8000000000000000
[ 346.624098] EEH: Reset without hotplug activity
[ 346.625998] EXT4-fs error (device sda2): ext4_journal_check_start:83: comm systemd-journal: Detected aborted journal
[ 346.626020] EXT4-fs error (device sda2): ext4_journal_check_start:83: comm rs:main Q:Reg: Detected aborted journal
[ 349.134491] EXT4-fs error (device sda2): ext4_journal_check_start:83: comm ext4lazyinit: Detected aborted journal
[ 359.548470] EEH: Beginning: 'slot_reset'
[ 359.548480] PCI 0003:01:00.0#01fd: EEH: Invoking aacraid->slot_reset()
[ 359.548486] aacraid 0003:01:00.0: aacraid: PCI error - slot reset
[ 359.615158] PCI 0003:01:00.0#01fd: EEH: aacraid driver reports: 'recovered'
[ 359.615162] EEH: Finished:'slot_reset' with aggregate recovery state:'recovered'
[ 359.615163] EEH: Notify device driver to resume
[ 359.615164] EEH: Beginning: 'resume'
[ 359.615172] PCI 0003:01:00.0#01fd: EEH: Invoking aacraid->resume()
[ 474.477185] EXT4-fs error (device sda2): ext4_journal_check_start:83: comm snapd: Detected aborted journal
[ 484.561619] INFO: task kworker/17:1:1020 blocked for more than 120 seconds.
[ 484.561643] Not tainted 5.15.0-40-generic #43-Ubuntu
[ 484.561664] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 484.561679] task:kworker/17:1 state:D stack: 0 pid: 1020 ppid: 2 flags:0x00000800
[ 484.561686] Workqueue: events cpuset_hotplug_workfn
[ 484.561694] Call Trace:
[ 484.561695] [c0000000198c3350] [c0000000198c3420] 0xc0000000198c3420 (unreliable)
[ 484.561703] [c0000000198c3540] [c00000000001ddf4] __switch_to+0x134/0x230
[ 484.561711] [c0000000198c35a0] [c0000000012a8c38] __schedule+0x228/0x6f0
[ 484.561718] [c0000000198c3660] [c0000000012a916c] schedule+0x6c/0x130
[ 484.561724] [c0000000198c3690] [c000000000220250] rwsem_down_write_slowpath+0x340/0x6a0
[ 484.561731] [c0000000198c3770] [c000000000535084] mpol_rebind_mm+0x44/0x120
[ 484.561737] [c0000000198c37b0] [c0000000002cfb7c] update_tasks_nodemask+0xfc/0x210
[ 484.561743] [c0000000198c3870] [c0000000002d4df8] cpuset_hotplug_update_tasks+0x548/0x610
[ 484.561749] [c0000000198c3900] [c0000000002d5114] cpuset_hotplug_workfn+0x254/0x7d0
[ 484.561754] [c0000000198c3c70] [c0000000001a8634] process_one_work+0x2f4/0x5e0
[ 484.561760] [c0000000198c3d10] [c0000000001a8f08] worker_thread+0xa8/0x610
[ 484.561765] [c0000000198c3da0] [c0000000001b5774] kthread+0x184/0x190
[ 484.561772] [c0000000198c3e10] [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64
[ 484.561792] INFO: task jbd2/sda2-8:1767 blocked for more than 120 seconds.
[ 484.561816] Not tainted 5.15.0-40-generic #43-Ubuntu
[ 484.561837] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 484.561860] task:jbd2/sda2-8 state:D stack: 0 pid: 1767 ppid: 2 flags:0x00000800
[ 484.561866] Call Trace:
[ 484.561867] [c00000005640b540] [c00000005630ad00] 0xc00000005630ad00 (unreliable)
[ 484.561873] [c00000005640b730] [c00000000001ddf4] __switch_to+0x134/0x230
[ 484.561879] [c00000005640b790] [c0000000012a8c38] __schedule+0x228/0x6f0
[ 484.561886] [c00000005640b850] [c0000000012a916c] schedule+0x6c/0x130
[ 484.561891] [c00000005640b880] [c0000000012a92b4] io_schedule+0x54/0x80
[ 484.561897] [c00000005640b8b0] [c0000000012a9ec8] bit_wait_io+0x28/0xa0
[ 484.561903] [c00000005640b8e0] [c0000000012a9898] __wait_on_bit+0x98/0x160
[ 484.561909] [c00000005640b930] [c0000000012a9a00] out_of_line_wait_on_bit+0xa0/0xd0
[ 484.561916] [c00000005640b9a0] [c000000000619c90] __wait_on_buffer+0x50/0x70
[ 484.561923] [c00000005640b9d0] [c0000000007a2ac8] jbd2_write_superblock+0x228/0x330
[ 484.561930] [c00000005640ba80] [c0000000007a2894] jbd2_journal_abort+0x134/0x140
[ 484.561936] [c00000005640bb00] [c000000000795d14] jbd2_journal_commit_transaction+0x1294/0x1f20
[ 484.561943] [c00000005640bd00] [c00000000079f36c] kjournald2+0xfc/0x370
[ 484.561948] [c00000005640bda0] [c0000000001b5774] kthread+0x184/0x190
[ 484.561955] [c00000005640be10] [c00000000000cf64] ret_from_kernel_thread+0x5c/0x64

Past test result:
  * This test was not executed (or not successfully executed on this node with 5.15.0-28.29 ~ 5.15.0-36.37)

I can reproduce it manually with 5.15.0-39

Steps to reproduce:
  sudo apt install -y automake bison build-essential byacc flex git keyutils libacl1-dev libaio-dev libcap-dev libmm-dev libnuma-dev libsctp-dev libselinux1-dev libssl-dev libtirpc-dev pkg-config quota xfslibs-dev xfsprogs gcc
  git clone --depth=1 https://github.com/linux-test-project/ltp.git
  cd ltp
  make autotools
  ./configure
  sudo make install
  echo "cpuhotplug04 cpuhotplug04.sh -l 1" > /tmp/target
  sudo /opt/ltp/runltp -f /tmp/target

Note that this issue does not exist on Impish 5.13.0 with this P9 node.

Po-Hsu Lin (cypressyew)
tags: added: 5.15 jammy ppc64el sru-20220530 ubuntu-ltp
Po-Hsu Lin (cypressyew)
description: updated
Po-Hsu Lin (cypressyew)
description: updated
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Download full text (8.4 KiB)

I can see this issue with 5.19-rc3 as well:

[ 199.281238] LTP: starting cpuhotplug04 (cpuhotplug04.sh -l 1)
[ 203.753796] EEH: Recovering PHB#3-PE#1fd
[ 203.753813] EEH: PE location: Onboard SAS, PHB location: N/A
[ 203.753817] EEH: Frozen PHB#3-PE#1fd detected
[ 203.753821] EEH: Call Trace:
[ 203.753822] EEH: [c00000000005596c] __eeh_send_failure_event+0x6c/0x120
[ 203.753840] EEH: [c00000000004e5b4] eeh_dev_check_failure.part.0+0x254/0x650
[ 203.753851] EEH: [c000000000ad9d58] __pci_write_msi_msg+0x2a8/0x350
[ 203.753858] EEH: [c000000000ada2e8] pci_msi_domain_write_msg+0x38/0x50
[ 203.753863] EEH: [c000000000241398] msi_domain_set_affinity+0xe8/0x100
[ 203.753869] EEH: [c000000000230c04] irq_do_set_affinity+0x1e4/0x2a0
[ 203.753874] EEH: [c000000000240414] irq_migrate_all_off_this_cpu+0x324/0x390
[ 203.753879] EEH: [c0000000000d3f24] xive_smp_disable_cpu+0x34/0xd0
[ 203.753883] EEH: [c0000000000e5eb8] pnv_smp_cpu_disable+0x88/0x120
[ 203.753888] EEH: [c000000000063c84] __cpu_disable+0x54/0xb0
[ 203.753892] EEH: [c000000000170c4c] take_cpu_down+0x4c/0x110
[ 203.753896] EEH: [c0000000002da8d4] multi_cpu_stop+0xe4/0x230
[ 203.753902] EEH: [c0000000002d9ee4] cpu_stopper_thread+0x164/0x2c0
[ 203.753907] EEH: [c0000000001b4e90] smpboot_thread_fn+0x1f0/0x260
[ 203.753911] EEH: [c0000000001adaa4] kthread+0x124/0x130
[ 203.753916] EEH: [c00000000000ce54] ret_from_kernel_thread+0x5c/0x64
[ 203.753920] EEH: This PCI device has failed 1 times in the last hour and will be permanently disabled after 5 failures.
[ 203.753922] EEH: Notify device drivers to shutdown
[ 203.753925] EEH: Beginning: 'error_detected(IO frozen)'
[ 203.753928] PCI 0003:01:00.0#01fd: EEH: Invoking aacraid->error_detected(IO frozen)
[ 203.753934] aacraid 0003:01:00.0: aacraid: PCI error detected 2
[ 203.774865] sd 0:2:0:0: [sda] tag#404 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[ 203.774884] sd 0:2:0:0: [sda] tag#404 CDB: Write(10) 2a 00 23 00 ad 00 00 02 00 00
[ 203.774890] I/O error, dev sda, sector 587246848 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 203.774903] sd 0:2:0:0: [sda] tag#405 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[ 203.774909] sd 0:2:0:0: [sda] tag#405 CDB: Write(10) 2a 00 23 00 af 00 00 02 00 00
[ 203.774912] I/O error, dev sda, sector 587247360 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 203.774933] sd 0:2:0:0: [sda] tag#406 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[ 203.774938] sd 0:2:0:0: [sda] tag#406 CDB: Write(10) 2a 00 23 00 b1 00 00 02 00 00
[ 203.774940] I/O error, dev sda, sector 587247872 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 203.774948] sd 0:2:0:0: [sda] tag#407 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[ 203.774952] sd 0:2:0:0: [sda] tag#407 CDB: Write(10) 2a 00 23 00 b3 00 00 02 00 00
[ 203.774954] I/O error, dev sda, sector 587248384 op 0x1:(WRITE) flags 0x4800 phys_seg 4 prio class 0
[ 203.774962] sd 0:2:0:0: [sda] tag#408 FAILED Result: hostbyte=DID_NO_CONNECT driverbyte=DRIVER_OK cmd_age=0s
[ 203.774965] sd 0:2:0:0: [sda] tag#408 CDB: Write(10) 2a 00 23...

Read more...

description: updated
Po-Hsu Lin (cypressyew)
tags: added: sru-20220620
bugproxy (bugproxy)
tags: added: architecture-ppc64le bugnameltc-198904 severity-medium targetmilestone-inin---
Po-Hsu Lin (cypressyew)
tags: added: sru-20230612
Frank Heimes (fheimes)
Changed in ubuntu-kernel-tests:
status: New → In Progress
Changed in ubuntu-power-systems:
status: New → In Progress
importance: Undecided → Medium
tags: added: 6.2 lunar
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.