P8 node modoc will reboot automatically when running the sru_misc test suite

Bug #1867155 reported by Po-Hsu Lin
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
Triaged
Undecided
Unassigned
linux (Ubuntu)
Confirmed
Undecided
Unassigned

Bug Description

Tested with 5 attempts, 4 hangs around the following test in ubuntu_kernel_selftests net sub-category:
 # selftests: net: reuseport_bpf_cpu

First attempt:
23:21:32 DEBUG| [stdout] ok 2 selftests: net: reuseport_bpf_cpu
23:21:32 DEBUG| [stdout] # selftests: net: reuseport_bpf_numa
23:21:32 DEBUG| [stdout] # ---- IPv4 UDP ----
(hang here)

Second attempt:
10:17:35 DEBUG| [stdout] ok 1 selftests: net: reuseport_bpf
10:17:35 DEBUG| [stdout] # selftests: net: reuseport_bpf_cpu
10:17:35 DEBUG| [stdout] # ---- IPv4 UDP ----
10:17:35 DEBUG| [stdout] # send cpu 0, receive socket 0
(line skipped)
10:17:35 DEBUG| [stdout] # send cpu 159, receive socket 159
10:17:35 DEBUG| [stdout] # ---- IPv6 TCP ----
(hang here)

Third attempt failed because of test timeout:
12:46:16 DEBUG| [stdout] # [FAIL]
12:46:16 DEBUG| [stdout] # --------------------
12:46:16 DEBUG| [stdout] # running psock_tpacket test
12:46:16 DEBUG| [stdout] # --------------------
13:14:13 INFO | Timer expired (1800 sec.), nuking pid 161853

Fourth attempt:
07:41:51 DEBUG| [stdout] # selftests: net: reuseport_bpf_cpu
07:41:51 DEBUG| [stdout] # ---- IPv4 UDP ----
07:41:51 DEBUG| [stdout] # send cpu 0, receive socket 0
(lines skipped)
07:41:51 DEBUG| [stdout] # send cpu 159, receive socket 159
07:41:51 DEBUG| [stdout] # ---- IPv6 UDP ----
07:41:51 DEBUG| [stdout] # send cpu 0, receive socket 0
07:41:51 DEBUG| [stdout] # send cpu 1, receive socket 1
(lines skipped)
07:41:51 DEBUG| [stdout] # send cpu 157, receive socket 157
07:41:51 DEBUG| [stdout] # send cpu 159, receive socket 159
07:41:51 DEBUG| [stdout] # ---- IPv4 TCP ----
(test hang here)

Fifth attempt:
04:29:17 DEBUG| [stdout] ok 1 selftests: net: reuseport_bpf
04:29:17 DEBUG| [stdout] # selftests: net: reuseport_bpf_cpu
04:29:17 DEBUG| [stdout] # ---- IPv4 UDP ----
04:29:17 DEBUG| [stdout] # send cpu 0, receive socket 0
(lines skipped)
04:29:17 DEBUG| [stdout] # send cpu 159, receive socket 159
04:29:17 DEBUG| [stdout] # ---- IPv6 UDP ----
04:29:17 DEBUG| [stdout] # send cpu 0, receive socket 0
(lines skipped)
04:29:17 DEBUG| [stdout] # send cpu 159, receive socket 159
04:29:17 DEBUG| [stdout] # ---- IPv4 TCP ----
04:29:17 DEBUG| [stdout] # send cpu 0, receive socket 0
(lines skipped)
04:29:17 DEBUG| [stdout] # send cpu 15, receive socket 15
(test hang here)

I tried to run tests in this sru-misc suite in the following order:
        'hwclock',
        'ubuntu_bpf',
        'ubuntu_bpf_jit',
        'ubuntu_kernel_selftests',
        'ubuntu_lxc',
        'ubuntu_seccomp',
        'ubuntu_unionmount_ovlfs',
        'ubuntu_cts_kernel',
        'ubuntu_kvm_unit_tests',
One by one on this node, but I can't reproduce this issue.

I tried to watch dmesg when this happens, but there is no information there, the system will be reboot automatically silently.

This is what you can see from syslog after reboot:
Mar 12 04:27:39 modoc kernel: [ 536.668305] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.684547] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.700907] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.717246] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.719288] page:c00c000000c4f000 refcount:1 mapcount:0 mapping:c000000f8cfe0fd1 index:0x7611c3e
Mar 12 04:27:39 modoc kernel: [ 536.719289] anon
Mar 12 04:27:39 modoc kernel: [ 536.719291] flags: 0x3ffff800080024(uptodate|active|swapbacked)
Mar 12 04:27:39 modoc kernel: [ 536.719294] raw: 003ffff800080024 5deadbeef0000100 5deadbeef0000122 c000000f8cfe0fd1
Mar 12 04:27:39 modoc kernel: [ 536.719295] raw: 0000000007611c3e 0000000000000000 00000001ffffffff c000000fcfd1c000
Mar 12 04:27:39 modoc kernel: [ 536.719296] page dumped because: unmovable page
Mar 12 04:27:39 modoc kernel: [ 536.719296] page->mem_cgroup:c000000fcfd1c000
Mar 12 04:27:39 modoc kernel: [ 536.735465] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.751848] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.768210] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.784450] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.800756] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.817006] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.833133] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.849205] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.865448] Injecting error (-12) to MEM_GOING_OFFLINE
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Mar 12 04:35:41 modoc systemd[1]: Starting Flush Journal to Persistent Storage...
Mar 12 04:35:41 modoc kernel: [ 0.000000] hash-mmu: Page sizes from device-tree:
Mar 12 04:35:41 modoc kernel: [ 0.000000] hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
Mar 12 04:35:41 modoc kernel: [ 0.000000] hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
Mar 12 04:35:41 modoc kernel: [ 0.000000] hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
Mar 12 04:35:41 modoc systemd[1]: Started udev Kernel Device Manager.

From the log above, line "^@^@^@^@^@^" indicates the reboot. It looks like it's running the memory-hotplug test.

Maybe we need to use IPMI to see if there is anything on the console.

ProblemType: Bug
DistroRelease: Ubuntu 19.10
Package: linux-image-5.3.0-42-generic 5.3.0-42.34
ProcVersionSignature: Ubuntu 5.3.0-42.34-generic 5.3.18
Uname: Linux 5.3.0-42-generic ppc64le
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Mar 12 04:33 seq
 crw-rw---- 1 root audio 116, 33 Mar 12 04:33 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.11-0ubuntu8.5
Architecture: ppc64el
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Thu Mar 12 09:42:24 2020
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb:
 Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: root=UUID=b2a867ce-7813-4785-8861-4e7de2ac39b4 ro console=hvc0
ProcLoadAvg: 0.07 0.02 0.00 1/1461 86637
ProcLocks:
 1: POSIX ADVISORY WRITE 3799 00:18:841 0 EOF
 2: POSIX ADVISORY WRITE 3526 00:18:743 0 EOF
 3: FLOCK ADVISORY WRITE 3720 00:18:837 0 EOF
ProcSwaps:
 Filename Type Size Used Priority
 /swap.img file 8388544 0 -2
ProcVersion: Linux version 5.3.0-42-generic (buildd@bos02-ppc64el-006) (gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)) #34-Ubuntu SMP Fri Feb 28 05:49:17 UTC 2020
RelatedPackageVersions:
 linux-restricted-modules-5.3.0-42-generic N/A
 linux-backports-modules-5.3.0-42-generic N/A
 linux-firmware 1.183.4
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
VarLogDump_list: total 0
cpu_cores: Number of cores present = 20
cpu_coreson: Number of cores online = 20
cpu_dscr: DSCR is 0
cpu_freq:
 min: 3.694 GHz (cpu 159)
 max: 3.695 GHz (cpu 1)
 avg: 3.694 GHz
cpu_runmode:
 Could not retrieve current diagnostics mode,
 No kernel interface to firmware
cpu_smt: SMT=8

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
tags: added: sru-20200217
summary: - P8 node modoc will reboot when running the sru_misc
+ P8 node modoc will reboot automatically when running the sru_misc test
+ suite
description: updated
Po-Hsu Lin (cypressyew)
description: updated
Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Po-Hsu Lin (cypressyew)
tags: added: kqa-blocker
removed: sru-20200217
Po-Hsu Lin (cypressyew)
description: updated
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Please find attachment for the syslog?field.comment=Please find attachment for the syslog

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

I can see this message on a freshly deployed modoc:
[ 396.856011] ipr 0001:08:00.0: 9076: Configuration error, missing remote IOA
[ 396.856042] ipr 0001:08:00.0: Attached Adapter not discovered within allotted time [PRC: 17101541]
[ 396.856051] ipr 0001:08:00.0: Remote IOA VPID/SN: 00000000
[ 396.856058] ipr 0001:08:00.0: Remote IOA WWN: 0000000000000000
[ 396.856065] ipr: 00000000: 01000000 00000100 FC220000 08400000
[ 396.856072] ipr: 00000010: 4C494344 49424D20 20202020 35374438
[ 396.856078] ipr: 00000020: 30303153 4953494F 41202020 00000000
[ 396.856084] ipr: 00000030: 30303431 57303132 50050760 5EC33900
[ 396.856091] ipr: 00000040: 00000001 00000000 02020101 000000FF

With 5.3.0-40

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Download full text (4.7 KiB)

I managed to catch this issue with IPMI console:

[ 673.975988] BUG: Unable to handle kernel instruction fetch
[ 673.976017] Faulting instruction address: 0x7fe000087fe00008
[ 673.976025] Oops: Kernel access of bad area, sig: 11 [#1]
[ 673.976032] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
[ 673.976040] Modules linked in: binfmt_misc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua leds_powernv ipmi_powernv ipmi_devintf ipmi_msghandler ibmpowernv uio_pdrv_genirq vmx_crypto uio powernv_rng powernv_op_panel sch_fq_codel ip_tables x_tables autofs4 ses enclosure scsi_transport_sas btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_vpmsum crc32c_vpmsum tg3 ipr [last unloaded: notifier_error_inject]
[ 673.976101] CPU: 0 PID: 76667 Comm: reuseport_bpf_c Not tainted 5.3.0-42-generic #34-Ubuntu
[ 673.976109] NIP: 7fe000087fe00008 LR: c000000000ca9d10 CTR: 7fe000087fe00008
[ 673.976117] REGS: c000000ffffcf580 TRAP: 0480 Not tainted (5.3.0-42-generic)
[ 673.976124] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002488 XER: 20000000
[ 673.976135] CFAR: c000000000ca9d0c IRQMASK: 0
[ 673.976135] GPR00: c000000000e4094c c000000ffffcf810 c0000000019c9000 c000000a2208b6e0
[ 673.976135] GPR04: c008000007c70038 c000000a2208b6e0 0000000000000028 000000012e2e5506
[ 673.976135] GPR08: 00000000a2be9417 0000000000000000 0000000000000000 0000000000000000
[ 673.976135] GPR12: 7fe000087fe00008 c000000001d60000 0000000000000003 0000000000000101
[ 673.976135] GPR16: c000000003fe00e8 00000000000022b8 0000000000002788 c0000000018fb600
[ 673.976135] GPR20: 000000000000000a 0000000000000000 00000000000022b8 0000000000000001
[ 673.976135] GPR24: 0000000000000000 0000000000000028 00000000000000a0 c000000003fe00e8
[ 673.976135] GPR28: c008000007c70000 00000000147b1819 c000000a2208b6e0 c000001e2d8b0000
[ 673.976195] NIP [7fe000087fe00008] 0x7fe000087fe00008
[ 673.976206] LR [c000000000ca9d10] reuseport_select_sock+0x100/0x400
[ 673.976212] Call Trace:
[ 673.976216] [c000000ffffcf810] [c000000fe9daf0f0] 0xc000000fe9daf0f0 (unreliable)
[ 673.976226] [c000000ffffcf8b0] [c000000000e4094c] inet6_lhash2_lookup+0x1ec/0x220
[ 673.976234] [c000000ffffcf930] [c000000000e40c80] inet6_lookup_listener+0x300/0x3f0
[ 673.976244] [c000000ffffcf9d0] [c000000000e1ec38] tcp_v6_rcv+0x7d8/0xe10
[ 673.976252] [c000000ffffcfb00] [c000000000dd7e70] ip6_protocol_deliver_rcu+0x110/0x6e0
[ 673.976261] [c000000ffffcfb80] [c000000000dd846c] ip6_input_finish+0x2c/0x40
[ 673.976269] [c000000ffffcfba0] [c000000000dd8560] ip6_input+0xe0/0xf0
[ 673.976276] [c000000ffffcfc10] [c000000000dd7008] ip6_rcv_finish+0xa8/0xe0
[ 673.976283] [c000000ffffcfc40] [c000000000dd7b90] ipv6_rcv+0x100/0x110
[ 673.976291] [c000000ffffcfcc0] [c000000000c6d040] __netif_receive_skb_one_core+0x70/0xb0
[ 673.976299] [c000000ffffcfd00] [c000000000c6ee70] process_backlog+0xd0/0x230
[ 673.976307] [c000000ffffcfd70] [c000000000c6db88] net_rx_action+0x1e8/0x510
[ 673.976315] [c000000ffffcfe90] [c000000000e9c8a0] __do_softirq+0x160/0x3f4
[ 673.976324] [c000000ffffcff...

Read more...

Sean Feole (sfeole)
Changed in ubuntu-kernel-tests:
status: New → Triaged
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

I can reproduce this with 5.3.0-40-generic, so removing the kqa-blocker tag.

Po-Hsu Lin (cypressyew)
tags: removed: kqa-blocker
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

This issue is striking us again with 5.3.0-60.54 on modoc.

tags: added: sru-20200608
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

As node modoc is no longer accessible, and I managed to reproduce this on node entei and dryden in bug 1909286, I will mark this one as a dup.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.