P8 node modoc will reboot automatically when running the sru_misc test suite

Bug #1867155 reported by Po-Hsu Lin on 2020-03-12
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
Undecided
Unassigned
linux (Ubuntu)
Undecided
Unassigned

Bug Description

Tested with 5 attempts, 4 hangs around the following test in ubuntu_kernel_selftests net sub-category:
 # selftests: net: reuseport_bpf_cpu

First attempt:
23:21:32 DEBUG| [stdout] ok 2 selftests: net: reuseport_bpf_cpu
23:21:32 DEBUG| [stdout] # selftests: net: reuseport_bpf_numa
23:21:32 DEBUG| [stdout] # ---- IPv4 UDP ----
(hang here)

Second attempt:
10:17:35 DEBUG| [stdout] ok 1 selftests: net: reuseport_bpf
10:17:35 DEBUG| [stdout] # selftests: net: reuseport_bpf_cpu
10:17:35 DEBUG| [stdout] # ---- IPv4 UDP ----
10:17:35 DEBUG| [stdout] # send cpu 0, receive socket 0
(line skipped)
10:17:35 DEBUG| [stdout] # send cpu 159, receive socket 159
10:17:35 DEBUG| [stdout] # ---- IPv6 TCP ----
(hang here)

Third attempt failed because of test timeout:
12:46:16 DEBUG| [stdout] # [FAIL]
12:46:16 DEBUG| [stdout] # --------------------
12:46:16 DEBUG| [stdout] # running psock_tpacket test
12:46:16 DEBUG| [stdout] # --------------------
13:14:13 INFO | Timer expired (1800 sec.), nuking pid 161853

Fourth attempt:
07:41:51 DEBUG| [stdout] # selftests: net: reuseport_bpf_cpu
07:41:51 DEBUG| [stdout] # ---- IPv4 UDP ----
07:41:51 DEBUG| [stdout] # send cpu 0, receive socket 0
(lines skipped)
07:41:51 DEBUG| [stdout] # send cpu 159, receive socket 159
07:41:51 DEBUG| [stdout] # ---- IPv6 UDP ----
07:41:51 DEBUG| [stdout] # send cpu 0, receive socket 0
07:41:51 DEBUG| [stdout] # send cpu 1, receive socket 1
(lines skipped)
07:41:51 DEBUG| [stdout] # send cpu 157, receive socket 157
07:41:51 DEBUG| [stdout] # send cpu 159, receive socket 159
07:41:51 DEBUG| [stdout] # ---- IPv4 TCP ----
(test hang here)

Fifth attempt:
04:29:17 DEBUG| [stdout] ok 1 selftests: net: reuseport_bpf
04:29:17 DEBUG| [stdout] # selftests: net: reuseport_bpf_cpu
04:29:17 DEBUG| [stdout] # ---- IPv4 UDP ----
04:29:17 DEBUG| [stdout] # send cpu 0, receive socket 0
(lines skipped)
04:29:17 DEBUG| [stdout] # send cpu 159, receive socket 159
04:29:17 DEBUG| [stdout] # ---- IPv6 UDP ----
04:29:17 DEBUG| [stdout] # send cpu 0, receive socket 0
(lines skipped)
04:29:17 DEBUG| [stdout] # send cpu 159, receive socket 159
04:29:17 DEBUG| [stdout] # ---- IPv4 TCP ----
04:29:17 DEBUG| [stdout] # send cpu 0, receive socket 0
(lines skipped)
04:29:17 DEBUG| [stdout] # send cpu 15, receive socket 15
(test hang here)

I tried to run tests in this sru-misc suite in the following order:
        'hwclock',
        'ubuntu_bpf',
        'ubuntu_bpf_jit',
        'ubuntu_kernel_selftests',
        'ubuntu_lxc',
        'ubuntu_seccomp',
        'ubuntu_unionmount_ovlfs',
        'ubuntu_cts_kernel',
        'ubuntu_kvm_unit_tests',
One by one on this node, but I can't reproduce this issue.

I tried to watch dmesg when this happens, but there is no information there, the system will be reboot automatically silently.

This is what you can see from syslog after reboot:
Mar 12 04:27:39 modoc kernel: [ 536.668305] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.684547] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.700907] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.717246] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.719288] page:c00c000000c4f000 refcount:1 mapcount:0 mapping:c000000f8cfe0fd1 index:0x7611c3e
Mar 12 04:27:39 modoc kernel: [ 536.719289] anon
Mar 12 04:27:39 modoc kernel: [ 536.719291] flags: 0x3ffff800080024(uptodate|active|swapbacked)
Mar 12 04:27:39 modoc kernel: [ 536.719294] raw: 003ffff800080024 5deadbeef0000100 5deadbeef0000122 c000000f8cfe0fd1
Mar 12 04:27:39 modoc kernel: [ 536.719295] raw: 0000000007611c3e 0000000000000000 00000001ffffffff c000000fcfd1c000
Mar 12 04:27:39 modoc kernel: [ 536.719296] page dumped because: unmovable page
Mar 12 04:27:39 modoc kernel: [ 536.719296] page->mem_cgroup:c000000fcfd1c000
Mar 12 04:27:39 modoc kernel: [ 536.735465] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.751848] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.768210] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.784450] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.800756] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.817006] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.833133] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.849205] Injecting error (-12) to MEM_GOING_OFFLINE
Mar 12 04:27:39 modoc kernel: [ 536.865448] Injecting error (-12) to MEM_GOING_OFFLINE
^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@Mar 12 04:35:41 modoc systemd[1]: Starting Flush Journal to Persistent Storage...
Mar 12 04:35:41 modoc kernel: [ 0.000000] hash-mmu: Page sizes from device-tree:
Mar 12 04:35:41 modoc kernel: [ 0.000000] hash-mmu: base_shift=12: shift=12, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=0
Mar 12 04:35:41 modoc kernel: [ 0.000000] hash-mmu: base_shift=12: shift=16, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=7
Mar 12 04:35:41 modoc kernel: [ 0.000000] hash-mmu: base_shift=12: shift=24, sllp=0x0000, avpnm=0x00000000, tlbiel=1, penc=56
Mar 12 04:35:41 modoc systemd[1]: Started udev Kernel Device Manager.

From the log above, line "^@^@^@^@^@^" indicates the reboot. It looks like it's running the memory-hotplug test.

Maybe we need to use IPMI to see if there is anything on the console.

ProblemType: Bug
DistroRelease: Ubuntu 19.10
Package: linux-image-5.3.0-42-generic 5.3.0-42.34
ProcVersionSignature: Ubuntu 5.3.0-42.34-generic 5.3.18
Uname: Linux 5.3.0-42-generic ppc64le
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Mar 12 04:33 seq
 crw-rw---- 1 root audio 116, 33 Mar 12 04:33 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.11-0ubuntu8.5
Architecture: ppc64el
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
Date: Thu Mar 12 09:42:24 2020
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb:
 Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: root=UUID=b2a867ce-7813-4785-8861-4e7de2ac39b4 ro console=hvc0
ProcLoadAvg: 0.07 0.02 0.00 1/1461 86637
ProcLocks:
 1: POSIX ADVISORY WRITE 3799 00:18:841 0 EOF
 2: POSIX ADVISORY WRITE 3526 00:18:743 0 EOF
 3: FLOCK ADVISORY WRITE 3720 00:18:837 0 EOF
ProcSwaps:
 Filename Type Size Used Priority
 /swap.img file 8388544 0 -2
ProcVersion: Linux version 5.3.0-42-generic (buildd@bos02-ppc64el-006) (gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)) #34-Ubuntu SMP Fri Feb 28 05:49:17 UTC 2020
RelatedPackageVersions:
 linux-restricted-modules-5.3.0-42-generic N/A
 linux-backports-modules-5.3.0-42-generic N/A
 linux-firmware 1.183.4
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
VarLogDump_list: total 0
cpu_cores: Number of cores present = 20
cpu_coreson: Number of cores online = 20
cpu_dscr: DSCR is 0
cpu_freq:
 min: 3.694 GHz (cpu 159)
 max: 3.695 GHz (cpu 1)
 avg: 3.694 GHz
cpu_runmode:
 Could not retrieve current diagnostics mode,
 No kernel interface to firmware
cpu_smt: SMT=8

Po-Hsu Lin (cypressyew) wrote :
tags: added: sru-20200217
summary: - P8 node modoc will reboot when running the sru_misc
+ P8 node modoc will reboot automatically when running the sru_misc test
+ suite
description: updated
Po-Hsu Lin (cypressyew) on 2020-03-12
description: updated

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Po-Hsu Lin (cypressyew) on 2020-03-12
tags: added: kqa-blocker
removed: sru-20200217
Po-Hsu Lin (cypressyew) on 2020-03-12
description: updated
Po-Hsu Lin (cypressyew) wrote :

Please find attachment for the syslog?field.comment=Please find attachment for the syslog

Po-Hsu Lin (cypressyew) wrote :

I can see this message on a freshly deployed modoc:
[ 396.856011] ipr 0001:08:00.0: 9076: Configuration error, missing remote IOA
[ 396.856042] ipr 0001:08:00.0: Attached Adapter not discovered within allotted time [PRC: 17101541]
[ 396.856051] ipr 0001:08:00.0: Remote IOA VPID/SN: 00000000
[ 396.856058] ipr 0001:08:00.0: Remote IOA WWN: 0000000000000000
[ 396.856065] ipr: 00000000: 01000000 00000100 FC220000 08400000
[ 396.856072] ipr: 00000010: 4C494344 49424D20 20202020 35374438
[ 396.856078] ipr: 00000020: 30303153 4953494F 41202020 00000000
[ 396.856084] ipr: 00000030: 30303431 57303132 50050760 5EC33900
[ 396.856091] ipr: 00000040: 00000001 00000000 02020101 000000FF

With 5.3.0-40

Po-Hsu Lin (cypressyew) wrote :
Download full text (4.7 KiB)

I managed to catch this issue with IPMI console:

[ 673.975988] BUG: Unable to handle kernel instruction fetch
[ 673.976017] Faulting instruction address: 0x7fe000087fe00008
[ 673.976025] Oops: Kernel access of bad area, sig: 11 [#1]
[ 673.976032] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
[ 673.976040] Modules linked in: binfmt_misc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua leds_powernv ipmi_powernv ipmi_devintf ipmi_msghandler ibmpowernv uio_pdrv_genirq vmx_crypto uio powernv_rng powernv_op_panel sch_fq_codel ip_tables x_tables autofs4 ses enclosure scsi_transport_sas btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_vpmsum crc32c_vpmsum tg3 ipr [last unloaded: notifier_error_inject]
[ 673.976101] CPU: 0 PID: 76667 Comm: reuseport_bpf_c Not tainted 5.3.0-42-generic #34-Ubuntu
[ 673.976109] NIP: 7fe000087fe00008 LR: c000000000ca9d10 CTR: 7fe000087fe00008
[ 673.976117] REGS: c000000ffffcf580 TRAP: 0480 Not tainted (5.3.0-42-generic)
[ 673.976124] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002488 XER: 20000000
[ 673.976135] CFAR: c000000000ca9d0c IRQMASK: 0
[ 673.976135] GPR00: c000000000e4094c c000000ffffcf810 c0000000019c9000 c000000a2208b6e0
[ 673.976135] GPR04: c008000007c70038 c000000a2208b6e0 0000000000000028 000000012e2e5506
[ 673.976135] GPR08: 00000000a2be9417 0000000000000000 0000000000000000 0000000000000000
[ 673.976135] GPR12: 7fe000087fe00008 c000000001d60000 0000000000000003 0000000000000101
[ 673.976135] GPR16: c000000003fe00e8 00000000000022b8 0000000000002788 c0000000018fb600
[ 673.976135] GPR20: 000000000000000a 0000000000000000 00000000000022b8 0000000000000001
[ 673.976135] GPR24: 0000000000000000 0000000000000028 00000000000000a0 c000000003fe00e8
[ 673.976135] GPR28: c008000007c70000 00000000147b1819 c000000a2208b6e0 c000001e2d8b0000
[ 673.976195] NIP [7fe000087fe00008] 0x7fe000087fe00008
[ 673.976206] LR [c000000000ca9d10] reuseport_select_sock+0x100/0x400
[ 673.976212] Call Trace:
[ 673.976216] [c000000ffffcf810] [c000000fe9daf0f0] 0xc000000fe9daf0f0 (unreliable)
[ 673.976226] [c000000ffffcf8b0] [c000000000e4094c] inet6_lhash2_lookup+0x1ec/0x220
[ 673.976234] [c000000ffffcf930] [c000000000e40c80] inet6_lookup_listener+0x300/0x3f0
[ 673.976244] [c000000ffffcf9d0] [c000000000e1ec38] tcp_v6_rcv+0x7d8/0xe10
[ 673.976252] [c000000ffffcfb00] [c000000000dd7e70] ip6_protocol_deliver_rcu+0x110/0x6e0
[ 673.976261] [c000000ffffcfb80] [c000000000dd846c] ip6_input_finish+0x2c/0x40
[ 673.976269] [c000000ffffcfba0] [c000000000dd8560] ip6_input+0xe0/0xf0
[ 673.976276] [c000000ffffcfc10] [c000000000dd7008] ip6_rcv_finish+0xa8/0xe0
[ 673.976283] [c000000ffffcfc40] [c000000000dd7b90] ipv6_rcv+0x100/0x110
[ 673.976291] [c000000ffffcfcc0] [c000000000c6d040] __netif_receive_skb_one_core+0x70/0xb0
[ 673.976299] [c000000ffffcfd00] [c000000000c6ee70] process_backlog+0xd0/0x230
[ 673.976307] [c000000ffffcfd70] [c000000000c6db88] net_rx_action+0x1e8/0x510
[ 673.976315] [c000000ffffcfe90] [c000000000e9c8a0] __do_softirq+0x160/0x3f4
[ 673.976324] [c000000ffffcff...

Read more...

Sean Feole (sfeole) on 2020-03-12
Changed in ubuntu-kernel-tests:
status: New → Triaged
Po-Hsu Lin (cypressyew) wrote :

I can reproduce this with 5.3.0-40-generic, so removing the kqa-blocker tag.

Po-Hsu Lin (cypressyew) on 2020-03-13
tags: removed: kqa-blocker
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers