Kernel oopsed and null pointer dereference while running ubuntu_kernel_selftests on Eoan Power8

Bug #1869032 reported by Po-Hsu Lin on 2020-03-25
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
ubuntu-kernel-tests
Undecided
Unassigned
linux (Ubuntu)
Undecided
Unassigned

Bug Description

Issue found on P8 node modoc with Eoan (5.3.0-43.36)
(Note that this test has passed with P9 node baltar without any traces in syslog)

The ubuntu_kernel_selftests hangs:
15:16:53 DEBUG| [stdout] ok 4 selftests: net: reuseport_dualstack
15:16:53 DEBUG| [stdout] # selftests: net: reuseaddr_conflict
15:16:53 DEBUG| [stdout] # Opening 127.0.0.1:9999
15:16:53 DEBUG| [stdout] # Opening INADDR_ANY:9999
15:16:53 DEBUG| [stdout] # bind: Address already in use
15:16:53 DEBUG| [stdout] # Opening in6addr_any:9999
15:16:53 DEBUG| [stdout] # Opening INADDR_ANY:9999
15:16:53 DEBUG| [stdout] # bind: Address already in use
15:16:53 DEBUG| [stdout] # Opening INADDR_ANY:9999 after closing ipv6 socket
15:16:53 DEBUG| [stdout] # bind: Address already in use
15:16:53 DEBUG| [stdout] # Successok 5 selftests: net: reuseaddr_conflict
15:16:53 DEBUG| [stdout] # selftests: net: tls
15:16:56 DEBUG| [stdout] # tls.c:967:tls.mutliproc_sendpage_even:Expected status (4) == 0 (0)
15:17:26 DEBUG| [stdout] # Alarm clock
15:45:20 INFO | Timer expired (1800 sec.), nuking pid 33351
(And test continues)

It looks like it's the selftests: net: tls that's causing this issue.

If you ssh to the node, the following trace could be found in dmesg:
 Injecting error (-12) to MEM_GOING_OFFLINE
 Injecting error (-12) to MEM_GOING_OFFLINE
 Injecting error (-12) to MEM_GOING_OFFLINE
 Oops: Exception in kernel mode, sig: 4 [#1]
 LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
 Modules linked in: tls binfmt_misc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_powernv uio_pdrv_genirq ipmi_devintf ipmi_msghandler uio powernv_rng ibmpowernv vmx_crypto leds_powernv powernv_op_panel sch_fq_codel ip_tables x_tables autofs4 ses enclosure scsi_transport_sas btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_vpmsum crc32c_vpmsum tg3 ipr [last unloaded: notifier_error_inject]
 CPU: 18 PID: 36045 Comm: tls Not tainted 5.3.0-43-generic #36-Ubuntu
 NIP: c00800000a4d6a40 LR: c00800000a4d6a40 CTR: c000000000179270
 REGS: c000000fc97837a0 TRAP: 0e40 Not tainted (5.3.0-43-generic)
 MSR: 900000000288b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 28002862 XER: 20000000
 CFAR: c00000000000dfc4 IRQMASK: 0
 GPR00: c00800000a4d6a40 c000000fc9783a30 c0000000019d9000 0000000000000000
 GPR04: c000000f5afc0000 0000000000000000 c000000fc97839b8 0000000000000000
 GPR08: c000000f5afc0000 0000000000000000 0000000000000000 c000000ff9a13780
 GPR12: 0000000088002462 c000000ffffeb380 0000000000000000 0000000000000000
 GPR16: 0000000000000000 00000ac875961368 00000ac875960d38 00000ac875960d90
 GPR20: 00007ffffc6b4ef0 c000000e3996dc48 0000000000000000 0000000000000000
 GPR24: c0000000004a9d70 0000000000000000 000000000000ea60 0000000000000000
 GPR28: c00c00000398c740 c00000000ab32e70 0000000000000000 c000000f7ecebd00
 NIP [c00800000a4d6a40] tls_sw_sendpage+0x68/0x100 [tls]
 LR [c00800000a4d6a40] tls_sw_sendpage+0x68/0x100 [tls]
 Call Trace:
 [c000000fc9783a30] [c00800000a4d6a40] tls_sw_sendpage+0x68/0x100 [tls] (unreliable)
 [c000000fc9783a80] [c000000000d7698c] inet_sendpage+0x8c/0x140
 [c000000fc9783ad0] [c000000000c33fd8] kernel_sendpage+0x38/0x70
 [c000000fc9783af0] [c000000000c34044] sock_sendpage+0x34/0x50
 [c000000fc9783b10] [c0000000004a9e8c] pipe_to_sendpage+0x7c/0xf0
 [c000000fc9783b40] [c0000000004ab3d4] __splice_from_pipe+0x164/0x280
 [c000000fc9783ba0] [c0000000004ad994] splice_from_pipe+0x74/0xc0
 [c000000fc9783c20] [c0000000004a9dbc] direct_splice_actor+0x4c/0xa0
 [c000000fc9783c40] [c0000000004aad14] splice_direct_to_actor+0x2a4/0x3c0
 [c000000fc9783cc0] [c0000000004aaee4] do_splice_direct+0xb4/0x130
 [c000000fc9783d30] [c0000000004565c4] do_sendfile+0x234/0x4a0
 [c000000fc9783dd0] [c000000000456b70] sys_sendfile64+0x160/0x170
 [c000000fc9783e20] [c00000000000b388] system_call+0x5c/0x70
 Instruction dump:
 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
 00000000 00000000 00000000 00000000 <00000000> 00000000 00000000 00000000
 ---[ end trace 8961ea39a6f2dd08 ]---

 BUG: Kernel NULL pointer dereference at 0x00000000
 Faulting instruction address: 0xc00000000020bd74
 Oops: Kernel access of bad area, sig: 11 [#2]
 LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
 Modules linked in: tls binfmt_misc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua ipmi_powernv uio_pdrv_genirq ipmi_devintf ipmi_msghandler uio powernv_rng ibmpowernv vmx_crypto leds_powernv powernv_op_panel sch_fq_codel ip_tables x_tables autofs4 ses enclosure scsi_transport_sas btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_vpmsum crc32c_vpmsum tg3 ipr [last unloaded: notifier_error_inject]
 CPU: 10 PID: 38095 Comm: modprobe Tainted: G D 5.3.0-43-generic #36-Ubuntu
 NIP: c00000000020bd74 LR: c0000000007a24f4 CTR: c00000000020bd40
 REGS: c000000fb0ad3580 TRAP: 0300 Tainted: G D (5.3.0-43-generic)
 MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 88222482 XER: 00000000
 CFAR: c00000000000dfc4 DAR: 0000000000000000 DSISR: 40000000 IRQMASK: 0
 GPR00: c0000000007a24f4 c000000fb0ad3810 c0000000019d9000 c00800000c3b9a39
 GPR04: 0000000000000000 0000000000000003 0000000000000010 5f5f636667383032
 GPR08: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
 GPR12: c00000000020bd40 c000000fffff4780 c000000fb0ad3d70 0000000000000001
 GPR16: 000001fde4b4eeb0 0000000000000000 000001fde4b2cfb8 0000000000000000
 GPR20: 0000000000000000 c00800000c217370 c000000fb0ad3c30 c000000fb0ad3aa0
 GPR24: 0000000000000002 c00800000c3b9a39 c00800000a4de188 0000000000000010
 GPR28: c00000000020bd40 0000000000000001 c00800000a4de198 0000000000000001
 NIP [c00000000020bd74] cmp_name+0x34/0x190
 LR [c0000000007a24f4] bsearch+0x84/0x110
 Call Trace:
 [c000000fb0ad3810] [c000000fb0ad3d70] 0xc000000fb0ad3d70 (unreliable)
 [c000000fb0ad3830] [c00000000004ecc8] apply_relocate_add+0x698/0xda0
 [c000000fb0ad3890] [c00000000020c418] find_exported_symbol_in_section+0x58/0x170
 [c000000fb0ad3920] [c00000000020e308] each_symbol_section.part.0+0x188/0x270
 [c000000fb0ad3a40] [c00000000020e64c] find_symbol+0x5c/0x100
 [c000000fb0ad3af0] [c000000000214218] load_module+0x1408/0x1a20
 [c000000fb0ad3d00] [c000000000214b38] __do_sys_finit_module+0xc8/0x150
 [c000000fb0ad3e20] [c00000000000b388] system_call+0x5c/0x70
 Instruction dump:
 3842d2c0 7c0802a6 60000000 f821ffe1 78690520 e8840008 2c290fc0 40800140
 78890520 2c290fc0 40800134 7ce01c28 <7d002428> 39200000 7cea43f8 7ce94bf8
 ---[ end trace 8961ea39a6f2dd09 ]---

Please find attachment for the complete syslog.

ProblemType: Bug
DistroRelease: Ubuntu 19.10
Package: linux-image-5.3.0-43-generic 5.3.0-43.36
ProcVersionSignature: Ubuntu 5.3.0-43.36-generic 5.3.18
Uname: Linux 5.3.0-43-generic ppc64le
AlsaDevices:
 total 0
 crw-rw---- 1 root audio 116, 1 Mar 25 15:09 seq
 crw-rw---- 1 root audio 116, 33 Mar 25 15:09 timer
AplayDevices: Error: [Errno 2] No such file or directory: 'aplay': 'aplay'
ApportVersion: 2.20.11-0ubuntu8.7
Architecture: ppc64el
ArecordDevices: Error: [Errno 2] No such file or directory: 'arecord': 'arecord'
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Date: Wed Mar 25 15:40:27 2020
IwConfig: Error: [Errno 2] No such file or directory: 'iwconfig': 'iwconfig'
Lsusb:
 Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
 Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
PciMultimedia:

ProcEnviron:
 TERM=xterm-256color
 PATH=(custom, no user)
 LANG=C.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: root=UUID=b2a867ce-7813-4785-8861-4e7de2ac39b4 ro console=hvc0
ProcLoadAvg: 5.13 5.02 4.07 1/1493 37887
ProcLocks:
 1: POSIX ADVISORY WRITE 3475 00:18:752 0 EOF
 2: POSIX ADVISORY WRITE 3689 00:18:851 0 EOF
 3: FLOCK ADVISORY WRITE 3656 00:18:820 0 EOF
ProcSwaps:
 Filename Type Size Used Priority
 /swap.img file 8388544 0 -2
ProcVersion: Linux version 5.3.0-43-generic (buildd@bos02-ppc64el-012) (gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)) #36-Ubuntu SMP Mon Mar 16 13:26:20 UTC 2020
RelatedPackageVersions:
 linux-restricted-modules-5.3.0-43-generic N/A
 linux-backports-modules-5.3.0-43-generic N/A
 linux-firmware 1.183.5
RfKill: Error: [Errno 2] No such file or directory: 'rfkill': 'rfkill'
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
VarLogDump_list: total 0
cpu_cores: Number of cores present = 20
cpu_coreson: Number of cores online = 20
cpu_dscr: DSCR is 0
cpu_freq:
 min: 3.694 GHz (cpu 159)
 max: 3.695 GHz (cpu 1)
 avg: 3.695 GHz
cpu_runmode:
 Could not retrieve current diagnostics mode,
 No kernel interface to firmware
cpu_smt: SMT=8

Po-Hsu Lin (cypressyew) wrote :
description: updated
description: updated
Po-Hsu Lin (cypressyew) wrote :

This is what I saw on this Eoan P8 node modoc on the last cycle:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1867155

Which makes it failed to finish the test.

tags: added: 5.3 kqa-blocker ubuntu-kernel-selftests
tags: added: sru-20200316

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Po-Hsu Lin (cypressyew) wrote :

A quick search with keywork "tls" in Eoan tree brought me this:
commit 299dfeeb7a216fe4dcfdd6ad0461ea93db72d389
Author: Jakub Kicinski <email address hidden>
Date: Fri Jan 10 04:38:32 2020 -0800

    net/tls: fix async operation

    BugLink: https://bugs.launchpad.net/bugs/1864710

    commit db885e66d268884dc72967279b7e84f522556abc upstream.

    Mallesham reports the TLS with async accelerator was broken by
    commit d10523d0b3d7 ("net/tls: free the record on encryption error")
    because encryption can return -EINPROGRESS in such setups, which
    should not be treated as an error.

    The error is also present in the BPF path (likely copied from there).

    Reported-by: Mallesham Jatharakonda <email address hidden>
    Fixes: d3b18ad31f93 ("tls: add bpf support to sk_msg handling")
    Fixes: d10523d0b3d7 ("net/tls: free the record on encryption error")
    Signed-off-by: Jakub Kicinski <email address hidden>
    Reviewed-by: Simon Horman <email address hidden>
    Signed-off-by: David S. Miller <email address hidden>
    Signed-off-by: Greg Kroah-Hartman <email address hidden>
    Signed-off-by: Kamal Mostafa <email address hidden>
    Signed-off-by: Khalid Elmously <email address hidden>

$ git tag --contains 299dfeeb7a216fe4dcfdd6ad0461ea93db72d389
Ubuntu-5.3.0-43.35
Ubuntu-5.3.0-43.36

Need to check if this is the cause.

Po-Hsu Lin (cypressyew) wrote :

I can manually run the net suite in kselftest on 5.3.0-43-generic with modoc, by running the following command in Eoan tree:
sudo make run_tests TARGETS=net

The test can finish without tripping this issue.

Also, I can see a "[ 28.249600] ipr 0001:08:00.0: 8150: Permanent IOA failure" message in boot dmesg, not sure if this means HW issue?

https://www.ibm.com/support/knowledgecenter/TI0003N/p8ebk/urc_tables.htm

Po-Hsu Lin (cypressyew) wrote :

I can finish running this test manually with autotest framework locally on this affected node modoc with 5.3.0-43:
AUTOTEST_PATH=/home/ubuntu/autotest sudo -E autotest/client/autotest-local --verbose autotest/client/tests/ubuntu_kernel_selftests/control

And the test suite can finish without tripping this issue.

But during the test I noticed that there will be another "missing remote IOA" error in dmesg:
[ 353.854103] test_bpf: Summary: 378 PASSED, 0 FAILED, [366/366 JIT'ed]
[ 353.854127] test_bpf: test_skb_segment: success in skb_segment!
[ 359.982427] u32 classifier
[ 359.982431] input device check on
[ 359.982432] Actions configured
[ 360.023690] gre: GRE over IPv4 demultiplexor driver
[ 360.027718] ip_gre: GRE over IPv4 tunneling driver
[ 360.231910] ip6_gre: GRE over IPv6 tunneling driver
[ 361.139317] bridge: filtering via arp/ip/ip6tables is no longer available by default. Update your scripts to load br_netfilter if you need this.
[ 361.141793] test-br0: port 1(test-dummy0) entered blocking state
[ 361.141796] test-br0: port 1(test-dummy0) entered disabled state
[ 361.141929] device test-dummy0 entered promiscuous mode
[ 361.143982] test-br0: port 1(test-dummy0) entered blocking state
[ 361.143984] test-br0: port 1(test-dummy0) entered forwarding state
[ 361.166931] 8021q: 802.1Q VLAN Support v1.8
[ 361.276750] device test-dummy0 left promiscuous mode
[ 361.276826] test-br0: port 1(test-dummy0) entered disabled state
[ 363.318680] MACsec IEEE 802.1AE
[ 363.401018] Initializing XFRM netlink socket
[ 365.457161] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
[ 365.462267] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
[ 365.464936] IPv6: ADDRCONF(NETDEV_CHANGE): veth0: link becomes ready
[ 365.550548] bpfilter: Loaded bpfilter_umh pid 20653
[ 366.415447] IPv6: ADDRCONF(NETDEV_CHANGE): eth1: link becomes ready
[ 397.070039] ipr 0001:08:00.0: 9076: Configuration error, missing remote IOA
[ 397.070068] ipr 0001:08:00.0: Attached Adapter not discovered within allotted time [PRC: 17101541]
[ 397.070077] ipr 0001:08:00.0: Remote IOA VPID/SN: 00000000
[ 397.070084] ipr 0001:08:00.0: Remote IOA WWN: 0000000000000000

Maybe it's some combination issue with the sru-misc test suite, which contains the following tests and will be executed in the following order:
        'hwclock',
        'libhugetlbfs',
        'ubuntu_bpf_jit',
        'ubuntu_kernel_selftests',
        'ubuntu_lxc',
        'ubuntu_seccomp',
        'ubuntu_unionmount_ovlfs',
        'ubuntu_cts_kernel',
        'ubuntu_kvm_unit_tests',

Po-Hsu Lin (cypressyew) wrote :

The jenkins job for sru-misc has successfully completed on node modoc with 5.3.0-43 without any hang issue. (I don't have a chance to check for syslog, but since it's not hanging I guess it's fine).

I've also tested the net/tls test in the selftest for 100 times on modoc with 5.3.0-43, passed with oops:
  for i in $(seq 1 100); do sudo ./tls; done

Furthermore, I had it tested with the net test suite for 100 time, passed without oops:
  for i in $(seq 1 100); do echo "====== cycle $i ======" | sudo tee /dev/kmsg; sudo make run_tests TARGETS=net; done

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers