ubuntu_kernel_selftest will be interrupted with the reuseport_bpf_cpu / reuseport_bpf_numa test in net (BUG: Unable to handle kernel instruction fetch (NULL pointer?))

Bug #1909286 reported by Po-Hsu Lin
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
The Ubuntu-power-systems project
Confirmed
Undecided
Unassigned
ubuntu-kernel-tests
Confirmed
Undecided
Unassigned
linux (Ubuntu)
Confirmed
Undecided
Unassigned
Focal
Confirmed
Undecided
Unassigned
Hirsute
Confirmed
Undecided
Unassigned

Bug Description

Issue found with 5.4.0-59.65~18.04.1 P8 node entei and dryden

It looks like the ubuntu_kernel_selftests will be interrupted when running the reuseport_bpf_cpu / reuseport_bpf_numa in net category.

From the jenkins test log you will see the test didn't finish.

For node entei:
07:37:44 DEBUG| [stdout] # selftests: net: reuseport_bpf_cpu
07:37:44 DEBUG| [stdout] # ---- IPv4 UDP ----
07:37:44 DEBUG| [stdout] # send cpu 0, receive socket 0
07:37:44 DEBUG| [stdout] # send cpu 1, receive socket 1
07:37:44 DEBUG| [stdout] # send cpu 2, receive socket 2
...
07:37:44 DEBUG| [stdout] # send cpu 123, receive socket 123
07:37:44 DEBUG| [stdout] # send cpu 125, receive socket 125
07:37:44 DEBUG| [stdout] # send cpu 127, receive socket 127
07:37:44 DEBUG| [stdout] # ---- IPv6 UDP ----
07:37:44 DEBUG| [stdout] # send cpu 0, receive socket 0
07:37:44 DEBUG| [stdout] # send cpu 1, receive socket 1
07:37:44 DEBUG| [stdout] # send cpu 2, receive socket 2
....
07:37:44 DEBUG| [stdout] # send cpu 123, receive socket 123
07:37:44 DEBUG| [stdout] # send cpu 125, receive socket 125
07:37:44 DEBUG| [stdout] # send cpu 127, receive socket 127
07:37:44 DEBUG| [stdout] # ---- IPv4 TCP ----
+ ARCHIVE=/var/lib/jenkins/jobs/sru-misc__B-hwe-5.4_ppc64el-generic__using_entei__for_kernel/builds/1/archive
+ scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=quiet -r ubuntu@entei:kernel-test-results /var/lib/jenkins/jobs/sru-misc__B-hwe-5.4_ppc64el-generic__using_entei__for_kernel/builds/1/archive

For node dryden:
08:22:09 DEBUG| [stdout] ok 2 selftests: net: reuseport_bpf_cpu
08:22:09 DEBUG| [stdout] # selftests: net: reuseport_bpf_numa
08:22:09 DEBUG| [stdout] # ---- IPv4 UDP ----
+ ARCHIVE=/var/lib/jenkins/jobs/sru-misc__B-hwe-5.4_ppc64el-generic__using_dryden__for_kernel/builds/2/archive
+ scp -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o LogLevel=quiet -r ubuntu@dryden:kernel-test-results /var/lib/jenkins/jobs/sru-misc__B-hwe-5.4_ppc64el-generic__using_dryden__for_kernel/builds/2/archive

This issue does not exist in Focal 5.4

Trace back to older Bionic 5.4 kernel:
* 5.4.0-58.64~18.04.1 - not tested
* 5.4.0-57.63~18.04.1 - this issue does not exist as the bpf test build is blocking the build of net test
* 5.4.0-56.62~18.04.1 - missing test result as node modoc is broken
* 5.4.0-52.57~18.04.1 - no such issue with node modoc

Po-Hsu Lin (cypressyew)
tags: added: 5.4 bionic kqa-blocker ppc64el sru-20201130 ubuntu-kernel-selftests
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Tried this again on jenkins, this issue can be reproduced.
This is similar to bug 1867155, the node will be rebooted when hitting this issue.

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Download full text (6.3 KiB)

More test results:
* This issue can be reproduced with 5.4.0-59.65~18.04.1 and 5.4.0-58.64~18.04.1, and even with 5.4.0-52.57~18.04.1 on node entei.
* This issue cannot be reproduced when you run these two tests directly with "sudo ./reuseport_bpf_cpu" and "sudo ./reuseport_bpf_numa"
* This issue cannot be reproduced when you run the whole net test suite with "sudo make run_tests TARGETS=net in tools/testing/selftests/ of the source tree.
* This issue can be reproduced when you run the whole ubuntu_kernel_selftest suite with "AUTOTEST_PATH=/home/ubuntu/autotest sudo -E autotest/client/autotest-local --verbose autotest/client/tests/ubuntu_kernel_selftests/control", so it looks like some test before the net test ('setup','breakpoints','cpu-hotplug','efivarfs','memfd','memory-hotplug','mount') will make the net test crash with this.

And I managed to get the kernel panic log when this happens:

entei login: [ 3062.665272] BUG: Unable to handle kernel instruction fetch
[ 3062.665461] Faulting instruction address: 0x7fe000087fe00008
[ 3062.665507] Oops: Kernel access of bad area, sig: 11 [#1]
[ 3062.665566] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
[ 3062.665705] Modules linked in: sch_fq dccp_ipv6 dccp_ipv4 dccp ip6table_filter xt_conntrack nf_conntrack nf_defrag_ipv4 ip6_tables nf_defrag_ipv6 ip_vti ip6_vti fou6 sit ipip tunnel4 geneve act_mirred cls_basic esp6 authenc echainiv xt_policy iptable_filter bpfilter veth esp4_offload esp4 xfrm_user xfrm_algo macsec fou vxlan ip6_udp_tunnel udp_tunnel vrf 8021q garp mrp bridge stp llc ip6_gre ip6_tunnel tunnel6 ip_gre ip_tunnel gre cls_u32 sch_htb dummy tls binfmt_misc input_leds joydev mac_hid ofpart cmdlinepart powernv_flash at24 ipmi_powernv ipmi_devintf mtd ipmi_msghandler uio_pdrv_genirq uio opal_prd powernv_rng vmx_crypto sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid ast drm_vram_helper i2c_algo_bit ttm drm_kms_helper crct10dif_vpmsum
[ 3062.665793] crc32c_vpmsum syscopyarea sysfillrect sysimgblt fb_sys_fops drm hid ahci libahci drm_panel_orientation_quirks tg3 [last unloaded: notifier_error_inject]
[ 3062.666150] CPU: 0 PID: 53591 Comm: reuseport_bpf_c Tainted: G W 5.4.0-58-generic #64~18.04.1-Ubuntu
[ 3062.666181] NIP: 7fe000087fe00008 LR: c000000000d20004 CTR: 7fe000087fe00008
[ 3062.666208] REGS: c0000007ff6f7580 TRAP: 0480 Tainted: G W (5.4.0-58-generic)
[ 3062.666235] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002228 XER: 20000000
[ 3062.666265] CFAR: c000000000d20000 IRQMASK: 0
[ 3062.666265] GPR00: c000000000da196c c0000007ff6f7810 c000000001abf800 c000000662e546e0
[ 3062.666265] GPR04: c008000003cd0038 c000000662e546e0 0000000000000028 000000006a17a667
[ 3062.666265] GPR08: fffffffe031e7987 7fe000087fe00008 c0000007666a5158 0000000000000000
[ 3062.666265] GPR12: 7fe000087fe00008 c0000007ff7fee00 0000000000000000 00000000000022b8
[ 3062.666265] GPR16: 000000000100007f 00000000...

Read more...

Revision history for this message
Po-Hsu Lin (cypressyew) wrote :
Download full text (4.9 KiB)

This is the error log for 5.4.0-52.57~18.04.1

[ 2861.690893] BUG: Unable to handle kernel instruction fetch (NULL pointer?)
[ 2861.690981] Faulting instruction address: 0x00000000
[ 2861.691113] Oops: Kernel access of bad area, sig: 11 [#1]
[ 2861.691129] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
[ 2861.691177] Modules linked in: binfmt_misc ofpart cmdlinepart joydev input_leds mac_hid powernv_flash mtd at24 uio_pdrv_genirq uio ipmi_powernv ipmi_devintf opal_prd ipmi_msghandler powernv_rng vmx_crypto sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor hid_generic usbhid hid raid6_pq libcrc32c raid1 raid0 multipath linear ast drm_vram_helper i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops crct10dif_vpmsum crc32c_vpmsum drm ahci libahci drm_panel_orientation_quirks tg3 [last unloaded: notifier_error_inject]
[ 2861.691385] CPU: 0 PID: 42482 Comm: reuseport_bpf_c Tainted: G W L 5.4.0-52-generic #57~18.04.1-Ubuntu
[ 2861.691407] NIP: 0000000000000000 LR: c000000000d16524 CTR: 0000000000000000
[ 2861.691426] REGS: c0000007ff6f7580 TRAP: 0400 Tainted: G W L (5.4.0-52-generic)
[ 2861.691446] MSR: 9000000040009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 44002848 XER: 20000000
[ 2861.691466] CFAR: c00000000000dec4 IRQMASK: 0
[ 2861.691466] GPR00: c000000000eb74fc c0000007ff6f7810 c000000001a98a00 c0000007443f2ce0
[ 2861.691466] GPR04: c008000002010038 c0000007443f2ce0 0006000000000000 0000000000000000
[ 2861.691466] GPR08: 0000000000000001 0000000000000000 0000000000000000 00000000b0e34d31
[ 2861.691466] GPR12: 0000000000000000 c0000007ff7fee00 0000000000000000 00000000000022b8
[ 2861.691466] GPR16: 000000000000a5b7 c000000757284938 c000000001d2c100 c0000000019c8d00
[ 2861.691466] GPR20: 000000000000000a 0000000000000000 00000000000022b8 0000000000000001
[ 2861.691466] GPR24: 0000000000000000 0000000000000080 c0000000019c8d00 0000000000000028
[ 2861.691466] GPR28: c008000002010000 c0000007443f2ce0 000000002763df38 c00000073d47b000
[ 2861.691701] NIP [0000000000000000] 0x0
[ 2861.691717] LR [c000000000d16524] reuseport_select_sock+0x104/0x420
[ 2861.691733] Call Trace:
[ 2861.691746] [c0000007ff6f7810] [c0000007ff6f7860] 0xc0000007ff6f7860 (unreliable)
[ 2861.691769] [c0000007ff6f78b0] [c000000000eb74fc] inet6_lhash2_lookup+0x1fc/0x210
[ 2861.691789] [c0000007ff6f7930] [c000000000eb7860] inet6_lookup_listener+0x350/0x420
[ 2861.691808] [c0000007ff6f79f0] [c000000000e955e0] tcp_v6_rcv+0x830/0xca0
[ 2861.691826] [c0000007ff6f7b20] [c000000000e4e440] ip6_protocol_deliver_rcu+0x150/0x630
[ 2861.691845] [c0000007ff6f7ba0] [c000000000e4e9b4] ip6_input+0x54/0x100
[ 2861.691862] [c0000007ff6f7c10] [c000000000e4dba4] ip6_rcv_finish+0xb4/0xe0
[ 2861.691879] [c0000007ff6f7c40] [c000000000e4e120] ipv6_rcv+0x110/0x120
[ 2861.691897] [c0000007ff6f7cc0] [c000000000cd6ca4] __netif_receive_skb_one_core+0x74/0xb0
[ 2861.691916] [c0000007ff6f7d10] [c000000000cd7098] process_backlog+0xc8/0x200
[ 2861.691935] [c0000007ff6f7d80...

Read more...

summary: ubuntu_kernel_selftest will be interrupted with the reuseport_bpf_cpu /
- reuseport_bpf_numa test in net
+ reuseport_bpf_numa test in net (BUG: Unable to handle kernel instruction
+ fetch (NULL pointer?))
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

Since this issue can be reproduced with older kernel, and we don't have modoc anymore to see if it can reproduce this issue, I think it's safe to call it not a regression.

Revision history for this message
Krzysztof Kozlowski (krzk) wrote :

The last error log looks exactly like lp:1927076.

tags: added: 5.11 5.8 architecture-ppc64le bugnameltc-192677 focal groovy hirsute severity-high sru-20210412
Revision history for this message
Krzysztof Kozlowski (krzk) wrote :

See the bug lp:1927076 for recent comments/logs/findings. Shortly summarising:
---
this issue can be reproduced in the following order:
1. Run the cpu-hotplug test
   sudo ./autotest/client/tmp/ubuntu_kernel_selftests/src/linux/tools/testing/selftests/cpu-hotplug/cpu-on-off-test.sh
2. Run the reuseport_bpf_cpu test
   sudo ./autotest/client/tmp/ubuntu_kernel_selftests/src/linux/tools/testing/selftests/net/reuseport_bpf_cpu
---

Seen in logs:
---
[ 287.477797] Oops: Exception in kernel mode, sig: 4 [#1]
---
Or:
---
[ 417.696448] BUG: Unable to handle kernel instruction fetch (NULL pointer?)
[ 417.696522] Faulting instruction address: 0x00000000
[ 417.696677] Oops: Kernel access of bad area, sig: 11 [#1]
---

This issue can be reproduced on P8 node entei with:
  * F-5.4 (5.4.0-81-generic)
  * F-5.11 (5.11.0-27-generic #29~20.04.1-Ubuntu)
  * focal-hwe (Linux thiel 5.11.0-27-generic)
  * focal-hwe (5.11.0-34-generic) on thiel
  * H-5.11 (5.11.0-31-generic)
  * Hirsute (5.11.0-31-generic)
  * `gulpin` (8335-GTA) with hirsute (5.11.0-34-generic)
  * `entei` it is also a POWER8 (8335-GTA) with Hirsute latest kernel (5.11.0-34-generic)

Looks like not reproducible on different hardware:
 * Power8 LPAR: P8LPAR05 MAAS
 * Power9: QEMU with 4 or 128 CPUs (and 4 GB of RAM)

Revision history for this message
Ubuntu Kernel Bot (ubuntu-kernel-bot) wrote : Missing required logs.

This bug is missing log files that will aid in diagnosing the problem. While running an Ubuntu kernel (not a mainline or third-party kernel) please enter the following command in a terminal window:

apport-collect 1909286

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Changed in linux (Ubuntu Focal):
status: New → Incomplete
Changed in linux (Ubuntu Hirsute):
status: New → Incomplete
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in ubuntu-kernel-tests:
status: New → Confirmed
Changed in linux (Ubuntu Focal):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Hirsute):
status: Incomplete → Confirmed
Revision history for this message
Krzysztof Kozlowski (krzk) wrote :

System hang (but no logs collected - lack of console) also on node huggins (POWER8NVL, 8335-GTB) with Hirsute 5.11.0-34-generic

Revision history for this message
Krzysztof Kozlowski (krzk) wrote (last edit ):

Reproduced on mainline kernel v5.11.0-20-generic (v5.11.22) on node huggins (POWER8NVL, 8335-GTB) with Hirsute. Logs:
---
[ 167.500544] Faulting instruction address: 0x00000000
[ 167.500562] Oops: Kernel access of bad area, sig: 11 [#1]
[ 167.500578] LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV
...
[ 167.501361] NIP [0000000000000000] 0x0
[ 167.501376] LR [c000000000f2be4c] __bpf_prog_run_save_cb+0x5c/0x190
[ 167.501400] Call Trace:
[ 167.501410] [c000007fff4b38d0] [c00000406dbb4c80] 0xc00000406dbb4c80 (unreliable)
[ 167.501435] [c000007fff4b3940] [c000000000f2c02c] run_bpf_filter+0xac/0x200
[ 167.501457] [c000007fff4b39a0] [c000000000f2c204] reuseport_select_sock+0x84/0x170
[ 167.501648] [c000007fff4b39e0] [c000000001021640] udp4_lib_lookup2+0x200/0x2a0
[ 167.501673] [c000007fff4b3a60] [c000000001022928] __udp4_lib_lookup+0x2f8/0x340
[ 167.501696] [c000007fff4b3b40] [c00000000102873c] __udp4_lib_rcv+0x3ac/0x740
[ 167.501720] [c000007fff4b3c10] [c000000000fce750] ip_protocol_deliver_rcu+0x60/0x2c0
[ 167.501781] [c000007fff4b3c60] [c000000000fcea20] ip_local_deliver_finish+0x70/0x90
[ 167.501909] [c000007fff4b3c80] [c000000000fcddb0] ip_rcv_finish+0xc0/0xf0
[ 167.501931] [c000007fff4b3cc0] [c000000000eebab4] __netif_receive_skb_one_core+0x74/0xb0
[ 167.501956] [c000007fff4b3d10] [c000000000eebf28] process_backlog+0x138/0x280
[ 167.501979] [c000007fff4b3d80] [c000000000eecfc0] napi_poll+0x100/0x3c0
[ 167.502038] [c000007fff4b3e10] [c000000000eed374] net_rx_action+0xf4/0x2d0
[ 167.502175] [c000007fff4b3ea0] [c000000001175e90] __do_softirq+0x150/0x428
[ 167.502198] [c000007fff4b3f90] [c00000000002caec] call_do_softirq+0x14/0x24
[ 167.502220] [c00000003c343840] [c000000000017448] do_softirq_own_stack+0x38/0x50
[ 167.502245] [c00000003c343860] [c000000000165310] do_softirq+0xa0/0xb0
[ 167.502436] [c00000003c343890] [c000000000165418] __local_bh_enable_ip+0xf8/0x120
[ 167.502461] [c00000003c3438b0] [c000000000fd30e8] ip_finish_output2+0x1f8/0x620
[ 167.502486] [c00000003c343950] [c000000000fd77ec] ip_send_skb+0x6c/0x120
[ 167.502508] [c00000003c3439a0] [c000000001026658] udp_send_skb+0x168/0x450
[ 167.502530] [c00000003c3439f0] [c00000000102736c] udp_sendmsg+0x97c/0xd40
[ 167.502707] [c00000003c343bd0] [c00000000103a184] inet_sendmsg+0x64/0xb0
[ 167.502730] [c00000003c343c10] [c000000000ea9930] sock_sendmsg+0x80/0xb0
[ 167.502751] [c00000003c343c40] [c000000000eaeb18] __sys_sendto+0xf8/0x1a0
[ 167.502771] [c00000003c343d90] [c000000000eaec30] sys_send+0x30/0x40
[ 167.502793] [c00000003c343db0] [c000000000036fc4] system_call_exception+0xf4/0x200
[ 167.502982] [c00000003c343e10] [c00000000000d860] system_call_common+0xf0/0x27c
---

As in all other cases reuseport_select_sock() appears.

Revision history for this message
Krzysztof Kozlowski (krzk) wrote (last edit ):

Also reproduced on (huggins, POWER8NVL, 8335-GTB):
5.13.17-051317-generic mainline fails even on first run of reuseport_bpf_cpu test.
5.14.4-051404-generic mainline after 4 tries of the test.

Changed in ubuntu-power-systems:
status: New → Confirmed
Revision history for this message
Po-Hsu Lin (cypressyew) wrote :

I would like to mark this one as a dup for bug 1927076 instead, as Thadeu provides some further test results there. Let's track this issue in that bug.
Thanks!

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.