net:pmtu.sh of ubuntu_kselftests_net and subsequent subtests time out
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
ubuntu-kernel-tests |
New
|
Undecided
|
Unassigned |
Bug Description
Found on N-nvidia-64k with node hinyari. After net:pmtu.sh times out, these tests also timeout until "TEST SYSTEM FAILURE DETECTED" is eventually triggered: net:udpgso.sh, net:ip_defrag.sh, net:udpgso_
14:44:14 INFO | START ubuntu_
14:44:14 DEBUG| Persistent state client.
14:44:14 DEBUG| Persistent state client.
14:44:14 DEBUG| Waiting for pid 43936 for 2700 seconds
14:44:14 WARNI| System python is too old, crash handling disabled
14:44:14 DEBUG| Running 'make run_tests -C net TEST_PROGS=pmtu.sh TEST_GEN_PROGS='' TEST_CUSTOM_
14:44:14 DEBUG| [stdout] make: Entering directory '/home/
14:44:14 DEBUG| [stdout] TAP version 13
14:44:14 DEBUG| [stdout] 1..1
14:44:14 DEBUG| [stdout] # timeout set to 0
14:44:14 DEBUG| [stdout] # selftests: net: pmtu.sh
14:44:14 DEBUG| [stdout] # TEST: ipv4: PMTU exceptions [ OK ]
14:44:14 DEBUG| [stdout] # TEST: ipv4: PMTU exceptions - nexthop objects [ OK ]
14:44:14 DEBUG| [stdout] # TEST: ipv6: PMTU exceptions [ OK ]
14:44:14 DEBUG| [stdout] # TEST: ipv6: PMTU exceptions - nexthop objects [ OK ]
14:44:14 DEBUG| [stdout] # TEST: ICMPv4 with DSCP and ECN: PMTU exceptions [ OK ]
14:44:15 DEBUG| [stdout] # TEST: ICMPv4 with DSCP and ECN: PMTU exceptions - nexthop objects [ OK ]
14:44:15 DEBUG| [stdout] # 2024/06/26 14:44:15 socat[44926] E xioopen_
14:44:15 DEBUG| [stdout] # TEST: UDPv4 with DSCP and ECN: PMTU exceptions [ OK ]
14:44:15 DEBUG| [stdout] # ./pmtu.sh: line 937: kill: (44926) - No such process
14:44:15 DEBUG| [stdout] # 2024/06/26 14:44:15 socat[45066] E xioopen_
14:44:15 DEBUG| [stdout] # TEST: UDPv4 with DSCP and ECN: PMTU exceptions - nexthop objects [ OK ]
14:44:15 DEBUG| [stdout] # ./pmtu.sh: line 937: kill: (45066) - No such process
14:44:16 DEBUG| [stdout] # TEST: IPv4 over vxlan4: PMTU exceptions [ OK ]
14:44:17 DEBUG| [stdout] # TEST: IPv4 over vxlan4: PMTU exceptions - nexthop objects [ OK ]
14:44:18 DEBUG| [stdout] # TEST: IPv6 over vxlan4: PMTU exceptions [ OK ]
14:44:19 DEBUG| [stdout] # TEST: IPv6 over vxlan4: PMTU exceptions - nexthop objects [ OK ]
14:44:20 DEBUG| [stdout] # TEST: IPv4 over vxlan6: PMTU exceptions [ OK ]
14:44:22 DEBUG| [stdout] # TEST: IPv4 over vxlan6: PMTU exceptions - nexthop objects [ OK ]
14:44:23 DEBUG| [stdout] # TEST: IPv6 over vxlan6: PMTU exceptions [ OK ]
14:44:24 DEBUG| [stdout] # TEST: IPv6 over vxlan6: PMTU exceptions - nexthop objects [ OK ]
14:44:25 DEBUG| [stdout] # TEST: IPv4 over geneve4: PMTU exceptions [ OK ]
14:44:26 DEBUG| [stdout] # TEST: IPv4 over geneve4: PMTU exceptions - nexthop objects [ OK ]
14:44:27 DEBUG| [stdout] # TEST: IPv6 over geneve4: PMTU exceptions [ OK ]
14:44:28 DEBUG| [stdout] # TEST: IPv6 over geneve4: PMTU exceptions - nexthop objects [ OK ]
14:44:30 DEBUG| [stdout] # TEST: IPv4 over geneve6: PMTU exceptions [ OK ]
14:44:31 DEBUG| [stdout] # TEST: IPv4 over geneve6: PMTU exceptions - nexthop objects [ OK ]
14:44:32 DEBUG| [stdout] # TEST: IPv6 over geneve6: PMTU exceptions [ OK ]
14:44:33 DEBUG| [stdout] # TEST: IPv6 over geneve6: PMTU exceptions - nexthop objects [ OK ]
14:44:35 DEBUG| [stdout] # TEST: IPv4, bridged vxlan4: PMTU exceptions [ OK ]
14:44:37 DEBUG| [stdout] # TEST: IPv4, bridged vxlan4: PMTU exceptions - nexthop objects [ OK ]
14:44:40 DEBUG| [stdout] # TEST: IPv6, bridged vxlan4: PMTU exceptions [ OK ]
14:44:42 DEBUG| [stdout] # TEST: IPv6, bridged vxlan4: PMTU exceptions - nexthop objects [ OK ]
14:44:44 DEBUG| [stdout] # TEST: IPv4, bridged vxlan6: PMTU exceptions [ OK ]
14:44:46 DEBUG| [stdout] # TEST: IPv4, bridged vxlan6: PMTU exceptions - nexthop objects [ OK ]
14:44:49 DEBUG| [stdout] # TEST: IPv6, bridged vxlan6: PMTU exceptions [ OK ]
14:44:51 DEBUG| [stdout] # TEST: IPv6, bridged vxlan6: PMTU exceptions - nexthop objects [ OK ]
14:44:53 DEBUG| [stdout] # TEST: IPv4, bridged geneve4: PMTU exceptions [ OK ]
14:44:55 DEBUG| [stdout] # TEST: IPv4, bridged geneve4: PMTU exceptions - nexthop objects [ OK ]
14:44:58 DEBUG| [stdout] # TEST: IPv6, bridged geneve4: PMTU exceptions [ OK ]
14:45:00 DEBUG| [stdout] # TEST: IPv6, bridged geneve4: PMTU exceptions - nexthop objects [ OK ]
14:45:02 DEBUG| [stdout] # TEST: IPv4, bridged geneve6: PMTU exceptions [ OK ]
14:45:04 DEBUG| [stdout] # TEST: IPv4, bridged geneve6: PMTU exceptions - nexthop objects [ OK ]
14:45:07 DEBUG| [stdout] # TEST: IPv6, bridged geneve6: PMTU exceptions [ OK ]
14:45:09 DEBUG| [stdout] # TEST: IPv6, bridged geneve6: PMTU exceptions - nexthop objects [ OK ]
14:45:09 DEBUG| [stdout] # ovs_bridge not supported
14:45:09 DEBUG| [stdout] # TEST: IPv4, OVS vxlan4: PMTU exceptions [SKIP]
14:45:09 DEBUG| [stdout] # ovs_bridge not supported
14:45:09 DEBUG| [stdout] # TEST: IPv6, OVS vxlan4: PMTU exceptions [SKIP]
14:45:09 DEBUG| [stdout] # ovs_bridge not supported
14:45:09 DEBUG| [stdout] # TEST: IPv4, OVS vxlan6: PMTU exceptions [SKIP]
14:45:09 DEBUG| [stdout] # ovs_bridge not supported
14:45:09 DEBUG| [stdout] # TEST: IPv6, OVS vxlan6: PMTU exceptions [SKIP]
14:45:09 DEBUG| [stdout] # ovs_bridge not supported
14:45:09 DEBUG| [stdout] # TEST: IPv4, OVS geneve4: PMTU exceptions [SKIP]
14:45:10 DEBUG| [stdout] # ovs_bridge not supported
14:45:10 DEBUG| [stdout] # TEST: IPv6, OVS geneve4: PMTU exceptions [SKIP]
14:45:10 DEBUG| [stdout] # ovs_bridge not supported
14:45:10 DEBUG| [stdout] # TEST: IPv4, OVS geneve6: PMTU exceptions [SKIP]
14:45:10 DEBUG| [stdout] # ovs_bridge not supported
14:45:10 DEBUG| [stdout] # TEST: IPv6, OVS geneve6: PMTU exceptions [SKIP]
14:45:11 DEBUG| [stdout] # TEST: IPv4 over fou4: PMTU exceptions [ OK ]
14:45:11 DEBUG| [stdout] # TEST: IPv4 over fou4: PMTU exceptions - nexthop objects [ OK ]
14:45:11 DEBUG| [stdout] # TEST: IPv6 over fou4: PMTU exceptions [ OK ]
14:45:12 DEBUG| [stdout] # TEST: IPv6 over fou4: PMTU exceptions - nexthop objects [ OK ]
14:45:12 DEBUG| [stdout] # TEST: IPv4 over fou6: PMTU exceptions [ OK ]
14:45:12 DEBUG| [stdout] # TEST: IPv4 over fou6: PMTU exceptions - nexthop objects [ OK ]
14:45:12 DEBUG| [stdout] # TEST: IPv6 over fou6: PMTU exceptions [ OK ]
14:45:13 DEBUG| [stdout] # TEST: IPv6 over fou6: PMTU exceptions - nexthop objects [ OK ]
14:45:13 DEBUG| [stdout] # TEST: IPv4 over gue4: PMTU exceptions [ OK ]
14:45:13 DEBUG| [stdout] # TEST: IPv4 over gue4: PMTU exceptions - nexthop objects [ OK ]
14:45:13 DEBUG| [stdout] # TEST: IPv6 over gue4: PMTU exceptions [ OK ]
14:45:14 DEBUG| [stdout] # TEST: IPv6 over gue4: PMTU exceptions - nexthop objects [ OK ]
14:45:14 DEBUG| [stdout] # TEST: IPv4 over gue6: PMTU exceptions [ OK ]
14:45:14 DEBUG| [stdout] # TEST: IPv4 over gue6: PMTU exceptions - nexthop objects [ OK ]
14:45:14 DEBUG| [stdout] # TEST: IPv6 over gue6: PMTU exceptions [ OK ]
14:45:14 DEBUG| [stdout] # TEST: IPv6 over gue6: PMTU exceptions - nexthop objects [ OK ]
14:45:15 DEBUG| [stdout] # TEST: IPv4 over IPv4: PMTU exceptions [ OK ]
14:45:15 DEBUG| [stdout] # TEST: IPv4 over IPv4: PMTU exceptions - nexthop objects [ OK ]
14:45:15 DEBUG| [stdout] # TEST: IPv6 over IPv4: PMTU exceptions [ OK ]
14:45:15 DEBUG| [stdout] # TEST: IPv6 over IPv4: PMTU exceptions - nexthop objects [ OK ]
14:45:15 DEBUG| [stdout] # TEST: IPv4 over IPv6: PMTU exceptions [ OK ]
14:45:16 DEBUG| [stdout] # TEST: IPv4 over IPv6: PMTU exceptions - nexthop objects [ OK ]
14:45:16 DEBUG| [stdout] # TEST: IPv6 over IPv6: PMTU exceptions [ OK ]
14:45:16 DEBUG| [stdout] # TEST: IPv6 over IPv6: PMTU exceptions - nexthop objects [ OK ]
15:29:14 INFO | Timer expired (2700 sec.), nuking pid 43936
15:29:14 INFO | ERROR ubuntu_
15:29:14 INFO | END ERROR ubuntu_
15:29:14 DEBUG| Persistent state client.
15:29:14 DEBUG| Persistent state client.
15:29:14 DEBUG| Test has timeout: 2700 sec.
15:29:14 INFO | START ubuntu_
15:29:14 DEBUG| Persistent state client.
15:29:14 DEBUG| Persistent state client.
15:29:14 DEBUG| Waiting for pid 54956 for 2700 seconds
15:29:14 WARNI| System python is too old, crash handling disabled
15:29:14 DEBUG| Running 'make run_tests -C net TEST_PROGS=
15:29:14 DEBUG| [stdout] make: Entering directory '/home/
15:29:14 DEBUG| [stdout] TAP version 13
15:29:14 DEBUG| [stdout] 1..1
15:29:14 DEBUG| [stdout] # timeout set to 0
15:29:14 DEBUG| [stdout] # selftests: net: udpgso.sh
15:29:14 DEBUG| [stdout] # ipv4 cmsg
16:14:14 INFO | Timer expired (2700 sec.), nuking pid 54956
16:14:14 INFO | ERROR ubuntu_
16:14:14 INFO | END ERROR ubuntu_
...
Accompanied by the following dmesg entries, "waiting for veth_A-R1 to become free" appears to repeat until the system is shutdown:
[ 574.404587] unregister_
[ 584.652587] unregister_
[ 594.900587] unregister_
[ 605.148587] unregister_
[ 615.396587] unregister_
[ 625.644587] unregister_
[ 635.892587] unregister_
[ 646.140587] unregister_
[ 656.388587] unregister_
[ 666.636587] unregister_
[ 676.884586] unregister_
[ 687.132586] unregister_
[ 697.380587] unregister_
[ 707.628586] unregister_
[ 717.876586] unregister_
[ 728.124587] unregister_
[ 738.284631] INFO: task modprobe:54505 blocked for more than 122 seconds.
[ 738.291488] Not tainted 6.8.0-1009-
[ 738.297364] "echo 0 > /proc/sys/
[ 738.372587] unregister_
[ 748.620586] unregister_
...
I was able to reproduce the issue on N-nvidia and N-nvidia-64k, versions 6.8.0-1009.9 and 6.8.0-1008.8, so it does not appear to be a new regression. I could also reproduce the issue on upstream 6.10.0-rc7.
I can cause the issue reliably on node hinyari by repeatedly running pmtu.sh from kernel selftests individually with only the pmtu_ipv6_ ipv6_exception test case:
# while true; do ./pmtu.sh pmtu_ipv6_ ipv6_exception; done
After a couple dozen iterations, I see the following in dmesg:
[ 67.586600] ip6_tunnel: ip_a xmit: Local address not yet configured! netdevice: waiting for veth_A-R1 to become free. Usage count = 5 7.729:108) : apparmor="DENIED" operation="open" class="file" profile="rsyslogd" name="/ run/systemd/ sessions/ " pid=1668 comm=72733A6D61 696E20513A52656 7 requested_mask="r" denied_mask="r" fsuid=102 ouid=0 netdevice: waiting for veth_A-R1 to become free. Usage count = 5 7.849:109) : apparmor="DENIED" operation="open" class="file" profile="rsyslogd" name="/ run/systemd/ sessions/ " pid=1668 comm=72733A6D61 696E20513A52656 7 requested_mask="r" denied_mask="r" fsuid=102 ouid=0 netdevice: waiting for veth_A-R1 to become free. Usage count = 5 7.993:110) : apparmor="DENIED" operation="open" class="file" profile="rsyslogd" name="/ run/systemd/ sessions/ " pid=1668 comm=72733A6D61 696E20513A52656 7 requested_mask="r" denied_mask="r" fsuid=102 ouid=0
[ 73.426596] unregister_
[ 73.435693] audit: type=1400 audit(172064544
[ 76.802597] ip6_tunnel: ip_a xmit: Local address not yet configured!
[ 83.546588] unregister_
[ 83.555666] audit: type=1400 audit(172064545
[ 93.690586] unregister_
[ 93.699676] audit: type=1400 audit(172064546
[ 94.722599] ip6_tunnel: ip_a xmit: Local address not yet configured!
[ 96.259457] neighbour: ndisc_cache: neighbor table overflow!
[ 96.259484] neighbour: ndisc_cache: neighbor table overflow!
[ 96.259512] neighbour: ndisc_cache: neighbor table overflow!
[ 96.259539] neighbour: ndisc_cache: neighbor table overflow!
Running ./pmtu.sh pmtu_vti6_exception normally works fine for many iterations. However, if ip6_vti wasn't loaded before hand, and the above "waiting for veth_A-R1 to become free" condition occurs, running it results in it hanging, unable to be interrupted.
The following eventually appears in dmesg: kernel/ hung_task_ timeout_ secs" disables this message. to+0xc8/ 0x100 0x2a8/0x7b0 preempt_ disabled+ 0x30/0x68 write_slowpath+ 0x27c/0x768 0xcc/0xd8 pernet_ device+ 0x38/0x98 init+0x30/ 0xff8 [ip6_vti] initcall+ 0x64/0x3b8 module+ 0xa4/0x280 0x7b8/0x8f0 from_file+ 0x98/0x118
[ 492.525328] INFO: task modprobe:21918 blocked for more than 122 seconds.
[ 492.532186] Tainted: G OE 6.8.0-1008-nvidia #8-Ubuntu
[ 492.539219] "echo 0 > /proc/sys/
[ 492.547228] task:modprobe state:D stack:0 pid:21918 tgid:21918 ppid:828 flags:0x0080000c
[ 492.547232] Call trace:
[ 492.547233] __switch_
[ 492.547241] __schedule+
[ 492.547243] schedule+0x40/0x168
[ 492.547245] schedule_
[ 492.547248] rwsem_down_
[ 492.547252] down_write+
[ 492.547254] register_
[ 492.547263] vti6_tunnel_
[ 492.547272] do_one_
[ 492.547276] do_init_
[ 492.547279] load_module+
[ 492.547281] init_module_
[...