Raspberry Pi 3 network dies shortly after a burst of IPv6 tunnel network load ((lan78xx): transmit queue 0 timed out)

Bug #1861936 reported by mlx
56
This bug affects 8 people
Affects Status Importance Assigned to Milestone
linux-raspi (Ubuntu)
Incomplete
Undecided
Unassigned
Focal
Fix Committed
Undecided
Unassigned
Groovy
Fix Released
Undecided
Unassigned
linux-raspi2 (Ubuntu)
Incomplete
High
Hui Wang
Eoan
Won't Fix
Undecided
Unassigned

Bug Description

[Impact]

A mix of incoming TCP traffic, forwarding packets and saturating the link results in lan78xx TX queue timeouts which brings the network interface down.

[Test Case]

Se comment #99 below.

[Where Problems Could Occur]

Probably lost or corrupted TCP packets. Most likely not worse than what we're currently seeing.

[Original Description]

Desciption changed:
Raspberry Pi 3 network partially dies (transmission doesn't work, reception still does) shortly after a burst of network load over IPv6, when IPv6 connectivity is provided by a tunnel from tunnelbroker.net. The triggering load is typically an HTTP(S) download, and replication can be done without actually saving the file (wget -O /dev/null ...). Problem happens within downloading ~10 GB, usually withinthe first 1 GB of traffic)
Replication is 100% as long as _all_ of the following conditions are met
- 6in4 tunnel to HE.net set up with netplan
- ipv6 rules applied (netfilter-persistent)
- ipv6 forwarding enabled (edit /etc/sysctl.conf)

kern.log message that appears after a while:

Feb 4 23:42:59 rpi3 kernel: [ 571.878359] ------------[ cut here ]------------
Feb 4 23:42:59 rpi3 kernel: [ 571.878420] NETDEV WATCHDOG: eth0 (lan78xx): transmit queue 0 timed out
Feb 4 23:42:59 rpi3 kernel: [ 571.878550] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:447 dev_watchdog+0x324/0x330
Feb 4 23:42:59 rpi3 kernel: [ 571.878557] Modules linked in: sit tunnel4 ip_tunnel bridge stp llc ip6table_filter ip6_tables xt_tcpudp xt_conntrack nf_conntrack iptable_filter bpfilter nls_ascii dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua btsdio bluetooth ecdh_generic ecc brcmfmac brcmutil cfg80211 bcm2835_v4l2(CE) bcm2835_mmal_vchiq(CE) input_leds vc_sm_cma(CE) v4l2_common videobuf2_vmalloc spidev videobuf2_memops videobuf2_v4l2 raspberrypi_hwmon videobuf2_common videodev mc uio_pdrv_genirq uio sch_fq_codel jool(OE) jool_common(OE) nf_defrag_ipv6 nf_defrag_ipv4 ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid crct10dif_ce sdhci_iproc phy_generic fixed aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
Feb 4 23:42:59 rpi3 kernel: [ 571.878774] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G C OE 5.3.0-1017-raspi2 #19-Ubuntu
Feb 4 23:42:59 rpi3 kernel: [ 571.878781] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
Feb 4 23:42:59 rpi3 kernel: [ 571.878789] pstate: 60400005 (nZCv daif +PAN -UAO)
Feb 4 23:42:59 rpi3 kernel: [ 571.878800] pc : dev_watchdog+0x324/0x330
Feb 4 23:42:59 rpi3 kernel: [ 571.878807] lr : dev_watchdog+0x324/0x330
Feb 4 23:42:59 rpi3 kernel: [ 571.878812] sp : ffff00001001bd60
Feb 4 23:42:59 rpi3 kernel: [ 571.878817] x29: ffff00001001bd60 x28: 0000000000000140
Feb 4 23:42:59 rpi3 kernel: [ 571.878827] x27: 00000000ffffffff x26: 0000000000000000
Feb 4 23:42:59 rpi3 kernel: [ 571.878836] x25: ffff8ecbefa4e000 x24: ffff305e0f529018
Feb 4 23:42:59 rpi3 kernel: [ 571.878845] x23: 0000000000000000 x22: 0000000000000001
Feb 4 23:42:59 rpi3 kernel: [ 571.878853] x21: ffff8ecbefa4e480 x20: ffff305e0f807000
Feb 4 23:42:59 rpi3 kernel: [ 571.878862] x19: 0000000000000000 x18: 0000000000000000
Feb 4 23:42:59 rpi3 kernel: [ 571.878871] x17: ffff000010fd8218 x16: ffff305e0e30efb8
Feb 4 23:42:59 rpi3 kernel: [ 571.878879] x15: ffff8ecbf922a290 x14: ffffffffffffffff
Feb 4 23:42:59 rpi3 kernel: [ 571.878888] x13: 0000000000000000 x12: ffff305e0f944000
Feb 4 23:42:59 rpi3 kernel: [ 571.878897] x11: ffff305e0f82d000 x10: 0000000000000000
Feb 4 23:42:59 rpi3 kernel: [ 571.878905] x9 : 0000000000000004 x8 : 000000000000017f
Feb 4 23:42:59 rpi3 kernel: [ 571.878913] x7 : 0000000000000000 x6 : 0000000000000001
Feb 4 23:42:59 rpi3 kernel: [ 571.878921] x5 : 0000000000000000 x4 : 0000000000000008
Feb 4 23:42:59 rpi3 kernel: [ 571.878929] x3 : ffff305e0ee15750 x2 : 0000000000000004
Feb 4 23:42:59 rpi3 kernel: [ 571.878937] x1 : 6abb42c67c954600 x0 : 0000000000000000
Feb 4 23:42:59 rpi3 kernel: [ 571.878946] Call trace:
Feb 4 23:42:59 rpi3 kernel: [ 571.878955] dev_watchdog+0x324/0x330
Feb 4 23:42:59 rpi3 kernel: [ 571.878967] call_timer_fn+0x3c/0x178
Feb 4 23:42:59 rpi3 kernel: [ 571.878977] __run_timers.part.0+0x200/0x330
Feb 4 23:42:59 rpi3 kernel: [ 571.878985] run_timer_softirq+0x40/0x78
Feb 4 23:42:59 rpi3 kernel: [ 571.878995] __do_softirq+0x168/0x384
Feb 4 23:42:59 rpi3 kernel: [ 571.879007] irq_exit+0xb0/0xe8
Feb 4 23:42:59 rpi3 kernel: [ 571.879020] __handle_domain_irq+0x70/0xc0
Feb 4 23:42:59 rpi3 kernel: [ 571.879028] bcm2836_arm_irqchip_handle_irq+0x74/0xe0
Feb 4 23:42:59 rpi3 kernel: [ 571.879037] el1_irq+0x108/0x200
Feb 4 23:42:59 rpi3 kernel: [ 571.879050] arch_cpu_idle+0x3c/0x1c8
Feb 4 23:42:59 rpi3 kernel: [ 571.879069] default_idle_call+0x24/0x48
Feb 4 23:42:59 rpi3 kernel: [ 571.879081] do_idle+0x210/0x2a0
Feb 4 23:42:59 rpi3 kernel: [ 571.879091] cpu_startup_entry+0x28/0x30
Feb 4 23:42:59 rpi3 kernel: [ 571.879102] secondary_start_kernel+0x154/0x1c8
Feb 4 23:42:59 rpi3 kernel: [ 571.879108] ---[ end trace 349744d60a20e77a ]---

ProblemType: Bug
DistroRelease: Ubuntu 19.10
Package: linux-image-5.3.0-1017-raspi2 5.3.0-1017.19
ProcVersionSignature: Ubuntu 5.3.0-1017.19-raspi2 5.3.13
Uname: Linux 5.3.0-1017-raspi2 aarch64
ApportVersion: 2.20.11-0ubuntu8.2
Architecture: arm64
Date: Tue Feb 4 23:49:19 2020
ProcEnviron:
 TERM=linux
 PATH=(custom, no user)
 LANG=C.UTF-8
 SHELL=/bin/bash
SourcePackage: linux-raspi2
UpgradeStatus: No upgrade log present (probably fresh install)

CVE References

Revision history for this message
mlx (myxal-mxl) wrote :
mlx (myxal-mxl)
description: updated
Revision history for this message
Hui Wang (hui.wang) wrote :

I will try to reproduce your issue first.

thx.

Changed in linux-raspi2 (Ubuntu):
importance: Undecided → High
assignee: nobody → Hui Wang (hui.wang)
Revision history for this message
Hui Wang (hui.wang) wrote :
Download full text (3.5 KiB)

I did the test on the rpi3B+ board, scp a 3.3G size file from host machine to the rpi3B+ board (mmc card), there is no any errors, and there is no calltrace in the dmesg of the rpi3B+ board.

log on the host machine:
hwang4@hwang4-Vostro-5390:~/Downloads$ scp dell-bto-bionic-beaver-osp1-shireen-X44-20200103-22.iso ubuntu@192.168.2.103:~/test/
The authenticity of host '192.168.2.103 (192.168.2.103)' can't be established.
ECDSA key fingerprint is SHA256:HbfQtMzTnE9Bl6sg20Y95s6Ruf21ybTJ6T7npMl0Ex4.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.2.103' (ECDSA) to the list of known hosts.
ubuntu@192.168.2.103's password:
dell-bto-bionic-beaver-osp1-shireen-X44-20200103-22.iso 100% 3389MB 641.5KB/s 1:30:09

log on the rpi3B+ board:
ubuntu@ubuntu:~/test$ uname -a
Linux ubuntu 5.3.0-1017-raspi2 #19+otg SMP Wed Jan 29 12:45:11 CST 2020 aarch64 aarch64 aarch64 GNU/Linux
ubuntu@ubuntu:~/test$ ls -la
total 3470028
drwxrwxr-x 2 ubuntu ubuntu 4096 Feb 7 11:53 .
drwxr-xr-x 7 ubuntu ubuntu 4096 Feb 7 11:51 ..
-rw-rw-r-- 1 ubuntu ubuntu 3553296384 Feb 7 13:23 dell-bto-bionic-beaver-osp1-shireen-X44-20200103-22.iso

dmesg:
....
[ 22.099395] audit: type=1400 audit(1581076194.463:10): apparmor="STATUS" operation="profile_load" profile="unconfined" name="lsb_release" pid=1132 comm="apparmor_parser"
[ 22.114632] audit: type=1400 audit(1581076194.479:11): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/sbin/dhclient" pid=1131 comm="apparmor_parser"
[ 27.603378] IPv6: ADDRCONF(NETDEV_CHANGE): wlan0: link becomes ready
[ 112.486200] kauditd_printk_skb: 18 callbacks suppressed
[ 112.486207] audit: type=1400 audit(1581076328.080:30): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.lxd.activate" pid=1678 comm="apparmor_parser"
[ 112.507068] audit: type=1400 audit(1581076328.104:31): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.lxd.buginfo" pid=1680 comm="apparmor_parser"
[ 112.543622] audit: type=1400 audit(1581076328.140:32): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.lxd.benchmark" pid=1679 comm="apparmor_parser"
[ 112.684646] audit: type=1400 audit(1581076328.280:33): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.lxd.check-kernel" pid=1681 comm="apparmor_parser"
[ 113.298782] audit: type=1400 audit(1581076328.896:34): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.lxd.hook.install" pid=1684 comm="apparmor_parser"
[ 113.474537] audit: type=1400 audit(1581076329.072:35): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.lxd.lxc" pid=1685 comm="apparmor_parser"
[ 113.522045] audit: type=1400 audit(1581076329.116:36): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.lxd.hook.configure" pid=1683 comm="apparmor_parser"
[ 113.551776] audit: type=1400 audit(1581076329.148:37): apparmor="STATUS" operation="profile_replace" profile="unconfined"...

Read more...

Revision history for this message
Hui Wang (hui.wang) wrote :

set it to incomplete first.

Changed in linux-raspi2 (Ubuntu):
status: New → Incomplete
Revision history for this message
Mauricio Faria de Oliveira (mfo) wrote :

Hi Hui,

There seems to be some (small?) difference between your kernel (5.3.0-1017-raspi2 #19+otg) and the reporter's kernel (original 5.3.0-1017-raspi2 #19-Ubuntu) ?

I realize you're of course aware of the changes introduced in this '+otg' custom build you're running, but just wanted to mention that to confirm those are unrelated to the symptoms seen in this bug.

cheers,
Mauricio

Revision history for this message
Hui Wang (hui.wang) wrote :

Oh, 192.168.2.103 is the wifi's IP on the RPI3B+, I redo the test with lan's ip (192.168.2.107), can't reproduce the bug too.

log on the host:
hwang4@hwang4-Vostro-5390:~/Downloads$ scp dell-bto-bionic-beaver-osp1-shireen-X44-20200103-22.iso ubuntu@192.168.2.107:~/lan/
The authenticity of host '192.168.2.107 (192.168.2.107)' can't be established.
ECDSA key fingerprint is SHA256:HbfQtMzTnE9Bl6sg20Y95s6Ruf21ybTJ6T7npMl0Ex4.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.2.107' (ECDSA) to the list of known hosts.
ubuntu@192.168.2.107's password:
dell-bto-bionic-beaver-osp1-shireen-X44-20200103-22.iso 100% 3389MB 7.0MB/s 08:06

log on the RPI3B+:

ubuntu@ubuntu:~/lan$ dmesg
[ 0.000000] Booting Linux on physical CPU 0x0000000000 [0x410fd034]
[ 0.000000] Linux version 5.3.0-1017-raspi2 (root@zaku) (gcc version 9.2.1 20191008 (Ubuntu 9.2.1-9ubuntu2)) #19+otg SMP Wed Jan 29 12:45:11 CST 2020 (Ubuntu 5.3.0-1017.19+otg-raspi2 5.3.13)
[ 0.000000] Machine model: Raspberry Pi 3 Model B Plus Rev 1.3
[ 0.000000] efi: Getting EFI parameters from FDT:
...
[ 113.962890] audit: type=1400 audit(1581076329.560:38): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.lxd.lxd" pid=1686 comm="apparmor_parser"
[ 114.104891] audit: type=1400 audit(1581076329.700:39): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.lxd.migrate" pid=1687 comm="apparmor_parser"
[ 6029.845036] lan78xx 1-1.1.1:1.0 eth0: No phy led trigger registered for speed(-1)
[ 6029.853021] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready

ubuntu@ubuntu:~/lan$ ifconfig
eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.168.2.107 netmask 255.255.255.0 broadcast 192.168.2.255
        inet6 fe80::ba27:ebff:fece:ca4c prefixlen 64 scopeid 0x20<link>
        ether b8:27:eb:ce:ca:4c txqueuelen 1000 (Ethernet)
        RX packets 2470290 bytes 3725719810 (3.7 GB)
        RX errors 0 dropped 14 overruns 0 frame 0
        TX packets 1074945 bytes 72518896 (72.5 MB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

ubuntu@ubuntu:~/lan$ ls -la
total 3470028
drwxrwxr-x 2 ubuntu ubuntu 4096 Feb 7 13:32 .
drwxr-xr-x 8 ubuntu ubuntu 4096 Feb 7 13:32 ..
-rw-rw-r-- 1 ubuntu ubuntu 3553296384 Feb 7 13:40 dell-bto-bionic-beaver-osp1-shireen-X44-20200103-22.iso

Revision history for this message
mlx (myxal-mxl) wrote :

Hmm, will look into this again. I'm using a feature-rich networking setup (6in4 tunnel, jool NAT64, LAN + WLAN bridge, ip(6)tables), maybe some of it is also required).

Revision history for this message
Hui Wang (hui.wang) wrote :

The only difference between kernel 5.3.0-1017-raspi2 #19+otg and 5.3.0-1017-raspi2 #19-Ubuntu is:
CONFIG_USB_DWC2_HOST=y and CONFIG_USB_DWC2_DUAL_ROLE is not set in the #19-Ubuntu
CONFIG_USB_DWC2_DUAL_ROLE=y and CONFIG_USB_DWC2_HOST is not set in the #19-otg

The difference will not affect this bug. And I built the #19+otg kernel for this bug: #1861070

Revision history for this message
mlx (myxal-mxl) wrote :

(This is still with full setup, ie, jool and iptunnel, traffic using the tunnel): The first time I tried I also managed to download gparted without issues, but next time the issue appeared again.

No jool and no tunnel: wget 2x 900MB (ubuntu iso), no issue. scp the same iso from local machine to RPi, also no issue.

No jool, 6in4 tunnel + ip6tables: Replicated on 3rd attempt with wget. I think what may be contributing is that the machine is acting as an IPv6 gateway for my network, and there are some background bursts of load on the tunnel (Youtube playing on the desktop). The wget download itself is also going through the tunnel:

$ wget http://ftp.antik.sk/ubuntu-releases/eoan/ubuntu-19.10-live-server-amd64.iso
--2020-02-08 21:51:00-- http://ftp.antik.sk/ubuntu-releases/eoan/ubuntu-19.10-live-server-amd64.iso
Resolving ftp.antik.sk (ftp.antik.sk)... 2a02:130:9900:30::12, 88.212.10.12
Connecting to ftp.antik.sk (ftp.antik.sk)|2a02:130:9900:30::12|:80... connected.

Can you retry with a 6in4 tunnel? Im using the free offering from tunnelbroker.net Or perhaps suggest a way to collect more logs.

I'll upload a set of relevant config files for:
- netplan
- netfilter-persistent
- sysctl

Revision history for this message
mlx (myxal-mxl) wrote :
Revision history for this message
mlx (myxal-mxl) wrote :
Revision history for this message
mlx (myxal-mxl) wrote :
Revision history for this message
mlx (myxal-mxl) wrote :
Revision history for this message
mlx (myxal-mxl) wrote :
Revision history for this message
mlx (myxal-mxl) wrote :

@hui.wang Is there anything else needed from me? The issue still happens (intermittently under my normal use as an IPv6 gateway).

Revision history for this message
Hui Wang (hui.wang) wrote :

@mlx,

I will try to reproduce the bug according to your steps this weekend.

thx.

Revision history for this message
Hui Wang (hui.wang) wrote :

Hi mlx,

From the calltrace, it looks like the transmit doesn't work when problem happens, it is probably a problem of the driver lan78xx.c or a problem of the usb host driver.

So when problem happens, does the other usb ports still work? and the latest kernel -1019 version integrated a couple of patches to lan78xx.c, you could do a test with -1019 kernel.

And looks like other people also met the similar problem (at least similar calltrace):
https://www.raspberrypi.org/forums/search.php?keywords=lan78xx+transmit+queue+0+timed+out&sid=54addf297812dfe931b66a621bcae444

And I am still trying to reproduce this problem.

Revision history for this message
Hui Wang (hui.wang) wrote :

Looks like to setup a 6in4 tunnel, I need a static public IP first, but I don't have that IP.

And you could do a test, let us change the usb host driver, if the problem can't be reproduced anymore, it proves it is a problem on usb host driver.

To change the usb host driver:
add dtoverlay=dwc2 in the config.txt for RPI3B+ board.

to check if usb host driver is changed:
lsusb -t (it will show dwc2 instead of dwc_otg).

Revision history for this message
mlx (myxal-mxl) wrote :

Hi. The USB ports still work fine - I use a USB keyboard to correctly reboot the machine when it gets stuck. My most recent replication was with ~1017 kernel, will try the newer ones.

Revision history for this message
mlx (myxal-mxl) wrote :

Where can I find the ~1019 kernel? The repositories are still at ~1018.

Revision history for this message
mlx (myxal-mxl) wrote :

> Looks like to setup a 6in4 tunnel, I need a static public IP first

I don't think that static IP is needed to make it work at least temporarily, I don't have a static one either, just public. You only need to have public IP on the device you can manage - e.g. router at home, and make sure that raspberry has a static internal address. Then, in the router's firewall/forwarding settings, either set raspberry as the DMZ host, or set forward anything from the remote tunnel IP to raspberry.

That being said, if are behind a carrier NAT which you can't manage, you won't be able to set up the tunnel.

I've just tried the switch to dwc2 driver, it didn't help. (still on ~1018 kernel).

Revision history for this message
Hui Wang (hui.wang) wrote :

-1019 kernel should be in the eoan-proposed channel.

And I tried to setup the 6in4 on my host laptop, it didn't work, maybe the ISP block the packets because of GFW of china.

And since the dwc2 also has this problem, it looks this issue has nothing to do with usb host driver, could you plug a different usb->ethernet (different brand from lan78xx) and test with this usb->ethernet, let us see if you could still reproduce the problem?

Revision history for this message
mlx (myxal-mxl) wrote :
Download full text (4.5 KiB)

Sorry, got kinda stuck trying to test with a USB NIC: https://askubuntu.com/q/1212529/325336

After disabling systemd-resolved (+ putting a public DNS IPv4 address in resolv.conf), v6 download over the external USB NIC (Apple USB Ethernet, 05ac:1402) works fine. After switching the cable back to the built-in NIC, the error appeared almost immediately, didn't even get to try the v6 download.

[ 2452.471188] br0: port 3(eth1) entered disabled state
[ 3030.723594] br0: received packet on wlan0 with own address as source address (addr:b8:27:eb:3a:ba:a2, vlan:0)
[ 3030.860954] br0: received packet on wlan0 with own address as source address (addr:b8:27:eb:3a:ba:a2, vlan:0)
[ 3037.463785] br0: received packet on wlan0 with own address as source address (addr:b8:27:eb:3a:ba:a2, vlan:0)
[ 3228.800706] asix 1-1.3:1.0 eth1: link up, 100Mbps, full-duplex, lpa 0xCDE1
[ 3228.804385] br0: port 3(eth1) entered blocking state
[ 3228.804417] br0: port 3(eth1) entered forwarding state
[ 3293.930873] ------------[ cut here ]------------
[ 3293.930958] NETDEV WATCHDOG: eth0 (lan78xx): transmit queue 0 timed out
[ 3293.931102] WARNING: CPU: 3 PID: 0 at net/sched/sch_generic.c:447 dev_watchdog+0x324/0x330
[ 3293.931109] Modules linked in: asix sit tunnel4 ip_tunnel bridge stp llc ip6table_filter ip6_tables xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter nls_ascii dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua btsdio bluetooth ecdh_generic bcm2835_v4l2(CE) ecc bcm2835_mmal_vchiq(CE) vc_sm_cma(CE) brcmfmac v4l2_common videobuf2_vmalloc brcmutil videobuf2_memops videobuf2_v4l2 cfg80211 videobuf2_common input_leds videodev mc spidev raspberrypi_hwmon uio_pdrv_genirq uio sch_fq_codel ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid crct10dif_ce sdhci_iproc phy_generic fixed aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
[ 3293.931334] CPU: 3 PID: 0 Comm: swapper/3 Tainted: G C E 5.3.0-1018-raspi2 #20-Ubuntu
[ 3293.931339] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
[ 3293.931348] pstate: 60400005 (nZCv daif +PAN -UAO)
[ 3293.931356] pc : dev_watchdog+0x324/0x330
[ 3293.931363] lr : dev_watchdog+0x324/0x330
[ 3293.931366] sp : ffff00001001bd60
[ 3293.931371] x29: ffff00001001bd60 x28: 0000000000000140
[ 3293.931380] x27: 00000000ffffffff x26: 0000000000000000
[ 3293.931388] x25: fffff2fb79363000 x24: ffff4f7cab129018
[ 3293.931396] x23: 0000000000000000 x22: 0000000000000001
[ 3293.931404] x21: fffff2fb79363480 x20: ffff4f7cab407000
[ 3293.931412] x19: 0000000000000000 x18: 0000000000000000
[ 3293.931420] x17: ffff000010fd8218 x16: ffff4f7ca9f0f060
[ 3293.931428] x15: fffff2fb79228510 x14: ffffffffffffffff
[ 3293.931437] x13: 0000000000000000 x12: ffff4f7cab540000
[ 3293.931445] x11: ffff4f7cab42d000 x10: 0000000000000000
[ 3293.931453] x9 : 0000000000000004 x8 : 0000000000000192
[ 3293.931461] x7 : 0000000000000000 x6 : 0000000000000001
[ 3293.931468] x5 : 0000000000000000 x4 : 0000000000000008
[ 3293.931476] x3 : ffff4f7caaa1575...

Read more...

Revision history for this message
Hui Wang (hui.wang) wrote :

Since the Apple usb NIC doesn't have this issue, it is a problem on the driver lan78xx.c

Do you mean even without ipv6 tunnel, only setup a bridge on rpi3 to let eth0 connect WAN and let wlan0 play a role of AP, this issue could be reproduced?

Could you please share the detailed steps on how to setup a bridge on the rpi3 to reproduce this issue?

Revision history for this message
mlx (myxal-mxl) wrote :

How to set up bridge - is github gist link OK?
https://gist.github.com/myxal/6554bd370658a11621a30cd2e6e7d7a8

Testing without IPv6 tunnel - I'm not sure what you mean. If eth0 is the WAN side, and only wlan0 is for the LAN side, then there's no bridge needed... This would introduce another NAT (currently IPv4 NAT is already performed at the ISP router, and I can't change that), compared to the current setup - see the gist linked above.

I have tried (previously, with kernel 1017 I think) just disabling the IPv6 part altogether (the tunnel interface was not created, static addresses not assigned, etc. wlan and eth0 still in bridge), and the issue was not reproduced.

Revision history for this message
Hui Wang (hui.wang) wrote :

I changed the ip in the two yaml and put them to $rpi3B+/etc/netplan/, reboot, but the ipv6 doesn't work, only ipv4 works. BTW, you said you run wget and play youtube all on the rpi3B+, right?

This is the log on my rpi3B+:

ubuntu@ubuntu:~$ ifconfig
br0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        inet 192.168.2.4 netmask 255.255.255.0 broadcast 192.168.2.255
        inet6 fe80::7c3a:7cff:fe69:3425 prefixlen 64 scopeid 0x20<link>
        inet6 2001:470:35:3b1::2 prefixlen 64 scopeid 0x0<global>
        ether 7e:3a:7c:69:34:25 txqueuelen 1000 (Ethernet)
        RX packets 305 bytes 78562 (78.5 KB)
        RX errors 0 dropped 8 overruns 0 frame 0
        TX packets 137 bytes 14821 (14.8 KB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

eth0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
        ether b8:27:eb:ce:ca:4c txqueuelen 1000 (Ethernet)
        RX packets 318 bytes 83612 (83.6 KB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 137 bytes 14821 (14.8 KB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

he-ipv6: flags=209<UP,POINTOPOINT,RUNNING,NOARP> mtu 1480
        inet6 fe80::c0a8:204 prefixlen 64 scopeid 0x20<link>
        inet6 2001:470:35:3b1::1 prefixlen 64 scopeid 0x0<global>
        sit txqueuelen 1000 (IPv6-in-IPv4)
        RX packets 0 bytes 0 (0.0 B)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 77 bytes 7128 (7.1 KB)
        TX errors 22 dropped 0 overruns 0 carrier 22 collisions 0

lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
        inet 127.0.0.1 netmask 255.0.0.0
        inet6 ::1 prefixlen 128 scopeid 0x10<host>
        loop txqueuelen 1000 (Local Loopback)
        RX packets 807 bytes 61288 (61.2 KB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 807 bytes 61288 (61.2 KB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

ubuntu@ubuntu:~$ wget http://ftp.antik.sk/ubuntu-releases/eoan/ubuntu-19.10-live-server-amd64.iso
--2020-02-24 09:38:24-- http://ftp.antik.sk/ubuntu-releases/eoan/ubuntu-19.10-live-server-amd64.iso
Resolving ftp.antik.sk (ftp.antik.sk)... 2a02:130:9900:30::12, 88.212.10.12
Connecting to ftp.antik.sk (ftp.antik.sk)|2a02:130:9900:30::12|:80... failed: Connection timed out.
Connecting to ftp.antik.sk (ftp.antik.sk)|88.212.10.12|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 883949568 (843M) [application/x-iso9660-image]
Saving to: ��‘ubuntu-19.10-live-server-amd64.iso��’

9.10-live-server-am 0%[ ] 2.51K --.-KB/s eta 262d ^C

Revision history for this message
mlx (myxal-mxl) wrote :

Not quite.
- for downloads, as the issue happens when using IPv6 tunnel, so I'd use wget -6 ...
- for youtube - this is playing on another machine (Mac running Chromium browser). I mentioned it because it's significant amount of traffic which, in presence of IPv6 connectivity, will (mostly) go through the tunnel.

Your IPv6 addresses look misconfigured, as both he-ipv6 and br0 have addresses from the same /64 prefix.

Revision history for this message
Hui Wang (hui.wang) wrote :

So both your RPI3B+ and your host mac machine setup the IPv6 tunnel, and they share the same tunnel from tunnelbroker.net.

And this is my example configs and my two yamls:

https://pastebin.ubuntu.com/p/TDCDMrtzCb/

anything is wrong?

Revision history for this message
mlx (myxal-mxl) wrote :

@hui.wang: Only the RPi is acting as the IPv6 gateway for the network (it sends router advertisements to internal LAN on br0, issued by dnsmasq - the relevant dnsmasq config is here: https://github.com/myxal/DigitalHome/blob/master/configurations/dnsmasq/dnsmasq.d/Echolife6.conf

The mac is unaware of the tunnel, it autoconfigures itself according to the router advertisements, and RPi forwards the packets according to normal forwaring rules. But we're getting off-topic here.

Regarding your config - the internal (br0) and external (he-ipv6) should have addresses from different prefixes. I'm not sure which of the settings is wrong, but seeing as you are setting the gateway to what looks like a correct address, I'm guessing your internal address is wrong. On tunnelbroker, this is listed as:

"Client IPv6 address" - this is the external address, and must be set exactly as indicated by tunnelbroker
"Server IPv6 address" - this is the gateway - seems correctly entered in the config
"Routed /64" - this is the prefix from which you'll pick an address for the internal (br0) interface. I picked ...::1 because it's easy to enter and remember.

Revision history for this message
mlx (myxal-mxl) wrote :

re: dnsmasq - whoops, private repo. Here's another link: https://paste.ubuntu.com/p/TCwQBkg9vS/

Revision history for this message
Hui Wang (hui.wang) wrote :

So far, I couldn't set up the ipv6 on my rpi3B+ board, I need to do more study.

And I checked the eoan-proposed, the 1019 kernel is already there.

If there is no this file in the folder /etc/apt/sources.list.d/ on your board, you could put this file to that folder and run sudo apt-get update

Then you could install 1019 kernel to do the test.

sudo apt install linux-image-(press tab), it will list 1019 kernel.

Revision history for this message
mlx (myxal-mxl) wrote :

Oh, I had the URL wrong (archive.ubuntu instead of ports...).

Installed .1019 kernel, headers, etc... issue persists with the new kernel - dmesg attached.

Revision history for this message
Hui Wang (hui.wang) wrote :

OK, I will continue studying how to setup ipv6.

And could you please find some easy ways to reproduce this problem (like without ipv6), if you could, it will be easier to debug and we could report this issue to raspberry.org too.

thx.

Revision history for this message
mlx (myxal-mxl) wrote :

Sorry for the delayed response. I moved my network back to RPi1 so I can do experiments on the affected RPi3.

I'm thinking, if this is indeed caused by the usage of a 6in4 tunnel, and you're unable to set up one with HE.net because of unavailability/ISP restrictions, it should be possible to set up a 6in4 tunnel within your own network (obviously, another host would be needed for this). I'll try to provide a config/script/whatever that would do this.

Revision history for this message
mlx (myxal-mxl) wrote :

I made a pair of scripts - one sets up the tunnel, addresses on dummy interfaces, and routing. The other one cleans up.

Testing with iperf does NOT replicate the issue, however. Maybe someone else can find a test case that reliably triggers the bug. I have some ideas which I'll try soon.

https://gist.github.com/myxal/7a191e8a62a45f9187ae56cbe3c862b2

Revision history for this message
mlx (myxal-mxl) wrote :

Huh... even downloading over HE.net's tunnel doesn't kill it (within the 3 attempts I've tried) now. I have previously added a passive cooler to the USB/NIC chip as the RPi3 was down for this test.

I'll keep the RPi3 up as a caching DNS + IPv6 gateway (ie. as before, minus DHCPv4), point DNS queries to it, and see how it behaves for the next week.

Revision history for this message
mlx (myxal-mxl) wrote :

Never mind the LAN7515 heatsink. The issue was avoided by some combination of other missing configuration (I have since managed to trigger the issue):
- iptables-rules not loaded
- dnsmasq not running

Sometimes (in the full HE.net tunnel + v6 gateway configuration), the problem is triggered by an IPv4 download, even.

I'll keep tracking this down.

Revision history for this message
mlx (myxal-mxl) wrote :

Well, I only managed to replicate it on Ubuntu, and it requires many conditions to be met:
- 6in4 tunnel to HE.net
- ipv6 rules applied
- ipv6 forwarding enabled

Removing/replacing any one of these made downloading (wget -O /dev/null) reliable for at least 3 repeats of a large (~1GB+) downloads.

I recreated the setup on Raspbian, and so far haven't had the issue.

summary: - Raspberry Pi 3 network dies shortly after a burst of SD card IO and
- network load ((lan78xx): transmit queue 0 timed out)
+ Raspberry Pi 3 network dies shortly after a burst of IPv6 tunnel network
+ load ((lan78xx): transmit queue 0 timed out)
mlx (myxal-mxl)
description: updated
Revision history for this message
Juerg Haefliger (juergh) wrote :

Can you try the Focal 5.4 kernel to see if the problem still exists there as well?

Add the following line to /etc/apt/sources.list and run (I believe) 'apt install linux-raspi2':
deb http://ports.ubuntu.com/ubuntu-ports focal-proposed universe

Revision history for this message
mlx (myxal-mxl) wrote :

@juergh - I'm using origin/priority/pinning, so I installed that kernel with

apt install linux-raspi2=5.4.0.1004.4 linux-image-raspi2=5.4.0.1004.4 linux-headers=5.4.0.1004.4 # (or something like that)

Now my RPi is stuck at the color palette screen. As there is no grub2 AFAICT, what do I need to do to switch back to the older kernel? (I'm pretty sure the kernel was not uninstalled.

Revision history for this message
mlx (myxal-mxl) wrote :

For reference, recovered the installation with:

cd <1st partition, FAT>
for FILE in *.bak ;do mv -i ${FILE%%.bak} ${FILE%%.bak}.broken; mv -i ${FILE} ${FILE%%.bak}; done

So yeah, kernel 5.4.0.1004.4 is unbootable.

Revision history for this message
mlx (myxal-mxl) wrote :

Minor update - tried to check what's going on with 5.4 kernel by listening on the UART - there's no output. Setting standard priority on the focal-proposed repo, and reinstalling the linux-raspi2 package (along with its dependencies) from focal-proposed makes no difference.

Revision history for this message
Jesus Feliz Fernandez (janzun-w) wrote :
Download full text (5.9 KiB)

Hi all,

From my tests, it doesn't seem like a problem related to ipv6, but with high load in 3B+ version. In my case its easy to reproduce if I use ipforward and connection state over 20.04 server pi image, more less from scratch, static ip, pihole, tun0 and no ipv6.

root@pin:/home/janzun# uname -a
Linux pin 5.4.0-1008-raspi #8-Ubuntu SMP Wed Apr 8 11:13:06 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux

# sysctl -p
net.ipv6.conf.all.disable_ipv6 = 1
net.ipv6.conf.default.disable_ipv6 = 1
net.ipv6.conf.lo.disable_ipv6 = 1
net.ipv4.ip_forward = 1|0 -> it depends

First test:

# echo 0 > /proc/sys/net/ipv4/ip_forward
# iptables -F; iptables -X; iptables -P INPUT|FORWARD|OUTPUT ACCEPT
# wget -O /dev/null https://*/10GB.bin ==> Some error in one of the attemps

Second test (forget the purpose of the rules):

# echo 1 > /proc/sys/net/ipv4/ip_forward
# iptables -A INPUT -p tcp -m state --state ESTABLISHED,RELATED -j ACCEPT
# wget -O /dev/null https://*/10GB.bin ==> Allways stopped at about 1.5GB and crash (same with -p udp).

Kernel log:

Apr 25 18:11:21 localhost kernel: [22872.242430] ------------[ cut here ]------------
Apr 25 18:11:21 localhost kernel: [22872.242515] NETDEV WATCHDOG: eth0 (lan78xx): transmit queue 0 timed out
Apr 25 18:11:21 localhost kernel: [22872.242716] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:447 dev_watchdog+0x324/0x330
Apr 25 18:11:21 localhost kernel: [22872.242733] Modules linked in: 8021q garp mrp stp llc xt_multiport xt_state xt_conntrack xt_tcpudp iptab
Apr 25 18:11:21 localhost kernel: [22872.242956] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G C E 5.4.0-1008-raspi #8-Ubuntu
Apr 25 18:11:21 localhost kernel: [22872.242962] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
Apr 25 18:11:21 localhost kernel: [22872.242972] pstate: 60400005 (nZCv daif +PAN -UAO)
Apr 25 18:11:21 localhost kernel: [22872.242985] pc : dev_watchdog+0x324/0x330
Apr 25 18:11:21 localhost kernel: [22872.242995] lr : dev_watchdog+0x324/0x330
Apr 25 18:11:21 localhost kernel: [22872.243000] sp : ffff80001000bd60
Apr 25 18:11:21 localhost kernel: [22872.243005] x29: ffff80001000bd60 x28: 0000000000000140
Apr 25 18:11:21 localhost kernel: [22872.243014] x27: 00000000ffffffff x26: 0000000000000000
Apr 25 18:11:21 localhost kernel: [22872.243023] x25: ffff00002f650000 x24: ffffa8d07f545018
Apr 25 18:11:21 localhost kernel: [22872.243031] x23: 0000000000000000 x22: 0000000000000001
Apr 25 18:11:21 localhost kernel: [22872.243039] x21: ffff00002f650480 x20: ffffa8d07fa07000
Apr 25 18:11:21 localhost kernel: [22872.243047] x19: 0000000000000000 x18: 0000000000000000
Apr 25 18:11:21 localhost kernel: [22872.243055] x17: ffff800010876378 x16: 0000000000000000
Apr 25 18:11:21 localhost kernel: [22872.243063] x15: ffff000039234090 x14: ffffffffffffffff
Apr 25 18:11:21 localhost kernel: [22872.243072] x13: 0000000000000000 x12: ffffa8d07fb3e000
Apr 25 18:11:21 localhost kernel: [22872.243081] x11: ffffa8d07fa2c000 x10: 0000000000000000
Apr 25 18:11:21 localhost kernel: [22872.243088] x9 : 0000000000000004 x8 : 00000000000001cb
Apr 25 18:11:21 localhost kernel: [22872.243096] x7 : 0000000000000000 x6 : 0000000000000001
Apr ...

Read more...

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-raspi2 (Ubuntu Eoan):
status: New → Confirmed
Revision history for this message
mlx (myxal-mxl) wrote :

Still observed with 5.3.0-1023 on eoan. Will be switching to focal, seeing as it should now boot.

Revision history for this message
Matt Elek Harris (mattelekharris) wrote :

I can confirm having the same issue, including on multiple raspberry pi 3s, one with an IPv6 tunnel from tunnelbroker.net, and one without. It can be triggered by a lot of IPv4 traffic for me as well.

Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-raspi2 (Ubuntu Focal):
status: New → Confirmed
Revision history for this message
Hui Wang (hui.wang) wrote :

@mlx and Matt,

We have ubuntu-5.4 kernel for raspi (20.04), could you please test if this issue still happen with 5.4 kernel?

Revision history for this message
mlx (myxal-mxl) wrote :

@hui.wang - yes, I have seen this a few times in Focal since upgrading about a month ago. I didn't look at the logs closely, but the message that appears repeatedly after the interface fails is "eth0: kevent 0 may have been dropped" I think, and bug #1647397 may be related/identical.

User from that bug ascribes the cause to the use of bridge, and this is my case. @Matt, are you also running the interface in a bridge?

Revision history for this message
Seamus Ryan (seamooose) wrote :

I can confirm I am also seeing a very similar set of circumstances:

-Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
-5.4.0-1015-raspi #15-Ubuntu SMP Fri Jul 10 05:34:24 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
-The "tunnel" in my case is using Wireguard to a server hosted externally.
-Wireless continues to work
-eth0 drops (can get arp but not packets)
-Reboot is the only fix

-Happens randomly, haven't been able to pinpoint the actual cause in my case

Revision history for this message
Seamus Ryan (seamooose) wrote :

FWIW, i actually thought this was a bug with the latest raspbian image and decided to move forward with using 20.04 on my raspberry pi's, turns out that wasn't the case.

Revision history for this message
Seamus Ryan (seamooose) wrote :
Download full text (4.0 KiB)

Forgot to include:

Jul 28 18:18:25 dns1 kernel: [ 1289.970276] ------------[ cut here ]------------
Jul 28 18:18:25 dns1 kernel: [ 1289.970354] NETDEV WATCHDOG: eth0 (lan78xx): transmit queue 0 timed out
Jul 28 18:18:25 dns1 kernel: [ 1289.970463] WARNING: CPU: 1 PID: 0 at net/sched/sch_generic.c:447 dev_watchdog+0x324/0x330
Jul 28 18:18:25 dns1 kernel: [ 1289.970467] Modules linked in: wireguard ip6_udp_tunnel udp_tunnel 8021q garp mrp stp llc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua btsdio bluetooth ecdh_generic ecc brcmfmac brcmutil cfg80211 bcm2835_v4l2(CE) bcm2835_codec(CE) bcm2835_isp(CE) bcm2835_mmal_vchiq(CE) snd_bcm2835(CE) v4l2_mem2mem videobuf2_vmalloc videobuf2_dma_contig snd_pcm videobuf2_memops videobuf2_v4l2 snd_timer videobuf2_common raspberrypi_hwmon snd videodev mc vc_sm_cma(CE) rpi_poe_fan uio_pdrv_genirq uio sch_fq_codel drm ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_ce spidev phy_generic aes_neon_bs aes_neon_blk crypto_simd cryptd
Jul 28 18:18:25 dns1 kernel: [ 1289.970634] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G C E 5.4.0-1015-raspi #15-Ubuntu
Jul 28 18:18:25 dns1 kernel: [ 1289.970639] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
Jul 28 18:18:25 dns1 kernel: [ 1289.970646] pstate: 60400005 (nZCv daif +PAN -UAO)
Jul 28 18:18:25 dns1 kernel: [ 1289.970655] pc : dev_watchdog+0x324/0x330
Jul 28 18:18:25 dns1 kernel: [ 1289.970662] lr : dev_watchdog+0x324/0x330
Jul 28 18:18:25 dns1 kernel: [ 1289.970666] sp : ffff80001000bd60
Jul 28 18:18:25 dns1 kernel: [ 1289.970669] x29: ffff80001000bd60 x28: 0000000000000140
Jul 28 18:18:25 dns1 kernel: [ 1289.970677] x27: 00000000ffffffff x26: 0000000000000000
Jul 28 18:18:25 dns1 kernel: [ 1289.970684] x25: ffff000034891000 x24: ffffa6c7d8749018
Jul 28 18:18:25 dns1 kernel: [ 1289.970691] x23: 0000000000000000 x22: 0000000000000001
Jul 28 18:18:25 dns1 kernel: [ 1289.970698] x21: ffff000034891480 x20: ffffa6c7d8c07000
Jul 28 18:18:25 dns1 kernel: [ 1289.970704] x19: 0000000000000000 x18: 0000000000000000
Jul 28 18:18:25 dns1 kernel: [ 1289.970711] x17: 0000000000000000 x16: 0000000000000000
Jul 28 18:18:25 dns1 kernel: [ 1289.970726] x15: ffff000035a35e50 x14: ffffffffffffffff
Jul 28 18:18:25 dns1 kernel: [ 1289.970733] x13: 0000000000000000 x12: ffffa6c7d8d3f000
Jul 28 18:18:25 dns1 kernel: [ 1289.970740] x11: ffffa6c7d8c2c000 x10: 0000000000000000
Jul 28 18:18:25 dns1 kernel: [ 1289.970746] x9 : 0000000000000004 x8 : 00000000000001b9
Jul 28 18:18:25 dns1 kernel: [ 1289.970752] x7 : 0000000000000000 x6 : 0000000000000001
Jul 28 18:18:25 dns1 kernel: [ 1289.970759] x5 : 0000000000000000 x4 : 0000000000000002
Jul 28 18:18:25 dns1 kernel: [ 1289.970765] x3 : ffffa6c7d8015790 x2 : 0000000000000040
Jul 28 18:18:25 dns1 kernel: [ 1289.970772] x1 : ba6e3f480831e100 x0 : 0000000000000000
Jul 28 18:18:25 dns1 kernel: [ 1289.970779] Call trace:
Jul 28 18:18:25 dns1 kernel: [ 1289.970787] dev_watchdog+0x324/0x330
Jul 28 18:18:25 dns1 kernel: [ 1289.970800] call_timer_fn+0x3c/0x178
Jul 28 18:18:25 dns1 kernel: [...

Read more...

Revision history for this message
mlx (myxal-mxl) wrote :

I got myself the Ruideng TC66 USB tester and played a bit with it - it appears that after the network dies on Raspberry, the system's stable power consumption ends up higher than when it's functional and idling. Some stray endless loop, perhaps?

Revision history for this message
mlx (myxal-mxl) wrote :
Revision history for this message
Hui Wang (hui.wang) wrote :

Recently we found the CONFIG_PREEMPT is not enabled in the ubuntu kernel, but it is enabled in the Pi OS's kernel. We plan to enable the CONFIG_PREEMPT in the ubuntu kernel too. So please wait for that kernel and then test a kernel with PREEMPT enabled.

Or you could test Pi OS's kernel first. Looks like there is 5.4 Pi OS's kernel.

thx.

Revision history for this message
mlx (myxal-mxl) wrote :

@Hui, is there a guide/checklist how to test the Pi OS kernel on Ubuntu? I trie replacing the contents of the FAT partition, and the kernel did boot, but that's still missing the modules, possibly something else..?

Revision history for this message
Hui Wang (hui.wang) wrote :

Please test this kernel, I enabled the preempt in the testing kernel.

https://people.canonical.com/~hwang4/rpi-preempt/

Revision history for this message
mlx (myxal-mxl) wrote :
Download full text (4.3 KiB)

Thanks. I hope the installation worked - I only have Focal now, and the headers package in that folder was asking for a dependency from Eoan which I don't have.

Still, I managed to boot the system, and unfortunately the behaviour is the same.

Aug 1 12:04:56 rpi3 kernel: [ 117.985005] ------------[ cut here ]------------
Aug 1 12:04:56 rpi3 kernel: [ 117.985053] NETDEV WATCHDOG: eth0 (lan78xx): transmit queue 0 timed out
Aug 1 12:04:56 rpi3 kernel: [ 117.985163] WARNING: CPU: 1 PID: 0 at /home/hwang4/work/mainline/build/eoan-rpi/ubuntu-eoan/net/sched/sch_generic.c:448 dev_watchdog+0x370/0x378
Aug 1 12:04:56 rpi3 kernel: [ 117.985168] Modules linked in: sit tunnel4 ip_tunnel bridge stp llc ip6table_filter ip6_tables xt_tcpudp xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua btsdio bluetooth ecdh_generic ecc brcmfmac brcmutil cfg80211 input_leds bcm2835_v4l2(CE) bcm2835_mmal_vchiq(CE) raspberrypi_hwmon vc_sm_cma(CE) raspberrypi_cpufreq v4l2_common videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev mc uio_pdrv_genirq uio sch_fq_codel drm ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid sdhci_iproc spidev phy_generic crct10dif_ce fixed aes_neon_bs aes_neon_blk crypto_simd cryptd aes_arm64
Aug 1 12:04:56 rpi3 kernel: [ 117.985309] CPU: 1 PID: 0 Comm: swapper/1 Tainted: G C E 5.3.0-1030-raspi2 #32+testpreempt
Aug 1 12:04:56 rpi3 kernel: [ 117.985311] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
Aug 1 12:04:56 rpi3 kernel: [ 117.985316] pstate: 80400005 (Nzcv daif +PAN -UAO)
Aug 1 12:04:56 rpi3 kernel: [ 117.985320] pc : dev_watchdog+0x370/0x378
Aug 1 12:04:56 rpi3 kernel: [ 117.985324] lr : dev_watchdog+0x370/0x378
Aug 1 12:04:56 rpi3 kernel: [ 117.985326] sp : ffff00001000bd90
Aug 1 12:04:56 rpi3 kernel: [ 117.985329] x29: ffff00001000bd90 x28: ffff567789c1abf0
Aug 1 12:04:56 rpi3 kernel: [ 117.985334] x27: ffffeeb8788d4480 x26: 00000000ffffffff
Aug 1 12:04:56 rpi3 kernel: [ 117.985338] x25: 0000000000000140 x24: ffffeeb86f5c8480
Aug 1 12:04:56 rpi3 kernel: [ 117.985342] x23: ffff56778a807000 x22: ffffeeb8788d445c
Aug 1 12:04:56 rpi3 kernel: [ 117.985346] x21: ffffeeb8788d4000 x20: ffffeeb8788d4480
Aug 1 12:04:56 rpi3 kernel: [ 117.985351] x19: 0000000000000000 x18: ffffffffffffffff
Aug 1 12:04:56 rpi3 kernel: [ 117.985355] x17: 0000000000000000 x16: 0000000000000000
Aug 1 12:04:56 rpi3 kernel: [ 117.985359] x15: ffff56778a809708 x14: ffff56778a9418e0
Aug 1 12:04:56 rpi3 kernel: [ 117.985363] x13: ffff56778a941533 x12: ffff56778a82e000
Aug 1 12:04:56 rpi3 kernel: [ 117.985367] x11: 0000000000000000 x10: ffff56778a940000
Aug 1 12:04:56 rpi3 kernel: [ 117.985372] x9 : 0000000000000000 x8 : ffff56778a94b8ab
Aug 1 12:04:56 rpi3 kernel: [ 117.985376] x7 : 0000000000000000 x6 : 0000000000000002
Aug 1 12:04:56 rpi3 kernel: [ 117.985379] x5 : 0000000000000000 x4 : 0000000000000002
Aug 1 12:04:56 rpi3 kernel: [ 117.98...

Read more...

Revision history for this message
Seamus Ryan (seamooose) wrote :

This has probably already been noticed, but in an attempt to simply get my PI working somewhat the way i wanted, i changed my tunnel to run over wlan0 instead of eth0 (they both connect back to the same network so doesn't make a huge difference).

Tunnel and interfaces have been up for days, no issues reported.

Revision history for this message
Brian Murray (brian-murray) wrote :

The Eoan Ermine has reached end of life, so this bug will not be fixed for that release

Changed in linux-raspi2 (Ubuntu Eoan):
status: Confirmed → Won't Fix
Revision history for this message
Matt Elek Harris (mattelekharris) wrote :

@mlx I'm running eth0 in bridge, and also using a tunnel broker ipv6 tunnel like the original poster. How would I go about running the testing kernel?

Revision history for this message
mlx (myxal-mxl) wrote :

@Matt I didn't find any special way to easily select between kernels on raspberry - installing the linux-{headers,image}-<version>-raspi2 would overwrite whatever was on the FAT partition, making it the kernel loaded on next boot. To revert to previous kernel, its package would need to be reinstalled (apt install --reinstall <pkg>).

I did run into a problem where a tested kernel wouldn't boot, so I'd recommend making a backup of the FAT partition in advance so you can recover it outside of the Raspberry's OS session.

Revision history for this message
Christopher Yates (scubachristopher) wrote :

RPI 3B+. My configuration is as follows:
Raspian (buster)
5.4.51-v7+#1333 SMP Mon Aug 10 16:45:19 BST 2020

wlan0 AP, no bridge, fixed IP.
wlan1 USB Wifi adapter
ppp0 Hologram Nova Modem

My application is in the field, so Wifi HotSpots can be unreliable (I can get an IP but can't connect). So I watch to see if I can get to the internet through wlan1 and if I cannot, I switch to cellular. Once on cellular, I look to see if I can get to the Internet via wlan1 (cheaper / faster) by ifconfig wlan1 up.

When I do connect, I am sending a lot of data. Hard to pin what actions are triggering the NETDEV WATCHDOG backtrace, but it leaves wlan1 in a state where it cannot come up again:

SIOCSIFFLAGS: No such device

ip address shows wlan1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000

vcgencmd shows:
Aug 6 2020 16:24:09
version af3...d5b337 (clean) (release) (start)

Not clear this is the same issue, but I thought I'd report as a use case.

Revision history for this message
Hamish McIntyre-Bhatty (hamishmb) wrote :

I'm having this issue as well, but on Raspbian 10. Symptoms seems the same, generally occurs with high load. However, I don't seem to need IPv6 packets to cause this - my network is IPv4 only.

I thought I had a power problem because I am also using an external HDD (it's also working as a NAS), but with the powered hub things aren't any better.

Revision history for this message
Hamish McIntyre-Bhatty (hamishmb) wrote :

NB: On Linux 4.9. I'll post here again if it happens on the new kernel (5.4).

Revision history for this message
Christopher Yates (scubachristopher) wrote :

@hamishbm -- I solved this by reverting to a prior fw release:

sudo rpi-update 86b202d127ca3d413d0779d870cce2169afaacd7

Zero issues since. JFYI

Revision history for this message
Hamish McIntyre-Bhatty (hamishmb) wrote :

Also happens on Linux 5.4 on Raspbian 10.

Okay, I'll keep that in mind. At the moment I'm trying a different overlay (driver) for the USB chip to see if that helps (not sure yet, though it hasn't helped with the HDD).

Revision history for this message
Hamish McIntyre-Bhatty (hamishmb) wrote :

@scubachristopher what kernel version did that put you on?

Revision history for this message
Christopher Yates (scubachristopher) wrote :
Revision history for this message
Hui Wang (hui.wang) wrote :

@Christopher,

It is devicetree, kernel and bootloader binary update, do you know what change on source code level fix this issue?

Revision history for this message
Christopher Yates (scubachristopher) wrote :

No idea. I didn't march forward on the commits until it breaks, and I don't know when the bug occurs or I'd scream about it.

What I do know is this old version is stable and I've had demonstrably no problems with it.

Revision history for this message
Seamus Ryan (seamooose) wrote :

I *may* be jumping the gun here, but having just installed 5.4.0-1019-raspi this morning, things appear to be.... good!

Did an upgrade ~ 30 minutes ago (usually i find the issue is triggered after around 10 minutes or so).

Will update if/when it crashes.

Revision history for this message
Seamus Ryan (seamooose) wrote :

Cancel that. As expected, issue still present.

Revision history for this message
Christopher Yates (scubachristopher) wrote :

Bummers.

Revision history for this message
Tim Brand (t-brand-h) wrote :

I'm having the exact same issue on my RPi 3. I've been using the Pi for a very long time now and never had an issue. But since last week I updated the OS to the latest Raspberry OS and the Pi is now running as a K3s master node. Since then the Pi keeps crashing like this. I've having a lot of network traffic, way more than it had before it was used as a k3s master node.

About 2 days ago I tried the rpi-update which Christopher provided, and since then the Pi is running without any crashes (yet). Almost 48 hours already, where before it wasn't able to run 24 hours straight.

Revision history for this message
Hamish McIntyre-Bhatty (hamishmb) wrote :

Have you tried the new overlay (dwc2)? It worked really well for me, thread and instructions here: https://github.com/raspberrypi/linux/issues/3843

Revision history for this message
mlx (myxal-mxl) wrote :

FWIW, I tried the dwc2 overlay again with kernel 5.4.0-1021-raspi - as before in #21 this changes nothing - network still dies during first major (>100MB) download, while other USB ports continue working.

On the plus side, the Pi can now actually boot with an external NIC plugged in. I'm not sure if the issue I had (boot getting stuck with external USB Ethernet adapter) got fixed or a hardware change I made (Apple USB Ethernet adapter -> generic RTL8153-based adapter) is responsible.

Revision history for this message
Juerg Haefliger (juergh) wrote :

Ok, I'm utterly confused about the reported issue. Simple steps to reproduce, please?

Revision history for this message
Christopher Yates (scubachristopher) wrote :

@Juerg:

My config for my network setup, which is an AP via wlan0, not bridged and station via USB Wifi dongle:

-------------------------------------------------
#!/bin/bash

# Error management
set -o pipefail
set -o nounset
set -x

apt install -y hostapd dnsmasq
systemctl unmask hostapd
systemctl enable hostapd

systemctl stop hostapd
systemctl stop dnsmasq

echo -e "interface wlan0\n\tstatic ip_address=192.168.10.1/24\n\tnohook wpa_supplicant\n" >> /etc/dhcpcd.conf

systemctl restart dhcpcd
cp /home/pi/hostapd.conf /etc/hostapd/
sed -i 's/^#DAEMON_CONF=.*$/DAEMON_CONF="\/etc\/hostapd\/hostapd.conf"/' /etc/default/hostapd

mv /etc/dnsmasq.conf /etc/dnsmasq.conf.orig
cp /home/pi/dnsmasq.conf /etc/

systemctl unmask hostapd
systemctl enable hostapd
systemctl start hostapd
service dnsmasq start

-------------------------------------------------

Unlike the top post, I'm not using IPV6 so that is not a factor. However, the kernel
log --- CUT HERE --- ... NETDEV WATCHDOG occurs for me.

As I state above in my original post:

-------------------------------------------------
When I do connect, I am sending a lot of data. Hard to pin what actions are triggering the NETDEV WATCHDOG backtrace, but it leaves wlan1 in a state where it cannot come up again:

SIOCSIFFLAGS: No such device

ip address shows wlan1: <BROADCAST,MULTICAST> mtu 1500 qdisc mq state DOWN group default qlen 1000

vcgencmd shows:
Aug 6 2020 16:24:09
version af3...d5b337 (clean) (release) (start)
-------------------------------------------

Seems to me a driver issue when there is a flood of traffic. The older version
of firmware I posted above solves the problem.

Revision history for this message
mlx (myxal-mxl) wrote :

@Chrostopher: if it's a driver problem, wouldn't your issue with WLAN be distinct from mine, where I'm seeing the wired Ethernet break?

Revision history for this message
Christopher Yates (scubachristopher) wrote :

@mlx -- dunno. What I do know is reverting to prior firmware solves the issue for me and at least @Tim.

I had the same issue in May as there was another release. I reverted and it became stable.

This summer, I tried to sudo apt upgrade and hit the problem again. So I reverted and am stable.

Would like to get back to the tip for sure, but instability is instability. And it seems it's a shared problem across ETH0 and WLAN1 based on this thread.

Revision history for this message
Tim Brand (t-brand-h) wrote :

I want to add that in my case it's not related to WLAN. I'm not using WLAN on the Pi and also totally disabled it in the config ( I was hoping that helped, but it didn't ). So the issue is related to the ETH0

Revision history for this message
Juerg Haefliger (juergh) wrote :

I'm unable to reproduce this issue. I have not fiddled with IPv6 yet due to my lack of knowledge and (IMO) unclear instructions in previous comments. I've pushed data through a wireguard tunnel while loading the Pi (because this sounds similar: https://github.com/raspberrypi/linux/issues/3782 but I've just realized, that that issue is filed against 3B and not 3B+, duh!). I've also tried the instructions from comment #43 but my test completes multiple times just fine.

I'm afraid there's not much I can do without *clear* instructions on how to reproduce this in an *isolated* environment using an *Ubuntu* image.

Revision history for this message
Hui Wang (hui.wang) wrote :

Oh, the board I have is also a 3B+, maybe that is the reason I can't reproduce the issue.

the silkscreen on my board is:

Raspberry Pi 3 Model B+
@ Raspberry Pi 2017.

Revision history for this message
mlx (myxal-mxl) wrote :

I only have the 3B+ board - the issue was encountered, and is still reproducible on that board. OS has always been Ubuntu.

I'll try to reduce the repro steps/conditions when I have time; my setup is the v6 tunnel +v6 router + ip6tables + bridge reproduces the issue immediately, so hopefully determining what contributes to the issue and what doesn't should be too hard.

Revision history for this message
Seamus Ryan (seamooose) wrote :

I am also on a 3B+

ubuntu@dns1:~$ cat /proc/device-tree/model
Raspberry Pi 3 Model B Plus Rev 1.3

I dont believe you need anything fancy with IPv6, im not tunneling any v6 and its occurring with me.
I do have IPv6 enabled on the interface used for tuneling.

Hui Wang, im happy to provide a shell with remote access for you replicate and troubleshoot if required.

Revision history for this message
Juerg Haefliger (juergh) wrote :

@seamoose So you're just setting up a wireguard tunnel and push data through it to make it fail? That's what I did but in my case between two Pis on the same network. Works just fine for me. Can you share some of your relevant (non-standard) configs?

Revision history for this message
Seamus Ryan (seamooose) wrote :

Absolutely!

----------
My Raspberry Pi sits inside my home network, its config:

eth0:
IP: 192.168.200.11
GW: 192.168.200.1
IPv6: enabled (prefix delegation from ISP)
Metric: 300

wlan0:
IP: 192.168.209.11
GW: 192.168.209.1
IPv6: enabled (prefix delegation from ISP)
Metric: 400

wg0: flags=209<UP,POINTOPOINT,RUNNING,NOARP> mtu 1420
        inet 10.241.0.1 netmask 255.255.255.0 destination 10.241.0.1
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 1000 (UNSPEC)
        RX packets 37100 bytes 44690744 (44.6 MB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 17505 bytes 4748492 (4.7 MB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

Wireguard config:
root@dns1:~# cat /etc/wireguard/wg0.conf
[Interface]
## This Desktop/client's private key ##
PrivateKey = <removed>

## Client ip address ##
Address = 10.241.0.1/24

[Peer]
## Ubuntu 20.04 server public key ##
PublicKey = <removed>

## set ACL ##
AllowedIPs = 10.241.0.0/24

## Your Ubuntu 20.04 LTS server's public IPv4/IPv6 address and port ##
Endpoint = <My VPS Public IPv4 IP>:51820

## Key connection alive ##
PersistentKeepalive = 15

ubuntu@dns1:~$ sudo wg
interface: wg0
  public key: <removed>
  private key: (hidden)
  listening port: 53514

peer: <removed>
  endpoint: <My VPS Public IPv4 IP>:51820
  allowed ips: 10.241.0.0/24
  latest handshake: 33 seconds ago
  transfer: 42.68 MiB received, 4.79 MiB sent
  persistent keepalive: every 15 seconds
ubuntu@dns1:~$

----------

My VPS hosted externally:
eth0: DHCP from VPS provider with IPv4/6(public IP, ie not NAT)

wg0: flags=209<UP,POINTOPOINT,RUNNING,NOARP> mtu 1420
        inet 10.241.0.3 netmask 255.255.255.0 destination 10.241.0.3
        unspec 00-00-00-00-00-00-00-00-00-00-00-00-00-00-00-00 txqueuelen 1000 (UNSPEC)
        RX packets 16739 bytes 4405736 (4.4 MB)
        RX errors 0 dropped 0 overruns 0 frame 0
        TX packets 36539 bytes 44736120 (44.7 MB)
        TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0

root@m21:~# cat /etc/wireguard/wg0.conf
[Interface]
Address = 10.241.0.3/24
SaveConfig = true
ListenPort = 51820
FwMark = 0xca6c
PrivateKey = <removed>

[Peer]
PublicKey = <removed>
AllowedIPs = 10.241.0.1/32, 192.168.200.0/24, 192.168.209.0/24, 192.168.210.0/24, 192.168.211.0/24, 192.168.212.0/24
Endpoint = <My home public IPv4 IP>:33962
root@m21:~#

root@m21:~$ sudo wg
interface: wg0
  public key: <removed>
  private key: (hidden)
  listening port: 51820
  fwmark: 0xca6c

peer: <removed>
  endpoint: <My home public IPv4 IP>:53514
  allowed ips: 10.241.0.1/32, 192.168.200.0/24, 192.168.209.0/24, 192.168.210.0/24, 192.168.211.0/24, 192.168.212.0/24
  latest handshake: 1 minute, 36 seconds ago
  transfer: 5.33 MiB received, 42.91 MiB sent
root@m21:~$

Revision history for this message
Seamus Ryan (seamooose) wrote :

Forgot to include:
Raspberry PI:
ubuntu@dns1:~$ uname -a
Linux dns1 5.4.0-1022-raspi #25-Ubuntu SMP PREEMPT Thu Oct 15 13:31:49 UTC 2020 aarch64 aarch64 aarch64 GNU/Linux
ubuntu@dns1:~$

VPS:
root@m21:~# uname -a
Linux m21 5.4.0-52-generic #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
root@m21:~#

Revision history for this message
Seamus Ryan (seamooose) wrote :
Download full text (4.0 KiB)

And the most recent crash:

Oct 22 11:44:51 dns1 kernel: [ 3096.793118] ------------[ cut here ]------------
Oct 22 11:44:51 dns1 kernel: [ 3096.793183] NETDEV WATCHDOG: eth0 (lan78xx): transmit queue 0 timed out
Oct 22 11:44:51 dns1 kernel: [ 3096.793282] WARNING: CPU: 2 PID: 0 at net/sched/sch_generic.c:447 dev_watchdog+0x370/0x378
Oct 22 11:44:51 dns1 kernel: [ 3096.793288] Modules linked in: sctp wireguard ip6_udp_tunnel udp_tunnel 8021q garp mrp stp llc dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua btsdio bluetooth ecdh_generic ecc brcmfmac brcmutil cfg80211 bcm2835_v4l2(CE) bcm2835_isp(CE) bcm2835_codec(CE) v4l2_mem2mem bcm2835_mmal_vchiq(CE) videobuf2_vmalloc videobuf2_dma_contig videobuf2_memops snd_bcm2835(CE) videobuf2_v4l2 videobuf2_common snd_pcm raspberrypi_hwmon videodev snd_timer snd mc vc_sm_cma(CE) rpi_poe_fan uio_pdrv_genirq uio sch_fq_codel drm ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor xor_neon raid6_pq libcrc32c raid1 raid0 multipath linear crct10dif_ce spidev phy_generic aes_neon_bs aes_neon_blk crypto_simd cryptd
Oct 22 11:44:51 dns1 kernel: [ 3096.793477] CPU: 2 PID: 0 Comm: swapper/2 Tainted: G C E 5.4.0-1022-raspi #25-Ubuntu
Oct 22 11:44:51 dns1 kernel: [ 3096.793482] Hardware name: Raspberry Pi 3 Model B Plus Rev 1.3 (DT)
Oct 22 11:44:51 dns1 kernel: [ 3096.793491] pstate: 60400005 (nZCv daif +PAN -UAO)
Oct 22 11:44:51 dns1 kernel: [ 3096.793499] pc : dev_watchdog+0x370/0x378
Oct 22 11:44:51 dns1 kernel: [ 3096.793507] lr : dev_watchdog+0x370/0x378
Oct 22 11:44:51 dns1 kernel: [ 3096.793512] sp : ffff800010013d80
Oct 22 11:44:51 dns1 kernel: [ 3096.793517] x29: ffff800010013d80 x28: ffff0000363c2380
Oct 22 11:44:51 dns1 kernel: [ 3096.793527] x27: 00000000ffffffff x26: ffff00002bf81280
Oct 22 11:44:51 dns1 kernel: [ 3096.793536] x25: ffffa28a3750a018 x24: ffff00002bf81340
Oct 22 11:44:51 dns1 kernel: [ 3096.793546] x23: ffff00003535245c x22: ffff000035352000
Oct 22 11:44:51 dns1 kernel: [ 3096.793555] x21: ffff000035352480 x20: ffffa28a37807000
Oct 22 11:44:51 dns1 kernel: [ 3096.793564] x19: 0000000000000000 x18: 0000000000000000
Oct 22 11:44:51 dns1 kernel: [ 3096.793573] x17: 0000000000000000 x16: 0000000000000000
Oct 22 11:44:51 dns1 kernel: [ 3096.793582] x15: ffff000035a340b0 x14: ffffffffffffffff
Oct 22 11:44:51 dns1 kernel: [ 3096.793591] x13: 0000000000000000 x12: ffffa28a3793f000
Oct 22 11:44:51 dns1 kernel: [ 3096.793600] x11: ffffa28a3782c000 x10: ffffa28a3793fa80
Oct 22 11:44:51 dns1 kernel: [ 3096.793609] x9 : 0000000000000000 x8 : 0000000000000004
Oct 22 11:44:51 dns1 kernel: [ 3096.793617] x7 : 0000000000000000 x6 : 0000000000000000
Oct 22 11:44:51 dns1 kernel: [ 3096.793626] x5 : 0000000000000000 x4 : 0000000000000004
Oct 22 11:44:51 dns1 kernel: [ 3096.793634] x3 : ffffa28a36e15798 x2 : 0000000000000040
Oct 22 11:44:51 dns1 kernel: [ 3096.793643] x1 : 0000000000000000 x0 : 0000000000000000
Oct 22 11:44:51 dns1 kernel: [ 3096.793652] Call trace:
Oct 22 11:44:51 dns1 kernel: [ 3096.793661] dev_watchdog+0x370/0x378
Oct 22 11:44:51 dns1 kernel: [ 3096.793674] call_timer_fn+0x40/0x1e8
Oct 22 11:44:51 dn...

Read more...

Revision history for this message
Juerg Haefliger (juergh) wrote :

@seamoose Thanks. That's pretty much the config that I had. I've now setup an external VM and created a wireguard tunnel from my Pi 3B+ but it still won't fail running iperf3. Is there a specific payload that you're transferring? Also, your IPv6 config, is that just the default or did you do some manual configuration? On my Pi:
$ ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether b8:27:eb:3e:ab:fb brd ff:ff:ff:ff:ff:ff
    inet 192.168.99.67/24 brd 192.168.99.255 scope global dynamic eth0
       valid_lft 860949sec preferred_lft 860949sec
    inet6 fe80::ba27:ebff:fe3e:abfb/64 scope link
       valid_lft forever preferred_lft forever
3: wlan0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether b8:27:eb:6b:fe:ae brd ff:ff:ff:ff:ff:ff
    inet 192.168.99.57/24 brd 192.168.99.255 scope global dynamic wlan0
       valid_lft 860949sec preferred_lft 860949sec
    inet6 fe80::ba27:ebff:fe6b:feae/64 scope link
       valid_lft forever preferred_lft forever
4: wg0: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1420 qdisc noqueue state UNKNOWN group default qlen 1000
    link/none
    inet 10.241.0.1/24 scope global wg0
       valid_lft forever preferred_lft forever

no longer affects: linux-raspi (Ubuntu Eoan)
Changed in linux-raspi (Ubuntu Focal):
status: New → Confirmed
no longer affects: linux-raspi2 (Ubuntu Focal)
Changed in linux-raspi (Ubuntu):
status: New → Incomplete
Revision history for this message
Seamus Ryan (seamooose) wrote :

So exactly what type of payload triggers the issues, that i cant work out.

I know the traffic profile, vast majority of it is monitoring related as my VPS is running grafana/influx and my internal clients are sending data over this tunnel using telegraf.
There is also a fair bit of ICMP etc traffic, so all "monitoring" related.

For a bit of clarity, my Pi essentially acts as an internal router to link to my VPS.
So clients send traffic originating from 192.168.x.x to 192.168.200.11 (internal static routes) to reach the 10.241.0.x network.

Will ping you offline to see if I can provide a level of access to troubleshoot in real time.

Revision history for this message
Seamus Ryan (seamooose) wrote :

RE IPv6, there is some manual config.

I have each vlan/network assigned a /64. On my pi's they get an address in that /64 (different networks for eth0 and wlan0) but they do have a static assignment as well (should have included that earlier:

interface eth0
metric 300
static ip6_address=2403:XXXX:XXXX:XXX0::11/64
static ip6_routers=2403:XXXX:XXXX:XXX0::1
static ip_address=192.168.200.11/24
static routers=192.168.200.1

interface wlan0
metric 400
static ip6_address=2403:XXXX:XXXX:XXX3::11/64
static ip6_routers=2403:XXXX:XXXX:XXX3::1
static ip_address=192.168.209.11/24
static routers=192.168.209.1

Revision history for this message
Seamus Ryan (seamooose) wrote :

@juergh I may have (somewhat) found the trigger.

As previously noted, my internal clients have a static route to the remote wireguard networks via my raspberry pi. This is defined in my internal router (Unifi)
ie:
client > router > static route (10.241.0.0/24) to pi eth0 IP > pi > tunnel > VPS

If have disabled this route, so now none of my clients can send data over the tunnel. Only the Pi which is running the tunnel itself is able to communicate over said tunnel.
No crashes at all.

So, with this in mind, setup an internal client with a static route to your wireguard network and start sending traffic to the remote site. This should (hopefully) trigger the issue.

Revision history for this message
Christopher Yates (scubachristopher) wrote :

Ahhh. I do have 3 tunnels up as well. Good find!

Revision history for this message
Seamus Ryan (seamooose) wrote :

Ok yup this is almost certainly the issue (when the tunnel is pi is tunnelling traffic for something other than itself)

My tunnel has been up for 12 hours, no crashes at all.

Revision history for this message
Seamus Ryan (seamooose) wrote :

Well, it took more than 24 hours for my Pi to crash again, but it did happen. I wasn't able to capture much of what triggered it as i wasn't home at the time.

Revision history for this message
Juerg Haefliger (juergh) wrote :

I was finally able to reproduce the problem by pushing data through the Pi, i.e. sender -> Pi -> wireguard tunnel -> receiver. Takes about 1.5 hrs to trigger the timeout.

Revision history for this message
Boris Prinz (bprinz) wrote :

I found a way to trigger the problem fast (within one minute).

On Raspberry Pi 3 B+:

* Enable IPv4 forwarding: sysctl net.ipv4.ip_forward=1
* Enable masquerading: iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE
* Start a netcat listening process: sudo nc -l -p 1234 > /dev/null

On another machine:

* Configure IPv4 address manually setting Raspberry Pi as gateway
* Open browser and load some web pages to test routing
* Send data to Raspberry Pi: nc RASPI_IP 1234 < /dev/zero
* Parallel browsing now leads to network crash on Raspberry Pi.

A few seconds later the kernel warining appears:

    NETDEV WATCHDOG: eth0 (lan78xx): transmit queue 0 timed out

Both machines are connected via ethernet to a 100MBit router.

I tested the following hardware:

* Raspberry Pi 2 Model B v1.1: no crash
* Raspberry Pi 3 Model B+: crash
* Raspberry Pi 4 Model B (4 GB): no crash

$ uname -a
Linux ubuntu 5.8.0-1010-raspi #13-Ubuntu SMP PREEMPT Wed Dec 9 17:19:55 UTC 2020 armv7l armv7l armv7l GNU/Linux

Revision history for this message
Jon (jgelsey) wrote :

I can deterministically reproduce this crash with Wireguard on the RPI3 running 20.10 and hence using the lan78xx driver on eth0. A few minutes of browsing public web sites on a client connected through Wireguard to the RPI3 crashes the RPI. Changing nothing except for running Wireguard's forwarding through wlan0 rather than eth0 does not result in any crashes. So certainly seems to indicate that the wired ethernet's lan78xx driver shipping today with 20.10 is the culprit causing these crashes.

Revision history for this message
Juerg Haefliger (juergh) wrote :

Thanks for the added information. I'll take another crack at trying to reproduce the problem.

Revision history for this message
Juerg Haefliger (juergh) wrote :

I finally managed to reproduce this problem consistently and within 30 secs. Bisecting the kernel points at [1] to be problematic. Whether that commit introduces a real regression or merely exposes a different underlying issue is unclear at the moment. I wasn't able to reproduce the problem with a Realtek USB NIC attached to the Pi 3B+ so that points more towards a lan78xx driver/HW issue.

Reverting commit [1] plus two follow-on fixes makes the problem go away.

This looks similar to [2].

[1] https://github.com/torvalds/linux/commit/4f693b55c3d2d2239b8a0094b518a1e533cf75d5
[2] https://<email address hidden>/

Revision history for this message
Juerg Haefliger (juergh) wrote :

Upstream fix: https://<email address hidden>/

no longer affects: linux-raspi2 (Ubuntu Groovy)
Changed in linux-raspi (Ubuntu Groovy):
status: New → Confirmed
Revision history for this message
Hamish McIntyre-Bhatty (hamishmb) wrote :

Nice :)

Does anyone know if this is getting into Raspberry Pi OS?

Revision history for this message
Christopher Yates (scubachristopher) wrote :

+1 -- I'm stuck on an ancient version of the Pi OS because of this issue...

Juerg Haefliger (juergh)
description: updated
Changed in linux-raspi (Ubuntu Focal):
status: Confirmed → In Progress
Changed in linux-raspi (Ubuntu Groovy):
status: Confirmed → In Progress
Changed in linux-raspi (Ubuntu Focal):
status: In Progress → Fix Committed
Changed in linux-raspi (Ubuntu Groovy):
status: In Progress → Fix Committed
Revision history for this message
Hamish McIntyre-Bhatty (hamishmb) wrote :

Mentioned to Raspberry Pi people at: https://github.com/raspberrypi/linux/issues/3850

If anyone knows of a more current bug report for this issue, let me know.

Revision history for this message
Kleber Sacilotto de Souza (kleber-souza) wrote :

The upstream fix "b160c28548bc tcp: do not mess with cloned skbs in tcp_add_backlog()" has been applied to focal/linux as part of the fixes for bug 1915195 (Focal update: v5.4.93 upstream stable release). It will be included for the next focal/linux-raspi, however this bug report will not be closed automatically.

Revision history for this message
Hamish McIntyre-Bhatty (hamishmb) wrote :

NB: Have been informed that this will trickle down from upstream into the 5.10 kernels for Raspberry Pi OS as well.

Revision history for this message
Launchpad Janitor (janitor) wrote :
Download full text (130.4 KiB)

This bug was fixed in the package linux-raspi - 5.8.0-1016.19

---------------
linux-raspi (5.8.0-1016.19) groovy; urgency=medium

  * groovy/linux-raspi: 5.8.0-1016.19 -proposed tracker (LP: #1914800)

  * Raspberry Pi 3 network dies shortly after a burst of IPv6 tunnel network
    load ((lan78xx): transmit queue 0 timed out) (LP: #1861936)
    - tcp: do not mess with cloned skbs in tcp_add_backlog()

  * Groovy update: upstream stable patchset 2021-01-13 (LP: #1911476)
    - [Config] raspi: updateconfigs for USB_SISUSBVGA_CON
    - [Config] raspi: updateconfigs for ZSMALLOC_PGTABLE_MAPPING

  * Groovy update: upstream stable patchset 2021-01-12 (LP: #1911235)
    - [Config] raspi: update config for INFINIBAND_VIRT_DMA

  [ Ubuntu: 5.8.0-44.50 ]

  * groovy/linux: 5.8.0-44.50 -proposed tracker (LP: #1914805)
  * Packaging resync (LP: #1786013)
    - update dkms package versions
    - update dkms package versions
  * Introduce the new NVIDIA 460-server series and update the 460 series
    (LP: #1913200)
    - [Config] dkms-versions -- drop NVIDIA 435 455 and 440-server
    - [Config] dkms-versions -- add the 460-server nvidia driver
  * [SRU][G/H/U/OEM-5.10] re-enable s0ix of e1000e (LP: #1910541)
    - Revert "UBUNTU: SAUCE: e1000e: bump up timeout to wait when ME un-configure
      ULP mode"
    - e1000e: Only run S0ix flows if shutdown succeeded
    - Revert "e1000e: disable s0ix entry and exit flows for ME systems"
    - e1000e: Export S0ix flags to ethtool
  * suspend only works once on ThinkPad X1 Carbon gen 7 (LP: #1865570) //
    [SRU][G/H/U/OEM-5.10] re-enable s0ix of e1000e (LP: #1910541)
    - e1000e: bump up timeout to wait when ME un-configures ULP mode
  * Cannot probe sata disk on sata controller behind VMD: ata1.00: failed to
    IDENTIFY (I/O error, err_mask=0x4) (LP: #1894778)
    - PCI: vmd: Offset Client VMD MSI-X vectors
  * Enable mute and micmute LED on HP EliteBook 850 G7 (LP: #1910102)
    - ALSA: hda/realtek: Enable mute and micmute LED on HP EliteBook 850 G7
  * SYNA30B4:00 06CB:CE09 Mouse on HP EliteBook 850 G7 not working at all
    (LP: #1908992)
    - HID: multitouch: Enable multi-input for Synaptics pointstick/touchpad device
  * HD Audio Device PCI ID for the Intel Cometlake-R platform (LP: #1912427)
    - SAUCE: ALSA: hda: Add Cometlake-R PCI ID
  * switch to an autogenerated nvidia series based core via dkms-versions
    (LP: #1912803)
    - [Packaging] nvidia -- use dkms-versions to define versions built
    - [Packaging] update-version-dkms -- maintain flags fields
    - [Config] dkms-versions -- add transitional/skip information for nvidia
      packages
  * udpgro.sh in net from ubuntu_kernel_selftests seems not reflecting sub-test
    result (LP: #1908499)
    - selftests: fix the return value for UDP GRO test
  * [UBUNTU 21.04] vfio: pass DMA availability information to userspace
    (LP: #1907421)
    - vfio/type1: Refactor vfio_iommu_type1_ioctl()
    - vfio iommu: Add dma available capability
  * qede: Kubernetes Internal DNS Failure due to QL41xxx NIC not supporting IPIP
    tx csum offload (LP: #1909062)
    - qede: fix offload for IPIP tunnel packets
  * Use DCPD to control HP DreamColor panel (...

Changed in linux-raspi (Ubuntu Groovy):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.