14e4:1687 broadcom tg3 network driver disconnects under high load

Bug #1447664 reported by Toan
112
This bug affects 18 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Fix Released
High
Unassigned

Bug Description

The tg3 broadcom network driver that binds with chipset 5762 goes offline and unable to recover (even with tg3 watchdog timeout) when network transmit is under high load. Call trace:
https://launchpadlibrarian.net/204185480/dmesg

When this happens, only a reboot would be able to fix it. Sometimes, however, bringing the interface offline and online (via ifconfig) would recover networking. I've also tested with the latest tg3 driver (dec 2014 version) and networking is still problematic. I have also disabled TSO, GSO etc... with ethtool and the bug still surfaces. This bug may be related to the integrated Firmware.

Here is the procedure to replicate the issue because it is hard to replicate it under moderate network load.

1. Bootup a machine with a broadcom 5762 NIC (ie. HP DeskElite 705) using a Ubuntu/Kubunu Live CD 14.04-15.04.
2. from another machine: start 5 sessions, repetitively copy (scp with public key authentication) a 70 meg file back and forth to the tg3 machine in each session. (not sure if this is necessary)
3. create a 1GB file on the tg3 machine, with something like dd if=/dev/urandom of=/my/test/file bs=1024 count=$((1024*1000))
4. from another machine: repetitively scp copy that 1GB file from the tg3 machine. This can be done with something like:

while [ 0 ]; do
   scp -i /my/scp/private.key <email address hidden>:/my/test/file /tmp
done;

Networking will mostly goes offline in about 10-30 minutes.

WORKAROUND: Add udev rule to make the changes permanent in /etc/udev/rules.d/80-tg3-fix.rules :
ACTION=="add", SUBSYSTEM=="net", ATTRS{vendor}=="0x14e4", ATTRS{device}=="0x1687", RUN+="/sbin/ethtool -K %k highdma off"

ProblemType: Bug
DistroRelease: Ubuntu 15.04
Package: linux-image-3.19.0-15-generic 3.19.0-15.15
ProcVersionSignature: Ubuntu 3.19.0-15.15-generic 3.19.3
Uname: Linux 3.19.0-15-generic x86_64
ApportVersion: 2.17.2-0ubuntu1
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC1: kubuntu 3748 F.... pulseaudio
 /dev/snd/controlC0: kubuntu 3748 F.... pulseaudio
CasperVersion: 1.360
Date: Thu Apr 23 11:16:24 2015
IwConfig:
 eth0 no wireless extensions.

 lo no wireless extensions.
LiveMediaBuild: Kubuntu 15.04 "Vivid Vervet" - Release amd64 (20150422)
MachineType: Hewlett-Packard HP EliteDesk 705 G1 MT
ProcEnviron:
 LANGUAGE=
 TERM=xterm
 PATH=(custom, no user)
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB: 0 radeondrmfb
ProcKernelCmdLine: BOOT_IMAGE=/casper/vmlinuz.efi file=/cdrom/preseed/hostname.seed boot=casper maybe-ubiquity quiet splash ---
PulseList:
 Error: command ['pacmd', 'list'] failed with exit code 1: Home directory not accessible: Permission denied
 No PulseAudio daemon running, or not running as session daemon.
RelatedPackageVersions:
 linux-restricted-modules-3.19.0-15-generic N/A
 linux-backports-modules-3.19.0-15-generic N/A
 linux-firmware 1.143
RfKill:

SourcePackage: linux
UdevLog: Error: [Errno 2] No such file or directory: '/var/log/udev'
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 10/22/2014
dmi.bios.vendor: Hewlett-Packard
dmi.bios.version: L06 v02.15
dmi.board.asset.tag: 2UA5041TG4
dmi.board.name: 2215
dmi.board.vendor: Hewlett-Packard
dmi.chassis.asset.tag: 2UA5041TG4
dmi.chassis.type: 6
dmi.chassis.vendor: Hewlett-Packard
dmi.modalias: dmi:bvnHewlett-Packard:bvrL06v02.15:bd10/22/2014:svnHewlett-Packard:pnHPEliteDesk705G1MT:pvr:rvnHewlett-Packard:rn2215:rvr:cvnHewlett-Packard:ct6:cvr:
dmi.product.name: HP EliteDesk 705 G1 MT
dmi.sys.vendor: Hewlett-Packard

Revision history for this message
Toan (tpham3783) wrote :
Revision history for this message
Brad Figg (brad-figg) wrote : Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status: New → Confirmed
Toan (tpham3783)
description: updated
description: updated
Revision history for this message
Joseph Salisbury (jsalisbury) wrote : Re: broadcom tg3 network driver disconnects under high load

Did this issue start happening after an update/upgrade? Was there a prior kernel version where you were not having this particular problem?

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v4.0 kernel[0].

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.0-vivid/

Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Toan (tpham3783) wrote :

Joseph,

>Did this issue start happening after an update/upgrade?

No, I also had this issue. I tested with multiple OSes and kernel versions. I tested the issue with kernel 2.6.39,
and three Ubuntu live CDs 12.04, 14.04, and 15.04 (which was released today). I, however, will consider testing with kernel 4.x.

>Was there a prior kernel version where you were not having this particular problem?

No

Revision history for this message
Toan (tpham3783) wrote :

Please note,this bug is unrelated to Bug #1331513 b/c even if TSO, GSO etc... are disabled, I can still re-producible it. The lock-up would only occur under VERY_HIGH_NETWORK_LOAD, so a typical user (web-surfing only) would not be able catch it easily. On a side note, the machine I am testing is an HP EliteDesk 705 (DMI info below), and it is the official certified hardware to run Ubuntu.

System Information
        Manufacturer: Hewlett-Packard
        Product Name: HP EliteDesk 705 G1 MT
        Version:
        Serial Number: 2UA5041TG4
        UUID: E24D7A80-9AA4-11E4-8822-8A8247065164
        Wake-up Type: Power Switch
        SKU Number: K5U61UP#ABA
        Family: 103C_53307F G=D

Here is the state of the network interface when the tigon3 driver completely locked up. Attached file is the dmesg log.

eth0 Link encap:Ethernet HWaddr 64:51:06:47:82:8a
          UP BROADCAST MULTICAST MTU:1500 Metric:1
          RX packets:90235313784 errors:30064771065 dropped:7 overruns:0 frame:120259084260
          TX packets:90387363107 errors:30064771065 dropped:0 overruns:0 carrier:0
          collisions:30064771065 txqueuelen:1000
          RX bytes:32978848243 (32.9 GB) TX bytes:321345086545 (321.3 GB)
          Interrupt:18

PS: I just compiled linux-stable 4.0 trunk, will try to run and and report back soon.

penalvch (penalvch)
tags: added: latest-bios-2.15
tags: added: trusty
Revision history for this message
Toan (tpham3783) wrote :

Guys,

I've just confirmed that this bug exist in the upstream kernel version 4.0. Attached file is the full kernel-4.0 log (from bootup to the time the broadcom driver crashes). We may have to report this bug to a Broadcom network driver/firmware developer. thanks

Toan (tpham3783)
tags: added: bcm5762 broadcom kernel-bug-exists-upstream linux-4.0 lucid tg3 tigon
Po-Hsu Lin (cypressyew)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

Toan, the issue you are reporting is an upstream one. Could you please report this problem to the appropriate mailing list (netdev) by following the instructions verbatim at https://wiki.ubuntu.com/Bugs/Upstream/kernel ?

Please provide a direct URL to your e-mail to the mailing list once you have made it so that it may be tracked via http://vger.kernel.org/vger-lists.html . It can take a day for the new e-mail to show up in the respective archive.

Thank you for your understanding.

tags: added: kernel-bug-exists-upstream-4.0
removed: bcm5762 broadcom linux-4.0 tg3 tigon
Changed in linux (Ubuntu):
status: Confirmed → Triaged
summary: - broadcom tg3 network driver disconnects under high load
+ 14e4:1687 broadcom tg3 network driver disconnects under high load
Revision history for this message
Toan (tpham3783) wrote :

Here is the bug report email to netdev mailing list:

http://www.spinics.net/lists/netdev/msg326389.html

Revision history for this message
Lauri Võsandi (v6sa) wrote :

Hi, disabling highdma with ethtool seems to work around the issue. I've added following udev rule to make the changes permanent in /etc/udev/rules.d/80-tg3-fix.rules

ACTION=="add", SUBSYSTEM=="net", ATTRS{vendor}=="0x14e4", ATTRS{device}=="0x1687", RUN+="/sbin/ethtool -K %k highdma off"

Revision history for this message
Toan (tpham3783) wrote :

Thank you for your valuable finding. I'll test your suggestion in the next few days to confirm that it works.

I've also reported the work-around to Broadcom dev team and suggested a patch to the tg3 driver to disable highdma. I'll keep you updated on the issue... thank you once again.

Revision history for this message
Toan (tpham3783) wrote :

Lauri,

Can you let me know if you've tested the work-around solution on a 64bit or 32bit OS? AFAK, HIGHMEM option only allows dma support on 64bit system (>4GB), so I dont think it would make a difference if the native OS is 32bit. The reason I am asking because I've tested the bug on both 32 and 64 bit systems, so I just dont see how disabling highdma on a 32bit system would resolve the issue. Regardless, I will try the work-around solution on a 32bit system pretty soon.

Revision history for this message
Lauri Võsandi (v6sa) wrote :

Hi,

I am running on 64-bit system. The machine didn't hiccup in ~36 hours so we stopped testing there, otherwise I managed to bump into connection drop within hours, 8 hours tops. For test I had scp copying data inbound and outbound and in addition to that Youtube was playing in several browser tabs. Higher memory usage seemed to trigger the bug faster.

Revision history for this message
Toan (tpham3783) wrote :

Lauri,

I've pumped over 1.5TB of data and have never seen the hic-up yet. I think we've found the smoking gun. Below is a simple patch to the tigon device driver if you prefer not to use the udev rule solution.

I believe the root cause is that the tigon net driver uses virtual memory for DMA transfers. All DMA transfers should be remapped to logical memory using dma_map_page() in order for HIGHDMA feature to work. Broadcom will look into this and hopefully, the bug will be fixed upstream soon... Thanks again...

--- linux-2.6.38.2/drivers/staging/bcm-tg3/tg3.c.vanilla 2016-01-07 14:14:20.000000000 -0500
+++ linux-2.6.38.2/drivers/staging/bcm-tg3/tg3.c 2016-01-06 16:05:37.000000000 -0500
@@ -18992,6 +18992,12 @@

        tg3_init_bufmgr_config(tp);

+ /* pham, patch 5762 chip */
+ if (tp->pdev->device == 0x1687 || tg3_asic_rev(tp) == ASIC_REV_5762){
+ printk("tg3: disable HIGHDMA for tigon3 device 5762\r\n");
+ dev->features &= ~NETIF_F_HIGHDMA;
+ }
+
        /* 5700 B0 chips do not support checksumming correctly due
         * to hardware bugs.
         */

Revision history for this message
Toan (tpham3783) wrote :

It is confirmed, disabling HIGHDMA fixed the NIC problem. This was tested by putting a system under load for 120+ hours, and simulated over 12TB of data through the tg3 NIC. Great find Lauri, and thank you again!

penalvch (penalvch)
description: updated
Changed in linux (Ubuntu):
importance: Medium → High
Revision history for this message
Lauri Võsandi (v6sa) wrote :

Hello, it seems that while running graphical user interface and highdma off similar problem persists:

NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
[...]
irq 18: nobody cared (try booting with the "irqpoll" option)

After that device goes offline and can'be brought up again with rmmod/modprobe and the mouse movement becomes jerky. The problem appears quicker if you play around with Firefox etc. Tried booting with irqpoll, the connection still drops but module can be reloaded and mouse isn't jerky. I tried this with 3.18.25 and 4.4.2 kernels, both exibited similar behaviour.

Revision history for this message
chriscrutch (chriscrutch) wrote :

Any chance there's been any movement on this bug? It's really a pain for me. Disabling HIGHDMA helped a bit, but now it seems to kick in at different times. The bandwidth use doesn't seem to be an issue anymore, but now it disconnects with heavy data transfer to USB. It kicks in when performing large backups to an external hard drive, and when copying large video files to a SD card attached with a USB adapter.

Revision history for this message
Daniel (dkim-b) wrote :

I am having the issue as well on kernel 4.4.0-66 (x64). Disabling HighDMA did not fix anything on my end and I cannot figure out what will trigger the issue. It seems to occur randomly and even if there is no active network traffic.

Revision history for this message
gadi (gadieid) wrote :

It happened to me as well on Proliant 360 gent 9 Ubuntu 16.04.2 with 4.4.0-72-generic kernel
ifconfig -a didn't show any eno devices
3 identical servers (HW and SW) had no problem at all
a simple modprobe tg3 command and all eno devices (1-4) appeared

Revision history for this message
Jorge Joaquim Gomes Silva (jorgej) wrote :

Any fix to this bug?

I have the same problem: Ubuntu 16.04 LTS, kernel: 4.8.0-46-generic. Same problem in Debian 9 kernel 4.9.

penalvch (penalvch)
tags: added: bios-outdated-2.28
removed: latest-bios-2.15
penalvch (penalvch)
description: updated
penalvch (penalvch)
tags: added: kernel-bug-exists-upstream-4.11
removed: kernel-bug-exists-upstream-4.0
tags: added: xenial
Revision history for this message
luc (glarage) wrote :

HP EliteDesk 705 G1 SFF with NetXtreme BCM5762 Gigabit Ethernet PCIe

FTTH user here, no ethernet connection after highload (speedtest or you tube) , like others users i had to reboot. The only workaround i found= [sudo ethtool -s eno1 speed 100 duplex full autoneg on] after a reboot, and i can use network but not with my full bandwidth....
Lubuntu 17.04 with 4.12.0-041200rc3-generic

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

If this happens on mainline kernel, please file an upstream bug at https://bugzilla.kernel.org.

Revision history for this message
Jorge Joaquim Gomes Silva (jorgej) wrote :

Hi,

Have same issue with ubuntu 17.04 kernel 4.10.0.19. Any suggestions to fix this problem, besides to reduce speed of the interface?

Revision history for this message
Roger Techima (techima) wrote :

Hello,

i am having the same problem in HP EliteDesk 705 G2 Desktop Mini.

I tried 14.04, 16.04 and 17.04 highdma off solution but this didn't solve the bug for me.

I am running in a 100Mbit network. I noticed that in gigabit seems to work, but it seems I did not test for enough time.

Best regards,

Roger

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

FWIW, I can't reproduce the issue on the same chip. I used iperf instead of scp though.

Revision history for this message
luc (glarage) wrote :

hi guys,
Not a fix but it did the trick: add to your grub iommu=soft.
You will have a fully working ethernet connection...without reduce the speed.
Mine look like this = GRUB_CMDLINE_LINUX_DEFAULT="iommu=soft"
After you have to update grub, like you know.
Why?
Because of this lines with DMESG after i updated my bios (BIOS L06 v02.28 02/07/2017)=

[ 108.769354] psmouse serio1: Wheel Mouse at isa0060/serio1/input0 lost synchronization, throwing 2 bytes away.
[ 108.903961] tg3 0000:03:00.0: tg3_stop_block timed out, ofs=4c00 enable_bit=2
[ 108.945302] tg3 0000:03:00.0 eno1: Link is down
[ 109.305448] AMD-Vi: Event logged [
[ 109.305454] IO_PAGE_FAULT device=03:00.0 domain=0x000d address=0x00000000ffa06e80 flags=0x0020]
[ 109.305459] AMD-Vi: Event logged [
[ 109.305460] IO_PAGE_FAULT device=03:00.0 domain=0x000d address=0x00000000ffa06ec0 flags=0x0020]

Revision history for this message
Yngvi Hrafn Pétursson (skuti) wrote :

Having same same issue on HP EliteDesk 705 G3 Desktop Mini (W4V44AV)
Broadcom Corporation NetXtreme BCM5762 Gigabit Ethernet PCIe (rev 10) and tg3 module

Error is triggered after the link speed is set or negotiated to 100Mbps
Usually within 15sec of ping ower 100Mbps link
But but works ok with 1Gbps links.

Can be triggered by pluging to 100Mbps port, changin the switch port to 100Mbps or:
# ethtool -s eno1 speed 100 duplex full autoneg off

Netboot works until the tg3 module takes ower.
Windows works ok.
Tested:
- multiple cables, computers and switch vendors
- upgrading bios
- ethtool disable eee and hardware offload
- ubuntu 12.04 - 17.04
- new kernel linux-generic-hwe-16.04-edge Version: 4.11.0.14.22
- disable power management in bios
- disable power management with grup switches
- iommu=soft iommu=on iommu=off
- disable highdma

None of the workarounds that i found on Google worked for me.

modinfo tg3 | grep -v alias
filename: /lib/modules/4.4.0-92-generic/kernel/drivers/net/ethernet/broadcom/tg3.ko
firmware: tigon/tg3_tso5.bin
firmware: tigon/tg3_tso.bin
firmware: tigon/tg3.bin
version: 3.137
license: GPL
description: Broadcom Tigon3 ethernet driver
author: David S. Miller (<email address hidden>) and Jeff Garzik (<email address hidden>)
srcversion: 8C06FB0EBBF221DF79133B9
depends: ptp
intree: Y
vermagic: 4.4.0-92-generic SMP mod_unload modversions
parm: tg3_debug:Tigon3 bitmapped debugging message enable value (int)

Revision history for this message
Paulo Abadie Guedes (paulo.guedes) wrote :

Hello, I have seen the exactly same issue, with the exactly same hardware you have: it's the HP EliteDesk 705 G3 Desktop Mini.

I've tested already a ton of options, including recompiling the latest kernel, booting with several parameters, and so on and so forth. Got nothing more than a big headache. I have 100+ machines to install in a month and my team is having a really hard time to deal with this issue.

I have posted my findings on the fog forums. Fog is an open-source cloning tool. Please check it out:

https://forums.fogproject.org/topic/10731/crash-due-to-timeout-in-tg3-kernel-module-tg3_stop_block-timed-out-ofs-4c00-enable_bit-2

Any ideas on this bug? It seems to be related to 10/100 switches. If both ends are gigabit, it works much more reliably. Problems still arise, but much less frequently. With my old "fast ethernet" switch, the problem alwasy happens.

It's lurking anywhere between the binary blob (the firmware), the kernel driver, the hardware or any tricky combination of these. Perhaps related to the AMD platform

I can run tests or gather more data, if it helps. The issue always happens here.
Any ideas on how to solve or workaround this issue? Patches or parameters are welcome...

Regards,
Paulo

Revision history for this message
Tessio Fechine (tessiof) wrote :

There was a commit to fix something about the BCM5762 variant, but it seems to be restricted to DELL servers..
https://github.com/torvalds/linux/commit/4419bb1cedcda0272e1dc410345c5a1d1da0e367#diff-ee9b0abeec638cc316efd5b30e0e01e8

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote : Re: [Bug 1447664] Re: 14e4:1687 broadcom tg3 network driver disconnects under high load
Download full text (4.6 KiB)

> On 11 Jan 2018, at 9:23 PM, Tessio Fechine <email address hidden> wrote:
>
> There was a commit to fix something about the BCM5762 variant, but it seems to be restricted to DELL servers..
> https://github.com/torvalds/linux/commit/4419bb1cedcda0272e1dc410345c5a1d1da0e367#diff-ee9b0abeec638cc316efd5b30e0e01e8

Can you try it without the if block?

If you don’t know how to compile kernel, I can build kernel package.

>
> --
> You received this bug notification because you are subscribed to linux
> in Ubuntu.
> https://bugs.launchpad.net/bugs/1447664
>
> Title:
> 14e4:1687 broadcom tg3 network driver disconnects under high load
>
> Status in linux package in Ubuntu:
> Triaged
> Status in linux package in Debian:
> New
>
> Bug description:
> The tg3 broadcom network driver that binds with chipset 5762 goes offline and unable to recover (even with tg3 watchdog timeout) when network transmit is under high load. Call trace:
> https://launchpadlibrarian.net/204185480/dmesg
>
> When this happens, only a reboot would be able to fix it. Sometimes,
> however, bringing the interface offline and online (via ifconfig)
> would recover networking. I've also tested with the latest tg3 driver
> (dec 2014 version) and networking is still problematic. I have also
> disabled TSO, GSO etc... with ethtool and the bug still surfaces.
> This bug may be related to the integrated Firmware.
>
> Here is the procedure to replicate the issue because it is hard to
> replicate it under moderate network load.
>
> 1. Bootup a machine with a broadcom 5762 NIC (ie. HP DeskElite 705) using a Ubuntu/Kubunu Live CD 14.04-15.04.
> 2. from another machine: start 5 sessions, repetitively copy (scp with public key authentication) a 70 meg file back and forth to the tg3 machine in each session. (not sure if this is necessary)
> 3. create a 1GB file on the tg3 machine, with something like dd if=/dev/urandom of=/my/test/file bs=1024 count=$((1024*1000))
> 4. from another machine: repetitively scp copy that 1GB file from the tg3 machine. This can be done with something like:
>
> while [ 0 ]; do
> scp -i /my/scp/private.key <email address hidden>:/my/test/file /tmp
> done;
>
> Networking will mostly goes offline in about 10-30 minutes.
>
> WORKAROUND: Add udev rule to make the changes permanent in /etc/udev/rules.d/80-tg3-fix.rules :
> ACTION=="add", SUBSYSTEM=="net", ATTRS{vendor}=="0x14e4", ATTRS{device}=="0x1687", RUN+="/sbin/ethtool -K %k highdma off"
>
> ProblemType: Bug
> DistroRelease: Ubuntu 15.04
> Package: linux-image-3.19.0-15-generic 3.19.0-15.15
> ProcVersionSignature: Ubuntu 3.19.0-15.15-generic 3.19.3
> Uname: Linux 3.19.0-15-generic x86_64
> ApportVersion: 2.17.2-0ubuntu1
> Architecture: amd64
> AudioDevicesInUse:
> USER PID ACCESS COMMAND
> /dev/snd/controlC1: kubuntu 3748 F.... pulseaudio
> /dev/snd/controlC0: kubuntu 3748 F.... pulseaudio
> CasperVersion: 1.360
> Date: Thu Apr 23 11:16:24 2015
> IwConfig:
> eth0 no wireless extensions.
>
> lo no wireless extensions.
> LiveMediaBuild: Kubuntu 15.04 "Vivid Vervet" - Release amd64 (20150422)
> MachineType: Hewlett-Packard HP EliteDesk 705 G...

Read more...

Revision history for this message
Tessio Fechine (tessiof) wrote :

If you point me to the kernel package I can try it..

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :
Revision history for this message
Yngvi Hrafn Pétursson (skuti) wrote :

I tested this kernel but was unable to mount the hard disk.
Missing modules for HP EliteDesk 705 G3 Desktop Mini?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Probably. I built a new one, please give it a try:
http://people.canonical.com/~khfeng/lp1447664~2/

Revision history for this message
Yngvi Hrafn Pétursson (skuti) wrote :

This kernel works on the HP box i have.
Tested with Firefox and speedtest.net.
Tested with iperf3 on 1Gpbs, 100Mbps full-duplex and 100Mbps half-duplex.

No timeouts or errors in dmesg :)

Revision history for this message
Tessio Fechine (tessiof) wrote :
Download full text (80.3 KiB)

tg3 still crashing..

[ 301.753501] tg3 0000:01:00.0 eno1: Link is up at 100 Mbps, full duplex
[ 301.753546] tg3 0000:01:00.0 eno1: Flow control is off for TX and off for RX
[ 301.753551] tg3 0000:01:00.0 eno1: EEE is disabled
[ 312.032110] NETDEV WATCHDOG: eno1 (tg3): transmit queue 0 timed out
[ 312.032190] ------------[ cut here ]------------
[ 312.032208] WARNING: CPU: 1 PID: 0 at /home/khfeng/Sources/linux-lp1447664/net/sched/sch_generic.c:320 dev_watchdog+0x21e/0x230
[ 312.032209] Modules linked in: rfcomm bnep nls_iso8859_1 edac_mce_amd kvm_amd kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel btusb joydev btrtl btbcm btintel input_leds snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi aes_x86_64 crypto_simd snd_hda_intel snd_hda_codec bluetooth snd_hda_core snd_hwdep glue_helper ecdh_generic cryptd snd_pcm hp_wmi sparse_keymap snd_seq_midi snd_seq_midi_event snd_rawmidi snd_seq wmi_bmof mac_hid shpchp snd_seq_device snd_timer fam15h_power k10temp i2c_piix4 snd tpm_infineon soundcore parport_pc ppdev lp parport autofs4 uas usb_storage hid_generic usbhid hid amdkfd amd_iommu_v2 amdgpu i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops tg3 ahci drm ptp libahci pps_core wmi video
[ 312.032305] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.14.0-17-generic #20~lp1447664
[ 312.032307] Hardware name: HP HP EliteDesk 705 G2 MINI/805B, BIOS N26 Ver. 02.11 11/01/2016
[ 312.032310] task: ffff88952c81c500 task.stack: ffff9df2c19c4000
[ 312.032314] RIP: 0010:dev_watchdog+0x21e/0x230
[ 312.032317] RSP: 0018:ffff88953ec83e50 EFLAGS: 00010282
[ 312.032320] RAX: 0000000000000037 RBX: 0000000000000000 RCX: 0000000000000000
[ 312.032322] RDX: 0000000000000000 RSI: ffff88953ec96598 RDI: ffff88953ec96598
[ 312.032323] RBP: ffff88953ec83e80 R08: 0000000000000001 R09: 00000000000003bf
[ 312.032325] R10: ffff88953ec83ee0 R11: 0000000000000000 R12: 0000000000000005
[ 312.032327] R13: 0000000000000001 R14: ffff8895226ea000 R15: ffff889521856d80
[ 312.032330] FS: 0000000000000000(0000) GS:ffff88953ec80000(0000) knlGS:0000000000000000
[ 312.032333] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 312.032334] CR2: 00000000021a0008 CR3: 00000003a6126000 CR4: 00000000001406e0
[ 312.032337] Call Trace:
[ 312.032341] <IRQ>
[ 312.032349] ? qdisc_rcu_free+0x50/0x50
[ 312.032358] call_timer_fn+0x33/0x130
[ 312.032361] run_timer_softirq+0x3fd/0x460
[ 312.032367] ? ktime_get+0x40/0xa0
[ 312.032371] ? lapic_next_event+0x1d/0x30
[ 312.032377] __do_softirq+0xda/0x2a6
[ 312.032382] irq_exit+0xb6/0xc0
[ 312.032385] smp_apic_timer_interrupt+0x69/0x120
[ 312.032388] apic_timer_interrupt+0x9f/0xb0
[ 312.032390] </IRQ>
[ 312.032397] RIP: 0010:cpuidle_enter_state+0xa2/0x2e0
[ 312.032399] RSP: 0018:ffff9df2c19c7e70 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[ 312.032402] RAX: ffff88953eca2c40 RBX: 00000048a68f835f RCX: 000000000000001f
[ 312.032403] RDX: 00000048a68f835f RSI: fffffffb76b082a3 RDI: 0000000000000000
[ 312.032405] RBP: ffff9df2c19c7eb0 R08: 0000000000000858 R09: 0000000000000861
[ 312.032407] R10: ffff9df2c19c7e40 R11: 0000000000000643 R12: ffff8895...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Take a deeper look, I don't think [1] will help the situation. It's for mainly to solve the issue on jumbo frame.

I thinks it's better to ask HP and Broadcom to fix the issue.

[1] https://github.com/torvalds/linux/commit/4419bb1cedcda0272e1dc410345c5a1d1da0e367#diff-ee9b0abeec638cc316efd5b30e0e01e8

Revision history for this message
Paulo Abadie Guedes (paulo.guedes) wrote :

Hello, I am still having this bug. I'm working with several HP machines, with the same model as Yngvi. Here it is (from dmesg messages):
Hardware name: HP HP EliteDesk 705 G3 Brazil Desktop Mini/8266, BIOS P26 Ver. 02.03 12/22/2016

Interesting to notice that it always happens with a 10/100 switch, but never occurs with a gigabit one.

I've compiled and tested the 4.15.0-rc8 release candidade, which has the commit 4419bb1cedcda0272e1dc410345c5a1d1da0e367, but it does not solve the issue. I added a few printk and can see that the module is correctly compiled and loaded, but my machine is not a Dell. Hence, the "if" condition fails and the body is not executed.

I tried also to force the patch, by keeping the "if body" and removing the condition, just to see what happens (with another printk to prove that it runs). The code runs (limiting MRRS t0 2048, I think), but it does not solve the bug.
It complains that TSC is unstable, right after tg3 breaks. Here is a dmesg snippet, maybe it helps.

<...>
[ 155.816404] clocksource: timekeeping watchdog on CPU0: Marking clocksource 'tsc' as unstable because the skew is too large:
[ 155.816447] clocksource: 'refined-jiffies' wd_now: fffdcbf3 wd_last: fffdc110 mask: ffffffff
[ 155.816490] clocksource: 'tsc' cs_now: 7d3f16e620 cs_last: 7b2987b172 mask: ffffffffffffffff
[ 155.816533] tsc: Marking TSC unstable due to clocksource watchdog
[ 155.939181] tg3 0000:01:00.0: tg3_stop_block timed out, ofs=4c00 enable_bit=2
[ 156.103998] tg3 0000:01:00.0 eth0: Link is down
[ 156.322988] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
[ 156.323040] sched_clock: Marking unstable (156322980975, 5436)<-(156582881282, -259894745)
[ 156.323144] clocksource: Switched to clocksource refined-jiffies
<...>

If you want to take a deeper look, there are a few logs here. Tried also with "tsc=unstable" and other boot parameters, mostly to see if any would help (feeling lucky, perhaps?). Nothing changed, the bug is still in here. They show mostly the same messages, to me.

log_01_acpi_off.txt
https://pastebin.com/FGQNiLqk

log_02_maxcpus_1.txt
https://pastebin.com/2eEJnA3Z

log_03_nmi_watchdog_off.txt
https://pastebin.com/Su44AqiX

log_04_nmi_watchdog_off.txt
https://pastebin.com/4ja0UZ0c

log_05_noapic_nolapic.txt
https://pastebin.com/fZNJbME5

Well, any ideas? I can reproduce the problem 100% of the time. Would you like me to test any other patch?

Kai-Heng Feng, you mention "it's better to ask HP and Broadcom to fix the issue". I agree, but how can we do that?

Thank you,
Paulo

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

First please file an upstream bug at https://bugzilla.kernel.org/
Product: Drivers
Component: Network

Also, looks like it's a Ubuntu certified hardware, let me ask around.

Revision history for this message
Paulo Abadie Guedes (paulo.guedes) wrote :
Download full text (5.4 KiB)

Hello, I would like to confirm that it's useful to file a new bug for this
issue. For me, the problem I'm having is the same as we are discussing in
this thread. Would it be just a duplicate?

Maybe I'm missing something, because I don't know the details of the bug
hunting process for Ubuntu.

Can you please confirm I should open it?

In this case, I can add a detailed description and dmesg logs, with debug
on and the timeout error message inside.

Anyway, I want to report advances in this problem. I have tested a few
kernels and patches in the last weeks, and have found one combination that
does solve the issue.

I also checked that this patch is not yet merged into the latest vanilla
stable kernel, version 4.15, released three days ago. But it patches and
works also for 4.15, which is just great (at last for me).

Will send the details later (or tomorrow), as soon as I get back to my
computer.

Paulo

On Jan 29, 2018 12:54 AM, "Kai-Heng Feng" <email address hidden>
wrote:

> First please file an upstream bug at https://bugzilla.kernel.org/
> Product: Drivers
> Component: Network
>
> Also, looks like it's a Ubuntu certified hardware, let me ask around.
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1447664
>
> Title:
> 14e4:1687 broadcom tg3 network driver disconnects under high load
>
> Status in linux package in Ubuntu:
> Triaged
> Status in linux package in Debian:
> New
>
> Bug description:
> The tg3 broadcom network driver that binds with chipset 5762 goes
> offline and unable to recover (even with tg3 watchdog timeout) when network
> transmit is under high load. Call trace:
> https://launchpadlibrarian.net/204185480/dmesg
>
> When this happens, only a reboot would be able to fix it. Sometimes,
> however, bringing the interface offline and online (via ifconfig)
> would recover networking. I've also tested with the latest tg3 driver
> (dec 2014 version) and networking is still problematic. I have also
> disabled TSO, GSO etc... with ethtool and the bug still surfaces.
> This bug may be related to the integrated Firmware.
>
> Here is the procedure to replicate the issue because it is hard to
> replicate it under moderate network load.
>
> 1. Bootup a machine with a broadcom 5762 NIC (ie. HP DeskElite 705)
> using a Ubuntu/Kubunu Live CD 14.04-15.04.
> 2. from another machine: start 5 sessions, repetitively copy (scp with
> public key authentication) a 70 meg file back and forth to the tg3 machine
> in each session. (not sure if this is necessary)
> 3. create a 1GB file on the tg3 machine, with something like dd
> if=/dev/urandom of=/my/test/file bs=1024 count=$((1024*1000))
> 4. from another machine: repetitively scp copy that 1GB file from the
> tg3 machine. This can be done with something like:
>
> while [ 0 ]; do
> scp -i /my/scp/private.key <email address hidden>:/my/test/file /tmp
> done;
>
> Networking will mostly goes offline in about 10-30 minutes.
>
> WORKAROUND: Add udev rule to make the changes permanent in
> /etc/udev/rules.d/80-tg3-fix.rules :
> ACTION=="add", SUBSYSTEM=="net", ATTRS{vendor}=="0x1...

Read more...

Revision history for this message
Paulo Abadie Guedes (paulo.guedes) wrote :

Hello, this thread has a patch that solved the bug (for me).
https://<email address hidden>/msg189347.html

The patch is here:
https://<email address hidden>/msg189923/0001-tg3-Add-clock-override-support-for-5762.patch

I tested this patch on the following kernels and situations.

1) Stable kernels 4.13.3 and 4.15 crash without the patch (plus all other versions tested). Patch is not merged yet in the main linux branch, until (and including) 4.15 (stable).

2) Stable kernels 4.13.3 and 4.15 work great with the patch: no timeouts on tg3. Fast transfers on gigabit links and 10/100 links.

3) I wrote to the patch author, mentioned my results and asked when it will be merged on Jan 31 (10 days ago). Still waiting, probably the author is currently quite busy.

4) A lot of tests performed during weeks. The last session took about one or two weeks, working full time, on an isolated network. Using the fog open source cloning solution. Several hundreds of GB transferred during tests, for cloning 100+ machines inside a few labs. Both single and multicast cloning sessions used. Tested with a gigabit switch and also with 10/100 switches. Checked both single and multicast, sequential tests, in parallel, with/without power failures, with/without several patches, in many configurations, with lots of kernel parameters, you name it.

5) The test scenario shows this bug is completely reproducible, 100% of the time. Without the patch, my kernels always fail. Tested about 20 different versions and none worked. With the patch above, the two versions always work correctly.

6) A minor detail: patch has a slight offset for 4.15 (2 lines, probably new comments or code) but works anyway.

This work would be impossible without all the cooperation from the fog team. Sebastian suggested the patch, and others helped a lot. A big "thank you" for them!

I wonder when this will be merged in the main kernel. Please, can anyone help on this?

Regards,
Paulo

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Kernel with patch in comment #40. Please try it out.

http://people.canonical.com/~khfeng/lp1447664-clk/

Changed in linux (Ubuntu):
assignee: nobody → Kai-Heng Feng (kaihengfeng)
Revision history for this message
Paulo Abadie Guedes (paulo.guedes) wrote :

Thank you, we will try it as soon as possible.

Currently I'm on vacation, and will not be able to test it until about
March 5 (2 weeks from now). But as soon as I test it, I'll let you know
about the results.

It would be great if someone else could try it too.

Thanks,
Paulo

On Feb 12, 2018 3:25 AM, "Kai-Heng Feng" <email address hidden>
wrote:

Kernel with patch in comment #40. Please try it out.

http://people.canonical.com/~khfeng/lp1447664-clk/

--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/1447664

Title:
  14e4:1687 broadcom tg3 network driver disconnects under high load

Status in linux package in Ubuntu:
  Triaged
Status in linux package in Debian:
  New

Bug description:

Revision history for this message
marc (boolioncube) wrote :

i recently got one of these EliteDesks. tg3 locks up like once a week; seems to happen when flexget adds a bunch to transmission ... it spikes the TX... and boom. i just installed the patched kernel now. thanks yall.

Revision history for this message
Ed S (imimimx) wrote :

dpkg: dependency problems prevent configuration of linux-headers-4.13.0-34-generic:
 linux-headers-4.13.0-34-generic depends on libssl1.1 (>= 1.1.0); however:
  Package libssl1.1 is not installed

Depending version problem for Ubuntu 16.04?

ii libssl-dev:amd64 1.0.2g-1ubuntu4.10 amd64 Secure Sockets Layer toolkit - development files

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

The kernel was compiled in Bionic, so it has wrong dependency on Xenial.
I built a new one, please give it a try:
http://people.canonical.com/~khfeng/lp1447664-xenial/

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Guy, Broadcom has a new patch [1] that need to test.
Here's the kernel [2] to try.

[1] https://lkml.org/lkml/2018/3/20/35
[2] https://people.canonical.com/~khfeng/lp1447664-20180320/

Revision history for this message
Paulo Abadie Guedes (paulo.guedes) wrote :
Download full text (5.4 KiB)

Ok, I'll check it out. Thank you very much!

By the way, we downloaded and tested one of the Deb packages you created,
and it worked quite well. Will check which one was exactly before
reporting (almost sure it was the one for xenial).

We managed to reproduce the issue easily by booting into pxe and, after the
nic was started (trying to get an ip), we reset the machine and booted into
Ubuntu. There is a huge difference by doing this and doing a cold boot,
directly into Ubuntu.

My hypothesis is that pxe setups the nic in a way that is not the default,
by changing one (or more) of the config bits for some register. This same
bit(s) is/are not being touched by the tg3 driver without patch. This way,
a boot may work sometimes, maybe due to default values not being set by the
kernel module tg3 (and being set by pxe code, if it executed before Linux
is loaded).

Anyway, the unpatched kernel breaks very quickly, while the patched kernel
you provided worked out very well. This happens after running pxe.

I will check your links soon and return with our results in the next days,
hopefully this weekend or next week.

Thank you,
Paulo

On Mar 20, 2018 14:16, "Kai-Heng Feng" <email address hidden> wrote:

Guy, Broadcom has a new patch [1] that need to test.
Here's the kernel [2] to try.

[1] https://lkml.org/lkml/2018/3/20/35
[2] https://people.canonical.com/~khfeng/lp1447664-20180320/

--
You received this bug notification because you are subscribed to the bug
report.
https://bugs.launchpad.net/bugs/1447664

Title:
  14e4:1687 broadcom tg3 network driver disconnects under high load

Status in linux package in Ubuntu:
  Triaged
Status in linux package in Debian:
  New

Bug description:
  The tg3 broadcom network driver that binds with chipset 5762 goes offline
and unable to recover (even with tg3 watchdog timeout) when network
transmit is under high load. Call trace:
  https://launchpadlibrarian.net/204185480/dmesg

  When this happens, only a reboot would be able to fix it. Sometimes,
  however, bringing the interface offline and online (via ifconfig)
  would recover networking. I've also tested with the latest tg3 driver
  (dec 2014 version) and networking is still problematic. I have also
  disabled TSO, GSO etc... with ethtool and the bug still surfaces.
  This bug may be related to the integrated Firmware.

  Here is the procedure to replicate the issue because it is hard to
  replicate it under moderate network load.

  1. Bootup a machine with a broadcom 5762 NIC (ie. HP DeskElite 705) using
a Ubuntu/Kubunu Live CD 14.04-15.04.
  2. from another machine: start 5 sessions, repetitively copy (scp with
public key authentication) a 70 meg file back and forth to the tg3 machine
in each session. (not sure if this is necessary)
  3. create a 1GB file on the tg3 machine, with something like dd
if=/dev/urandom of=/my/test/file bs=1024 count=$((1024*1000))
  4. from another machine: repetitively scp copy that 1GB file from the tg3
machine. This can be done with something like:

  while [ 0 ]; do
     scp -i /my/scp/private.key <email address hidden>:/my/test/file /tmp
  done;

  Networking will mostly goes offline in about 10-30 minutes.

  WORKAROUN...

Read more...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Folks,

tg3 maintainers are waiting for the test result. Hopefully it can fix the issue.

Revision history for this message
luc (glarage) wrote :

Hi Kai-heng,

I tried 4.15.0-14-generic #15~lp1447664 SMP Tue Mar 20 14:31:37 CST 2018 x86_64 x86_64 x86_64 GNU/Linux, on Lubuntu 17.10.
I have a Hewlett-Packard HP EliteDesk 705 G1 SFF/2215, BIOS L06 v02.28 02/07/2017 and Lubuntu is in UEFI mode (my only OS) on this device.
Unfortunelly, i have the same problem= (TG3 still crash, a reboot is mandatory)

[ 105.620301] tg3 0000:03:00.0 eno1: 0: Host status block [00000001:000000cc:(0000:002e:0000):(0000:0006)]
[ 105.620309] tg3 0000:03:00.0 eno1: 0: NAPI info [000000cc:000000cc:(0024:0006:01ff):0000:(00f7:0000:0000:0000)]
[ 105.620317] tg3 0000:03:00.0 eno1: 1: Host status block [00000001:00000042:(0000:0000:0000):(0830:0000)]
[ 105.620324] tg3 0000:03:00.0 eno1: 1: NAPI info [00000042:00000042:(0000:0000:01ff):0830:(0030:0030:0000:0000)]
[ 105.620331] tg3 0000:03:00.0 eno1: 2: Host status block [00000001:000000d2:(0fff:0000:0000):(0000:0000)]
[ 105.620370] tg3 0000:03:00.0 eno1: 2: NAPI info [000000d2:000000d2:(0000:0000:01ff):0fff:(07ff:07ff:0000:0000)]
[ 105.755739] tg3 0000:03:00.0: tg3_stop_block timed out, ofs=4c00 enable_bit=2
[ 105.797123] tg3 0000:03:00.0 eno1: Link is down
[ 105.889440] tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x00000000ffe3d640 flags=0x0020]
[ 105.889478] tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x000d address=0x00000000ffe3d680 flags=0x0020]
[ 109.932707] tg3 0000:03:00.0 eno1: Link is up at 1000 Mbps, full duplex
[ 109.932710] tg3 0000:03:00.0 eno1: Flow control is off for TX and off for RX
[ 109.932711] tg3 0000:03:00.0 eno1: EEE is enabled

Revision history for this message
Paulo Abadie Guedes (paulo.guedes) wrote :
Download full text (6.3 KiB)

We tried this same version yesterday and the bug was still present.
Actually it looked worse, because the machine crashed faster (maybe was
just an impression). Will collect logs to report this properly soon, in a
few hours.
Paulo

On Fri, Apr 13, 2018, 13:55 luc <email address hidden> wrote:

> Hi Kai-heng,
>
> I tried 4.15.0-14-generic #15~lp1447664 SMP Tue Mar 20 14:31:37 CST 2018
> x86_64 x86_64 x86_64 GNU/Linux, on Lubuntu 17.10.
> I have a Hewlett-Packard HP EliteDesk 705 G1 SFF/2215, BIOS L06 v02.28
> 02/07/2017 and Lubuntu is in UEFI mode (my only OS) on this device.
> Unfortunelly, i have the same problem= (TG3 still crash, a reboot is
> mandatory)
>
> [ 105.620301] tg3 0000:03:00.0 eno1: 0: Host status block
> [00000001:000000cc:(0000:002e:0000):(0000:0006)]
> [ 105.620309] tg3 0000:03:00.0 eno1: 0: NAPI info
> [000000cc:000000cc:(0024:0006:01ff):0000:(00f7:0000:0000:0000)]
> [ 105.620317] tg3 0000:03:00.0 eno1: 1: Host status block
> [00000001:00000042:(0000:0000:0000):(0830:0000)]
> [ 105.620324] tg3 0000:03:00.0 eno1: 1: NAPI info
> [00000042:00000042:(0000:0000:01ff):0830:(0030:0030:0000:0000)]
> [ 105.620331] tg3 0000:03:00.0 eno1: 2: Host status block
> [00000001:000000d2:(0fff:0000:0000):(0000:0000)]
> [ 105.620370] tg3 0000:03:00.0 eno1: 2: NAPI info
> [000000d2:000000d2:(0000:0000:01ff):0fff:(07ff:07ff:0000:0000)]
> [ 105.755739] tg3 0000:03:00.0: tg3_stop_block timed out, ofs=4c00
> enable_bit=2
> [ 105.797123] tg3 0000:03:00.0 eno1: Link is down
> [ 105.889440] tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x000d address=0x00000000ffe3d640 flags=0x0020]
> [ 105.889478] tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
> domain=0x000d address=0x00000000ffe3d680 flags=0x0020]
> [ 109.932707] tg3 0000:03:00.0 eno1: Link is up at 1000 Mbps, full duplex
> [ 109.932710] tg3 0000:03:00.0 eno1: Flow control is off for TX and off
> for RX
> [ 109.932711] tg3 0000:03:00.0 eno1: EEE is enabled
>
> ** Attachment added: "Bug tg3"
>
> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1447664/+attachment/5114233/+files/Bug%20tg3
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1447664
>
> Title:
> 14e4:1687 broadcom tg3 network driver disconnects under high load
>
> Status in linux package in Ubuntu:
> Triaged
> Status in linux package in Debian:
> New
>
> Bug description:
> The tg3 broadcom network driver that binds with chipset 5762 goes
> offline and unable to recover (even with tg3 watchdog timeout) when network
> transmit is under high load. Call trace:
> https://launchpadlibrarian.net/204185480/dmesg
>
> When this happens, only a reboot would be able to fix it. Sometimes,
> however, bringing the interface offline and online (via ifconfig)
> would recover networking. I've also tested with the latest tg3 driver
> (dec 2014 version) and networking is still problematic. I have also
> disabled TSO, GSO etc... with ethtool and the bug still surfaces.
> This bug may be related to the integrated Firmware.
>
> Here is the procedure to replicate the issue because it is hard to
> replicate it ...

Read more...

Revision history for this message
Paulo Abadie Guedes (paulo.guedes) wrote :
Download full text (7.3 KiB)

Hi Kai-heng,

Here are the test results we got.
Kernel 4.15.0-14-generic failed. Transmit queue timed out. The dmesg output
is attached. The tg3 module crashes in a few seconds right after opening
the user session (e.g. about less than 10 sec).

However, kernel 4.15.0-9-generic worked like a charm. It boots and brings
up tg3, the Ethernet link is working and the module seems stable. We tested
it to download a few gb, an Ubuntu image, play videos for a few hours and
the like. Not even a single crash was observed. The dmesg output for this
working kernel is attached also, because maybe it might help you to sort
out what's different from one kernel to the other.

Would you like us to test another image? Or to gather more information?

Regards,
Paulo

On Fri, Apr 13, 2018, 14:03 Paulo Guedes - IFPE - Campus Recife <
<email address hidden>> wrote:

> We tried this same version yesterday and the bug was still present.
> Actually it looked worse, because the machine crashed faster (maybe was
> just an impression). Will collect logs to report this properly soon, in a
> few hours.
> Paulo
>
> On Fri, Apr 13, 2018, 13:55 luc <email address hidden> wrote:
>
>> Hi Kai-heng,
>>
>> I tried 4.15.0-14-generic #15~lp1447664 SMP Tue Mar 20 14:31:37 CST 2018
>> x86_64 x86_64 x86_64 GNU/Linux, on Lubuntu 17.10.
>> I have a Hewlett-Packard HP EliteDesk 705 G1 SFF/2215, BIOS L06 v02.28
>> 02/07/2017 and Lubuntu is in UEFI mode (my only OS) on this device.
>> Unfortunelly, i have the same problem= (TG3 still crash, a reboot is
>> mandatory)
>>
>> [ 105.620301] tg3 0000:03:00.0 eno1: 0: Host status block
>> [00000001:000000cc:(0000:002e:0000):(0000:0006)]
>> [ 105.620309] tg3 0000:03:00.0 eno1: 0: NAPI info
>> [000000cc:000000cc:(0024:0006:01ff):0000:(00f7:0000:0000:0000)]
>> [ 105.620317] tg3 0000:03:00.0 eno1: 1: Host status block
>> [00000001:00000042:(0000:0000:0000):(0830:0000)]
>> [ 105.620324] tg3 0000:03:00.0 eno1: 1: NAPI info
>> [00000042:00000042:(0000:0000:01ff):0830:(0030:0030:0000:0000)]
>> [ 105.620331] tg3 0000:03:00.0 eno1: 2: Host status block
>> [00000001:000000d2:(0fff:0000:0000):(0000:0000)]
>> [ 105.620370] tg3 0000:03:00.0 eno1: 2: NAPI info
>> [000000d2:000000d2:(0000:0000:01ff):0fff:(07ff:07ff:0000:0000)]
>> [ 105.755739] tg3 0000:03:00.0: tg3_stop_block timed out, ofs=4c00
>> enable_bit=2
>> [ 105.797123] tg3 0000:03:00.0 eno1: Link is down
>> [ 105.889440] tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x000d address=0x00000000ffe3d640 flags=0x0020]
>> [ 105.889478] tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT
>> domain=0x000d address=0x00000000ffe3d680 flags=0x0020]
>> [ 109.932707] tg3 0000:03:00.0 eno1: Link is up at 1000 Mbps, full duplex
>> [ 109.932710] tg3 0000:03:00.0 eno1: Flow control is off for TX and off
>> for RX
>> [ 109.932711] tg3 0000:03:00.0 eno1: EEE is enabled
>>
>> ** Attachment added: "Bug tg3"
>>
>> https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1447664/+attachment/5114233/+files/Bug%20tg3
>>
>> --
>> You received this bug notification because you are subscribed to the bug
>> report.
>> https://bugs.launchpad.net/bugs/1447664
>>
>> Title:
>> 14e4:1687 ...

Read more...

Revision history for this message
luc (glarage) wrote :

Sorry for multi posting, didn't saw the 4.15.0.9 kernel before... :)
TG3 still crash, but not too early... I made several video on full HD + several speed test before losing connection; (FTTH here, my download speed is about 290 Mbps)

Revision history for this message
luc (glarage) wrote :
Download full text (4.8 KiB)

Hi guys,
A little review about the new bios (2.30) available for HP EliteDesk 705 G1 SFF/2215, BIOS L06 v02.30 03/22/2018.
It's change nothing about the TG3 driver= still crash (without iommu=soft, in my case) .... :(

[ 80.864034] ------------[ cut here ]------------
[ 80.864039] NETDEV WATCHDOG: eno1 (tg3): transmit queue 0 timed out
[ 80.864081] WARNING: CPU: 1 PID: 0 at /home/khfeng/Sources/linux-lp1447664-xenial/net/sched/sch_generic.c:323 dev_watchdog+0x222/0x230
[ 80.864082] Modules linked in: nls_iso8859_1 edac_mce_amd crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi aesni_intel aes_x86_64 hp_wmi snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_timer shpchp snd crypto_simd glue_helper cryptd fam15h_power input_leds serio_raw sparse_keymap soundcore wmi_bmof k10temp tpm_infineon i2c_piix4 mac_hid ip_tables x_tables autofs4 amdkfd amd_iommu_v2 amdgpu chash radeon i2c_algo_bit ttm tg3 ptp psmouse pps_core drm_kms_helper wmi syscopyarea sysfillrect ahci sysimgblt fb_sys_fops libahci drm video
[ 80.864136] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 4.15.0-9-generic #10~lp1447664+xenial
[ 80.864137] Hardware name: Hewlett-Packard HP EliteDesk 705 G1 SFF/2215, BIOS L06 v02.30 03/22/2018
[ 80.864141] RIP: 0010:dev_watchdog+0x222/0x230
[ 80.864143] RSP: 0018:ffff9d3caec83e68 EFLAGS: 00010282
[ 80.864146] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000006
[ 80.864147] RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff9d3caec96450
[ 80.864149] RBP: ffff9d3caec83e98 R08: 0000000000000001 R09: 00000000000003da
[ 80.864150] R10: 0000000000000000 R11: 00000000000003da R12: 0000000000000005
[ 80.864152] R13: ffff9d3c9b4a4000 R14: ffff9d3c9b4a4478 R15: ffff9d3c9af34d80
[ 80.864154] FS: 0000000000000000(0000) GS:ffff9d3caec80000(0000) knlGS:0000000000000000
[ 80.864156] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 80.864158] CR2: 00002547c8b50c00 CR3: 000000022188c000 CR4: 00000000000406e0
[ 80.864160] Call Trace:
[ 80.864163] <IRQ>
[ 80.864168] ? dev_graft_qdisc+0x70/0x70
[ 80.864174] call_timer_fn+0x32/0x140
[ 80.864178] run_timer_softirq+0x1ed/0x440
[ 80.864182] ? ktime_get+0x3e/0xa0
[ 80.864186] ? lapic_next_event+0x20/0x30
[ 80.864192] __do_softirq+0xf2/0x288
[ 80.864196] irq_exit+0xb6/0xc0
[ 80.864200] smp_apic_timer_interrupt+0x71/0x140
[ 80.864204] apic_timer_interrupt+0x9f/0xb0
[ 80.864205] </IRQ>
[ 80.864210] RIP: 0010:cpuidle_enter_state+0xa7/0x300
[ 80.864212] RSP: 0018:ffffbd7700d4fe60 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff11
[ 80.864215] RAX: ffff9d3caeca2840 RBX: 0000000000000002 RCX: 000000000000001f
[ 80.864216] RDX: 0000000000000000 RSI: 0000000024a3c7c4 RDI: 0000000000000000
[ 80.864218] RBP: ffffbd7700d4fe98 R08: ffff9d3caeca1664 R09: 0000000000000018
[ 80.864219] R10: ffffbd7700d4fe30 R11: 000000000000011c R12: 0000000000000002
[ 80.864221] R13: ffff9d3ca5f1b000 R14: ffffffffbf3802f8 R15: 00000012d3b48a8f
[ 80.864226] cpuidle_enter+0x17/0x20
[ 80.864230] call_cpuidle+0x23/0x40
[ 80.864233] do_idle+0x197/0x200
[ 80.864236] cpu_start...

Read more...

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

I guess this commit fixes the issue. Can anyone try it?

commit 3a498606bb04af603a46ebde8296040b2de350d1
Author: Sanjeev Bansal <email address hidden>
Date: Mon Jul 16 11:13:32 2018 +0530

    tg3: Add higher cpu clock for 5762.

    This patch has fix for TX timeout while running bi-directional
    traffic with 100 Mbps using 5762.

    Signed-off-by: Sanjeev Bansal <email address hidden>
    Signed-off-by: Siva Reddy Kallam <email address hidden>
    Reviewed-by: Michael Chan <email address hidden>
    Signed-off-by: David S. Miller <email address hidden>

Revision history for this message
Wu (wu) wrote :

Hi,

the commit 3a498606bb04af603a46ebde8296040b2de350d1 mentioned in #54 has been integrated to my current customized RHEL(CentOS)-7 for a certain time. This fix can also be found in the new kernel releases, however even with the 4.19 mainline kernel the bug is still not solved.

- For a 1Gbps ethernet, nothing was changed with this commit. There was immediately the crash after some transmission.

- With the 100M ethernet, the crash is not very often triggered even without the above fix. I cannot judge yet.

Best regards

Revision history for this message
luc (glarage) wrote :

Actually with 4.19.6 and Bios HP V02.31, Tg3 still crash with 100Mbps or 1 Gbps

Logs are still the same

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Yea I saw the same issue on Gigabits ethernet. I raised the issue [1] to the tg3 maintainers.

Do you use 5762?

[1] https://www.spinics.net/lists/netdev/msg538330.html

Revision history for this message
luc (glarage) wrote :

Yep, dmesg | grep tg3 | less =
tg3.c:v3.137 (May 11, 2014)
tg3 0000:03:00.0 eth0: Tigon3 [partno(BCM95762) rev 5762100] (PCI Express)
tg3 0000:03:00.0 eth0: attached PHY is 5762C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
tg3 0000:03:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
tg3 0000:03:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit]
tg3 0000:03:00.0 eno1: renamed from eth0
tg3 0000:03:00.0 eno1: Link is up at 1000 Mbps, full duplex
tg3 0000:03:00.0 eno1: Flow control is on for TX and on for RX
tg3 0000:03:00.0 eno1: EEE is enabled

BTW, thanks for your time, Kai-Heng Feng

Revision history for this message
Bob Lawrence (pilotbob42) wrote :

Confirmed that this is still an issue on 18.04.1. I have an HP 705 G1 with the Broadcom 5762. In my case it's a Plex server. Whenever I try to stream something the interface goes "NO-CARRIER" and the only way to recover is to reboot. I've tried disabling highdma, tso and gso using ethtool, iommu=soft kernel parameter, and forcing every combo of 1gbps/100mbps & half/full duplex. Nothing seems to workaround the issue.

System: Host: Bobs-HTPC Kernel: 4.15.0-43-generic x86_64 bits: 64 Console: tty 1 Distro: Ubuntu 18.04.1 LTS
Machine: Device: desktop System: Hewlett-Packard product: HP EliteDesk 705 G1 DM serial: N/A
           Mobo: Hewlett-Packard model: 225E serial: N/A BIOS: Hewlett-Packard v: L06 v02.31 date: 08/31/2018
Battery hidpp__0: charge: N/A condition: NA/NA Wh
CPU: Quad core AMD A8-7600 Radeon R7 10 Compute Cores 4C+6G (-MCP-) cache: 8192 KB
           clock speeds: max: 3100 MHz 1: 3094 MHz 2: 3094 MHz 3: 3094 MHz 4: 3094 MHz
Graphics: Card: Advanced Micro Devices [AMD/ATI] Kaveri [Radeon R7 Graphics]
           Display Server: N/A drivers: ati,radeon (unloaded: modesetting,fbdev,vesa)
           tty size: 120x53 Advanced Data: N/A out of X
Audio: Card-1 Advanced Micro Devices [AMD] FCH Azalia Controller driver: snd_hda_intel
           Card-2 Advanced Micro Devices [AMD/ATI] Kaveri HDMI/DP Audio Controller driver: snd_hda_intel
           Sound: Advanced Linux Sound Architecture v: k4.15.0-43-generic
Network: Card-1: Intel Wireless 7260 driver: iwlwifi
           IF: wlp2s0 state: up mac: cc:3d:82:a7:bf:ed
           Card-2: Broadcom Limited NetXtreme BCM5762 Gigabit Ethernet PCIe driver: tg3
           IF: eno1 state: up speed: 100 Mbps duplex: half mac: ec:b1:d7:4c:2d:8e
Drives: HDD Total Size: 9501.7GB (42.8% used)
           ID-1: /dev/sda model: ST500LM000 size: 500.1GB
           ID-2: USB /dev/sdb model: 5 size: 9001.6GB
Partition: ID-1: / size: 458G used: 23G (6%) fs: ext4 dev: /dev/sda1
RAID: No RAID devices: /proc/mdstat, md_mod kernel module present
Sensors: System Temperatures: cpu: 40.8C mobo: N/A gpu: 42.0
           Fan Speeds (in rpm): cpu: N/A
Info: Processes: 227 Uptime: 12:49 Memory: 1608.0/5943.7MB Init: systemd runlevel: 5
           Client: Shell (bash) inxi: 2.3.56

Revision history for this message
Paulo Abadie Guedes (paulo.guedes) wrote :
Download full text (8.1 KiB)

Thank you.
I am still having the problem during our cloning process, although it's not
so frequent. Before the patch I applied, each and every transfer would
ALWAYS kick the tg3 bug.

Here it seems related to problems with NAPI. AFAIK, this is an approach to
handle interrupt bursts. NIC's work typically in bursts: a long time
without packets, then a very large stream of packets, then silence. This is
the common scenario.

Having interrupts to serve sporadic data is ok. But a burst of packets
trigger a burst of interrupts, which is not as efficient as just polling
the NIC (during the burst).

What NAPI does is (in a very very simplified way): it expects the first
interrupt from the network, then switches off interrupts, poll the NIC (up
to a limit) until there are no more network packets, or the "work quota" is
exhausted, what happens first. Then it turns on interrupts and the cycle
repeats. This quota (sorry, don't remember the correct term) is very
important to prevent the kernel from being stuck just serving packets.

What's happening is (my understanding) that something went wrong during
this process and the tg3 driver gets stuck.

A colleague told me that it's related to the broadcom driver.

Please try this workaround. Remove the two drivers, then reload "broadcom"
and "tg3" in this order. Maybe then your network will restart.

sudo modprobe -r broadcom tg3
sudo modprobe broadcom
sudo modprobe tg3

Please tell us what happens when you try this. It won't solve the problem,
but perhaps it helps.

Regards,
Paulo

On Sat, Jan 26, 2019, 10:39 Bob Lawrence <<email address hidden> wrote:

> Confirmed that this is still an issue on 18.04.1. I have an HP 705 G1
> with the Broadcom 5762. In my case it's a Plex server. Whenever I try to
> stream something the interface goes "NO-CARRIER" and the only way to
> recover is to reboot. I've tried disabling highdma, tso and gso using
> ethtool, iommu=soft kernel parameter, and forcing every combo of
> 1gbps/100mbps & half/full duplex. Nothing seems to workaround the issue.
>
> System: Host: Bobs-HTPC Kernel: 4.15.0-43-generic x86_64 bits: 64
> Console: tty 1 Distro: Ubuntu 18.04.1 LTS
> Machine: Device: desktop System: Hewlett-Packard product: HP EliteDesk
> 705 G1 DM serial: N/A
> Mobo: Hewlett-Packard model: 225E serial: N/A BIOS:
> Hewlett-Packard v: L06 v02.31 date: 08/31/2018
> Battery hidpp__0: charge: N/A condition: NA/NA Wh
> CPU: Quad core AMD A8-7600 Radeon R7 10 Compute Cores 4C+6G (-MCP-)
> cache: 8192 KB
> clock speeds: max: 3100 MHz 1: 3094 MHz 2: 3094 MHz 3: 3094 MHz
> 4: 3094 MHz
> Graphics: Card: Advanced Micro Devices [AMD/ATI] Kaveri [Radeon R7
> Graphics]
> Display Server: N/A drivers: ati,radeon (unloaded:
> modesetting,fbdev,vesa)
> tty size: 120x53 Advanced Data: N/A out of X
> Audio: Card-1 Advanced Micro Devices [AMD] FCH Azalia Controller
> driver: snd_hda_intel
> Card-2 Advanced Micro Devices [AMD/ATI] Kaveri HDMI/DP Audio
> Controller driver: snd_hda_intel
> Sound: Advanced Linux Sound Architecture v: k4.15.0-43-generic
> Network: Card-1: Intel Wireless 7260 driver: iwlwifi
> I...

Read more...

Revision history for this message
marc (boolioncube) wrote :
Download full text (14.8 KiB)

we're using this hp g1 thing primarily for torrent seeding and
gnuMotion... so it is consistently getting ~5mb/s and pushing out 300kb/s.

the explanation on #60 might explain why this recipe is somewhat stable --
its always active.

i also havnt updated the bios - i wont have physical access for a few
months - but since others are saying it makes no diff ... think i wont
bother

- uptime/ 48 days
- RX/ 18tb
- TX/ 4tb
- 2 tg3 hangs ... and it restarted the driver on its own

------------------
zosky@mintyElite:~$ uname -a
Linux mintyElite 4.15.0-9-generic #10~lp1447664+xenial SMP Thu Feb 22
15:51:40 CST 2018 x86_64 x86_64 x86_64 GNU/Linux

zosky@mintyElite:~$ ifconfig
eno1 Link encap:Ethernet HWaddr 50:65:f3:51:fe:7e
          inet addr:192.168.1.62 Bcast:192.168.1.255 Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
          RX packets:13545805910 errors:0 dropped:1445 overruns:0 frame:0
          TX packets:9698573442 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
* RX bytes:18046221779614 (18.0 TB) TX bytes:4257723240257 (4.2
TB)*
          Interrupt:45

zosky@mintyElite:~$ dmesg | grep tg3 | head -10
[ 2.828297] tg3.c:v3.137 (May 11, 2014)
[ 2.846250] tg3 0000:01:00.0 eth0: Tigon3 [partno(BCM95762) rev *5762100*]
(PCI Express) MAC address 50:65:f3:51:fe:7e
[ 2.847035] tg3 0000:01:00.0 eth0: attached PHY is 5762C
(10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[ 2.847880] tg3 0000:01:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0]
ASF[0] TSOcap[1]
[ 2.848796] tg3 0000:01:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit]
[ 2.850519] tg3 0000:01:00.0 eno1: renamed from eth0
[ 46.205677] tg3 0000:01:00.0 eno1: Link is up at 1000 Mbps, full duplex
[ 46.205679] tg3 0000:01:00.0 eno1: Flow control is on for TX and on for
RX
[ 46.205681] tg3 0000:01:00.0 eno1: EEE is disabled
[2700404.396192] NETDEV WATCHDOG: eno1 (tg3): transmit queue 0 timed out

On Sat, Jan 26, 2019 at 9:04 AM Paulo Abadie Guedes <
<email address hidden>> wrote:

> Thank you.
> I am still having the problem during our cloning process, although it's not
> so frequent. Before the patch I applied, each and every transfer would
> ALWAYS kick the tg3 bug.
>
> Here it seems related to problems with NAPI. AFAIK, this is an approach to
> handle interrupt bursts. NIC's work typically in bursts: a long time
> without packets, then a very large stream of packets, then silence. This is
> the common scenario.
>
> Having interrupts to serve sporadic data is ok. But a burst of packets
> trigger a burst of interrupts, which is not as efficient as just polling
> the NIC (during the burst).
>
> What NAPI does is (in a very very simplified way): it expects the first
> interrupt from the network, then switches off interrupts, poll the NIC (up
> to a limit) until there are no more network packets, or the "work quota" is
> exhausted, what happens first. Then it turns on interrupts and the cycle
> repeats. This quota (sorry, don't remember the correct term) is very
> important to prevent the kernel from being stuck just serving packets.
>
> What's happening is (my understanding) that somet...

Revision history for this message
Bob Lawrence (pilotbob42) wrote :

@paulo.guedes

Yes, removing and re-adding the modules as you describe does at least recover eno1 without rebooting. Still, hardly a solution for what was intended to be a headless Plex server. This happens every time I start an mpeg2 tv stream through my plex box which is only about a 20mbps load. Sometimes it happens immediately sometimes it goes for nearly an hour.

Also, I compiled a custom kernel with the patch described in post #40. It had no effect on the dropouts for me. They are still occurring.

System:
Host: Bobs-HTPC Kernel: 4.15.18+ x86_64 bits: 64 Console: tty 1 Distro: Ubuntu 18.04.1 LTS

Machine:
Device: desktop System: Hewlett-Packard product: HP EliteDesk 705 G1 DM serial: N/A
Mobo: Hewlett-Packard model: 225E serial: N/A BIOS: Hewlett-Packard v: L06 v02.31 date: 08/31/2018

Network:
Card-1: Intel Wireless 7260 driver: iwlwifi
IF: wlp2s0 state: up mac: cc:3d:82:a7:bf:ed
Card-2: Broadcom Limited NetXtreme BCM5762 Gigabit Ethernet PCIe driver: tg3
IF: eno1 state: up speed: 100 Mbps duplex: full mac: ec:b1:d7:4c:2d:8e

Revision history for this message
Bob Lawrence (pilotbob42) wrote :

Also, on the last crash, I caught it while it was happening and RX/TX errors and collisions all went through the roof right before it went "no-carrier".

Revision history for this message
James Johnson (triplej) wrote :

I have been experiencing this issue on an HP 745 G4 with the same BCM 5762 on several kernel versions from ubuntu 16.04 and up to 4.15.0-45. On my system the network would immediately crash often before logging in. Occasionally I would be able to ping for several seconds before the device would crash.

I have tried several work arounds in this thread although none were successful. Setting iommu to soft may have increase the duration from 10 seconds to about 30 however I did not test this extensively.

I was able to upgrade to mainline kernel 4.20.7-042007 using uuku, and I no longer experience any device instability. I'm not sure if this specific patch was included in this release although it maybe useful for those still experiencing crashes on Ubuntu 18.04

Revision history for this message
Shane R. Spencer (whardier) wrote :

Same issue with HP EliteDesk 705 G2 MINI

Turned off all power saving options in BIOS.

Currently running 18.04 HWE EDGE (Linux 5.0.0-15-generic) compiled with:

CONFIG_TIGON3=m
CONFIG_TIGON3_HWMON=y

Tempted to turn off HWMON.

[ 1.314002] tg3 0000:01:00.0 eth0: Tigon3 [partno(BCM95762) rev 5762100] (PCI Express) MAC address c8:d3:ff:a2:96:e9
[ 1.314915] tg3 0000:01:00.0 eth0: attached PHY is 5762C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[ 1.315781] tg3 0000:01:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
[ 1.316661] tg3 0000:01:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit]
[ 1.324241] tg3 0000:01:00.0 eno1: renamed from eth0
[ 6.950429] tg3 0000:01:00.0 eno1: Link is up at 1000 Mbps, full duplex
[ 6.950471] tg3 0000:01:00.0 eno1: Flow control is on for TX and on for RX
[ 6.950475] tg3 0000:01:00.0 eno1: EEE is disabled

Has anybody found a stable fix for this problem?

Revision history for this message
Chris Schwarz (cschwarz) wrote :
Download full text (5.4 KiB)

I have not experienced the issue since I started using kernel 4.20.11 .

On Fri., May 24, 2019, 9:54 a.m. Shane R. Spencer, <email address hidden>
wrote:

> Same issue with HP EliteDesk 705 G2 MINI
>
> Turned off all power saving options in BIOS.
>
> Currently running 18.04 HWE EDGE (Linux 5.0.0-15-generic) compiled with:
>
> CONFIG_TIGON3=m
> CONFIG_TIGON3_HWMON=y
>
> Tempted to turn off HWMON.
>
> [ 1.314002] tg3 0000:01:00.0 eth0: Tigon3 [partno(BCM95762) rev
> 5762100] (PCI Express) MAC address c8:d3:ff:a2:96:e9
> [ 1.314915] tg3 0000:01:00.0 eth0: attached PHY is 5762C
> (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
> [ 1.315781] tg3 0000:01:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0]
> ASF[0] TSOcap[1]
> [ 1.316661] tg3 0000:01:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit]
> [ 1.324241] tg3 0000:01:00.0 eno1: renamed from eth0
> [ 6.950429] tg3 0000:01:00.0 eno1: Link is up at 1000 Mbps, full duplex
> [ 6.950471] tg3 0000:01:00.0 eno1: Flow control is on for TX and on for
> RX
> [ 6.950475] tg3 0000:01:00.0 eno1: EEE is disabled
>
> Has anybody found a stable fix for this problem?
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1447664
>
> Title:
> 14e4:1687 broadcom tg3 network driver disconnects under high load
>
> Status in linux package in Ubuntu:
> Triaged
> Status in linux package in Debian:
> New
>
> Bug description:
> The tg3 broadcom network driver that binds with chipset 5762 goes
> offline and unable to recover (even with tg3 watchdog timeout) when network
> transmit is under high load. Call trace:
> https://launchpadlibrarian.net/204185480/dmesg
>
> When this happens, only a reboot would be able to fix it. Sometimes,
> however, bringing the interface offline and online (via ifconfig)
> would recover networking. I've also tested with the latest tg3 driver
> (dec 2014 version) and networking is still problematic. I have also
> disabled TSO, GSO etc... with ethtool and the bug still surfaces.
> This bug may be related to the integrated Firmware.
>
> Here is the procedure to replicate the issue because it is hard to
> replicate it under moderate network load.
>
> 1. Bootup a machine with a broadcom 5762 NIC (ie. HP DeskElite 705)
> using a Ubuntu/Kubunu Live CD 14.04-15.04.
> 2. from another machine: start 5 sessions, repetitively copy (scp with
> public key authentication) a 70 meg file back and forth to the tg3 machine
> in each session. (not sure if this is necessary)
> 3. create a 1GB file on the tg3 machine, with something like dd
> if=/dev/urandom of=/my/test/file bs=1024 count=$((1024*1000))
> 4. from another machine: repetitively scp copy that 1GB file from the
> tg3 machine. This can be done with something like:
>
> while [ 0 ]; do
> scp -i /my/scp/private.key <email address hidden>:/my/test/file /tmp
> done;
>
> Networking will mostly goes offline in about 10-30 minutes.
>
> WORKAROUND: Add udev rule to make the changes permanent in
> /etc/udev/rules.d/80-tg3-fix.rules :
> ACTION=="add", SUBSYSTEM=="net", ATTRS{vendor}=="0x14e4",
> ATTRS{device}=="0x1687", RUN+="/sbin...

Read more...

Changed in linux (Ubuntu):
assignee: Kai-Heng Feng (kaihengfeng) → nobody
Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Latest kernels in Xenial, Bionic, Cosmic and Disco have the following commit:
commit 3a498606bb04af603a46ebde8296040b2de350d1
Author: Sanjeev Bansal <email address hidden>
Date: Mon Jul 16 11:13:32 2018 +0530

    tg3: Add higher cpu clock for 5762.

    This patch has fix for TX timeout while running bi-directional
    traffic with 100 Mbps using 5762.

    Signed-off-by: Sanjeev Bansal <email address hidden>
    Signed-off-by: Siva Reddy Kallam <email address hidden>
    Reviewed-by: Michael Chan <email address hidden>
    Signed-off-by: David S. Miller <email address hidden>

Changed in linux (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
Paulo Abadie Guedes (paulo.guedes) wrote :
Download full text (5.4 KiB)

Thank you, Kai-Heng Feng. Really appreciate it.

Currently I'm under a lot of pressure at work. But I will try this in the
next days, to see if it fixes the problem for us. My network still have the
same condition and my previous kernel versions are still breaking. So, it
should be easy to reproduce.
Will write back reporting as soon as I can.

Thank you again,
Paulo

On Tue, Jul 2, 2019, 03:15 Kai-Heng Feng <email address hidden>
wrote:

> Latest kernels in Xenial, Bionic, Cosmic and Disco have the following
> commit:
> commit 3a498606bb04af603a46ebde8296040b2de350d1
> Author: Sanjeev Bansal <email address hidden>
> Date: Mon Jul 16 11:13:32 2018 +0530
>
> tg3: Add higher cpu clock for 5762.
>
> This patch has fix for TX timeout while running bi-directional
> traffic with 100 Mbps using 5762.
>
> Signed-off-by: Sanjeev Bansal <email address hidden>
> Signed-off-by: Siva Reddy Kallam <email address hidden>
> Reviewed-by: Michael Chan <email address hidden>
> Signed-off-by: David S. Miller <email address hidden>
>
> ** Changed in: linux (Ubuntu)
> Status: Triaged => Fix Released
>
> --
> You received this bug notification because you are subscribed to the bug
> report.
> https://bugs.launchpad.net/bugs/1447664
>
> Title:
> 14e4:1687 broadcom tg3 network driver disconnects under high load
>
> Status in linux package in Ubuntu:
> Fix Released
> Status in linux package in Debian:
> New
>
> Bug description:
> The tg3 broadcom network driver that binds with chipset 5762 goes
> offline and unable to recover (even with tg3 watchdog timeout) when network
> transmit is under high load. Call trace:
> https://launchpadlibrarian.net/204185480/dmesg
>
> When this happens, only a reboot would be able to fix it. Sometimes,
> however, bringing the interface offline and online (via ifconfig)
> would recover networking. I've also tested with the latest tg3 driver
> (dec 2014 version) and networking is still problematic. I have also
> disabled TSO, GSO etc... with ethtool and the bug still surfaces.
> This bug may be related to the integrated Firmware.
>
> Here is the procedure to replicate the issue because it is hard to
> replicate it under moderate network load.
>
> 1. Bootup a machine with a broadcom 5762 NIC (ie. HP DeskElite 705)
> using a Ubuntu/Kubunu Live CD 14.04-15.04.
> 2. from another machine: start 5 sessions, repetitively copy (scp with
> public key authentication) a 70 meg file back and forth to the tg3 machine
> in each session. (not sure if this is necessary)
> 3. create a 1GB file on the tg3 machine, with something like dd
> if=/dev/urandom of=/my/test/file bs=1024 count=$((1024*1000))
> 4. from another machine: repetitively scp copy that 1GB file from the
> tg3 machine. This can be done with something like:
>
> while [ 0 ]; do
> scp -i /my/scp/private.key <email address hidden>:/my/test/file /tmp
> done;
>
> Networking will mostly goes offline in about 10-30 minutes.
>
> WORKAROUND: Add udev rule to make the changes permanent in
> /etc/udev/rules.d/80-tg3-fix.rules :
> ACTION=="add", SUBSYSTEM=="net", ATTRS{vendor}=="0x14e4",
> AT...

Read more...

Revision history for this message
luc (glarage) wrote :

Actually with kernel 5.1.15 and if i don't make a mistake, this commit is merged since 2018-07-16;
A first speedtest-net gives me this outpout=
tg3 0000:03:00.0 eno1: 0: Host status block [00000001:000000d5:(0000:0443:0000):(0000:00bc)]
tg3 0000:03:00.0 eno1: 0: NAPI info [000000d5:000000d5:(0117:00bc:01ff):0000:(050c:0000:0000:0000)]
tg3 0000:03:00.0 eno1: 1: Host status block [00000001:000000d2:(0000:0000:0000):(0000:0000)]
tg3 0000:03:00.0 eno1: 1: NAPI info [000000d2:000000d2:(0000:0000:01ff):0000:(0000:0000:0000:0000)]
tg3 0000:03:00.0 eno1: 2: Host status block [00000001:00000031:(0c44:0000:0000):(0000:0000)]
tg3 0000:03:00.0 eno1: 2: NAPI info [00000031:00000031:(0000:0000:01ff):0c44:(0444:0444:0000:0000)]
A second speedtest-net gives me this outpout (and i lost the connection)=
tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x1ffffed80 flags=0x0000]
[tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x1ffffee40 flags=0x0000]
tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x1ffffedc0 flags=0x0000]
tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x1ffffee00 flags=0x0000]
tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x1ffffee80 flags=0x0000]
[tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x1ffffeec0 flags=0x0000]
tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x1ffffef40 flags=0x0000]
tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x1ffffef00 flags=0x0000]
tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x1ffffef80 flags=0x0000]
tg3 0000:03:00.0: AMD-Vi: Event logged [IO_PAGE_FAULT domain=0x0000 address=0x1ffffefc0 flags=0x0000]
(...)
tg3 0000:03:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2
tg3 0000:03:00.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
tg3 0000:03:00.0: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3 0000:03:00.0 eno1: Link is down

LSPCI gives me = 03:00.0 Ethernet controller: Broadcom Inc. and subsidiaries NetXtreme BCM5762 Gigabit Ethernet PCIe (rev 10) and i have a HP 705G1

I'm grateful for the effort you put into solving this bug and the many reminders to broadcom people..

Brad Figg (brad-figg)
tags: added: cscc
Revision history for this message
Chi-Thanh Christopher Nguyen (chithanh) wrote :

Still an issue here with Dell Latitude 5495 and Kernel 5.2.7.

I noticed that very much like similar problems I had with Realtek LAN, it helped as a workaround to boot with iommu=pt kernel parameter.

A rtl8169 report was here https://bugzilla.kernel.org/show_bug.cgi?id=14962 (and many others exist)

penalvch (penalvch)
no longer affects: linux (Ubuntu)
Revision history for this message
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

Chi-Thanh Christopher Nguyen, please note:

1) The kernel 5.2.7 is not supported here on Launchapd. Hence, please re-direct your inquiry to the relevant maintainer(s) upstream.

2) If you can reproduce the issue with a supported kernel then please file a new report to provide debugging logs via a terminal:
ubuntu-bug linux

Please feel free to subscribe me to it.

affects: linux (Debian) → linux (Ubuntu)
Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Fix Released
Revision history for this message
Bob Lawrence (pilotbob42) wrote :
Download full text (4.4 KiB)

Problem still exists with kernel 5.3.0-59-generic. Same machine I reported on previously. Multiple kernel releases since then. Only change with more recent kernels is that the connection recovers on its own after a few minutes (as opposed to requiring a reboot). Still, the only workaround that has any effect is to manually set the connection to 100mbps and half duplex. Pretty useless for a media server.

Example dmesg output when the problem occurs:

[219326.666826] tg3 0000:03:00.0 eno1: transmit timed out, resetting
[219329.265075] tg3 0000:03:00.0 eno1: 0x00000000: 0x168714e4, 0x50100546, 0x02000010, 0x00000000
[219329.265116] tg3 0000:03:00.0 eno1: 0x00000010: 0xe082000c, 0x00000000, 0xe081000c, 0x00000000
[219329.265125] tg3 0000:03:00.0 eno1: 0x00000020: 0xe080000c, 0x00000000, 0x00000000, 0x225e103c

[many, many, hex dump lines repeated here]

[219329.267191] tg3 0000:03:00.0 eno1: 0x00007030: 0x000e0000, 0x000038d8, 0x00230030, 0x80000000
[219329.267198] tg3 0000:03:00.0 eno1: 0x00007500: 0x00000000, 0x00000000, 0x00000081, 0x00000000
[219329.267203] tg3 0000:03:00.0 eno1: 0x00007510: 0x00000000, 0x7fffffbf, 0x00000000, 0x00000000
[219329.267214] tg3 0000:03:00.0 eno1: 0: Host status block [00000001:000000a6:(0000:0481:0000):(0000:00bd)]
[219329.267222] tg3 0000:03:00.0 eno1: 0: NAPI info [000000a6:000000a6:(006a:00bd:01ff):0000:(067e:0000:0000:0000)]
[219329.267229] tg3 0000:03:00.0 eno1: 1: Host status block [00000001:0000003c:(0000:0000:0000):(090c:0000)]
[219329.267236] tg3 0000:03:00.0 eno1: 1: NAPI info [0000003c:0000003c:(0000:0000:01ff):090c:(010c:010c:0000:0000)]
[219329.267244] tg3 0000:03:00.0 eno1: 2: Host status block [00000001:000000b5:(05d9:0000:0000):(0000:0000)]
[219329.267256] tg3 0000:03:00.0 eno1: 2: NAPI info [000000b5:000000b5:(0000:0000:01ff):05d9:(05d9:05d9:0000:0000)]
[219329.267267] tg3 0000:03:00.0 eno1: 3: Host status block [00000001:00000093:(0000:0000:0000):(0000:0000)]
[219329.267273] tg3 0000:03:00.0 eno1: 3: NAPI info [00000093:00000093:(0000:0000:01ff):045b:(045b:045b:0000:0000)]
[219329.267279] tg3 0000:03:00.0 eno1: 4: Host status block [00000001:00000002:(0000:0000:0a76):(0000:0000)]
[219329.267286] tg3 0000:03:00.0 eno1: 4: NAPI info [00000002:00000002:(0000:0000:01ff):0a76:(0276:0276:0000:0000)]
[219329.370520] tg3 0000:03:00.0: tg3_stop_block timed out, ofs=1400 enable_bit=2
[219329.473173] tg3 0000:03:00.0: tg3_stop_block timed out, ofs=c00 enable_bit=2
[219329.575744] tg3 0000:03:00.0: tg3_stop_block timed out, ofs=4800 enable_bit=2
[219329.578634] tg3 0000:03:00.0 eno1: Link is down

INXI output:

System: Host: Bobs-HTPC Kernel: 5.3.0-59-generic x86_64 bits: 64 Desktop: MATE 1.20.1
           Distro: Ubuntu 18.04.4 LTS
Machine: Device: desktop System: Hewlett-Packard product: HP EliteDesk 705 G1 DM serial: N/A
           Mobo: Hewlett-Packard model: 225E serial: N/A
           BIOS: Hewlett-Packard v: L06 v02.31 date: 08/31/2018
CPU: Quad core AMD A8-7600 Radeon R7 10 Compute Cores 4C+6G (-MCP-) cache: 8192 KB
           clock speeds: max: 3100 MHz 1: 1499 MHz 2: 1524 MHz 3: 1438 MHz 4: 1402 MHz
Graphics: Card: Advanced Micro Devices [AMD/ATI] Kaveri [Radeon R7 Graphics]
         ...

Read more...

Revision history for this message
Bob Lawrence (pilotbob42) wrote :

I'll also add that though I had previously tried iommu=soft with no luck, trying iommu=pt as suggested by chithanh in post 70 does seem to workaround the issue successfully. I'm not sure as to why that would be the case as this is neither a VM nor a VM host, but since adding the parameter to my kernel line and rebooting I've been running for several hours with media continuously streaming. Without the parameter it would only stream for a matter of minutes before dropping the connection.

Revision history for this message
Toan (tpham3783) wrote :

Has anyone applied the patch to the tg3 driver that was shared in comment# 13? That one solved the issue for me. If that was the real fix, I'd like to inform the tg3 maintainers about it so that we can have it patched in the mainline. thanks.

tp

Revision history for this message
Bob Lawrence (pilotbob42) wrote :

I did not apply the patch in #13, but I did try disabling highdma with ethtool (essentially what the patch makes permanent) and that had no effect for me (at least not on the kernel I was using at the time). I did try the patch in #40 and that had no effect for me either. The only thing I've found that keeps my Broadcom 5762 alive without disconnecting is the kernel parameter "iommu=pt". I'm just finally grateful to have found a workaround so I can keep this server wired and not have to rely on its wireless only.

I can't help but think we are chasing a moving target across so many kernel versions since this issue was first reported.

Revision history for this message
Janno Sannik (jannoke) wrote :

Just letting know that "iommu=pt" fixed my problem on HP Elitedesk 705 G2. There was not even a talk about test benching anything since I could not even download a 100MB file using 300Mbit/s internet connection. It would lose connection without any logs. It however would recover with ifdown/ifup.

This is not ubuntu, but (up to date) proxmox-ve v6.2-15 using kernel 5.4.65-1-pve which is based on debian.

Revision history for this message
Tony Eckel (teckel) wrote :

Have an EliteDesk mini 705 G2 with identical issue and none of the fixes worked.
running Ubuntu 20.04.2 LTS
So it isn't fixed.

What do you need to troubleshoot this?

Revision history for this message
Kai-Heng Feng (kaihengfeng) wrote :

Tony, please file a new bug.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.