tg3 network disconnects during high usage

Bug #615053 reported by David Martin
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Debian
New
Undecided
Unassigned
linux (Ubuntu)
Incomplete
Undecided
Unassigned

Bug Description

Binary package hint: linux-generic

During periods of high volume network traffic the tg3 module drops out/in repeatedly.

Aug 8 10:46:04 repos kernel: [ 296.226367] tg3: eth1: Link is down.
Aug 8 10:46:06 repos kernel: [ 298.757874] tg3: eth1: Link is up at 1000 Mbps, full duplex.
Aug 8 10:46:06 repos kernel: [ 298.757881] tg3: eth1: Flow control is on for TX and on for RX.
Aug 8 10:46:18 repos kernel: [ 310.227127] tg3: eth1: Link is down.
Aug 8 10:46:20 repos kernel: [ 312.796360] tg3: eth1: Link is up at 1000 Mbps, full duplex.
Aug 8 10:46:20 repos kernel: [ 312.796367] tg3: eth1: Flow control is on for TX and on for RX.
Aug 8 10:46:31 repos kernel: [ 323.818955] tg3: eth1: Link is down.
Aug 8 10:46:34 repos kernel: [ 326.177056] tg3: eth1: Link is up at 1000 Mbps, full duplex.
Aug 8 10:46:34 repos kernel: [ 326.177062] tg3: eth1: Flow control is on for TX and on for RX.
Aug 8 10:46:54 repos kernel: [ 346.834385] tg3: eth1: Link is down.
Aug 8 10:46:57 repos kernel: [ 349.258010] tg3: eth1: Link is up at 1000 Mbps, full duplex.
Aug 8 10:46:57 repos kernel: [ 349.258016] tg3: eth1: Flow control is on for TX and on for RX.
Aug 8 10:47:13 repos kernel: [ 365.228754] tg3: eth1: Link is down.
Aug 8 10:47:15 repos kernel: [ 367.671012] tg3: eth1: Link is up at 1000 Mbps, full duplex.
Aug 8 10:47:15 repos kernel: [ 367.671018] tg3: eth1: Flow control is on for TX and on for RX.
Aug 8 10:47:26 repos kernel: [ 378.229198] tg3: eth1: Link is down.
Aug 8 10:47:28 repos kernel: [ 380.674225] tg3: eth1: Link is up at 1000 Mbps, full duplex.
Aug 8 10:47:28 repos kernel: [ 380.674231] tg3: eth1: Flow control is on for TX and on for RX.

lspci snippet:
02:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)
03:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)

dmesg snippet:
[ 1.009868] eth0: Tigon3 [partno(BCM95721) rev 4101] (PCI Express) MAC address 00:17:a4:eb:4d:1f
[ 1.009871] eth0: attached PHY is 5750 (10/100/1000Base-T Ethernet) (WireSpeed[1])
[ 1.009874] eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
[ 1.009876] eth0: dma_rwctrl[76180000] dma_mask[64-bit]
[ 1.029156] eth1: Tigon3 [partno(BCM95721) rev 4101] (PCI Express) MAC address 00:17:a4:eb:4d:1e
[ 1.029160] eth1: attached PHY is 5750 (10/100/1000Base-T Ethernet) (WireSpeed[1])
[ 1.029162] eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[ 1.029165] eth1: dma_rwctrl[76180000] dma_mask[64-bit]

uname -a:
Linux repos 2.6.32-24-generic #38-Ubuntu SMP Mon Jul 5 09:22:14 UTC 2010 i686 GNU/Linux

I have confirmed this bug with every Broadcom NetXtreme based server I have, and is completely repeatable by saturating the network link for more than a few seconds.

This might not be a tg3 bug, it is possible it is a known firmware bug. Broadcom does not issue firmware fixes to the public, instead they rely on vender's to roll them out.

Revision history for this message
David Martin (dlmarti) wrote :
Revision history for this message
David Martin (dlmarti) wrote :

I forgot to comment on the nicfwupg.log in the attachment.
That is a log from the HP firmware update software. I really suspect that this is a firmware bug common in most Broadcom products manufactured prior to a year ago. Although a firmware fix was released and repackaged by Broadcom the update software does not function (I suspect this is due to a TRUE bug in the existing tg3 module).

affects: linux-meta (Ubuntu) → linux (Ubuntu)
Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi David,

This bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 615053

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
David Martin (dlmarti) wrote :

Unfortunately Jeremy, I do not have the time right now to re-setup this test on a non-LTS release of Ubuntu.

I do know that is still exists with my current version, but I only use LTS versions of Ubuntu.

So re-imaging a set of machines just to run the test, and re-imaging back to an LTS version is just too time consuming.

Revision history for this message
Jim Sizelove (jims+launchpad) wrote :

We are experiencing a very similar situation. We have a server running Debian 6.0 (Squeeze) as a Xen host with several guests. From time to time the network interface on the host locks up in a cycle of timeouts and trying to reset itself. This has happened five times in the past five weeks, three of those in the past week.

We have a second Xen host server with nearly the same configuration that has not experienced this network issue. It does have a newer firmware version for the tg3 (5721-v3.61) than the server experiencing the problem (5721-v3.55a). It is possible that the Xen guests on the second server do not have the network traffic loads that trigger the problem on the first server.

David, can you compare the firmware versions of your servers? Is it possible this issue has been fixed in the firmware?

Below are some facts about our server that is experiencing the problem.

Here is a snippet from the syslog:
May 22 11:07:07 carndum kernel: [174743.812042] tg3: eth0: transmit timed out, resetting
May 22 11:07:07 carndum kernel: [174743.812084] tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
May 22 11:07:07 carndum kernel: [174743.812122] tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000000]
May 22 11:07:07 carndum kernel: [174743.913855] tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
May 22 11:07:07 carndum kernel: [174743.936658] tg3: eth0: Link is down.
May 22 11:07:07 carndum kernel: [174743.984229] br0: port 1(eth0) entering disabled state
May 22 11:07:10 carndum kernel: [174747.106573] tg3: eth0: Link is up at 1000 Mbps, full duplex.
May 22 11:07:10 carndum kernel: [174747.106612] tg3: eth0: Flow control is off for TX and off for RX.
May 22 11:07:10 carndum kernel: [174747.106849] br0: port 1(eth0) entering forwarding state

lspci snippet:
0e:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)
0f:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet PCI Express (rev 11)

lshw snippet:
*-network
                description: Ethernet interface
                product: NetXtreme BCM5721 Gigabit Ethernet PCI Express
                vendor: Broadcom Corporation
                physical id: 0
                bus info: pci@0000:0e:00.0
                logical name: eth0
                version: 11
                serial: 00:19:bb:eb:e4:1c
                size: 1GB/s
                capacity: 1GB/s
                width: 64 bits
                clock: 33MHz
                capabilities: bus_master cap_list ethernet physical tp 10bt 10bt-fd 100bt 100bt-fd 1000bt 1000bt-fd autonegotiation
                configuration: autonegotiation=on broadcast=yes driver=tg3 driverversion=3.102 duplex=full firmware=5721-v3.55a, ASFIPMI v6.17 latency=0 multicast=yes port=twisted pair speed=1GB/s
                resources: irq:16 memory:dc200000-dc20ffff

Revision history for this message
David Martin (dlmarti) wrote : Re: [Bug 615053] Re: tg3 network disconnects during high usage

How do I check the firmware version?

I have this on all my machines, all running different version of Ubuntu
( all are similar hardware).

It only occurs when the card is under 100% utilization for several seconds.
I have recently lowered the chance of this happening by bonding the
multiple interfaces.
Now it only happens rarely.

I also try to limit the bandwidth usage through application setup.

On 05/23/2011 04:22 PM, Jim Sizelove wrote:
> We are experiencing a very similar situation. We have a server running
> Debian 6.0 (Squeeze) as a Xen host with several guests. From time to
> time the network interface on the host locks up in a cycle of timeouts
> and trying to reset itself. This has happened five times in the past
> five weeks, three of those in the past week.
>
> We have a second Xen host server with nearly the same configuration that
> has not experienced this network issue. It does have a newer firmware
> version for the tg3 (5721-v3.61) than the server experiencing the
> problem (5721-v3.55a). It is possible that the Xen guests on the second
> server do not have the network traffic loads that trigger the problem on
> the first server.
>
> David, can you compare the firmware versions of your servers? Is it
> possible this issue has been fixed in the firmware?

Revision history for this message
Jim Sizelove (jims+launchpad) wrote :

We check the firmware version with lshw. Look for any sections with "-network" in the header. We see the firmware version in the configuration line.

We have not been able to learn how to get and install a newer version of the firmware. We are now considering using a different network card.

The dropout happens consistently during the mornings, which is typically our busiest time of day. However, when I look at our network traffic history, the loads do not appear higher than at many other times when the network interface stays up.

Revision history for this message
Ville Liski (hount) wrote :

I can confirm this bug.

"lspci" shows
Ethernet controller: Broadcom Corporation NetXtreme BCM5755M Gigabit Ethernet PCI Express (rev 02)

"modinfo tg3 | grep version" gives
version: 3.102

And ethernet does stop if I transfer enough data trough this particular ethernet card.

Revision history for this message
Toan (tpham3783) wrote :

I am affected by this bug on an HP 705 MT Server. It however, uses a boardcom 6257 chipset. The bug that I am experiencing is not recoverable, had to reboot the system to get network online again.

If anyone here needs more info, please let me know.

Revision history for this message
Diego (gran-diego) wrote :

I'm affected too. HP Microserver N54L. Broadcom BCM95723 rev 10 chipset.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.