Ubuntu
linux package

14e4:165f tg3 eth1: transmit timed out, resetting on BCM5720

Bug #1331513 reported by wonko on 2014-06-18

This bug affects 18 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Won't Fix	High	Unassigned

Bug Description

we have a problem with Dell PowerEdge machines, having the Broadcom 5720 chip. We have this problem on generation 12 systems, across different models (R420, R620), with several combinations of bios firmwares, lifecycle firmwares, etc... We see this on several versions of the linux kernel, ranging from 3.2.x up tot 3.11, with several versions of the tg3 driver, including a manually compiled latest version (3.133d) loaded in a 3.11. The latest machine, where we can reproduce the problem has Ubuntu Precise installed, but we also see this behaviour on Debian machines. We run Xen on it, running HVM hosts on it. Storage is handled over iSCSI (and it is the iSCSI interface we can trigger this bug on in a reproducible way, while we have the impression it also happens on other interfaces, but there we don't have a solid case where we have e reproducible setup).

All this info actually points into the direction of the tg3 driver and/or hardware below it not handling certain datastreams or data patterns correctly, and finally crashing the system. It seems unrelated to the version of kernel running, xen-version running, amount of VM's running, firmwares and revisions running, etc...

We have been trying to pinpoint this for over a year now, being unable to actually create a scenario where we could reproduce this. As of this week, we finally found a specific setup where we could trigger the error within a reasonable time.

The error is triggered by running a certain VM on the Xen stack, and inside that VM, importing a mysqldump in a running mysql on that VM. The VM has it's traffic on an iSCSI volume, so this effectually generates a datastream over the eth1 interface of the machine. Within a short amount of time, the system will crash in 2 steps. We first see a timeout on the tg3 driver on the eth1 interface (dmesg output section attached). This sometimes repeats two or three times, and finally, step 2, the machine freezes and reboots.

While debugging, we noticed that the bug goes away when we disable sg offloading with ethtool.

If you need any additional info, feel free to ask.

ProblemType: Bug
DistroRelease: Ubuntu 12.04
Package: linux-image-3.11.0-19-generic 3.11.0-19.33~precise1
ProcVersionSignature: Ubuntu 3.11.0-19.33~precise1-generic 3.11.10.5
Uname: Linux 3.11.0-19-generic x86_64
AlsaDevices:
total 0
crw-rw---T 1 root audio 116, 1 Jun 18 16:36 seq
crw-rw---T 1 root audio 116, 33 Jun 18 16:36 timer
AplayDevices: Error: [Errno 2] No such file or directory
ApportVersion: 2.0.1-0ubuntu17.6
Architecture: amd64
ArecordDevices: Error: [Errno 2] No such file or directory
AudioDevicesInUse: Error: command ['fuser', '-v', '/dev/snd/seq', '/dev/snd/timer'] failed with exit code 1:
CRDA: Error: command ['iw', 'reg', 'get'] failed with exit code 1: nl80211 not found.
Date: Wed Jun 18 16:47:27 2014
HibernationDevice: RESUME=UUID=f3577e02-64e3-4cab-b6e7-f30efa111565
InstallationMedia: Ubuntu-Server 12.04.4 LTS "Precise Pangolin" - Release amd64 (20140204)
MachineType: Dell Inc. PowerEdge R420
MarkForUpload: True
PciMultimedia:

ProcFB:

ProcKernelCmdLine: placeholder root=UUID=bbc71780-90bf-4647-b579-e48d5d8c2bce ro vga=0x317
RelatedPackageVersions:
linux-restricted-modules-3.11.0-19-generic N/A
linux-backports-modules-3.11.0-19-generic N/A
linux-firmware 1.79.12
RfKill: Error: [Errno 2] No such file or directory
SourcePackage: linux-lts-saucy
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 01/20/2014
dmi.bios.vendor: Dell Inc.
dmi.bios.version: 2.1.2
dmi.board.name: 0JD6X3
dmi.board.vendor: Dell Inc.
dmi.board.version: A00
dmi.chassis.type: 23
dmi.chassis.vendor: Dell Inc.
dmi.modalias: dmi:bvnDellInc.:bvr2.1.2:bd01/20/2014:svnDellInc.:pnPowerEdgeR420:pvr:rvnDellInc.:rn0JD6X3:rvrA00:cvnDellInc.:ct23:cvr:
dmi.product.name: PowerEdge R420
dmi.sys.vendor: Dell Inc.

Tags:

Revision history for this message

wonko (bernard-0) wrote on 2014-06-18:

dmesg output (tg3 timeout section and register dumps) Edit (35.6 KiB, text/plain)
AcpiTables.txt Edit (293.7 KiB, text/plain; charset="utf-8")
BootDmesg.txt Edit (76.6 KiB, text/plain; charset="utf-8")
CurrentDmesg.txt Edit (58.2 KiB, text/plain; charset="utf-8")
Dependencies.txt Edit (1.9 KiB, text/plain; charset="utf-8")
IwConfig.txt Edit (663 bytes, text/plain; charset="utf-8")
Lspci.txt Edit (109.3 KiB, text/plain; charset="utf-8")
Lsusb.txt Edit (416 bytes, text/plain; charset="utf-8")
ProcCpuinfo.txt Edit (17.8 KiB, text/plain; charset="utf-8")
ProcEnviron.txt Edit (94 bytes, text/plain; charset="utf-8")
ProcInterrupts.txt Edit (66.5 KiB, text/plain; charset="utf-8")
ProcModules.txt Edit (3.7 KiB, text/plain; charset="utf-8")
UdevDb.txt Edit (358.8 KiB, text/plain; charset="utf-8")
UdevLog.txt Edit (522.6 KiB, text/plain; charset="utf-8")
WifiSyslog.gz Edit (81.5 KiB, application/x-gzip)

Revision history for this message

Launchpad Janitor (janitor) wrote on 2014-06-18:

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux-lts-saucy (Ubuntu):
status:	New → Confirmed

Kent Baxley (kentb) on 2014-06-18

Changed in dell-poweredge:
status:	New → Incomplete

Revision history for this message

Kent Baxley (kentb) wrote on 2014-06-18:

Please also try with thethe lts-trusty kernels, ported to precise, which are based on 3.13:

sudo apt-get install linux-generic-lts-trusty linux-headers-generic-lts-trusty linux-image-generic-lts-trusty

Revision history for this message

wonko (bernard-0) wrote on 2014-06-18:

Same behaviour, it crashes in the same way, in the expected time-interval.

A bit more info.

root@hostname:~# uname -a
Linux hostname 3.13.0-29-generic #53~precise1-Ubuntu SMP Wed Jun 4 22:06:25 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$root@hostname:~# dmesg | grep tg3
[ 65.567717] tg3.c:v3.134 (Sep 16, 2013)
[ 65.580967] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address f0:1f:af:e9:33:da
[ 65.581141] tg3 0000:02:00.0 eth0: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[ 65.581308] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[ 65.581463] tg3 0000:02:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit]
[ 65.592762] tg3 0000:02:00.1 eth1: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address f0:1f:af:e9:33:db
[ 65.592909] tg3 0000:02:00.1 eth1: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[ 65.593058] tg3 0000:02:00.1 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[ 65.593196] tg3 0000:02:00.1 eth1: dma_rwctrl[00000001] dma_mask[64-bit]
[ 65.614306] tg3 0000:08:00.0 eth2: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address 00:0a:f7:52:73:aa
[ 65.614441] tg3 0000:08:00.0 eth2: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[ 65.614570] tg3 0000:08:00.0 eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[ 65.614695] tg3 0000:08:00.0 eth2: dma_rwctrl[00000001] dma_mask[64-bit]
[ 65.634161] tg3 0000:08:00.1 eth3: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address 00:0a:f7:52:73:ab
[ 65.634296] tg3 0000:08:00.1 eth3: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[ 65.634426] tg3 0000:08:00.1 eth3: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[ 65.634551] tg3 0000:08:00.1 eth3: dma_rwctrl[00000001] dma_mask[64-bit]
[ 75.038881] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex
[ 75.038904] tg3 0000:02:00.0 eth0: Flow control is off for TX and off for RX
[ 75.038908] tg3 0000:02:00.0 eth0: EEE is disabled
[ 77.449876] tg3 0000:02:00.1 eth1: Link is up at 1000 Mbps, full duplex
[ 77.449879] tg3 0000:02:00.1 eth1: Flow control is off for TX and off for RX
[ 77.449881] tg3 0000:02:00.1 eth1: EEE is disabled

I wasn't able to catch the transmit timeout, as the system rebooted to soon. However, from the looks of it, it was the same behaviour as on the other kernels.

As a sidenote, before changing kernels and rebooting, it was running the test again with both sg and gso off, and it didn't crash within a normal should-crash timespan (3.11 -23 kernel). It seems the sg and/or gso seem to make a difference. I'll restart the test with only sg disabled for now.

Same behaviour, it crashes in the same way, in the expected time-interval.

A bit more info.

root@hostname:~# uname -a
Linux hostname 3.13.0-29-generic #53~precise1-Ubuntu SMP Wed Jun 4 22:06:25 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
$root@hostname:~# dmesg | grep tg3
[   65.567717] tg3.c:v3.134 (Sep 16, 2013)
[   65.580967] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address f0:1f:af:e9:33:da
[   65.581141] tg3 0000:02:00.0 eth0: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[   65.581308] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[   65.581463] tg3 0000:02:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit]
[   65.592762] tg3 0000:02:00.1 eth1: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address f0:1f:af:e9:33:db
[   65.592909] tg3 0000:02:00.1 eth1: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[   65.593058] tg3 0000:02:00.1 eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[   65.593196] tg3 0000:02:00.1 eth1: dma_rwctrl[00000001] dma_mask[64-bit]
[   65.614306] tg3 0000:08:00.0 eth2: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address 00:0a:f7:52:73:aa
[   65.614441] tg3 0000:08:00.0 eth2: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[   65.614570] tg3 0000:08:00.0 eth2: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[   65.614695] tg3 0000:08:00.0 eth2: dma_rwctrl[00000001] dma_mask[64-bit]
[   65.634161] tg3 0000:08:00.1 eth3: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address 00:0a:f7:52:73:ab
[   65.634296] tg3 0000:08:00.1 eth3: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[   65.634426] tg3 0000:08:00.1 eth3: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[   65.634551] tg3 0000:08:00.1 eth3: dma_rwctrl[00000001] dma_mask[64-bit]
[   75.038881] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex
[   75.038904] tg3 0000:02:00.0 eth0: Flow control is off for TX and off for RX
[   75.038908] tg3 0000:02:00.0 eth0: EEE is disabled
[   77.449876] tg3 0000:02:00.1 eth1: Link is up at 1000 Mbps, full duplex
[   77.449879] tg3 0000:02:00.1 eth1: Flow control is off for TX and off for RX
[   77.449881] tg3 0000:02:00.1 eth1: EEE is disabled

I wasn't able to catch the transmit timeout, as the system rebooted to soon. However, from the looks of it, it was the same behaviour as on the other kernels.

Revision history for this message

Brad Figg (brad-figg) wrote on 2014-06-18: Status changed to Confirmed

This change was made by a bot.

Changed in linux (Ubuntu):
status:	New → Confirmed

penalvch (penalvch) on 2014-06-19

tags:

added: saucy

Revision history for this message

penalvch (penalvch) wrote on 2014-06-19: Re: tg3 eth1: transmit timed out, resetting on BCM5720

wonko, could you please test the latest upstream kernel available from the very top line at the top of the page (not the daily folder) following https://wiki.ubuntu.com/KernelMainlineBuilds ? It will allow additional upstream developers to examine the issue. Once you've tested the upstream kernel, please comment on which kernel version specifically you tested. If this bug is fixed in the mainline kernel, please add the following tags:
kernel-fixed-upstream
kernel-fixed-upstream-VERSION-NUMBER

where VERSION-NUMBER is the version number of the kernel you tested. For example:
kernel-fixed-upstream-3.16-rc1

This can be done by clicking on the yellow circle with a black pencil icon next to the word Tags located at the bottom of the bug description. As well, please remove the tag:
needs-upstream-testing

If the mainline kernel does not fix this bug, please add the following tags:
kernel-bug-exists-upstream
kernel-bug-exists-upstream-VERSION-NUMBER

As well, please remove the tag:
needs-upstream-testing

Once testing of the upstream kernel is complete, please mark this bug's Status as Confirmed. Please let us know your results. Thank you for your understanding.

tags:	added: bios-outdated-2.1.3
tags:	added: trusty
Changed in linux (Ubuntu):
importance:	Undecided → High
status:	Confirmed → Incomplete
summary:	- tg3 eth1: transmit timed out, resetting on BCM5720 + 14e4:165f tg3 eth1: transmit timed out, resetting on BCM5720

Revision history for this message

wonko (bernard-0) wrote on 2014-06-19:

A little in-between info.

I saw you added the bios-outdated tag. As this is a quick fix, I've updated the bios first to 2.1.3 (latest as to my info), and re-ran the test. It still crashes (this was with the 3.13.0-29-generic kernel). I guess this would remove the bios-outdated-tag.

The tests with the scatter-gather disabled ran all night, without a single crash. This might be a strong clue.

I'll now upgrade to the latest mainline, do the tests, and report back here, so we can verify whether it is fixed or not in the latest release.

Revision history for this message

wonko (bernard-0) wrote on 2014-06-19:

Installed

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.16-rc1-utopic/linux-image-3.16.0-031600rc1-generic_3.16.0-031600rc1.201406160035_amd64.deb

which gives me ...

root@hostname:~# uname -a
Linux hostname 3.16.0-031600rc1-generic #201406160035 SMP Mon Jun 16 04:36:15 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
root@hostname:~# dmesg | grep tg3
[ 65.271832] tg3.c:v3.137 (May 11, 2014)
[ 65.283708] tg3 0000:02:00.0 eth0: Tigon3 [partno(BCM95720) rev 5720000] (PCI Express) MAC address f0:1f:af:e9:33:da
[ 65.283968] tg3 0000:02:00.0 eth0: attached PHY is 5720C (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[1])
[ 65.284202] tg3 0000:02:00.0 eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]
[ 65.284472] tg3 0000:02:00.0 eth0: dma_rwctrl[00000001] dma_mask[64-bit]
... same for eth1 up to eth3
[ 74.311982] tg3 0000:02:00.0 eth0: Link is up at 1000 Mbps, full duplex
[ 74.312008] tg3 0000:02:00.0 eth0: Flow control is off for TX and off for RX
... same for eth1

Same behaviour, it crashes within the same time-interval. I'll apply the correct tags.

Changed in linux (Ubuntu):
status:	Incomplete → Confirmed
tags:	added: kernel-bug-exists-upstream kernel-bug-exists-upstream-3.16-rc1

Revision history for this message

penalvch (penalvch) wrote on 2014-06-19:

wonko, the issue you are reporting is an upstream one. Could you please report this problem through the appropriate channel by following the instructions _verbatim_ at https://wiki.ubuntu.com/Bugs/Upstream/kernel ?

Please provide a direct URL to your e-mail to the mailing list once you have made it so that it may be tracked.

Thank you for your understanding.

Changed in linux (Ubuntu):
status:	Confirmed → Triaged
tags:	added: latest-bios-2.1.3 removed: bios-outdated-2.1.3

Revision history for this message

wonko (bernard-0) wrote on 2014-06-19:

#10

Mail sent, this is the tread

http://marc.info/?l=linux-netdev&m=140318618603899&w=2

Revision history for this message

Jared Dominguez (jared-dominguez) wrote on 2014-06-19:

#11

Bernard,

I've some quick suggestions from my colleague Narendra here at Dell to pass on. First try disabling TSO. If that doesn't work, disable both GSO and TSO. ethtool can be used to disable both GSO and TSO.

Also, I can't see what firmware version you have on the NIC. Can you verify that it's the latest? It's available at http://www.dell.com/support/home/us/en/19/Drivers/DriversDetails?driverId=VPY7Y

Narendra also noted that there have been a couple of related bugs recently, though I believe those are in 3.16-rc1:

tg3: Fix data corruption on 5725 with TSO <https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/net/ethernet/broadcom/tg3.c?id=0f0d15100a8ac875bdd408324c473e16d73d3557>

tg3: Expand 4g_overflow_test workaround to skb fragments of any size <https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/drivers/net/ethernet/broadcom/tg3.c?id=375679104ab3ccfd18dcbd7ba503734fb9a2c63a>

Revision history for this message

wonko (bernard-0) wrote on 2014-06-19:

#12

Hey Daniel,

Thanks for the input.

The bug is "fixed" when disabling SG. Haven't tried disabling only GSO or TSO, will do that next. However, we would like to use the device to the full capacity; and use the offloading capabilities.

De firmware is listed below, I updated through the Lifecycle tool 5 days ago:

root@hostname:~# ethtool -i eth1
driver: tg3
version: 3.137
firmware-version: FFV7.8.53 bc 5720-v1.32
bus-info: 0000:02:00.1
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes

Both patches you mention are from mid-2013, seems to me like they would normally have made it in the kernel by now...

Revision history for this message

Jared Dominguez (jared-dominguez) wrote on 2014-06-19:

#13

Whoops, even if you hadn't updated the firmware, I gave you the wrong link anyway.

I did see that the bug is fixed if you disable SG. I'm trying to isolate the cause, and my colleague gave those suggestions. The goal is to fix the bug, not leave you permanently without offloading capabilities.

Yes, as I'd mentioned, the cited patches would be in 3.16-rc1, which you tested.

Changed in dell-poweredge:
importance:	Undecided → High
status:	Incomplete → Triaged

Revision history for this message

wonko (bernard-0) wrote on 2014-06-19:

#14

No worries, i'm running the TSO and GSO tests now (tso disabled first, gso still on). Just mentioned it in case you missed it, as it sits hidden away a bit in between the other stuff.

Revision history for this message

wonko (bernard-0) wrote on 2014-06-19:

#15

Either one disabled doesn't make a difference, I can still crash the system. It seems only SG is affecting the bug.

Revision history for this message

Kent Baxley (kentb) wrote on 2014-06-20:

#16

Bernard,

I may also want to go ahead and get a soup-to-nuts list of how to set up a machine to reproduce this. I've got an R620 and access to an iscsi volume in my lab. I'm trying currently to reproduce on a simpler scale, but, not having much luck at this time (which is no surprise).

Revision history for this message

wonko (bernard-0) wrote on 2014-06-20:

#17

Kent,

I've been search for a long time for a case to trigger this. I have no idea why loading that specific sqldump into that specific mysql server, on that specific VM is triggering the bug. We tried to rebuild the situation ourself, and the only way to reproduce this, is by doing it on the running vm itself. And to make things even harder, the SQL dump is private client data, and I cannot just hand this over.

I'll be searching myself for a scenario to reproduce this on other hardware, but I'm not really confident I might find this in the short term.

Revision history for this message

Kent Baxley (kentb) wrote on 2014-06-20:

#18

Understood. I didn't realize it was *that* specific. How big is the database you are importing and is there anything special about the VM itself? I'll try and get this as close as possible.

Revision history for this message

wonko (bernard-0) wrote on 2014-06-21:

#19

The dump itselfs creates about 50 tables, spread over 3 databases, and sums up in datasize to about 2.5 GB. So, it isn't the smallest one, but neither a big one. We import the data in about 7 to 10 minutes.

The VM is a setup we have running many times. It is the second node of a mysql master-master setup, running debian wheezy with MariaDB, and an idle MongoDB. The setup is for us a "classic" setup, we have this running for plenty of clients, in several variations, running databases a tenfold the sizes... It is specifically the import that triggers the crash, as the machine can sit idle for a very long time. It must be something in the datastream generated to/from the iSCSI target when MariaDB flushes the data to disk.

Server version: 5.5.34-MariaDB-1~wheezy-log mariadb.org binary distribution

On a sidenote, it seems like i must have missed a dot or a comma in my report to kernel.org, as it gets no attention/replies... Anyone an idea how I can get some attention there?

Revision history for this message

penalvch (penalvch) wrote on 2014-06-21:

#20

wonko, as advised in https://wiki.ubuntu.com/Bugs/Upstream/kernel did you CC the maintainer and the last person to submit commits to tg3?

Revision history for this message

Kent Baxley (kentb) wrote on 2014-06-23:

#21

Working on a reproducer. I've been able to set up the following:

A Xen HVM guest booting off of an iscsi-attached LVM LUN.
Guest is running Wheezy.
Dom0 is 12.04 on a PowerEdge R620, running the 3.11-based LTS kernels.
The eth1 NIC is being used for the iscsi traffic. This is one of the tg3-based onboard NICs.

I've so far been importing different databases into the running mysql database on the debian guest. No crashes, yet, but I'll keep playing around with different database dumps to see if I can trigger anything.

Revision history for this message

Kent Baxley (kentb) wrote on 2014-06-23:

#22

@wonko,

If there's any more specific information about the Xen VM, please let me know. What I'm looking for are any special storage or network drivers that the VM is using outside of what the standard Xen VM would use. If there's nothing special about the VM in that respect, then I'll continue with what I have. Also, the Xen config file for the Debian guest you're using might be helpful. No problem if you can't get if for me, though.

Revision history for this message

wonko (bernard-0) wrote on 2014-06-24:

#23

@christopher: I've mailed the maintainers.

@kent: Your setups looks okay. I have made a similar setup, and wasn't able to reproduce the problem myself, even loading the exact database dump. Even more, as the failing VM was a member of a two way mysql master-master setup, we've installed a third machine, acting as a slave, and this one runs just fine.

There must be something different to that specific VM, but I have no idea what exactly it is. And while I can easily test the issue, and make it crash within 10...15 minutes, it is never exactly at the same point, so it isn't the exact content of the dump, generating a specific flush to the disk with dataset xyz in it.

Additionally, I can tell you the iSCSI target is also a linux machine, running debian squeeze, and LIO as iSCSI target. There is no authentication, ACL is done with simple IP based restrictions. The config of the VM is below:

kernel = "/boot/domU/xenu-linux-3.2-amd64"
ramdisk = "/boot/domU/xenu-initrd-3.2-amd64"
name = "crashing-db-host"
vif = [ 'mac=02:00:bc:5d:64:62,bridge=xenbr201' ]
disk = [ 'iscsi:storage:crashing-db-host:root,xvda1,w', 'iscsi:local:crashing-db-host:swap,xvda2,w' ]
memory = 6144
maxmem = 16384
vcpus = 6
cpu_weight = 6144
extra = "4 mem=16384M xencons=hvc0 rootflags=uqnoenforce"
on_poweroff = 'destroy'
on_reboot = 'restart'
on_crash = 'restart'

The "iscsi" block device isn't really special, it just searches our storage systems for the correct iSCSI target, and takes the LVM volume from that system. Once it finds the correct volume, it just returns the exact LVM path, like you would use a phy:... with the path.

The kernel used in the VM is vmlinuz-3.2.0-4-amd64, but we noticed this doesn't make a difference.

Revision history for this message

wonko (bernard-0) wrote on 2014-07-14:

#24

Just a small update, but without any good news. I never got a single reply or inquiry from the kernel.org developers, neither the mailinglist nor from the tg3 driver developers. It seems I've hit a dead end, as I'm out of options (except for asking Dell for Intel NIC's for all my servers until this gets fixed, but I doubt that'll go through ...).

As an extra, we had a server crash again last week, exactly the same behaviour as with the other ones, so it isn't only this single isolated case we have, other server-vm-iscsi combo's have this also.

If any of you has any suggestion how this can be addressed further, any help would be appreciated.

Revision history for this message

Jared Dominguez (jared-dominguez) wrote on 2014-08-06:

#25

It was suggested to me that this may be relevant: https://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=4d8fdc95c60e90d84c8257a0067ff4b1729a3757

Revision history for this message

wonko (bernard-0) wrote on 2014-08-11:

#26

We tried the above patch, and ran the test again, but we got still get a crash/reboot at approx the same time...

penalvch (penalvch) on 2014-08-12

no longer affects:	linux-lts-saucy (Ubuntu)
tags:	added: bios-outdated-2.1.3

Revision history for this message

Jared Dominguez (jared-dominguez) wrote on 2014-09-02:

#27

Bernard,

We still haven't been able to reproduce the bug, but if you have time, please see if the following patches applied to your kernel help:

[PATCH net v5 1/4] tg3: Limit minimum tx queue wakeup threshold
http://marc.info/?l=linux-netdev&m=140934527707734&w=2

[PATCH net v5 2/4] tg3: Fix tx_pending check for MAX_SKB_FRAGS
http://marc.info/?l=linux-netdev&m=140934526407725&w=2

[PATCH net v5 3/4] tg3: Move tx queue stop logic to its own function
http://marc.info/?l=linux-netdev&m=140934525507718&w=2

[PATCH net v5 4/4] tg3: Fix tx_pending checks for tg3_tso_bug
http://marc.info/?l=linux-netdev&m=140934524407695&w=2

Revision history for this message

wonko (bernard-0) wrote on 2014-12-13:

#28

All,

it's been a while since i've investigated this a bit further, but I have just tried the same with the mainline 3.18 kernel. Result stays, machine hangs within a couple of minutes if I do the exact same steps...

penalvch (penalvch) on 2014-12-13

tags:

removed: bios-outdated-2.1.3

Revision history for this message

Toan (tpham3783) wrote on 2015-03-25:

#29

@Wonko,

I've updated the driver to the latest version from broadcom.com, version 3.137h; and I am still experiencing a similar issue. However, when the driver crashes, sometimes (70%) chance that the machine is useable, and another 30% the machine is totally locked up.

The NIC i am using is new, its product ID is 1687, and has an external PHY Rev. of 5762C. I am able to replicate it with the following methods:

1. Start the tg3 machine
2. from another machine: start 5 sessions, repetitively copy (scp with public key authentication) a 70 meg file back and forth to the tg3 machine in each session. (not sure if this is necessary)
3. create a 1GB file on the tg3 machine, with something like dd if=/dev/urandom of=/my/test/file bs=1024 count=$((1024*100))
4. from another machine: repetitively scp copy that 1GB file from the tg3 machine. This can be done with something like:

while [ 0 ]; do
scp -i /my/scp/private.key <email address hidden>:/my/test/file /
done;

I've done it about 40 times, and the tg3 machine will crash anywhere from 5 minutes to 50 minutes into the test.
I am still scratching my head over this bug, and as a matter of fact, we are thinking about switching to an Intel or Realtek NIC, if we can not get this resolved soon.

Revision history for this message

Emby Server (apps-z) wrote on 2015-06-22:

#30

I am experiencing network issues with my Dell 12 Gen system as well. My issues are with ssh. I have not tested for Toan's bug. Anyhow, every time I connect via ssh I get random "Connection reset by peer" errors. I have tried all the suggestion and fixes mention in this Bug report, but nothing seems to make the network connection stable or operational.

Revision history for this message

Emby Server (apps-z) wrote on 2015-06-22:

#31

I forgot to mention, I don't experience this bug using CentOS only Ubuntu and other Debian based distribution.

Revision history for this message

Fernando Soto (fernando-soto) wrote on 2015-09-01:

#32

Download full text (4.2 KiB)

I was able to consistently reproduce this issue in Debian Wheezy by setting up two Dell PowerEdge R620 servers directly connected and doing constant scp transfer of large files back and forth while also setting the interface up & down in a loop until it breaks (while [ true ]; do ip link set ${DEV} down; sleep 1; ip link set ${DEV} up; sleep 9; done).
I have then updated the tg3 driver to version 3.137h, but the issue was still reproducible.
I have tried RedHat 7.1 and it works fine (Kernel: 3.10.0-229.el7.x86_64, tg3 3.137).
Then I have tried Debian Jessie (kernel 3.16, tg3 3.137) and the issue is not reproducible.

Logs from Debian Wheezy:
Aug 28 14:16:27 bond-111 kernel: [ 519.336495] ------------[ cut here ]------------
Aug 28 14:16:27 bond-111 kernel: [ 519.336505] WARNING: at /build/linux-l1NKWv/linux-3.2.68/net/sched/sch_generic.c:256 dev_watchdog+0xf2/0x151()
Aug 28 14:16:27 bond-111 kernel: [ 519.336508] Hardware name: PowerEdge R620
Aug 28 14:16:27 bond-111 kernel: [ 519.336510] NETDEV WATCHDOG: eth0 (tg3): transmit queue 0 timed out
Aug 28 14:16:27 bond-111 kernel: [ 519.336513] Modules linked in: drbd lru_cache nf_conntrack_tftp nf_conntrack virtio_net virtio_blk virtio_pci virtio_ri
ng virtio kvm bonding sb_edac coretemp snd_pcm crc32c_intel ghash_clmulni_intel aesni_intel snd_page_alloc snd_timer snd soundcore aes_x86_64 edac_core joy
dev pcspkr shpchp iTCO_wdt iTCO_vendor_support evdev dcdbas aes_generic cryptd processor button thermal_sys acpi_power_meter wmi ext3 mbcache jbd microcode
usbhid hid sg sr_mod sd_mod cdrom crc_t10dif ahci libahci libata ehci_hcd megaraid_sas scsi_mod usbcore usb_common tg3 libphy [last unloaded: drbd]
Aug 28 14:16:27 bond-111 kernel: [ 519.336580] Pid: 29511, comm: scp Not tainted 3.2.0-4-amd64 #1 Debian 3.2.68-1+deb7u2
Aug 28 14:16:27 bond-111 kernel: [ 519.336583] Call Trace:
Aug 28 14:16:27 bond-111 kernel: [ 519.336585] <IRQ> [<ffffffff81046dbd>] ? warn_slowpath_common+0x78/0x8c
Aug 28 14:16:27 bond-111 kernel: [ 519.336599] [<ffffffff81046e69>] ? warn_slowpath_fmt+0x45/0x4a
Aug 28 14:16:27 bond-111 kernel: [ 519.336605] [<ffffffff812a8c91>] ? netif_tx_lock+0x40/0x75
Aug 28 14:16:27 bond-111 kernel: [ 519.336609] [<ffffffff812a8e01>] ? dev_watchdog+0xf2/0x151
Aug 28 14:16:27 bond-111 kernel: [ 519.336613] [<ffffffff810525f4>] ? run_timer_softirq+0x19a/0x261
Aug 28 14:16:27 bond-111 kernel: [ 519.336615] [<ffffffff812a8d0f>] ? netif_tx_unlock+0x49/0x49
Aug 28 14:16:27 bond-111 kernel: [ 519.336618] [<ffffffff8104c46a>] ? __do_softirq+0xb9/0x177
Aug 28 14:16:27 bond-111 kernel: [ 519.336622] [<ffffffff813583ec>] ? call_softirq+0x1c/0x30
Aug 28 14:16:27 bond-111 kernel: [ 519.336627] [<ffffffff8100fa91>] ? do_softirq+0x3c/0x7b
Aug 28 14:16:27 bond-111 kernel: [ 519.336630] [<ffffffff8104c6d2>] ? irq_exit+0x3c/0x99
Aug 28 14:16:27 bond-111 kernel: [ 519.336632] [<ffffffff8100f66a>] ? do_IRQ+0x82/0x98
Aug 28 14:16:27 bond-111 kernel: [ 519.336637] [<ffffffff813513ee>] ? common_interrupt+0x6e/0x6e
Aug 28 14:16:27 bond-111 kernel: [ 519.336638] <EOI> [<ffffffff811b43cd>] ? copy_user_generic_string+0x2d/0x40
Aug 28 14:16:27 bond-111 kernel: [ 519.336647] [<ffffffff...

Duplicates of this bug

Bug #1152580

You are

Subscribing...

Edit bug mail

Other bug subscribers

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Changed in linux (Ubuntu):
status:	Triaged → Incomplete
Changed in dell-poweredge:
status:	Triaged → Won't Fix
Changed in linux (Ubuntu):
status:	Incomplete → Won't Fix

Ubuntulinux package

14e4:165f tg3 eth1: transmit timed out, resetting on BCM5720

Bug Description

Duplicates of this bug

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package