tg3: reports "eth0: DMA Status error. Resetting chip.", fails to work

Bug #1005699 reported by Paul Collins on 2012-05-28
38
This bug affects 7 people
Affects Status Importance Assigned to Milestone
linux (Debian)
Fix Released
Unknown
linux (Ubuntu)
High
Joseph Salisbury
Precise
High
Joseph Salisbury

Bug Description

I recently upgraded a Hewlett-Packard ProLiant DL385 G1 to precise from lucid, landing on linux-image-3.2.0-24-generic 3.2.0-24.39.

When I rebooted, I was unable to log in over the network. I examined the serial console and found the following (full dmesg attached). This sequence is emitted over and over until "ifdown eth0" is run.

tg3 0000:03:06.0: eth0: DMA Status error. Resetting chip.
tg3 0000:03:06.0: eth0: 0x00000000: 0x164814e4, 0x22b00146, 0x02000010, 0x00804010
<many lines removed - see attached dmesg>
tg3 0000:03:06.0: eth0: 0x00007010: 0x44ea66f0, 0x00014a00, 0x009f0020, 0xa184a053
tg3 0000:03:06.0: eth0: 0: Host status block [00000007:00000001:(0000:0000:0000):(0000:0000)]
tg3 0000:03:06.0: eth0: 0: NAPI info [00000000:00000001:(0000:0000:01ff):0000:(00c8:0000:0000:0000)]
tg3 0000:03:06.0: tg3_stop_block timed out, ofs=4800 enable_bit=2

The problem (or a very similar one) is also present in oneiric's kernel (I tested linux-image-3.0.0-20-server 3.0.0-20.34).

Paul Collins (pjdc) wrote :
summary: - tg3: reports "eth0: DMA Status error. Resetting chip."
+ tg3: reports "eth0: DMA Status error. Resetting chip.", fails to work

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1005699

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
tags: added: precise
Paul Collins (pjdc) wrote :
Paul Collins (pjdc) wrote :
Paul Collins (pjdc) wrote :

It's not possible to run apport-collect as the machine has no network connectivity when the problem is recreated.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Oneiric):
importance: Undecided → Medium
Changed in linux (Ubuntu Precise):
importance: Undecided → High
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in linux (Ubuntu Oneiric):
status: New → Confirmed
Changed in linux (Ubuntu Precise):
status: New → Confirmed
James Troup (elmo) on 2012-05-29
tags: added: regression-release
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.4kernel[1] (Not a kernel in the daily directory). Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-quantal/

tags: added: needs-upstream-testing
Changed in linux (Ubuntu Precise):
status: Confirmed → Incomplete
Paul Collins (pjdc) wrote :

The machine boots and gets network with the kernel from v3.4-precise. (Please let me know if you really did want v3.4-quantal. If so, do I need to install the matching linux-image-extra package as well?)

pjdc@prat:~$ cat /proc/version
Linux version 3.4.0-030400-generic (apw@gomeisa) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) ) #201205210521 SMP Mon May 21 09:22:02 UTC 2012

tags: added: kernel-fixed-upstream
removed: needs-upstream-testing
Changed in linux (Ubuntu Precise):
status: Incomplete → Confirmed
Joseph Salisbury (jsalisbury) wrote :

@Paul,

It would be great if we could perform a "Reverse Bisect" to identify the commit that fixed this bug. Would it be possible for you to test a few test kernels?

tags: added: kernel-da-key performing-bisect
Brad Figg (brad-figg) wrote :

@paul,

Can you test the Precise kernel that is currently in -proposed? 3.2.0-25.40

Joseph Salisbury (jsalisbury) wrote :

If you can assist in bisecting, we need to identify which upstream kernel fixed this issue, then bisect that version. To start, could you test the following upstream kernels:

v3.2.18: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.2.18-precise/

v3.3-rc1 http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-rc1-precise/

v3.3 final: http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-precise/

If the bug exists in all those kernels, we can test some of the early v3.4 kernels. Else, we can narrow it down further.

Thanks in advance

Paul Collins (pjdc) wrote :

Both the kernel from -proposed and kernel-ppa's 3.2.18 evince the same symptoms.

3.3 and 3.3-rc1 from kernel-ppa work fine.

Joseph Salisbury (jsalisbury) wrote :

Thanks for testing, Paul. I will bisect between v3.2 final and v3.3-rc1. I'll post a test kernel shortly.

Joseph Salisbury (jsalisbury) wrote :

Hi Paul,

I posted a test kernel at:
http://people.canonical.com/~jsalisbury/lp1005699

Can you test that kernel and report back if it has the bug or not?

The test kernel is built up to commit:
2ac9d7aaccbd598b5bd19ac40761b723bb675442

Paul Collins (pjdc) wrote :

That kernel works.

Linux version 3.2.0-030200-generic (jsalisbury@tangerine) (gcc version 4.4.3 (Ubuntu 4.4.3-4ubuntu5.1) ) #201206042318 SMP Mon Jun 4 22:21:15 UTC 2012

When I compare tg3.c's dmesg entries between working and non-working kernels, this jumps out:

-tg3 0000:03:06.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[0]
+tg3 0000:03:06.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[1]

I.e., the failing systems have "TSOcap[1]".

Joseph Salisbury (jsalisbury) wrote :

Thanks for the update, Paul. I'll kick off the next test kernel build and look deeper into the fact that the failing systems have "TSOcap[1]".

The next text kernel will be built up to:
9753dfe19a85e7e45a34a56f4cb2048bb4f50e27

Mathieu Alorent (kumy) wrote :

Just upgraded a Hewlett-Packard ProLiant DL385 G1 to precise, and ran into the same issue.

Is there something I can do to help ?

Brad Figg (brad-figg) wrote :

@Mathieu,

You can test the kernels as Joe posts them.

Joseph Salisbury (jsalisbury) wrote :

The following commit may be a fix for this issue:
cf9ecf4b631f649a964fa611f1a5e8874f2a76db - tg3: Fix TSO CAP for 5704 devs w / ASF enabled

I'll build a precise test kernel with this commit and post it shortly.

Mathieu Alorent (kumy) wrote :

I'll do...

I'm waiting for builds based on 9753dfe19a85e7e45a34a56f4cb2048bb4f50e27 ;)

Joseph Salisbury (jsalisbury) wrote :

I posted a test kernel at:
http://people.canonical.com/~jsalisbury/lp1005699

This test kernel is patched with commit
cf9ecf4b631f649a964fa611f1a5e8874f2a76db

Can you test that kernel and report back if it resolves this bug?

Joseph Salisbury (jsalisbury) wrote :

@Mathieu,

I'll only be building a kernel up to commit 9753dfe19a85e7e45a34a56f4cb2048bb4f50e27 if the patch kernel posted in comment #22 does not fix the bug. The commit I mention in comment #17 is for a bisect, which will no longer be needed if the bug is fixed by the kernel in comment #22.

Paul Collins (pjdc) wrote :

Looks like we have a winner, the patched kernel yields a machine with a functioning NIC.

Linux version 3.2.0-25-generic (root@tangerine) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #40~lp1005699v1 SMP Tue Jun 5 15:16:15 UTC 2012

tg3.c:v3.121 (November 2, 2011)
tg3 0000:03:06.0: PCI INT A -> GSI 28 (level, low) -> IRQ 28
tg3 0000:03:06.0: eth0: Tigon3 [partno(349321-001) rev 2100] (PCIX:133MHz:64-bit) MAC address XX:XX:XX:XX:XX:XX
tg3 0000:03:06.0: eth0: attached PHY is 5704 (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
tg3 0000:03:06.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[0]
tg3 0000:03:06.0: eth0: dma_rwctrl[769f4000] dma_mask[64-bit]
tg3 0000:03:06.1: PCI INT B -> GSI 29 (level, low) -> IRQ 29
tg3 0000:03:06.1: eth1: Tigon3 [partno(349321-001) rev 2100] (PCIX:133MHz:64-bit) MAC address YY:YY:YY:YY:YY:YY
tg3 0000:03:06.1: eth1: attached PHY is 5704 (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
tg3 0000:03:06.1: eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
tg3 0000:03:06.1: eth1: dma_rwctrl[769f4000] dma_mask[64-bit]
tg3 0000:03:06.0: eth0: Link is up at 1000 Mbps, full duplex
tg3 0000:03:06.0: eth0: Flow control is off for TX and off for RX
tg3 0000:03:06.1: PME# enabled

Joseph Salisbury (jsalisbury) wrote :

Thanks great news, Paul. Thanks for all the help testing!

Mathieu Alorent (kumy) wrote :

this new build works here too.

@Joseph
my comment #21 was in response to #17.
I didn't refresh the page before posting, and saw post #20 (which was posted one minute before #21) after.

Joseph Salisbury (jsalisbury) wrote :

Thanks for testing, Mathieu. I'll submit the commit that fixed this bug for SRU consideration.

Changed in linux (Ubuntu Precise):
assignee: nobody → Joseph Salisbury (jsalisbury)
no longer affects: linux (Ubuntu Oneiric)
Changed in linux (Ubuntu):
assignee: nobody → Joseph Salisbury (jsalisbury)
Paul Collins (pjdc) wrote :

If the SRU is accepted, will this also result in the netboot files on cdimage.ubuntu.com being regenerated?

The bug makes it difficult for us to install precise on this class of machine using our existing auto-install infrastructure, since the installer needs to download the preseed file, download packages from the archive, and talk to the configuration management system, before we have an installation we can log into.

Brad Figg (brad-figg) on 2012-06-07
Changed in linux (Ubuntu Precise):
status: Confirmed → Fix Committed
Adam Conrad (adconrad) wrote :

@Paul, there will be new debian-installer builds with SRU kernels generated in the lead-up to the 12.04.1 release, which would include this fix if and when it gets in.

Adam Conrad (adconrad) wrote :

@Joseph, would it be possible to get a test kernel spun up for powerpc with this patch attached (if that's too much hassle, I can do my own, I suppose). I think we're running into the same bug on the distro PPC buildds, and it would be nice to test if this fixes them, so they can continue to be upgraded to precise.

Joseph Salisbury (jsalisbury) wrote :

@Adam, the ppc build completed. I posted the kernel to:
http://people.canonical.com/~jsalisbury/lp1005699/ppc/

It would be great if you could update the bug and confirm if it fixes the bug.

Thanks

Adam Conrad (adconrad) wrote :

Okay, turns out that ross was just suffering from a botched kernel upgrade (when they got a console attached to it, they just found it sitting at a yaboot prompt, oops), and the belief that it was suffering from this bug was conjecture. Turns out that, while it's a 5704, it's not the right sort of 5704 to suffer from the ASF/TSOcap confusion.

Joseph Salisbury (jsalisbury) wrote :

Thanks for the update, Adam.

Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel for Precise in -proposed solves the problem (3.2.0-26.41). Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-precise' to 'verification-done-precise'.

If verification is not done by one week from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-precise
Paul Collins (pjdc) wrote :

Looks good.

$ dmesg | grep tg3 | cut -c16-
tg3.c:v3.121 (November 2, 2011)
tg3 0000:03:06.0: PCI INT A -> GSI 28 (level, low) -> IRQ 28
tg3 0000:03:06.0: eth0: Tigon3 [partno(349321-001) rev 2100] (PCIX:133MHz:64-bit) MAC address XX:XX:XX:XX:XX:XX
tg3 0000:03:06.0: eth0: attached PHY is 5704 (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
tg3 0000:03:06.0: eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] TSOcap[0]
tg3 0000:03:06.0: eth0: dma_rwctrl[769f4000] dma_mask[64-bit]
tg3 0000:03:06.1: PCI INT B -> GSI 29 (level, low) -> IRQ 29
tg3 0000:03:06.1: eth1: Tigon3 [partno(349321-001) rev 2100] (PCIX:133MHz:64-bit) MAC address 00:14:38:4b:2b:0f
tg3 0000:03:06.1: eth1: attached PHY is 5704 (10/100/1000Base-T Ethernet) (WireSpeed[1], EEE[0])
tg3 0000:03:06.1: eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] TSOcap[1]
tg3 0000:03:06.1: eth1: dma_rwctrl[769f4000] dma_mask[64-bit]
tg3 0000:03:06.0: eth0: Link is up at 1000 Mbps, full duplex
tg3 0000:03:06.0: eth0: Flow control is off for TX and off for RX
$ cat /proc/version_signature
Ubuntu 3.2.0-26.41-generic 3.2.19

tags: added: verification-done-precise
removed: verification-needed-precise
Launchpad Janitor (janitor) wrote :
Download full text (13.2 KiB)

This bug was fixed in the package linux - 3.2.0-26.41

---------------
linux (3.2.0-26.41) precise-proposed; urgency=low

  [Luis Henriques]

  * Release Tracking Bug
    - LP: #1012057

  [ Andy Whitcroft ]

  * [Config] fix up postinst to ensure we know which error is which
    - LP: #1002388
  * [Config] highbank -- commonise filesystems
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise subsystems
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise network protocols
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise input drivers
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise CRYPTO options
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise HID options
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise sensors options
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise EXPORTFS/FHANDLE
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise CONFIG_CRYPTO_LZO
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise ENCRYPTED_KEYS
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise CONFIG_ATALK
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise INET/INET6
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise NLS
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise BLK/CHR
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise PHY settings
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise CRC settings
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise BINFMT settings
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise DM settings
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise RTC_DRV settings
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise KEYBOARD/MOUSE settings
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise USB settings
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise GPIO settings
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise I2C settings
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise numerous subsystem selectors
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise A-C modules missmatches
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise D-F modules missmatches
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise G-I modules missmatches
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise J-L modules missmatches
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise M modules missmatches
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise N-P modules missmatches
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise Q-R modules missmatches
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise S modules missmatches
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise T modules missmatches
    - LP: #1000831, #1010463
  * [Config] highbank -- commonise U-Z modules missmatches
    - LP: #1000831, #1010463

  [ Herton Ronaldo Krzesinski ]

  * SAUCE: fix get_gate_vma call in i386 NX emulation code
    - LP: #1009200

  [ Ike Panhc ]

  * [Config] add...

Changed in linux (Ubuntu Precise):
status: Fix Committed → Fix Released
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
tags: removed: performing-bisect
Changed in linux (Debian):
status: Unknown → New
Changed in linux (Debian):
status: New → Fix Released

The verification of this Stable Release Update has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regresssions.

Dear All,
I am very new to Linux and absolutely do not know to fix the issues or communicate the error log as well.

My machine is BenQ 32 Bit and gives similar error. It goes into loop an does not boot at all.

DMA Status erro, Resetting Chip
CPU stuck for 22 sec.

Then address lines of memory,
repeates for differnct address locations.

Agains DMA status error, resetting chip....

Please advise.

Please advise me commands that I shall use to provide you the error log so that I can be assisted.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.