Networking on Dell PowerEdge R300 with Broadcom Corporation NetXtreme BCM5722 Gigabit Ethernet PCI Express is not functional

Bug #960311 reported by Brendan Donegan on 2012-03-20
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
High
Unassigned
Precise
High
Unassigned

Bug Description

After installing Precise Beta1 on this server, the network card is not usable and this message appears in dmesg:

[ 1385.620013] tg3 0000:01:00.0: vpd r/w failed. This is likely a firmware bug on this device. Contact the card vendor for a firmware update.

Running ifconfig shows only an inet6 address. The system is not accesible by ssh (obviously) but we were able to retrieve dmesg manually.

The system was PxE installed so at some level the network hardware is working.

summary: - 'tg3 0000:01:00.0: vpd r/w failed' on Dell PowerEdge R300 with Broadcom
- Corporation NetXtreme BCM5722 Gigabit Ethernet PCI Express
+ Networking on Dell PowerEdge R300 with Broadcom Corporation NetXtreme
+ BCM5722 Gigabit Ethernet PCI Express is not functional
Joseph Salisbury (jsalisbury) wrote :

Hi Brendan,

Do you happen to know if this issue happened with previous Ubuntu versions on this server?

Do you have a way to install additional test kernels on the system? If so, it would be great if you could test the latest mainline kernel:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.3-precise/

Changed in linux (Ubuntu):
importance: Undecided → Medium
tags: added: kernel-da-key precise
Changed in linux (Ubuntu):
importance: Medium → High
Stefan Bader (smb) wrote :

Can you go back and install Oneiric? And then provide data from that? Thanks.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 960311

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
status: Incomplete → New
status: New → Incomplete

Thank you for taking the time to file a bug report on this issue.

However, given the number of bugs that the Kernel Team receives during any development cycle it is impossible for us to review them all. Therefore, we occasionally resort to using automated bots to request further testing. This is such a request.

We have noted that there is a newer version of the development kernel than the one you last tested when this issue was found. Please test again with the newer kernel and indicate in the bug if this issue still exists or not.

You can update to the latest development kernel by simply running the following commands in a terminal window:

    sudo apt-get update
    sudo apt-get upgrade

If the bug still exists, change the bug status from Incomplete to Incomplete. If the bug no longer exists, change the bug status from Incomplete to Fix Released.

If you want this bot to quit automatically requesting kernel tests, add a tag named: bot-stop-nagging.

 Thank you for your help, we really do appreciate it.

tags: added: kernel-request-3.2.0-19.30
tags: added: blocks-hwcert

So far I have only got to try with Oneiric and the issue is still happening. I think this system got a firmware upgrade very recently which I didn't know about. It's up to you guys to decide if it's still a bug, but in the meantime I have no way to get full logs for this bug. I can try testing the latest daily and get someone to manually install the upstream kernel though.

Stefan Bader (smb) wrote :

The question to ask is whether this particular machine has been seen running any version of Linux before? If yes, will it still run that version, if no... well has it run anything? Maybe it is just as the message says, broken hardware. If it has been running some other release before but not anymore, again broken hardware. If it still does run a previous release but not oneriic or precise, then we can go further.
And also, is this the only machine of that type or do we have another one. Maybe without a new firmware...?

I managed to get a run done with the following kernel, which I thought would be the latest daily :

Linux ubuntu 3.2.0-2-generic #5-Ubuntu SMP Mon Nov 28 18:10:23 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

With this kernel the network is working. However I'm confused by the fact that the version number is older than the one I raised the bug against? I think I might have accidentally used an older image and not infact the latest daily. This is one extra data point though.

Stefan,

To answer your questions, this machine has been working until the beginning of this week. I have attached dmesg log from the last successful instance of the system working.

Please disregard comment #8 - I was initially able to SSH to this system but eventually lost connectivity. At 1358 (seconds after boot I presume) the same message as in the bug appeared again.

It should be noted that this is not a black and white case of the card never working. it still works to allow the system to pxe install and is accesible for a few minutes after boot (the firmware error is being echoed at 1385 in the log, over 20 minutes after booting)

description: updated

Now successfully tested with the latest development kernel (3.2.0-19.30). Still the same error appears after 1338 seconds in dmesg. I need to ask the lab engineer who has local access to attached the dmesg log from this system.

The firmware has not been updated on this system recently so can probably be ruled out as a cause.

Stefan Bader (smb) wrote :

To summarize things as I understand them:
- The machine was ok about a week ago
- There has been no change in BIOS/or firmware
- The card fails after a certain uptime and is not accessible anymore
- This happens on all Precise kernels as tested now and even on an Oneiric kernel (which
  previously was ok)

This really sounds like a hardware problem. Unfortunately the NICs likely are onboard. Since in the initial dmesg, there is some life sign (interrupt assignment) from the second interface, have you tried to set that one up as a test.

VPD seems to be "vital product data" and from the symptoms it seems that the communication between driver and card dies at some point and when the driver attempts a recovery it fails to obtain certain hardware information. Like it would be expected when the firmware running on the card crashed for some reason.

I will work with the lab engineer to see if the firmware can be upgraded and/or the card replaced. If that doesn't fix it we can take it from there.

If I reboot the system then this problem never occurs, the network card stays up indefinitely

<brendand> pinky, what was updated exactly?
<pinky> brendand: there was just a BIOS update, from 1.4.3 to 1.5.2
<pinky> i was hoping for a NIC firmware update, but there wasn't one (unless it was rolled in with the BIOS)

I have not tested to see if this fixes the bug however the BIOS was marked as an urgent update.

It seems that the firmware update fixed this issue. Apologies for the delay, this should have been updated earlier.

I got the same problem with BIOS 1.5.2

I haven't tested NETW_FRMW_LX_R319248.BIN yet, how can i determine the current firmware of the NIC?

bevermind, just noticed I got
Package version: 6.4.5
Installed version: 3.08.0

gonna fire the firmware-upgrade later on

I'm now experiencing this problem again with the Precise release version.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Joseph Salisbury (jsalisbury) wrote :

@Brendan,

So did this issue go away for some kernel versions, then re-appear? Or are you just getting back to a specific test case now that the final release is available?

@Joseph,

I can't say for sure - finding an older image could be tricky. Maybe if you could provide me a link to a kernel from around the end of March, I could test that.

Joseph Salisbury (jsalisbury) wrote :

@Brendan,

All of the kernel versions for Precise can be found here:
https://launchpad.net/ubuntu/precise/+source/linux

After selecting a specific version, look for the builds section on the following page. The click the desired arch to access the .deb download.

It looks like the 3.2.0-17.27 kernel was release on March 2nd:
https://launchpad.net/ubuntu/+source/linux/3.2.0-17.27

Ara Pulido (ara) on 2012-06-19
tags: added: regression-release

@Joseph,

I tested with the Precise -proposed kernel as suggested on IRC yesterday and it still happens:

https://launchpad.net/ubuntu/+source/linux/3.2.0-26.41

Let me know what to try next.

Thanks,

Joseph Salisbury (jsalisbury) wrote :

It would also be good to know if this bug is already fixed upstream. Could you test the latest mainline kernel, which can be downloaded from:

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.5-rc3-quantal/

tags: added: needs-upstream-testing
Ara Pulido (ara) wrote :

Marking as Incomplete, as Brendan needs to test the upstream kernel

Changed in linux (Ubuntu):
status: Confirmed → Incomplete
Ara Pulido (ara) wrote :

Have we tested this with Quantal? Why test the upstream one if we haven't tested with Quantal yet?

I've confirmed this still happens in Quantal and will test the upstream kernel as soon as possible

Changed in linux (Ubuntu Precise):
importance: Undecided → Medium
status: New → Incomplete
importance: Medium → High

Hi,

I got around to testing with the upstream kernel and unfortunately it seems to be worse. I only have KVM access and when I reboot into the upstream kernel I lose access to the system completely, so I reckon it hasn't booted. If you think it will be productive I can go back and test the old kernel from when I first reported the bug to see if the firmware upgrade we did fixed it.

Ok, this still happens with linux-image-3.2.0-17-generic_3.2.0-17.27 which is the kernel it was originally reported on. I had though the firmware upgrade we did got rid of the bug, but somehow it's come back. I'm going to check with Oneiric to make positive this isn't just a hardware issue.

This is still somehow happening in Oneiric. I'm starting to think we'll never get to the bottom of this one - maybe we're at a dead end.

Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu) because there has been no activity for 60 days.]

Changed in linux (Ubuntu):
status: Incomplete → Expired
Launchpad Janitor (janitor) wrote :

[Expired for linux (Ubuntu Precise) because there has been no activity for 60 days.]

Changed in linux (Ubuntu Precise):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers