8086:294c Intel NIC driver e1000e not claiming HW

Bug #1072722 reported by Mark Bidewell
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Invalid
Medium
Unassigned

Bug Description

The intel e100e driver does not appear to be claming a intel 82566DC-2 NIC. It was working fine until sometime post 12.04.1. 12.10 also exhibits this behavior. The RealTek Wireless I installed as a backup works fine. I built the Intel driver on 12.10 with Version 2.1.14 from http://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=15817 but it still did not help.

ProblemType: Bug
DistroRelease: Ubuntu 12.10
Package: linux-image-3.5.0-17-generic 3.5.0-17.28
ProcVersionSignature: Ubuntu 3.5.0-17.28-generic 3.5.5
Uname: Linux 3.5.0-17-generic x86_64
NonfreeKernelModules: nvidia
ApportVersion: 2.6.1-0ubuntu6
Architecture: amd64
AudioDevicesInUse:
 USER PID ACCESS COMMAND
 /dev/snd/controlC0: cbidewell 1710 F.... pulseaudio
Date: Mon Oct 29 09:33:26 2012
HibernationDevice: RESUME=UUID=bd24c89a-e8d0-4552-851e-a550cb201fd6
InstallationDate: Installed on 2012-10-24 (4 days ago)
InstallationMedia: Ubuntu 12.10 "Quantal Quetzal" - Release amd64 (20121017.5)
MarkForUpload: True
ProcEnviron:
 TERM=xterm
 PATH=(custom, no user)
 XDG_RUNTIME_DIR=<set>
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcFB:

ProcKernelCmdLine: BOOT_IMAGE=/vmlinuz-3.5.0-17-generic root=/dev/mapper/cbdesktop-root ro quiet splash
RelatedPackageVersions:
 linux-restricted-modules-3.5.0-17-generic N/A
 linux-backports-modules-3.5.0-17-generic N/A
 linux-firmware 1.95
RfKill:
 0: phy0: Wireless LAN
  Soft blocked: no
  Hard blocked: no
SourcePackage: linux
UpgradeStatus: No upgrade log present (probably fresh install)
dmi.bios.date: 07/15/2009
dmi.bios.vendor: Intel Corp.
dmi.bios.version: DPP3510J.86A.0572.2009.0715.2346
dmi.board.asset.tag: Base Board Asset Tag
dmi.board.name: DP35DP
dmi.board.vendor: Intel Corporation
dmi.board.version: AAD81073-209
dmi.chassis.type: 3
dmi.modalias: dmi:bvnIntelCorp.:bvrDPP3510J.86A.0572.2009.0715.2346:bd07/15/2009:svn:pn:pvr:rvnIntelCorporation:rnDP35DP:rvrAAD81073-209:cvn:ct3:cvr:

Revision history for this message
Mark Bidewell (mbidewel) wrote :
Revision history for this message
Mark Bidewell (mbidewel) wrote :

Adding additional logging from lshw

Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
Revision history for this message
penalvch (penalvch) wrote :

Mark Bidewell, thank you for reporting this and helping make Ubuntu better. Regarding your Bug Description:
>" I built the latest Intel driver but it exhibited the same problems"

Could you please provide the specific latest intel driver URL and version?

As well, do you know which kernel specifically from 12.04.1 first demonstrated this problem?

Thank you for your understanding.

Helpful bug reporting tips:
https://help.ubuntu.com/community/ReportingBugs

summary: - Intel NIC driver e1000e not claiming HW
+ 8086:294c Intel NIC driver e1000e not claiming HW
tags: added: needs-upstream-testing regression-release
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
Revision history for this message
Mark Bidewell (mbidewel) wrote :

Unfortunately, I cannot pinpoint the exact kernel, I went back and tried a 12.04.1 LiveCD and the bug was there so I could be mistaken when it first exhibited. It happened once a few weeks ago a reboot brought it back, then this week eth0 disappeared and has only sporadically returned. What I know for sure is that I installed 12.04 on that box within a week of release and had no problems until recently. I also wonder if this could be a firmware or udev issue, as I was researching this issue, I ran across forum posts reporting similar issues with older kernels in other distros:

http://us.generation-nt.com/answer/gentoo-user-interface-eth0-does-not-exist-e1000e-e1000-help-204597341.html
https://bbs.archlinux.org/viewtopic.php?id=99571

The intel driver I built on 12.10 was Version 2.1.14 from http://downloadcenter.intel.com/Detail_Desc.aspx?DwnldID=15817

Unfortunately, I cannot rule out hardware failure, I will try to find an older version to try.

penalvch (penalvch)
tags: added: precise
removed: regression-release
description: updated
Revision history for this message
Mark Bidewell (mbidewel) wrote :

I have tried to reproduce this with Lucid which had worked correctly, but that exhibited the same bug. My only conclusion is that this is a hardware failure.

Changed in linux (Ubuntu):
status: Incomplete → Invalid
Mark Bidewell (mbidewel)
Changed in linux (Ubuntu):
status: Invalid → Incomplete
Revision history for this message
Mark Bidewell (mbidewel) wrote :

I did a little more research and there were cases in the past where e1000e could cause a HW failure so I don't know if this is invalid or not.. It appears that in the 2.6.27 series there were bugs in the e1000e driver which could corrupt its EEPROM. Could this bug have resurfaced? If I go into the BIOS Setup, eth0 will work until the system is shutdown.

https://bugzilla.redhat.com/show_bug.cgi?id=459202

Any thoughts where to go from here?

Revision history for this message
penalvch (penalvch) wrote :

Mark Bidewell, if the EEPROM has indeed been corrupted, you may have a recourse. As per http://news.opensuse.org/2008/10/16/intel-e1000e-corruption-fixed-already-in-opensuse-111-beta2-with-exception-of-debug-vanilla-kernels/ :
>"Karsten Keil has developed a way to fix broken e1000e eproms. Please contact him at <email address hidden> in case you need to recover from this bug."

I would reach out to Karsten and see what could be investigated. As well, please share here what specifically was done to recover the EEPROM if this was indeed corrupted.

Revision history for this message
Mark Bidewell (mbidewel) wrote :

I have attempted to reach out to Karsten. Unfortunately @suse.de is not longer valid. I will update if/when I hear back. What is clear is that this problem is hardware related due to the fact that reverting to an older/working kernel does not fix it. What is unclear is if a kernel bug created the issue or if it was a coincidence.

Revision history for this message
penalvch (penalvch) wrote :

Mark Bidewell, you could try his new E-Mail address:
keil <at> b1-systems.de

Revision history for this message
Mark Bidewell (mbidewel) wrote :

Thanks, I will do that. I was looking in launchpad to see if there were any similar bugs. Number 1014490 mentions in the comments that the SuperMicro Tech Support suggest a EEPROM flush.

Revision history for this message
Mark Bidewell (mbidewel) wrote :

I have gotten in contact with Karsten and sent him some data. I took lspci and syslog snapshots in both working and nonworking condition and there are differences which may prove enlightening so I am including it here.

Revision history for this message
Mark Bidewell (mbidewel) wrote :
Revision history for this message
Mark Bidewell (mbidewel) wrote :
Revision history for this message
Mark Bidewell (mbidewel) wrote :
Revision history for this message
Mark Bidewell (mbidewel) wrote :

Ran some more tests. After noting the BIOS setup correlation, I replaced the BIOS battery and reflashed the the BIOS. This did not appear to have any effect.

I just tried creating a modprobe conf file with "alias eth0 e1000e". After a reboot the NIC was back and functioning.

Revision history for this message
Mark Bidewell (mbidewel) wrote :

Unfortunately the modprobe "fix" turned out to be a mirage, and the card stopped working after the computer had been shutdown

Revision history for this message
Mark Bidewell (mbidewel) wrote :

Some more observations working on the Issue, when the NIC is working lspci reports it has the "bus master" flag, MSI is enabled, and IRQ typically 46. When not working lspci reports no "busmaster", MSI is disabled, and IRQ is 20. Syslog always reports e1000e looking at IRQ 46. In the failure case, the card remains unclaimed and the mei driver claims IRQ 46.

This raises two questions:
1) Why do the settings change and are both valid?
2) Why does e1000e always report IRQ 46 when lspci reports IRQ 20?

Any ideas on how to investigate further?

penalvch (penalvch)
Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Revision history for this message
Mark Bidewell (mbidewel) wrote :

I am including Karsten's response for the record here. I sent him some additional info (recorded above), but have not heard back.

Hi Mark,

Am 03.11.2012 03:05, schrieb Mark Bidewell:
> I have a DP35DP mainboard with the 82566DC-2 onboard NIC. I installed
> Ubuntu 12.04 with Kernel 3.2. At some point in that series, the NIC
> stopped being recognized (eth0 would show no such device) moving to
> earlier or later versions did not solve the issue. I am trying to
> determine if this is just a HW failure or if the EEPROM bug from
> 2.6.27 could have resurfaced.

Hmm, I do not think so, but it could be related.
The bug with 2.6.27 was not a single bug it was a combination of
different issues.

1. The BIOS area holding the e1000e firmware und PCI settings was not
write protected
2. A BUG in combination with a race condition in some kernel component
did caused writes to random memory location.
3. If the write did happen on the BIOS flash control area it did cause
an erase of the e1000e BIOS section, on the next reboot the card was not
longer detected by the system, because the PCI IDs of the device were
set to 0:0 .

This was fixed by several changes, so for example the e1000e driver now
does set bits which do write protect the firmware until next power
cycle. (And at least the race condition/random memory write BUG was
fixed as well.

So it would be from interest, if lspci -v still show the device, if
it is in the failure mode.

> The symptoms are different but what
> makes me suspicious that this could be an EEPROM issue is that if the
> BIOS Setup is accessed (even if no changes are made) the card will
> work flawlessly until the system is powered off for 5-10 minutes.

Hmm, I would suggest the check the RTC/NVRAM mainboard battery and
replace it (usually a 3V 2032 Lithium cell).

>
> Do you think this could be an EEPROM problem?
>

I cannot rule out that the EEProm was damaged. So far I remember
here were 2 copies of this area, so maybe one got corrupted and
if you go via BIOS setup the other area is used. Which area is in used
maybe is saved in the BIOS NV data, so if the battery is weak it could
be lost this data.

Best Regards

Revision history for this message
Mark Bidewell (mbidewel) wrote :

I got in contact with the e1000e driver maintainers. After trying some of there suggestions, it looks like a case of bad HW. Thanks for your help.

Changed in linux (Ubuntu):
status: Confirmed → Invalid
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.