[intrepid] 2.6.27 e1000e driver places Intel ICH8 and ICH9 gigE chipsets at risk

Bug #263555 reported by Chris Jones on 2008-09-01
492
This bug affects 14 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Medium
linux (Fedora)
Fix Released
Medium
linux (Gentoo Linux)
Fix Released
Medium
linux (Mandriva)
Fix Released
Critical
linux (Suse)
Fix Released
Critical
linux (Ubuntu)
Critical
Tim Gardner
Intrepid
Critical
Tim Gardner
linux-lpia (Ubuntu)
Critical
Amit Kucheria
Intrepid
Critical
Amit Kucheria

Bug Description

In some circumstances it appears possible for the 2.6.27-rc kernels to corrupt the NVRAM used by some Intel network parts to store data such as MAC addresses.
This is limited to the new e1000e driver, and reports have only appeared from users of "82566 and 82567 based LAN parts (ich8 and ich9)" (to quote Intel). The reports seem to be isolated to laptops, but it is not clear if this is because desktop/server parts are not vulnerable, or if use cases simply increase the chances of laptop users being hit.

Once this corruption has occurred, recovery may be possible via a BIOS update, but may well require replacement of the hardware. Use of Intel's IABUTIL.EXE is strongly discouraged, as it will worsen the problem to the point where the network part will no longer appear on the PCI bus.

(this is a new description, the original one was based on too much guesswork. Below are the URLs originally referenced)
(the driver i blacklisted in Ubuntu for 2.6.27-rc in the latest releases, so if your network is not working, it doesn't have to be damaged, but just disabled in order to prevent any accidents until this bug is solved, don't wary!)
http://www.blahonga.org/~art/rant.html (search for "em0")
http://<email address hidden>/msg00360.html
http://<email address hidden>/msg00398.html

Related branches

Chris Jones (cmsj) on 2008-09-01
Changed in linux:
importance: Undecided → Critical
Chris Jones (cmsj) wrote :

I'm wondering if it would be possible for us to patch out the sections of the driver which write to the NVRAM, assuming Intel are not able to make suitable changes before 2.6.27 is released, which prevent this from being possible (e.g. splitting the writing parts out into a separate module which is not loaded by default?)

Ben Collins (ben-collins) wrote :

Removed the regression-2.6.27 tag from this. The 2.6.26 kernel and 2.6.27 kernel have the exact same e1000e driver (one which we downloaded from Intel's e1000 sf.net project).

Still a serious issue, but I don't want it to be classified as a regression.

Chris Jones (cmsj) wrote :

http://marc.info/?t=122038337000003&r=1&w=2 is another interesting thread about this, on linux-netdev.

Hi Chris,

Just an update here in case you missed chatter in #kernel on Sept 03, tim has already began investigating this issue.

Changed in linux:
assignee: nobody → timg-tpi
status: New → Triaged
Yingying Zhao (yingying-zhao) wrote :

We just met a similar issue in the testing for Intrepid Alpha5. In the beginning, the LAN works fine for x86 system. But after we met a system hangs up in X86_64 system (caused by gfx) in the same machine,we found the Ethernet card can't work any more. "lspci" can't show the correct Ethernet card info. The X86 system which e1000e works before can't recognize the card neither.

Our investigation is underway now.

Changed in linux:
status: Unknown → Incomplete
Changed in linux:
status: Unknown → Confirmed
Changed in linux:
status: Unknown → Confirmed
Jeffrey Baker (jwbaker) wrote :

This is just my humble opinion, but the Alpha CD downloads should be pulled from the archive. This kernel can partially ruin your hardware, and unsuspecting users shouldn't be able to merrily download it.

Steve Langasek (vorlon) wrote :

Jorge brought this bug to my attention just now; this really needs to be fixed one way or another for beta, even if that would mean blacklisting e1000e altogether until this is resolved. Even with as little as I use the wired ethernet on my laptop, I wouldn't enjoy having to RMA it to fix it after a kernel bug. :/

Changed in linux:
milestone: none → ubuntu-8.10-beta
Colin Watson (cjwatson) wrote :

Jeffrey, we can't afford to do that; we need to be able to test with the Alpha CDs on the wide variety of hardware not affected by this bug, or our development schedules for 8.10 will be seriously compromised. However, I'd be happy to add a warning to the cdimage web pages. Can anyone suggest some text?

Alacrityathome (alacrityathome) wrote :

Colin,

Seems that a warning may be insufficient. I would think most of the folks testing a pre-release may not know they have an e1000e driver or affected NIC.

Maybe blacklist e1000e asap and then re-instate e1000e after a fix is found.

Perhaps have the "warning" state something about the e1000e being temporarily withheld from the pre-release with certain Intel NICs affected.

John

Is Ubuntu willing to risk the liability of distributing software known to destroy hardware?

Scruffynerf (scruffynerf) wrote :

Unless Canonical wants liability for
a) Individual user's destroyed hardware
b) Crippling reputation damages, especially against the 'new to linux' groups
I'd echo the suggestion to pull the liveCD's until this is fixed.

When new linux users discovered permanently corrupted hardware after trying Ubuntu, and this gets out in the wider webs, all of Ubuntu's efforts at promoting Ubuntu will also be destroyed.

Breaking known good hardware is a problem greater than keeping to a self-imposed delivery schedule.

eentonig (eentonig) wrote :

It's alpha software, people should be considered as being aware that using it might break stuff.

Furthermore, people should be smart enough to read about the known issues prior to installing it.

Yes a warning and blacklisting the e1000 driver should be done, but revoking an alpha because of a (serious) bug just doesn't seem the answer to me, because it blocks you from finding other issues that might bite people when the official release gets out.

Chris Jones (cmsj) wrote :

Colin: FWIW, I think some kind of warning on cdimage and in the alpha release notes seems highly prudent (not because of the bogus liability claims here, but just because it's the good thing to do). I would suggest:

"Due to an unresolved bug in the Linux kernel currently used in Ubuntu 8.10 users with Intel network hardware supported by the e1000e driver should not download and run these images. Doing so may render your network hardware permanently inoperable.
Older Intel network hardware which uses the e1000 driver is not affected by this, however, use of the e1000 driver in older Ubuntu releases is not a reliable indication of which driver will be used by Ubuntu 8.10. Support for hardware which uses a PCI Express bus has been moved from e1000 to e1000e. If in doubt, do not run these images and subscribe to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555 to receive notifications when the bug is fixed."

Steve: I am not sure exactly where the responsibility for handling this for the Alphas falls (other than being quite sure it's not mine, and suspecting it's yours ;) but I think we should put warnings out fairly prominently, as SuSE has done. The obvious safe default would be to yank e1000e.ko and replace the above warning with something similar which explains why newer Intel network hardware won't work in the Alphas. It's a bit of a nuclear option since there is a lot of this hardware around and the bug mostly seems to be affecting laptops, but since they tend to be doing a lot more "interesting" kernel work (suspending, frequent loading of modules, etc) it could simply be that they are exposing it more easily and server hardware is just as capable of being affected.

For those wishing to discuss this bug, its implications, etc. there is a forum thread which seems more suitable for this, see: http://ubuntuforums.org/showthread.php?t=912666

Arnd (arnd-arndnet) wrote :

> It's alpha software, people should be considered as being aware that using it might break stuff.

That's absolutely ridiculous. I'm being aware that ubuntu alpha or beta can break some stuff (like eating my filesystem or deleting my partitions etc). In fact it already did. However, this is a whole different thing as BREAKING HARDWARE.
To make this clear, we are talking about RMAing laptops and mainboards because of this bug. And we are also talking about reasonable popular hardware. With the statement standing in the room that I should espect my hardware to die when I try ubuntu I will certainly don't try any alpha or beta ubuntu software ever again.

In my opinion the Alpha should be pulled NOW. Then you can discuss what steps have to be make to address this problem. (e.g. one easy sollution: disable e1000 and republish) This won't you cost more than a few days.

Just my 2 cents

Christian Wolf (christianwolf) wrote :

Folks,

I suggest to remove the respective Intrepid AlphaX images from the mirrors ASAP.

Although testing is testing, and everybody knows that there is a risk (I remember a similar issue with Mandrake Linux and CD-Rom drives) and you, as a tester, take a known risk, we also have the responsibility to minimize impact of this issue.

I think only the latest Alpha6 has this flaw?

John Dong (jdong) wrote :

Shall we pull in e1000e-prevent-corruption-of-eeprom-nvm.patch? It seems from the discussion that it isn't a 100% fix (other methods of reaching mmio'ed EEPROM probably exist) but should at least eliminate this disaster scenario of just booting up the distribution causing the card to be hosed.

John Dong wrote:
> Shall we pull in e1000e-prevent-corruption-of-eeprom-nvm.patch? It
> seems from the discussion that it isn't a 100% fix (other methods of
> reaching mmio'ed EEPROM probably exist) but should at least eliminate
> this disaster scenario of just booting up the distribution causing
> the card to be hosed.

no, this patch is for e1000, and has nothing to do with this problem.
Right now, the only reports of this issue are with 82566 and 82567 based
LAN parts (ich8 and ich9).

the eeprom is not MMIO mapped, the registers for accessing it are. I'm
still not clear if a random write to a memory location could corrupt
things, we'll be looking at that today.

Chris Jones (cmsj) on 2008-09-23
description: updated

>http://www.ubuntu.com/testing/intrepid/alpha6

No warning

>http://cdimage.ubuntu.com/releases/intrepid/alpha-6/

No warning

>http://cdimage.ubuntu.com/releases/intrepid/alpha-6/intrepid-desktop-i386.iso

Download works.

How many people test these iso's? How many of them are using an intel motherboard? (10%-40% ?)
How many will ever test again if testing an ISO means you are frying your motherboard.

This isn't a blame game. But if top priority is not removing the alpha, it will very soon be...

We are talking about thousands, if not millions, of laptops and pc's that will be broken beyond repair, if I'm not mistaken...

And some people actually say things like:

>Jeffrey, we can't afford to do that; we need to be able to test with the Alpha CDs on the wide variety of hardware not affected by this bug,

Don't you get it. If you don't pull now, NOBODY WILL TEST THE NEXT VERSION.
There will be no NEXT VERSION because nobody DARES to install it.

>It's alpha software, people should be considered as being aware that using it might break stuff.

Yes, it may corrupt data. [but if it does that beyond its own partition; it should be a big issue as well]
But BREAKING hardware?

I'm quite sure that afterwards some quality control and reflection .. that there will be SOME policy to prevent these mistakes (NOT TAKING THE IMAGE DOWN) ..

But it will be too late.

PULL THE IMAGE: THEN DISCUSS!

abingham (abingham) wrote :

It's been almost a day since discussion on this issue resumed.

The Alpha 6 image are still up with no warning present.

I always assume that an Alpha or Beta release may break things to where I need to reinstall the OS. Battery life could be bad. Etc. But this is literally capable of *destroying* peoples hardware. It's a whole different ball game. Even the LiveCDs are affected, and people testing them can reasonably assume there will be no hardware impact on their system even if it is an Alpha.

These images need to be pulled from availablility *now*. Major mirror sites need to be notified.

If the release becomes '8.11' instead of '8.10' because of it, so what. We are talking about destroying motherboards here. Replacing a laptop motherboard can cost > $500.

If this is not dealt with, I will no longer be able to recommend Ubuntu to friends and family. The attitude of 'release on time at all costs' already caused many issues with 8.04, and now people are seriously suggesting continuing distribution of disc images that literally destroy hardware?

Jeffrey Baker (jwbaker) wrote :

There's no reason to be hysterical, but a re-spin of Alpha 6 CDs without the e1000e module may be called for. This is a separate bug, but the recommended workaround of adding "blacklist e1000e" in /etc/modprobe.d/blacklist doesn't work. Somehow, udev or some other thing manages to load it anyway. I had to unlink it.

abingham (abingham) wrote :

Intel has ~80% CPU market share and >=70% of the chipset market for their own CPU.

So at least 56% of machines sold are Intel CPUs with Intel chipsets that are susceptible to this bug.

1 in 2 of Ubuntu testers could be vulnerable to this.

I strongly recommend if you are going to test for this bug or haven't seen it
yet on your ich8/9 system, that you RIGHT NOW, do ethtool -e ethX >
savemyeep.txt

Having a saved copy of your eeprom means we can help you write it back to your
system.

For those that might be interested in testing the Alpha, but has an at risk
machine, is there an accepted workaround that removes the e1000e driver
without jeopardizing the hardware?

okay, lets just use the data we *have* now. What we know is that some
users have reported a corrupt NVM. Intel networking does not have a
current reproduction but is *fully engaged* on trying to solve this
problem. We have only had reports on 82566 and 82567 based machines, no
others. Trying to extrapolate this out to "1 of 2" users is just fear
mongering.

These kernels being released with this problem are still in alpha/beta,
which means our testing audience is smaller, but so is the potential
impact of any problem.

The process is working as far as I can see, we have a set of users that
is reporting the problem, which will help keep the kernels with the
issue from being promoted to full production status.

If you have some useful data to add to this bug, please comment, we're
listening. I think the discussion about pulling alpha cds or whatever
should go to some mailing list, and not be inside this bug.

-----Original Message-----
From: <email address hidden> [mailto:<email address hidden>] On Behalf Of
abingham
Sent: Tuesday, September 23, 2008 9:55 AM
To: Brandeburg, Jesse
Subject: [Bug 263555] Re: [intrepid] 2.6.27 e1000e driver places Intel
ICH8and ICH9 gigE chipsets at risk

Intel has ~80% CPU market share and >=70% of the chipset market for
their own CPU.

So at least 56% of machines sold are Intel CPUs with Intel chipsets that
are susceptible to this bug.

1 in 2 of Ubuntu testers could be vulnerable to this.

--
[intrepid] 2.6.27 e1000e driver places Intel ICH8 and ICH9 gigE chipsets
at risk
https://bugs.launchpad.net/bugs/263555
You received this bug notification because you are a direct subscriber
of the bug.

Status in The Linux Kernel: Confirmed
Status in "linux" source package in Ubuntu: Triaged
Status in linux in Ubuntu Intrepid: Triaged
Status in "linux" source package in Fedora: Confirmed
Status in "linux" source package in Suse: Incomplete

Bug description:
In some circumstances it appears possible for the 2.6.27-rc kernels to
corrupt the NVRAM used by some Intel network parts to store data such as
MAC addresses.
This is limited to the new e1000e driver, and reports have only appeared
from users of "82566 and 82567 based LAN parts (ich8 and ich9)" (to
quote Intel). The reports seem to be isolated to laptops, but it is not
clear if this is because desktop/server parts are not vulnerable, or if
use cases simply increase the chances of laptop users being hit.

Once this corruption has occurred, recovery may be possible via a BIOS
update, but may well require replacement of the hardware. Use of Intel's
IABUTIL.EXE is strongly discouraged, as it will worsen the problem to
the point where the network part will no longer appear on the PCI bus.

(this is a new description, the original one was based on too much
guesswork. Below are the URLs originally referenced)

http://www.blahonga.org/~art/rant.html (search for "em0")
http://<email address hidden>/msg00360.h
tml
http://<email address hidden>/msg00398.h
tml

Tim Gardner (timg-tpi) wrote :

Uploaded module-init-tools_3.3-pre11-4ubuntu10 to temporarily blacklist e1000e.

Harry (harry2o) wrote :

Even if it means making myself look like an idiot: May I also suggest publishing (maybe along with the warnings) hints about how to find out whether your hardware is / will be / might be affected. And when does that eeprom writing actually happen - at boot time, when using the lan interface, or elsewhen?

I had the duplicate bug 272630 and consider myself lucky. I had the Intel NIC but had used Alpha 5. I only had the dmesg error and not the hardware eeprom failure. I had Alpha 6 ready to test until I found this bug thread. For folks like me, it will be a good decision to blacklist e1000e pending a resolution. Most 1st time Alpha testers would not be as lucky or have the time to seek out a full bug thread.

>The process is working as far as I can see, we have a set of users that
is reporting the problem, which will help keep the kernels with the
issue from being promoted to full production status.

I'm sorry .. will you buy these a new laptop? Ifso, then the proccess is working.

Try advertising: there is a 50% chance that this will destroy your hardware.
How many testers will you have got left? ZERO.

THE IMPLIED RISK OF TESTING ALPHA SOFTWARE IS DATA CORRUPTION.

What's next? THe machiene blows up and kills people?
Would that be a reason to remove a machiene-destroying cd-image?

YES, its' not Ubuntu's fault .. the alpha is shipped. The procces is working, but the procces is not done until somebody QUICKLY removes the image.

No tester signed up for this; and I for sure will not ever put an alpha disc into one of my machienes. I'll even wait a couple of months after the release.

It's not that this happened. It's that afterwards official developers consider loosing half the hardware of every volunteers that test a sane and good move. Just part of the proccess.

This discussion SHOULD NOT BE MOVED TO A MAILING, until the CD IMAGE IS GONE.

I know this is not the UBUNTU code of conduct. But so is WILLFULLY DAMAGE PEOPLE"S PROPERTY.

You know of the BUG, REMOVE THE IMAGE.

Changed in linux:
status: Incomplete → In Progress
Changed in linux:
status: Confirmed → In Progress
M. Salivar (mfsalivar) wrote :

One alpha tester has already been lost, at least partially. When the problem occurred I sucked it up and said to myself, you know, it is alpha. Things like this shouldn't happen, but they do from time to time. Now that I see the indifference of Ubuntu devs to people losing their hardware, and even worse, to the extreme likelihood of more people losing theirs because of a stupidly strict adherence to release schedules, I'll never test an alpha outside of a virtual machine again (your loss, not mine). I'm not sure yet, but I may be through with Ubuntu, period.

You should pull all the current alphas, and quickly release an alpha 7 with the e1000e module removed or an older kernel. It's the only reasonable thing to do. Pulling the alphas and waiting for a fix will cause too many delays, but leaving up the current alphas is just plain immoral.

Thomas McKay (tom-mckay1) wrote :

I for one second the motion to remove the alpha images until they have been prepared so that there is no risk of hardware damage. I would be seriously concerned about the negative effects of ignoring this issue.

There is a liability on Canonical for the distribution of software which causes permanent and irreparable damage. The problem has been identified, and by not shielding their customers from this issue is nothing less than WILLFUL NEGLECT.

REMOVE THE IMAGES NOW, and re-issue them when the driver has been blacklisted.

-Zeus- (matthew-momjian) wrote :

Thomas, why do you feel that the present warnings are not enough? I for one feel that they are certainly sufficient warning to people running those chipsets to not download them. Also, this isn't a democracy; we don't vote on whether to pull images or not. That's up to the core developers/Canonical.

Thomas McKay (tom-mckay1) wrote :

Why do i not feel the warnings are enough?

for one, because users who download VIA bittorrent will never see those warnings, and like somebody above noted, not all testers know for certain the hardware they are running. Simply saying "if your computer breaks, tough luck, we warned you" will not garner any respect among linux users. The alpha testers are doing canonical a great service, and taking that for granted would be a shame.

I'm not saying it's a democracy, i am just warning of the consequences of ignoring this issue and bricking people's computers.

Ing0R (ing0r) wrote :

I think a warning (with Jesse Brandeburg's advice) should go to *every* tester who is running such a system.
I just read about it on a computer news site and I wish I got this news form canonical (maybe via update manager)

Ing0R

Thomas McKay (tom-mckay1) wrote :

Not to mention the thousands of people who downloaded these images before the warning was posted.

Daniel Kulesz (kuleszdl) wrote :

I was really shocked to see that it took more than one day between the Issue becoming apparent and the warning being placed on the website. This is a very serious issue and can cause severe damage on really expensive hardware (i.e. most recent Lenovo Thinkpads like the X200, X301, T400, T500 and so on).

Also, please be also aware, that some testers might simply change their old download URL from a download manager and increment from 5 to 6 - therefore I really suggest to at least move the images away to some different place and replace the ISO files with textfiles containing the same warning together with the real download location.

Is there any way to issue a warning to all the testers who are already using or began downloading the ISO? Does the installer query some URL through which the warning could be injected?

The only place where I can find official download links to the
BitTorrent is at http://cdimage.ubuntu.com/releases/intrepid/alpha-6/,
where the warning exists.

Chris Jones (cmsj) wrote :

Please listen to what Jesse said in comment #24. This is a bug report, not a discussion forum.

Calls for ISOs to be pulled, legal claims, accusations and use of capslock should be on the ubuntu-devel or ubuntu-devel-discuss mailing list. For one thing, by posting to those lists your opinion will be seen by a much wider audience.

The people subscribed to this bug have either been affected by this bug (such as myself, I filed this bug), or are trying to fix it.

Yelling at those people (which is what a number of you are doing) will solve nothing. Stop it please. The only relevant discussion here is that which is gathering information about the bug, or attempting to fix it.

I appreciate this is a contentious issue (since my laptop was affected by this), but I want to read about progress, I don't want to read lots of ranting. I also don't want this post to be perceived as negative, or whining, or whatever. I fully sympathise with people who are trying to protect their fellow users from harm, and in that respect I apologise for not shouting more about this bug when it was first uncovered. All I did was make sure as many of the people as possible who could fix it, knew about it.
If you would like to argue with me about this, please do not do it here, email me personally (see my Launchpad overview page for my addresses) or via <email address hidden>.

Michael W. (hotdog003-gmail) wrote :

If we don't pull the images (we should, but I won't comment since it's already being discussed), it might be a good idea to at least make the words "permanently inoperable" on the Alpha 6 testing page in big, bold letters so users have less of a chance to skim over that part.

Think about it: How many times do we read warning labels on the stuff we eat? My point exactly. Having a "WARNING" section on a testing page where people are already expecting things not to work perfectly might not be an accurate indicator of exactly how grave this problem really is.

I think we should do everything in our power to at least let users know what they're dealing with here. Somehow, we've managed to produce a stick of dynamite with a lit fuse. A lot of people are expecting testing images to be imperfect and may skip right over the warning section because they already know the typical "This is just alpha software, hopefully nothing major will happen" lecture that warning sections typically give them. Making "permanently inoperable" in bold letters will make it much more eye-catching than it is now.

Changed in linux:
status: Unknown → Confirmed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 2.6.27-4.6

---------------
linux (2.6.27-4.6) intrepid; urgency=low

  [ Tim Gardner ]

  * Disable e1000e until the NVRAM corruption problem is found.
    - LP: #263555

  [ Upstream Kernel Changes ]

  * Revert "[Bluetooth] Eliminate checks for impossible conditions in IRQ
    handler"

 -- Ben Collins <email address hidden> Tue, 23 Sep 2008 09:53:57 -0400

Changed in linux:
status: Triaged → Fix Released
Jojo (kuzniarpawel) wrote :

according to http://groups.google.com/group/linux.kernel/browse_thread/thread/a5ef7deff8551186/d05c233ecb430178

this bug might be related to xorg and Intel graphics.

I used e1000e for 5 days with lot of traffic on eth0 and nothing happened (luck?) but I have T61p wit NV Quadro

William Grant (wgrant) on 2008-09-24
Changed in linux:
status: Fix Released → In Progress
Changed in linux:
status: Confirmed → Fix Committed
Changed in linux:
status: Unknown → Confirmed
284 comments hidden view all 364 comments

(In reply to comment #91 from Olaf Kirch)
> There's a question whether the NVM we're talking about here is actually larger,
> and is used by components other than the e1000e. If for instance the video BIOS
> maps all of the NVM and, due to some bug, scribbles over parts of it that
> include the e1000e's config space - is there a way to verify this?

the NVM in question is a single part that the entire machine (VGA, BIOS, LAN, Manageability, AHCI, etc) all use.

I couldn't tell you how to verify if something else is mapping over the top of the LAN area of the NVM. The only reports I've heard are that the LAN NVM is corrupted. If you managed to corrupt the BIOS area, the machine wouldn't boot.

Changed in linux:
status: In Progress → Incomplete

(In reply to comment #94 from Egbert Eich)
> Another question came up: does this happen on both 64 and 32 bit installations?

At this point we don't know. At least one reported I worked with was running 32 bit.

(In reply to comment #96 from Jesse Brandeburg)

> The only reports I've heard are that the LAN NVM is
> corrupted. If you managed to corrupt the BIOS area, the machine wouldn't boot.

Helmut Schaa has an HP 2510p that lost some of its display modes after a hard X crash on an early 2.6.27-rc kernel (it now no longer knows that it has a 1280x800 panel but thinks that it only has 1024x768, the BIOS screen is in the upper left corner instead of centered on the screen). Even though we don't know that this is the same problem, it shows that sh*t happens.

Do we have any dumps of the gfx related crashes? Comment #98 seems to indicate that the video ROM may have also become corrupted (either that or the EEPROM containing the EDID), but I don't currently have any theories about how the gfx driver could cause that...

About Comment #86:
> but first we need reliable way to restore the EEPROM contents, otherwise the
> debugging is almost impossible.

A strange comment in Ubuntu bug Report that, maybe, can help:
"...I have resolved on my hp 8510w with an old image of windows, and my network card is reborn..."
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555/comments/75

Anyone have dual-boot (Windows) and can try this?

Best regards,
Renato

The Windows drivers do not restore NVM images. So I don't think this report was seeing the same issue. If the NVM is really corrupted, loading the Windows driver is not going fix it. The Windows driver does not calculate and check the checksum so the device could be using what ever is in the corrupted NVM and running with those settings. Much like in some case on this bug if you comment out the checksum check it works for some people (probably with some random MAC address).

Jesse, John we have one case, where the NVM is not completely destroyed. it seems only the
NVM valid bit is not longer set and it shows a checksum error.
The Lenovo T61 did work until a attempt to install Beta1, a network install.
During yasts Xserver setup, it reboots and after this it does not longer load
e1000e because of the checksum error. I will attach ethtool -e and ethregs from
this machine. I did not set the NVM valid bit up to now, so the NIC is still in this state.

Created attachment 242026
T61 ethtool -e dump

Created attachment 242027
T61 ethregs output

Shwan (shwan-ciyako) on 2008-09-27
description: updated
Changed in linux:
status: Fix Committed → Confirmed
Tim Gardner (timg-tpi) on 2008-09-30
Changed in linux:
status: In Progress → Fix Committed
Matt Zimmerman (mdz) on 2008-09-30
Changed in linux:
milestone: ubuntu-8.10-beta → none
Steve Langasek (vorlon) on 2008-09-30
Changed in linux:
milestone: none → ubuntu-8.10

Guys, I have an HP 8510w which is experiencing some interesting behaviour. I believe I may have had a graphics corruption first, though I don't recall if the problems started directly afterward. I'm definately running the e1000e driver, the machine has an NVIDIA Quadro FX570M (Mobile Version). The first thing I noticed was the Intel Boot agent in the BIOS reports the following;

Initializing Intel (R) Boot Agent GE v1.2.45
PXE-E05: The LAN adapter's confirguration is corrupted or has not been initialized. The Boot Agent cannot continue.

Then the eth0 device would no longer work. I found a link which I've posted at the end of this which talked about some work arounds etc using free dos and resetting the Intel Boot Agent using an Intel Program called IBAUTIL. I was at this point able to use the NIC while using windows, I was not able to use it using Linux, Linux would complain with a standard message in Yast that the card was corrupted and that therefore the module was not loaded.

I ran the procedure outlined using IBAUTIL and voila my linux ethernet worked again. However, upon booting up a day after it is back to being dead.

If this is indeed the same situation, this may be all we need to get info out of the card. Also, I may potentially have access to more of these machines that ARE going if that helps.

The windows OS will now not get an IP address either, which I assume isn't just about the address and rather about hardware failure. Event Viewer shows nothing as usual, where's the Windows DMESG!!!! Windows was working fine all day though.

I shall try this procedure again, but I expect I am now out of luck :(

If someone wants me to post some kind of image from a going one of these machines it might be possible, but I'll need to do it from an older version of Linux I expect :)

http://dance.richii.com/article238.html

I am now definately in the same boat as everyone else, I don't even have lights on on my NIC at the hardware level and the driver has been auto removed from windows! The worst part is the wireless doesn't work on Linux in Beta 1 for me so no network in linux at all! Now where is that old cisco wireless card......

(In reply to comment #105 from Quentin Jackson)
> Guys, I have an HP 8510w which is experiencing some interesting behaviour. I
> believe I may have had a graphics corruption first, though I don't recall if
> the problems started directly afterward. I'm definately running the e1000e
> driver, the machine has an NVIDIA Quadro FX570M (Mobile Version). The first
> thing I noticed was the Intel Boot agent in the BIOS reports the following;
>
> Initializing Intel (R) Boot Agent GE v1.2.45
> PXE-E05: The LAN adapter's confirguration is corrupted or has not been
> initialized. The Boot Agent cannot continue.

Quentin,

could you please post a lspci output from the affected machine? If you are experiencing the problem on a system that doesn't have intel graphics chip at all, you'd be the first one whatsoever, and this would really change the direction of our debugging efforts -- currently the main suspect is intel graphics driver in X.org, which apparently couldn't be blamed in such case.
In addition to that, could you please attach your /etc/X11/xorg.conf?

Thanks.

Created attachment 242732
LSPCI.txt

Created attachment 242733
Xorg.conf

Done :) I don't think the nic is showing up in LSPCI at all from what I can see. I also noticed my Firewire connector (shows up as a NIC in windows has an x through it, I really hope that's unrelated!

(In reply to comment #110 from Quentin Jackson)
> Done

Thanks. So apparently, you are really the first one, to my knowledge, who reports the problem on ICH chipset, but with no Intel graphics chip at all. This really seems to rule out the xorg graphics driver issue in my eyes.

Could you please boot a "Kernel Of The Day" from

          ftp://ftp.suse.com/pub/projects/kernel/kotd/HEAD/

This kernel contains a load of fixes for the e1000e driver. It is unfortunately not currently able to bring your network card back to life, but it will output a EEPROM contents dump into 'dmesg' output even if the contents are corrupt. Could you please attach this output then?

This will allow us to verify whether you are really hitting the very same problem.

Thanks.

Quentin, it is very important to get the NIC NVM image for this machine with ethtool -e. You could use a old SuSE 11.0 CD for this the rescue system is enough, you can mount a USB stick and save the ethtool -e eth0 output on it.

Hi Guys until now I'm not affected by this Bug although (according to Jesse Brandeburg) I would be a very hot candidate.
As this Bug seems not be related to SuSe Linux (mostly I'm using Mandriva) but SUSE Labs seemed for me very active on LKML to get this problem fixed, I subscribed to this Bugtracking system, too.

I would like to offer my help if desired, because I've got a Lenovo T61 as in Comment #102 and I have got a graphic adapter from NVIDIA (NVIDIA Quadro 140M) which should use the same driver as the HP 8510w from Comment #107.

I've got a backup of my working NIC NVM so if it would help I could post it here. As I need my laptop for daily business work I can only do further testings if there is a valid method to get a broken NIC back to work. I know that some guys of Intel are working on a tool doing that but I don't know if it is released yet.

OK, I'll have to do the kernel of the day tonight when I'm at home, but I should be able to use the ethtool dump today, hunting down a laptop now :)

Created attachment 242921
ethtool dump from HP 8510w

Please advise, if this suffices. Sounds like you've been looking for it for a while. Theoretically I have one of these machines to play with whenever needed, both dead and not dead.

Created attachment 242983
DMESG output after latest kernel of the day

The Kernel upgrade complained that I was upgrading over a newer version, I forced it as it was dated October. But thought I should mention it incase anything doesn't come through correctly. After the kernel was loaded and rebooted one of the network card lights now comes on, I don't think it was doing this in windows and definitely not in linux. Let me know if there is anything else I can provide and let me know if this is this bug or if I need to log it somewhere else! :)

I can volunteer too - I have a T60p with a:
02:00.0 Ethernet controller: Intel Corporation 82573L Gigabit Ethernet Controller
and (amusingly) the socket is physically broken (by myself), so I seldom to never use it.

Changed in linux:
status: Incomplete → In Progress
Changed in linux:
status: Fix Committed → Fix Released

Seems to have gone quiet around here :)

Can someone please explain to me what path they expect this bug to take? I am sitting with an unusable system and am wondering whether to go back to OpenSuSE 10.3 as at least I can have working wireless in that version. Unless I can get some direction I see no point in leaving Beta1 on my system as I cannot continue with bug fixing with no network access.

Currently, we're busy testing the patches we've put into beta2. These
are mostly patches from intel, also posted upstream on LKML

On beta1, we're able to reproduce the issue pretty reliably by simply booting
into runlevel 3, and shutdown the machine 1 minute later. The problem will
usually show up within 3-20 reboots. With beta2, we have so far run 350
reboots or more without hitting the problem.

We're currently still discussing with Intel and LKML what the cause of the
problem may be. We're chasing a number of leads, but it seems at least
one of the patches we have so far is effective in stopping the corruption
from happening.

That's a good update, thanks. More specifically, is someone able to advise:

a) is it possible eventually for this hardware to be repaired via some kind of software programming?

b) If so are we awaiting Intel or can this be done by my providing the ethtool dump above or something more specific?

c) If so presuming we would have a fix within, 2-4 weeks?

If not then it would make sense to get my hardware repaired and no doubt others will be interested in ETA's on this too.

Thanks.

Hello Quentin to answer your questions:
a) Yes, I'm working on a GPL tool for that
b) You need a ethtool dump, ideal from the machine itself or from a similar
   machine (then you need to give the MAC address to the tool)
   To see if a other machine has the same device, you need the PCI IDs from
   the machine before the overwrite happens, the IDs are overwritten in most
   cases via the NVM, if the NVM got corrupt it will fallback to the generic
   IDs
c) I hope I have a verified working version early next week

Michael Losonsky (michl) on 2008-10-03
Changed in linux:
status: Fix Released → In Progress
Changed in linux:
status: Confirmed → Fix Released
Colin Watson (cjwatson) on 2008-10-03
Changed in linux:
status: In Progress → Fix Released

It appears that the patch to use set_memory_ro/rw changes the timings enough in our test boxes that the problem no longer occurs.

We are not currently sure why this patch fixes it, but I wanted to share our findings.

We also have a patch (will attach here soon) to restore the eeprom from an ethtool -e dump, using a sysfs interface to the driver.

(In reply to comment #125 from Renato Yamane)
> Fixed?
> <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=4a7703582836f55a1cbad0e2c1c6ebbee3f9b3a7>
>

Yes, that's workaround that prevents the corruption of the EEPROM contents, but it doesn't fix the real problem, just prevents bad things from happening when the bug triggers.

Changed in linux:
status: In Progress → Fix Released

I'm hanging out to restore my ethernet card firmware. Any chance on getting that EEPROM restore application? Or if not public yet any chance of emailing it to quentin dot jackson at exclamation dot co dot nz? :)

The restore application does work now, I restored broken Thinkpad X61s successful. I'm now preparing a mini iso with the application and our rescue system, so you can boot from this CD and use the application in a sane environment.

Changed in linux:
status: Confirmed → In Progress

Well, I have gotten hold of and applied the recovery tool. Unfortunately it does not work for the following reason:

The device does not list in lspci or lspci -n because it is dead, therefore I cannot find the new device ID because it doesn't have one. The tool relies on this information to work. Apparently there are other tools that will get around it via some kind of BIOS update direct from intel. Thought you would all like to know.

I should have said, this is the case on my device, apparently it is not the case for all devices, you will need to check if your device is listed in lspci or not.

Changed in linux:
status: In Progress → Fix Released
Changed in linux:
status: Confirmed → Fix Released
Download full text (4.8 KiB)

Here is a patch which we at Intel LAD have been testing today. This looks to be a work-around and with the .28 a fix for the root cause of the problem. The problem was with ftrace which is what we bisec'd to last week. On systems that failed with minutes we have not been able to make it happen once ftrace was disabled. So I think the .28 ftrace needs to get included into SLES11.

>---------- Forwarded message ----------
>From: Steven Rostedt <email address hidden>
>Date: Wed, Oct 15, 2008 at 3:21 PM
>Subject: [PATCH -stable] disable CONFIG_DYNAMIC_FTRACE due to possible
>memory corruption on module unload
>To: LKML <email address hidden>, <email address hidden>
>Cc: Linus Torvalds <email address hidden>, Andrew Morton
><email address hidden>, Arjan van de Ven <email address hidden>,
><email address hidden>, <email address hidden>, Thomas Gleixner
><email address hidden>, Ingo Molnar <email address hidden>
>
>
>
>While debugging the e1000e corruption bug with Intel, we discovered
>today that the dynamic ftrace code in mainline is the likely source of
>this bug.
>
>For the stable kernel we are providing the only viable fix
>patch: labeling
>CONFIG_DYNAMIC_FTRACE as broken. (see the patch below)
>
>We will follow up with a backport patch that contains the
>fixes. But since
>the fixes are not a one liner, the safest approach for now is to
>disable the code in question.
>
>The cause of the bug is due to the way the current code in mainline
>handles dynamic ftrace. When dynamic ftrace is turned on, it also
>turns on CONFIG_FTRACE which enables the -pg config in gcc that places
>a call to mcount at every function call. With just CONFIG_FTRACE this
>causes a noticeable overhead. CONFIG_DYNAMIC_FTRACE works to ease this
>overhead by dynamically updating the mcount call sites into nops.
>
>The problem arises when we trace functions and modules are unloaded.
>The first time a function is called, it will call mcount and the mcount
>call will call ftrace_record_ip. This records the calling site and
>stores it in a preallocated hash table. Later on a daemon will
>wake up and call kstop_machine and convert any mcount callers into
>nops.
>
>The evolution of this code first tried to do this without the
>kstop_machine
>and used cmpxchg to update the callers as they were called. But I
>was informed that this is dangerous to do on SMP machines if another
>CPU is running that same code. The solution was to do this with
>kstop_machine.
>
>We still used cmpxchg to test if the code that we are modifying is
>indeed code that we expect to be before updating it - as a final
>line of defense.
>
>But on 32bit machines, ioremapped memory and modules share the same
>address space. When a module would load its code into memory
>and execute
>some code, that would register the function.
>
>On module unload, ftrace incorrectly did not zap these functions from
>its hash (this was the bug). The cmpxchg could have saved us in most
>cases (via luck) - but with ioremap-ed memory that was exactly
>the wrong
>thing to do - the results of cmpxchg on device memory are undefined.
>(and will likely result in a write)
>
>The pending .28 ftrace tree does not have this bug a...

Read more...

This patch is now included in our SLE11 kernel, as it is in 2.6.27.1, which is the base of our kernel tree.

So, I guess we can close this out now, thanks for all of the work everyone!

Changed in linux:
status: In Progress → Fix Released
Changed in linux:
status: Fix Released → Confirmed
Amit Kucheria (amitk) on 2008-10-24
Changed in linux-lpia:
assignee: nobody → amitk
importance: Undecided → Critical
milestone: none → ubuntu-8.10
status: New → Fix Committed
Changed in linux-lpia:
status: Fix Committed → Fix Released
Changed in linux:
status: Confirmed → Fix Released
Basilisk (bluebal-1) on 2008-11-01
Changed in linux:
assignee: timg-tpi → nobody
assignee: nobody → bluebal-1
William Grant (wgrant) on 2008-11-01
Changed in linux:
assignee: bluebal-1 → timg-tpi

Was the recovery tool ever published? I just ran into a beta user who still has a trashed e1000e.

<email address hidden> can help

I have been dealing with a lot of these recovery requests, and have been using a tool developered by Karsten Keil. The tool reads the (probably) corrupted content, which is sent to me. I repair the image - usually a single-byte corruption, and then I return the corrected image to be written back to the NVM using the same tool.

Follows the instructions that I have been providing to the individual reports I have had....
---- Start of instructions -------
Go to:

      ftp://ftp.suse.com/pub/people/kkeil/testing/e1000e/

Copy & paste this link in to a browser window, and you should see a list of files, including one:

      e1000e_recover.iso

This is an ISO image of a CD, so save it to your local system, then burn it to a CD, and use it to boot your problem system. From finding the ISO to actually booting your system is quite a few steps - if you get stuck of course just let me know and I'll guide you through the detail, but for now I'll assume that you're still with me.

From the boot options presented by the CD, select "rescue system", as that's where we'll find the eeprom recovery tool.

When prompted for user, log on as root. There's no password, so just hit return.

1) Read the current eeprom and save it to file. Be patient !

      e1000_nvm -r -u -o ethtool.dmp

2) mount a USB disk to save the file, and send the file to me <email address hidden>

I will then fix up the image, and mail it back to you as ethtoola.dmp, and then, you can boot again to the CD, and

3) Write the new eeprom content back to your system NVM, using something like (may be different depending on the device id that is indictaed in the nvm, but I will provide any update to this step along with the fixed-up NVM image that I return)

      e1000_nvm -u -P 10498086 ethtoola.dmp

And select YES when prompted.

4) You should then be able to remove the recovery CD, and successfully boot back to a working ethernet.

---- End of instructions -------

Changed in linux (Ubuntu):
status: Fix Released → Confirmed
Steve Langasek (vorlon) on 2009-12-02
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Changed in linux:
importance: Unknown → Medium
Changed in linux (Gentoo Linux):
importance: Unknown → Medium
Changed in linux (Mandriva):
importance: Unknown → Critical
Changed in linux (Fedora):
importance: Unknown → Medium
Changed in linux (Suse):
importance: Unknown → Critical
Displaying first 40 and last 40 comments. View all 364 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.