[intrepid] 2.6.27 e1000e driver places Intel ICH8 and ICH9 gigE chipsets at risk

Bug #263555 reported by Chris Jones on 2008-09-01
494
This bug affects 14 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Medium
linux (Fedora)
Fix Released
Medium
linux (Gentoo Linux)
Fix Released
Medium
linux (Mandriva)
Fix Released
Critical
linux (Suse)
Fix Released
Unknown
linux (Ubuntu)
Critical
Tim Gardner
Intrepid
Critical
Tim Gardner
linux-lpia (Ubuntu)
Critical
Amit Kucheria
Intrepid
Critical
Amit Kucheria

Bug Description

In some circumstances it appears possible for the 2.6.27-rc kernels to corrupt the NVRAM used by some Intel network parts to store data such as MAC addresses.
This is limited to the new e1000e driver, and reports have only appeared from users of "82566 and 82567 based LAN parts (ich8 and ich9)" (to quote Intel). The reports seem to be isolated to laptops, but it is not clear if this is because desktop/server parts are not vulnerable, or if use cases simply increase the chances of laptop users being hit.

Once this corruption has occurred, recovery may be possible via a BIOS update, but may well require replacement of the hardware. Use of Intel's IABUTIL.EXE is strongly discouraged, as it will worsen the problem to the point where the network part will no longer appear on the PCI bus.

(this is a new description, the original one was based on too much guesswork. Below are the URLs originally referenced)
(the driver i blacklisted in Ubuntu for 2.6.27-rc in the latest releases, so if your network is not working, it doesn't have to be damaged, but just disabled in order to prevent any accidents until this bug is solved, don't wary!)
http://www.blahonga.org/~art/rant.html (search for "em0")
http://<email address hidden>/msg00360.html
http://<email address hidden>/msg00398.html

Related branches

Description of problem:
I am unable to use my Ethernet controller: Intel Corporation 82566DC Gigabit Network Connection (rev 03). System does not see it. Pleae find dmesg output.

e1000e: Intel(R) PRO/1000 Network Driver - 0.2.0
e1000e: Copyright (c) 1999-2007 Intel Corporation.
ACPI: PCI Interrupt 0000:00:19.0[A] -> GSI 22 (level, low) -> IRQ 22
PCI: Setting latency timer of device 0000:00:19.0 to 64
iTCO_vendor_support: vendor-support=0
0000:00:19.0: The NVM Checksum Is Not Valid
ACPI: PCI interrupt for device 0000:00:19.0 disabled
e1000e: probe of 0000:00:19.0 failed with error -5

Version-Release number of selected component (if applicable):
Driver version 0.2.0

How reproducible:
Happens everytime

Steps to Reproduce:
1.Boot computer

What kernel version is this? Has this adapter ever worked under Fedora. If yes when did it stop?

I am sorry, i totally forgot about these details.
Kernels which i have:
2.6.25.11-97.fc9.x86_64
2.6.25.14-108.fc9.x86_64

I guess it stopped shortly after i upgraded to F9. It must have been one of first kernel updates. I am not sure if that ever worked in F9.

Strange thing, on ubuntu i can not use it too. I do not have dmesg output yet. I will try and see if this matches.

Can you post the output of 'lspci -nn -s 0000:00:19.0'?

Output you have requested:

00:19.0 Ethernet controller [0200]: Intel Corporation 82566DC Gigabit Network Connection [8086:104b] (rev 03)

Chris Jones (cmsj) on 2008-09-01
Changed in linux:
importance: Undecided → Critical
201 comments hidden view all 235 comments
Chris Jones (cmsj) wrote :

I'm wondering if it would be possible for us to patch out the sections of the driver which write to the NVRAM, assuming Intel are not able to make suitable changes before 2.6.27 is released, which prevent this from being possible (e.g. splitting the writing parts out into a separate module which is not loaded by default?)

Ben Collins (ben-collins) wrote :

Removed the regression-2.6.27 tag from this. The 2.6.26 kernel and 2.6.27 kernel have the exact same e1000e driver (one which we downloaded from Intel's e1000 sf.net project).

Still a serious issue, but I don't want it to be classified as a regression.

Chris Jones (cmsj) wrote :

http://marc.info/?t=122038337000003&r=1&w=2 is another interesting thread about this, on linux-netdev.

Hi Chris,

Just an update here in case you missed chatter in #kernel on Sept 03, tim has already began investigating this issue.

Changed in linux:
assignee: nobody → timg-tpi
status: New → Triaged
199 comments hidden view all 235 comments

The driver you have supports your hardware, but is erroring out on load.
The "NVM checksum is not valid" means that something corrupted your system BIOS flash.

Can you please give us details about the hardware in your system, attach the output of
# lspci -vvv > lspci.txt

# dmidecode > dmiout.txt

we have some reports that Lenovo systems (a lot of them) are starting to have this issue.

Please DO NOT run ibautil as some sites on the web suggest to try to fix this issue. It will likely cause you to have to replace your motherboard to get LAN functionality back.

Created attachment 316491
dmiout.txt

Created attachment 316492
lspci.txt

I have messed around a little with my card. Just wanted to check some suggestions point out here http://www.thinkwiki.org/wiki/Problem_with_e1000:_EEPROM_Checksum_Is_Not_Valid#Solutions

Little orange led on my ethernet is constantly flashing, when i tried with unloading e1000e module it did not changed anything. When i plugged in cable it stopped and green led showed up, meaning that connection is ok though driver still failed to load.
If you need any other info i will gladly help.

okay, so you have an HP machine with an ICH8 chipset. I don't know what the little orange LED flashing means, I will have to check on that.

can you get into the iAMT setup just after BIOS completes by pressing CTRL-p?
not sure if that might help you or not.

If I attach a debug driver here would you be willing to compile and run it?

I am not able to open iAMT setup. I believe that i do not have that option as i have found that to enable that i need to go to my BIOS settings and turn it on in Power section. Well, i do not have it there.

Yes, please attach driver.

203 comments hidden view all 235 comments
Yingying Zhao (yingying-zhao) wrote :

We just met a similar issue in the testing for Intrepid Alpha5. In the beginning, the LAN works fine for x86 system. But after we met a system hangs up in X86_64 system (caused by gfx) in the same machine,we found the Ethernet card can't work any more. "lspci" can't show the correct Ethernet card info. The X86 system which e1000e works before can't recognize the card neither.

Our investigation is underway now.

Changed in linux:
status: Unknown → Incomplete
Changed in linux:
status: Unknown → Confirmed
Changed in linux:
status: Unknown → Confirmed
Jeffrey Baker (jwbaker) wrote :

This is just my humble opinion, but the Alpha CD downloads should be pulled from the archive. This kernel can partially ruin your hardware, and unsuspecting users shouldn't be able to merrily download it.

203 comments hidden view all 235 comments

Created attachment 317425
driver with csum check bypass

here is a driver that just prints the message but doesn't error out if the checksum validation fails.

This should allow you to run ethtool -e ethX after loading the driver.

the difference in the driver I just attached is:
diff -rup e1000e-0.4.1.7.orig/src/netdev.c e1000e-0.4.1.7/src/netdev.c
--- e1000e-0.4.1.7.orig/src/netdev.c 2008-06-23 09:27:33.000000000 -0700
+++ e1000e-0.4.1.7/src/netdev.c 2008-09-22 16:06:59.000000000 -0700
@@ -56,7 +56,7 @@

 #define DRV_DEBUG

-#define DRV_VERSION "0.4.1.7" DRV_NAPI DRV_DEBUG
+#define DRV_VERSION "0.4.1.7_nocsum" DRV_NAPI DRV_DEBUG
 char e1000e_driver_name[] = "e1000e";
 const char e1000e_driver_version[] = DRV_VERSION;

@@ -5309,8 +5309,10 @@ static int __devinit e1000_probe(struct
                        break;
                if (i == 2) {
                        e_err("The NVM Checksum Is Not Valid\n");
+ /* JJJ skip around error path
                        err = -EIO;
                        goto err_eeprom;
+ JJJ end */
                }
        }

203 comments hidden view all 235 comments
Steve Langasek (vorlon) wrote :

Jorge brought this bug to my attention just now; this really needs to be fixed one way or another for beta, even if that would mean blacklisting e1000e altogether until this is resolved. Even with as little as I use the wired ethernet on my laptop, I wouldn't enjoy having to RMA it to fix it after a kernel bug. :/

Changed in linux:
milestone: none → ubuntu-8.10-beta
204 comments hidden view all 235 comments

also, whole piles of reports now starting to converge, many of them linked here:

http://bugzilla.kernel.org/show_bug.cgi?id=11382

I'm trying to work a plan to help address this soonest.

203 comments hidden view all 235 comments
Colin Watson (cjwatson) wrote :

Jeffrey, we can't afford to do that; we need to be able to test with the Alpha CDs on the wide variety of hardware not affected by this bug, or our development schedules for 8.10 will be seriously compromised. However, I'd be happy to add a warning to the cdimage web pages. Can anyone suggest some text?

Alacrityathome (alacrityathome) wrote :

Colin,

Seems that a warning may be insufficient. I would think most of the folks testing a pre-release may not know they have an e1000e driver or affected NIC.

Maybe blacklist e1000e asap and then re-instate e1000e after a fix is found.

Perhaps have the "warning" state something about the e1000e being temporarily withheld from the pre-release with certain Intel NICs affected.

John

203 comments hidden view all 235 comments

Michal, have you ever booted a Fedora 10 Alpha or rawhide disk on that system?

202 comments hidden view all 235 comments

Is Ubuntu willing to risk the liability of distributing software known to destroy hardware?

Scruffynerf (scruffynerf) wrote :

Unless Canonical wants liability for
a) Individual user's destroyed hardware
b) Crippling reputation damages, especially against the 'new to linux' groups
I'd echo the suggestion to pull the liveCD's until this is fixed.

When new linux users discovered permanently corrupted hardware after trying Ubuntu, and this gets out in the wider webs, all of Ubuntu's efforts at promoting Ubuntu will also be destroyed.

Breaking known good hardware is a problem greater than keeping to a self-imposed delivery schedule.

eentonig (eentonig) wrote :

It's alpha software, people should be considered as being aware that using it might break stuff.

Furthermore, people should be smart enough to read about the known issues prior to installing it.

Yes a warning and blacklisting the e1000 driver should be done, but revoking an alpha because of a (serious) bug just doesn't seem the answer to me, because it blocks you from finding other issues that might bite people when the official release gets out.

201 comments hidden view all 235 comments

Yes, i have rawhide on my system.
Last two kernels i have
2.6.27-0.226.rc1.git5.fc10.i686
2.6.27-0.244.rc2.git1.fc10.i686

I do not know which one killed my port. If you want me to run it or something i am unable to have any internet connection on that kernels, wifi does not work, eth you know.

200 comments hidden view all 235 comments
Chris Jones (cmsj) wrote :

Colin: FWIW, I think some kind of warning on cdimage and in the alpha release notes seems highly prudent (not because of the bogus liability claims here, but just because it's the good thing to do). I would suggest:

"Due to an unresolved bug in the Linux kernel currently used in Ubuntu 8.10 users with Intel network hardware supported by the e1000e driver should not download and run these images. Doing so may render your network hardware permanently inoperable.
Older Intel network hardware which uses the e1000 driver is not affected by this, however, use of the e1000 driver in older Ubuntu releases is not a reliable indication of which driver will be used by Ubuntu 8.10. Support for hardware which uses a PCI Express bus has been moved from e1000 to e1000e. If in doubt, do not run these images and subscribe to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/263555 to receive notifications when the bug is fixed."

Steve: I am not sure exactly where the responsibility for handling this for the Alphas falls (other than being quite sure it's not mine, and suspecting it's yours ;) but I think we should put warnings out fairly prominently, as SuSE has done. The obvious safe default would be to yank e1000e.ko and replace the above warning with something similar which explains why newer Intel network hardware won't work in the Alphas. It's a bit of a nuclear option since there is a lot of this hardware around and the bug mostly seems to be affecting laptops, but since they tend to be doing a lot more "interesting" kernel work (suspending, frequent loading of modules, etc) it could simply be that they are exposing it more easily and server hardware is just as capable of being affected.

For those wishing to discuss this bug, its implications, etc. there is a forum thread which seems more suitable for this, see: http://ubuntuforums.org/showthread.php?t=912666

Arnd (arnd-arndnet) wrote :

> It's alpha software, people should be considered as being aware that using it might break stuff.

That's absolutely ridiculous. I'm being aware that ubuntu alpha or beta can break some stuff (like eating my filesystem or deleting my partitions etc). In fact it already did. However, this is a whole different thing as BREAKING HARDWARE.
To make this clear, we are talking about RMAing laptops and mainboards because of this bug. And we are also talking about reasonable popular hardware. With the statement standing in the room that I should espect my hardware to die when I try ubuntu I will certainly don't try any alpha or beta ubuntu software ever again.

In my opinion the Alpha should be pulled NOW. Then you can discuss what steps have to be make to address this problem. (e.g. one easy sollution: disable e1000 and republish) This won't you cost more than a few days.

Just my 2 cents

Christian Wolf (christianwolf) wrote :

Folks,

I suggest to remove the respective Intrepid AlphaX images from the mirrors ASAP.

Although testing is testing, and everybody knows that there is a risk (I remember a similar issue with Mandrake Linux and CD-Rom drives) and you, as a tester, take a known risk, we also have the responsibility to minimize impact of this issue.

I think only the latest Alpha6 has this flaw?

John Dong (jdong) wrote :

Shall we pull in e1000e-prevent-corruption-of-eeprom-nvm.patch? It seems from the discussion that it isn't a 100% fix (other methods of reaching mmio'ed EEPROM probably exist) but should at least eliminate this disaster scenario of just booting up the distribution causing the card to be hosed.

198 comments hidden view all 235 comments

Does this mean Fedora 9 is not to blame for killing e1000e?

Slashdot reported that Fedora 9 and 10 are affected, but it sounds like only rawhide has the problem.

FWIW, I've heard of similar problems with recent -RT kernels.

198 comments hidden view all 235 comments

John Dong wrote:
> Shall we pull in e1000e-prevent-corruption-of-eeprom-nvm.patch? It
> seems from the discussion that it isn't a 100% fix (other methods of
> reaching mmio'ed EEPROM probably exist) but should at least eliminate
> this disaster scenario of just booting up the distribution causing
> the card to be hosed.

no, this patch is for e1000, and has nothing to do with this problem.
Right now, the only reports of this issue are with 82566 and 82567 based
LAN parts (ich8 and ich9).

the eeprom is not MMIO mapped, the registers for accessing it are. I'm
still not clear if a random write to a memory location could corrupt
things, we'll be looking at that today.

Chris Jones (cmsj) on 2008-09-23
description: updated

>http://www.ubuntu.com/testing/intrepid/alpha6

No warning

>http://cdimage.ubuntu.com/releases/intrepid/alpha-6/

No warning

>http://cdimage.ubuntu.com/releases/intrepid/alpha-6/intrepid-desktop-i386.iso

Download works.

How many people test these iso's? How many of them are using an intel motherboard? (10%-40% ?)
How many will ever test again if testing an ISO means you are frying your motherboard.

This isn't a blame game. But if top priority is not removing the alpha, it will very soon be...

We are talking about thousands, if not millions, of laptops and pc's that will be broken beyond repair, if I'm not mistaken...

And some people actually say things like:

>Jeffrey, we can't afford to do that; we need to be able to test with the Alpha CDs on the wide variety of hardware not affected by this bug,

Don't you get it. If you don't pull now, NOBODY WILL TEST THE NEXT VERSION.
There will be no NEXT VERSION because nobody DARES to install it.

>It's alpha software, people should be considered as being aware that using it might break stuff.

Yes, it may corrupt data. [but if it does that beyond its own partition; it should be a big issue as well]
But BREAKING hardware?

I'm quite sure that afterwards some quality control and reflection .. that there will be SOME policy to prevent these mistakes (NOT TAKING THE IMAGE DOWN) ..

But it will be too late.

PULL THE IMAGE: THEN DISCUSS!

abingham (abingham) wrote :

It's been almost a day since discussion on this issue resumed.

The Alpha 6 image are still up with no warning present.

I always assume that an Alpha or Beta release may break things to where I need to reinstall the OS. Battery life could be bad. Etc. But this is literally capable of *destroying* peoples hardware. It's a whole different ball game. Even the LiveCDs are affected, and people testing them can reasonably assume there will be no hardware impact on their system even if it is an Alpha.

These images need to be pulled from availablility *now*. Major mirror sites need to be notified.

If the release becomes '8.11' instead of '8.10' because of it, so what. We are talking about destroying motherboards here. Replacing a laptop motherboard can cost > $500.

If this is not dealt with, I will no longer be able to recommend Ubuntu to friends and family. The attitude of 'release on time at all costs' already caused many issues with 8.04, and now people are seriously suggesting continuing distribution of disc images that literally destroy hardware?

Jeffrey Baker (jwbaker) wrote :

There's no reason to be hysterical, but a re-spin of Alpha 6 CDs without the e1000e module may be called for. This is a separate bug, but the recommended workaround of adding "blacklist e1000e" in /etc/modprobe.d/blacklist doesn't work. Somehow, udev or some other thing manages to load it anyway. I had to unlink it.

abingham (abingham) wrote :

Intel has ~80% CPU market share and >=70% of the chipset market for their own CPU.

So at least 56% of machines sold are Intel CPUs with Intel chipsets that are susceptible to this bug.

1 in 2 of Ubuntu testers could be vulnerable to this.

I strongly recommend if you are going to test for this bug or haven't seen it
yet on your ich8/9 system, that you RIGHT NOW, do ethtool -e ethX >
savemyeep.txt

Having a saved copy of your eeprom means we can help you write it back to your
system.

For those that might be interested in testing the Alpha, but has an at risk
machine, is there an accepted workaround that removes the e1000e driver
without jeopardizing the hardware?

okay, lets just use the data we *have* now. What we know is that some
users have reported a corrupt NVM. Intel networking does not have a
current reproduction but is *fully engaged* on trying to solve this
problem. We have only had reports on 82566 and 82567 based machines, no
others. Trying to extrapolate this out to "1 of 2" users is just fear
mongering.

These kernels being released with this problem are still in alpha/beta,
which means our testing audience is smaller, but so is the potential
impact of any problem.

The process is working as far as I can see, we have a set of users that
is reporting the problem, which will help keep the kernels with the
issue from being promoted to full production status.

If you have some useful data to add to this bug, please comment, we're
listening. I think the discussion about pulling alpha cds or whatever
should go to some mailing list, and not be inside this bug.

-----Original Message-----
From: <email address hidden> [mailto:<email address hidden>] On Behalf Of
abingham
Sent: Tuesday, September 23, 2008 9:55 AM
To: Brandeburg, Jesse
Subject: [Bug 263555] Re: [intrepid] 2.6.27 e1000e driver places Intel
ICH8and ICH9 gigE chipsets at risk

Intel has ~80% CPU market share and >=70% of the chipset market for
their own CPU.

So at least 56% of machines sold are Intel CPUs with Intel chipsets that
are susceptible to this bug.

1 in 2 of Ubuntu testers could be vulnerable to this.

--
[intrepid] 2.6.27 e1000e driver places Intel ICH8 and ICH9 gigE chipsets
at risk
https://bugs.launchpad.net/bugs/263555
You received this bug notification because you are a direct subscriber
of the bug.

Status in The Linux Kernel: Confirmed
Status in "linux" source package in Ubuntu: Triaged
Status in linux in Ubuntu Intrepid: Triaged
Status in "linux" source package in Fedora: Confirmed
Status in "linux" source package in Suse: Incomplete

Bug description:
In some circumstances it appears possible for the 2.6.27-rc kernels to
corrupt the NVRAM used by some Intel network parts to store data such as
MAC addresses.
This is limited to the new e1000e driver, and reports have only appeared
from users of "82566 and 82567 based LAN parts (ich8 and ich9)" (to
quote Intel). The reports seem to be isolated to laptops, but it is not
clear if this is because desktop/server parts are not vulnerable, or if
use cases simply increase the chances of laptop users being hit.

Once this corruption has occurred, recovery may be possible via a BIOS
update, but may well require replacement of the hardware. Use of Intel's
IABUTIL.EXE is strongly discouraged, as it will worsen the problem to
the point where the network part will no longer appear on the PCI bus.

(this is a new description, the original one was based on too much
guesswork. Below are the URLs originally referenced)

http://www.blahonga.org/~art/rant.html (search for "em0")
http://<email address hidden>/msg00360.h
tml
http://<email address hidden>/msg00398.h
tml

Tim Gardner (timg-tpi) wrote :

Uploaded module-init-tools_3.3-pre11-4ubuntu10 to temporarily blacklist e1000e.

Harry (harry2o) wrote :

Even if it means making myself look like an idiot: May I also suggest publishing (maybe along with the warnings) hints about how to find out whether your hardware is / will be / might be affected. And when does that eeprom writing actually happen - at boot time, when using the lan interface, or elsewhen?

I had the duplicate bug 272630 and consider myself lucky. I had the Intel NIC but had used Alpha 5. I only had the dmesg error and not the hardware eeprom failure. I had Alpha 6 ready to test until I found this bug thread. For folks like me, it will be a good decision to blacklist e1000e pending a resolution. Most 1st time Alpha testers would not be as lucky or have the time to seek out a full bug thread.

>The process is working as far as I can see, we have a set of users that
is reporting the problem, which will help keep the kernels with the
issue from being promoted to full production status.

I'm sorry .. will you buy these a new laptop? Ifso, then the proccess is working.

Try advertising: there is a 50% chance that this will destroy your hardware.
How many testers will you have got left? ZERO.

THE IMPLIED RISK OF TESTING ALPHA SOFTWARE IS DATA CORRUPTION.

What's next? THe machiene blows up and kills people?
Would that be a reason to remove a machiene-destroying cd-image?

YES, its' not Ubuntu's fault .. the alpha is shipped. The procces is working, but the procces is not done until somebody QUICKLY removes the image.

No tester signed up for this; and I for sure will not ever put an alpha disc into one of my machienes. I'll even wait a couple of months after the release.

It's not that this happened. It's that afterwards official developers consider loosing half the hardware of every volunteers that test a sane and good move. Just part of the proccess.

This discussion SHOULD NOT BE MOVED TO A MAILING, until the CD IMAGE IS GONE.

I know this is not the UBUNTU code of conduct. But so is WILLFULLY DAMAGE PEOPLE"S PROPERTY.

You know of the BUG, REMOVE THE IMAGE.

Changed in linux:
status: Incomplete → In Progress
Changed in linux:
status: Confirmed → In Progress
M. Salivar (mfsalivar) wrote :

One alpha tester has already been lost, at least partially. When the problem occurred I sucked it up and said to myself, you know, it is alpha. Things like this shouldn't happen, but they do from time to time. Now that I see the indifference of Ubuntu devs to people losing their hardware, and even worse, to the extreme likelihood of more people losing theirs because of a stupidly strict adherence to release schedules, I'll never test an alpha outside of a virtual machine again (your loss, not mine). I'm not sure yet, but I may be through with Ubuntu, period.

You should pull all the current alphas, and quickly release an alpha 7 with the e1000e module removed or an older kernel. It's the only reasonable thing to do. Pulling the alphas and waiting for a fix will cause too many delays, but leaving up the current alphas is just plain immoral.

187 comments hidden view all 235 comments

I suggest this is severity urgent now.

186 comments hidden view all 235 comments
Thomas McKay (tom-mckay1) wrote :

I for one second the motion to remove the alpha images until they have been prepared so that there is no risk of hardware damage. I would be seriously concerned about the negative effects of ignoring this issue.

There is a liability on Canonical for the distribution of software which causes permanent and irreparable damage. The problem has been identified, and by not shielding their customers from this issue is nothing less than WILLFUL NEGLECT.

REMOVE THE IMAGES NOW, and re-issue them when the driver has been blacklisted.

-Zeus- (matthew-momjian) wrote :

Thomas, why do you feel that the present warnings are not enough? I for one feel that they are certainly sufficient warning to people running those chipsets to not download them. Also, this isn't a democracy; we don't vote on whether to pull images or not. That's up to the core developers/Canonical.

Thomas McKay (tom-mckay1) wrote :

Why do i not feel the warnings are enough?

for one, because users who download VIA bittorrent will never see those warnings, and like somebody above noted, not all testers know for certain the hardware they are running. Simply saying "if your computer breaks, tough luck, we warned you" will not garner any respect among linux users. The alpha testers are doing canonical a great service, and taking that for granted would be a shame.

I'm not saying it's a democracy, i am just warning of the consequences of ignoring this issue and bricking people's computers.

Ing0R (ing0r) wrote :

I think a warning (with Jesse Brandeburg's advice) should go to *every* tester who is running such a system.
I just read about it on a computer news site and I wish I got this news form canonical (maybe via update manager)

Ing0R

Thomas McKay (tom-mckay1) wrote :

Not to mention the thousands of people who downloaded these images before the warning was posted.

Daniel Kulesz (kuleszdl) wrote :

I was really shocked to see that it took more than one day between the Issue becoming apparent and the warning being placed on the website. This is a very serious issue and can cause severe damage on really expensive hardware (i.e. most recent Lenovo Thinkpads like the X200, X301, T400, T500 and so on).

Also, please be also aware, that some testers might simply change their old download URL from a download manager and increment from 5 to 6 - therefore I really suggest to at least move the images away to some different place and replace the ISO files with textfiles containing the same warning together with the real download location.

Is there any way to issue a warning to all the testers who are already using or began downloading the ISO? Does the installer query some URL through which the warning could be injected?

The only place where I can find official download links to the
BitTorrent is at http://cdimage.ubuntu.com/releases/intrepid/alpha-6/,
where the warning exists.

Chris Jones (cmsj) wrote :

Please listen to what Jesse said in comment #24. This is a bug report, not a discussion forum.

Calls for ISOs to be pulled, legal claims, accusations and use of capslock should be on the ubuntu-devel or ubuntu-devel-discuss mailing list. For one thing, by posting to those lists your opinion will be seen by a much wider audience.

The people subscribed to this bug have either been affected by this bug (such as myself, I filed this bug), or are trying to fix it.

Yelling at those people (which is what a number of you are doing) will solve nothing. Stop it please. The only relevant discussion here is that which is gathering information about the bug, or attempting to fix it.

I appreciate this is a contentious issue (since my laptop was affected by this), but I want to read about progress, I don't want to read lots of ranting. I also don't want this post to be perceived as negative, or whining, or whatever. I fully sympathise with people who are trying to protect their fellow users from harm, and in that respect I apologise for not shouting more about this bug when it was first uncovered. All I did was make sure as many of the people as possible who could fix it, knew about it.
If you would like to argue with me about this, please do not do it here, email me personally (see my Launchpad overview page for my addresses) or via <email address hidden>.

Michael W. (hotdog003-gmail) wrote :

If we don't pull the images (we should, but I won't comment since it's already being discussed), it might be a good idea to at least make the words "permanently inoperable" on the Alpha 6 testing page in big, bold letters so users have less of a chance to skim over that part.

Think about it: How many times do we read warning labels on the stuff we eat? My point exactly. Having a "WARNING" section on a testing page where people are already expecting things not to work perfectly might not be an accurate indicator of exactly how grave this problem really is.

I think we should do everything in our power to at least let users know what they're dealing with here. Somehow, we've managed to produce a stick of dynamite with a lit fuse. A lot of people are expecting testing images to be imperfect and may skip right over the warning section because they already know the typical "This is just alpha software, hopefully nothing major will happen" lecture that warning sections typically give them. Making "permanently inoperable" in bold letters will make it much more eye-catching than it is now.

Changed in linux:
status: Unknown → Confirmed
179 comments hidden view all 235 comments

Patches to the e1000e driver to protect the NVM were posted to netdev a few ours ago. They need to be tried on this problem. Either it will fix the problem or it should point to what is causing the problem. The patches are obviously for the 2.6.27-rc kernels.

178 comments hidden view all 235 comments
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 2.6.27-4.6

---------------
linux (2.6.27-4.6) intrepid; urgency=low

  [ Tim Gardner ]

  * Disable e1000e until the NVRAM corruption problem is found.
    - LP: #263555

  [ Upstream Kernel Changes ]

  * Revert "[Bluetooth] Eliminate checks for impossible conditions in IRQ
    handler"

 -- Ben Collins <email address hidden> Tue, 23 Sep 2008 09:53:57 -0400

Changed in linux:
status: Triaged → Fix Released
Jojo (kuzniarpawel) wrote :

according to http://groups.google.com/group/linux.kernel/browse_thread/thread/a5ef7deff8551186/d05c233ecb430178

this bug might be related to xorg and Intel graphics.

I used e1000e for 5 days with lot of traffic on eth0 and nothing happened (luck?) but I have T61p wit NV Quadro

William Grant (wgrant) on 2008-09-24
Changed in linux:
status: Fix Released → In Progress
178 comments hidden view all 235 comments

Someone try this patchs from Jeff Kirsher (Intel)?
http://lkml.org/lkml/2008/9/23/427
http://lkml.org/lkml/2008/9/23/431
http://lkml.org/lkml/2008/9/23/432

And I think that is a good idea change priority and severity to higher, because this bug can DAMAGED a hardware.

Best regards,
Renato

kernel-2.6.27-0.352.rc7.git1.fc10 (http://koji.fedoraproject.org/koji/buildinfo?buildID=64060) includes a fix for e1000 and (temporarily) disables e1000e.

This is probably sufficient for F10Beta (pending some regression testing)

I guess that will work, but you've now killed the wired network on quite a few hardware platforms. Pulling the patches from comment #20 would probably be better for F10Beta.

Changed in linux:
status: Confirmed → Fix Committed
Changed in linux:
status: Unknown → Confirmed
Changed in linux:
status: In Progress → Incomplete

> And I think that is a good idea change priority and severity to higher,
> because this bug can DAMAGED a hardware.

Nobody is changing priority and severity because those fields are meaningless. We should really remove those fields from the interface.

please see my message on lkml titled "e1000e NVM corruption issue status"

http://lkml.org/lkml/2008/9/25/510
This appears to be the post Jesse is referring to.

Shwan (shwan-ciyako) on 2008-09-27
description: updated

Another message from Jesse Brandeburg in LKML isd a list of the patches being used to debug the issue and under test as possible fixes to the issue:

  http://lkml.org/lkml/2008/9/25/515

Changed in linux:
status: Fix Committed → Confirmed
Tim Gardner (timg-tpi) on 2008-09-30
Changed in linux:
status: In Progress → Fix Committed
Matt Zimmerman (mdz) on 2008-09-30
Changed in linux:
milestone: ubuntu-8.10-beta → none
Steve Langasek (vorlon) on 2008-09-30
Changed in linux:
milestone: none → ubuntu-8.10
Changed in linux:
status: Incomplete → In Progress
Changed in linux:
status: Fix Committed → Fix Released
Michael Losonsky (michl) on 2008-10-03
Changed in linux:
status: Fix Released → In Progress
Changed in linux:
status: Confirmed → Fix Released
Colin Watson (cjwatson) on 2008-10-03
Changed in linux:
status: In Progress → Fix Released

*** Bug 465127 has been marked as a duplicate of this bug. ***

I was just hit by this bud after doing preupgrade from F9 64bit to F10 beta 64bit. The system states "no network device available". I'm including the output I got after running dmesg and other commands (hope it helps):
[Francisco@localhost ~]$ su -
Password:
[root@localhost ~]# /sbin/ifconfig
lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:124 errors:0 dropped:0 overruns:0 frame:0
          TX packets:124 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:10080 (9.8 KiB) TX bytes:10080 (9.8 KiB)

[root@localhost ~]# dmesg | grep eth
Driver 'sd' needs updating - please use bus_type methods
Driver 'sr' needs updating - please use bus_type methods
[root@localhost ~]# "dhclient eth0" //
-bash: dhclient eth0: command not found
[root@localhost ~]# dhclient eth0
Device "eth0" does not exist.
Cannot find device "eth0"
[root@localhost ~]# dhclient eth1
Device "eth1" does not exist.
Cannot find device "eth1"
[root@localhost ~]# lscpi -v|grep -i ethernet
-bash: lscpi: command not found
[root@localhost ~]# lspci -v|grep -i ethernet
00:19.0 Ethernet controller: Intel Corporation 82566DC Gigabit Network Connection (rev 02)
[root@localhost ~]# ifconfig -a
lo Link encap:Local Loopback
          inet addr:127.0.0.1 Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING MTU:16436 Metric:1
          RX packets:668 errors:0 dropped:0 overruns:0 frame:0
          TX packets:668 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:0
          RX bytes:55232 (53.9 KiB) TX bytes:55232 (53.9 KiB)
[root@localhost ~]#

I was able to solve this by manual installation of the latest available kernel, 2.6.27-0.382.rc8.git4.fc10, along with the equivalent kernel-firmware. Worked immediately.

Changed in linux:
status: In Progress → Fix Released
Changed in linux:
status: Confirmed → In Progress
Changed in linux:
status: In Progress → Fix Released
Download full text (5.5 KiB)

I have tried newest rawhide kernel and it does not help.
I have also tried attached drivers. Did not change anything. Still no ethernet. Now i did not mess aorund with no ethtool nor some intel soft.

Output of dmesg:

e1000e: Intel(R) PRO/1000 Network Driver - 0.4.1.7_nocsum-NAPI
e1000e: Copyright (c) 1999-2008 Intel Corporation.
ACPI: PCI Interrupt 0000:00:19.0[A] -> GSI 22 (level, low) -> IRQ 22
PCI: Setting latency timer of device 0000:00:19.0 to 64
0000:00:19.0: : Failed to initialize MSI interrupts. Falling back to legacy interrupts.
0000:00:19.0: 0000:00:19.0: The NVM Checksum Is Not Valid
BUG: soft lockup - CPU#0 stuck for 61s! [modprobe:3703]
Modules linked in: e1000e(+) rfkill_input bridge bnep rfcomm l2cap vboxdrv ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi fuse sunrpc arc4 ecb crypto_blkcipher b43 ssb rfkill mac80211 cfg80211 input_polldev ipt_REJECT xt_tcpudp nf_conntrack_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables cpufreq_ondemand acpi_cpufreq freq_table dm_mirror dm_log dm_multipath dm_mod ipv6 sr_mod cdrom pcspkr snd_hda_intel serio_raw joydev snd_seq_dummy sg snd_seq_oss snd_seq_midi_event i915 snd_seq ata_piix snd_seq_device pata_acpi snd_pcm_oss snd_mixer_oss video output ata_generic wmi battery ac drm hci_usb snd_pcm i2c_algo_bit i2c_core iTCO_wdt iTCO_vendor_support snd_timer snd_page_alloc bluetooth snd_hwdep snd soundcore ahci libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: e1000e]
CPU 0:
Modules linked in: e1000e(+) rfkill_input bridge bnep rfcomm l2cap vboxdrv ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi fuse sunrpc arc4 ecb crypto_blkcipher b43 ssb rfkill mac80211 cfg80211 input_polldev ipt_REJECT xt_tcpudp nf_conntrack_ipv4 xt_state nf_conntrack iptable_filter ip_tables x_tables cpufreq_ondemand acpi_cpufreq freq_table dm_mirror dm_log dm_multipath dm_mod ipv6 sr_mod cdrom pcspkr snd_hda_intel serio_raw joydev snd_seq_dummy sg snd_seq_oss snd_seq_midi_event i915 snd_seq ata_piix snd_seq_device pata_acpi snd_pcm_oss snd_mixer_oss video output ata_generic wmi battery ac drm hci_usb snd_pcm i2c_algo_bit i2c_core iTCO_wdt iTCO_vendor_support snd_timer snd_page_alloc bluetooth snd_hwdep snd soundcore ahci libata sd_mod scsi_mod ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd [last unloaded: e1000e]
Pid: 3703, comm: modprobe Not tainted 2.6.26.5-45.fc9.x86_64 #1
RIP: 0010:[<ffffffffa0649c24>] [<ffffffffa0649c24>] :e1000e:e1000_flash_cycle_ich8lan+0x34/0x60
RSP: 0018:ffff81003c0699d8 EFLAGS: 00000202
RAX: 000000000000e028 RBX: ffff81003c0699f8 RCX: 000000005351a052
RDX: 00000000000006e8 RSI: 00000000000001f4 RDI: 00000000000006c3
RBP: ...

Read more...

As far as I know the current fixes in the newest kernel only prevent this from happening to undamanged hardware. But they don't fix it, if it's already damaged.

Some people from Intel and Novell were talking about developing a tool to repair it, if you have a backup of the original eeprom contents or access to an identical system. However, I don't know if that tool is already done or where you can get it from.

Well, i did not backup my eeprom, my laptop is popular so i may have access to someones eeprom image to restore it. I'll just ask someone for image.

Thing is i had to disable e1000e loading (i am using drivers attached to this bug) as it constantly crashes with message i pasted above and i can not boot my kernel unless i blacklist module e1000e.

I hope guys will find way to fix it soon.

Changed in linux:
status: Confirmed → Fix Released
Changed in linux:
status: In Progress → Fix Released
Download full text (4.6 KiB)

It looks like the root cause of this problem has been found. Included here is the work-around for it as well as the reference to the 2.6.28-rc fix for the problem.

>---------- Forwarded message ----------
>From: Steven Rostedt <email address hidden>
>Date: Wed, Oct 15, 2008 at 3:21 PM
>Subject: [PATCH -stable] disable CONFIG_DYNAMIC_FTRACE due to possible
>memory corruption on module unload
>To: LKML <email address hidden>, <email address hidden>
>Cc: Linus Torvalds <email address hidden>, Andrew Morton
><email address hidden>, Arjan van de Ven <email address hidden>,
><email address hidden>, <email address hidden>, Thomas Gleixner
><email address hidden>, Ingo Molnar <email address hidden>
>
>
>
>While debugging the e1000e corruption bug with Intel, we discovered
>today that the dynamic ftrace code in mainline is the likely source of
>this bug.
>
>For the stable kernel we are providing the only viable fix
>patch: labeling
>CONFIG_DYNAMIC_FTRACE as broken. (see the patch below)
>
>We will follow up with a backport patch that contains the
>fixes. But since
>the fixes are not a one liner, the safest approach for now is to
>disable the code in question.
>
>The cause of the bug is due to the way the current code in mainline
>handles dynamic ftrace. When dynamic ftrace is turned on, it also
>turns on CONFIG_FTRACE which enables the -pg config in gcc that places
>a call to mcount at every function call. With just CONFIG_FTRACE this
>causes a noticeable overhead. CONFIG_DYNAMIC_FTRACE works to ease this
>overhead by dynamically updating the mcount call sites into nops.
>
>The problem arises when we trace functions and modules are unloaded.
>The first time a function is called, it will call mcount and the mcount
>call will call ftrace_record_ip. This records the calling site and
>stores it in a preallocated hash table. Later on a daemon will
>wake up and call kstop_machine and convert any mcount callers into
>nops.
>
>The evolution of this code first tried to do this without the
>kstop_machine
>and used cmpxchg to update the callers as they were called. But I
>was informed that this is dangerous to do on SMP machines if another
>CPU is running that same code. The solution was to do this with
>kstop_machine.
>
>We still used cmpxchg to test if the code that we are modifying is
>indeed code that we expect to be before updating it - as a final
>line of defense.
>
>But on 32bit machines, ioremapped memory and modules share the same
>address space. When a module would load its code into memory
>and execute
>some code, that would register the function.
>
>On module unload, ftrace incorrectly did not zap these functions from
>its hash (this was the bug). The cmpxchg could have saved us in most
>cases (via luck) - but with ioremap-ed memory that was exactly
>the wrong
>thing to do - the results of cmpxchg on device memory are undefined.
>(and will likely result in a write)
>
>The pending .28 ftrace tree does not have this bug anymore, as
>a general push
>towards more robustness of code patching, this is done
>differently: we do not
>use cmpxchg and we do a WARN_ON and turn the tracer off if
>anything deviates
>from its expected state. Furthermo...

Read more...

Changed in linux:
status: Fix Released → Confirmed
Amit Kucheria (amitk) on 2008-10-24
Changed in linux-lpia:
assignee: nobody → amitk
importance: Undecided → Critical
milestone: none → ubuntu-8.10
status: New → Fix Committed
Changed in linux-lpia:
status: Fix Committed → Fix Released
Changed in linux:
status: Confirmed → Fix Released
Basilisk (bluebal-1) on 2008-11-01
Changed in linux:
assignee: timg-tpi → nobody
assignee: nobody → bluebal-1
William Grant (wgrant) on 2008-11-01
Changed in linux:
assignee: bluebal-1 → timg-tpi

that cpu-stuck bug was a problem in the way the e1000e driver loops to read the NVM.

part of the threads on lkml covered a fix for that issue.

Please contact me directly for assistance restoring your eeprom image if you need help.

Changed in linux (Ubuntu):
status: Fix Released → Confirmed
37 comments hidden view all 235 comments
Steve Langasek (vorlon) wrote :

Please don't change bug statuses without explanation

Changed in linux (Ubuntu):
status: Confirmed → Fix Released
38 comments hidden view all 235 comments

Anybody can provide me the fix for the cpu-stuck fix?

Also I need to get an eeprom to restore my Intel® 82573L Ethernet LAN Controller supporting Gigabit Ethernet on the motherboard D5400XS.

Changed in linux:
importance: Unknown → Medium
Changed in linux (Gentoo Linux):
importance: Unknown → Medium
Changed in linux (Mandriva):
importance: Unknown → Critical
37 comments hidden view all 235 comments
Troex Nevelin (troex) wrote :
Download full text (4.8 KiB)

I have ThinkPad X60 with 82573L, and after upgrading to 11.04 beta with latest kernel it stop working almost at all.
Tested e1000e 1.2.20-k2 (in stock kernel) and 1.3.10a driver with no luck, booting with option "pcie_aspm=force" doesn't help, I've tried e1000e_recover.iso but it does not boot on my 32bit processor, and the last what strange I cannot read eeprom:

@tpx60:~# ifconfig eth0 up
@tpx60:~# ethtool -e eth0
Cannot get driver information: No such device
@tpx60:~# ifconfig eth0 down
@tpx60:~# ethtool -e eth0
Offset Values
------ ------
0x0000 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0010 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0020 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0030 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0040 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0050 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0060 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
0x0070 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

@tpx60:~# dmesg | grep e1000e
[ 1.231354] e1000e: Intel(R) PRO/1000 Network Driver - 1.2.20-k2
[ 1.231358] e1000e: Copyright(c) 1999 - 2011 Intel Corporation.
[ 1.231392] e1000e 0000:02:00.0: Disabling ASPM L1
[ 1.231410] e1000e 0000:02:00.0: PCI INT A -> GSI 16 (level, low) -> IRQ 16
[ 1.231435] e1000e 0000:02:00.0: setting latency timer to 64
[ 1.231639] e1000e 0000:02:00.0: irq 44 for MSI/MSI-X
[ 1.232563] e1000e 0000:02:00.0: Disabling ASPM L0s
[ 1.392249] e1000e 0000:02:00.0: eth0: (PCI Express:2.5GB/s:Width x1) 00:16:d3:3a:47:ae
[ 1.392253] e1000e 0000:02:00.0: eth0: Intel(R) PRO/1000 Network Connection
[ 1.392332] e1000e 0000:02:00.0: eth0: MAC: 2, PHY: 2, PBA No: 005302-003
[ 27.120351] e1000e 0000:02:00.0: irq 44 for MSI/MSI-X
[ 27.176320] e1000e 0000:02:00.0: irq 44 for MSI/MSI-X
[ 28.762602] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
[ 28.762610] e1000e 0000:02:00.0: eth0: 10/100 speed: disabling TSO
[ 32.855747] e1000e 0000:02:00.0: PCI INT A disabled
[ 32.855761] e1000e 0000:02:00.0: PME# enabled
[ 89.756091] e1000e 0000:02:00.0: BAR 0: set to [mem 0xee000000-0xee01ffff] (PCI address [0xee000000-0xee01ffff])
[ 89.756109] e1000e 0000:02:00.0: BAR 2: set to [io 0x2000-0x201f] (PCI address [0x2000-0x201f])
[ 89.756154] e1000e 0000:02:00.0: restoring config space at offset 0xf (was 0x100, writing 0x10b)
[ 89.756208] e1000e 0000:02:00.0: restoring config space at offset 0x1 (was 0x100000, writing 0x100107)
[ 89.756278] e1000e 0000:02:00.0: PME# disabled
[ 89.756372] e1000e 0000:02:00.0: Disabling ASPM L1
[ 89.756474] e1000e 0000:02:00.0: irq 44 for MSI/MSI-X
[ 89.758786] e1000e 0000:02:00.0: eth0: MAC Wakeup cause - Link Status Change
[ 89.828165] e1000e 0000:02:00.0: PME# enabled
[ 89.976110] e1000e 0000:02:00.0: BAR 0: set to [mem 0xee000000-0xee01ffff] (PCI address [0xee000000-0xee01ffff])
[ 89.976128] e1000e 0000:02:00.0: BAR 2: set to [io 0x2000-0x201f] (PCI address [0x2000-0x201f])
[ 89.976184] e1000e 0000:02:00.0: restoring config space at offset 0xf (was 0x100, writing 0x10b)
[ 89.976234] e1000e 0000:02:00.0: restoring config space at offs...

Read more...

this bug is not a catchall for all e1000e issues, the original issue this bug was filed against is fixed and will be highly unlikely to reoccur. If you're having e1000e issues please file a new bug.

Changed in linux (Fedora):
importance: Unknown → Medium
Displaying first 40 and last 40 comments. View all 235 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.