EDAC spam in dmesg, edac-utils shows no erros

Bug #367774 reported by Tobatus
34
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

After upgrade from 8.10 to 9.04 dmesg got spammed with messages:
[258533.482524] EDAC MC0: UE page 0x0, offset 0x0, grain 0, row 3, labels ":": x38 UE
[258533.483835] EDAC MC0: UE page 0x0, offset 0x0, grain 0, row 7, labels ":": x38 UE

But edac-utils shows no errors and server is not unstable at anyway so I don't really expect my RAM to be broken. My motherboard is Gigabyte GA-X48-DS5 and I use ECC memory (KVR800D2E6K2/4G).
$ edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: ch0|ch1: 0 Uncorrected Errors
mc0: csrow0: ch0: 0 Corrected Errors
mc0: csrow0: ch1: 0 Corrected Errors
mc0: csrow4: ch0|ch1: 0 Uncorrected Errors
mc0: csrow4: ch0: 0 Corrected Errors
mc0: csrow4: ch1: 0 Corrected Errors

ProblemType: Bug
Architecture: amd64
DistroRelease: Ubuntu 9.04
HibernationDevice: RESUME=/dev/md1
MachineType: Gigabyte Technology Co., Ltd. X48-DS5
Package: linux-image-2.6.28-11-server 2.6.28-11.42
ProcCmdLine: root=/dev/md0 ro quiet splash
ProcEnviron:
 SHELL=/bin/bash
 LANG=en_US
ProcVersionSignature: Ubuntu 2.6.28-11.42-server
SourcePackage: linux

Revision history for this message
Tobatus (toba-hukassa) wrote :
Revision history for this message
mjames (mcj210) wrote :

I am also seeing this behavior on Kubuntu 9.04 (desktop, amd64). I have a similar setup to Tobatus: Gigabyte motherboard using X38 chipset. Let me know if you need any further info from me.

Revision history for this message
beltbraces (beltbraces) wrote :

Just for the record, I too am seeing this behavior (Gigabyte GA-X48-DS4 motherboard). However, I don't see it with an ASUS P5E motherboard (all the other components being the same, swapped to test the different motherboard).

Also, if I run memtest86+ for a few seconds (just long enough to make a complete test of all memory addresses) before booting into Linux, then I don't see any EDAC errors reported.

It seems as if Linux is attempting to read memory before it has been written (possibly due to pipelining attempting to read addresses that are not actually going to be needed?).

Does anyone know if it is the responsibility of the BIOS to initialize all memory, or if the OS ought to be doing this?

Revision history for this message
Marty Lucich (mar3ty) wrote :

Just since upgrading to 9.04 I see the same thing. It's not just dmesg for me, it's also
every console. At a rate of once per second an endless stream of lines like this:

[51804.532033] EDAC MC0: UE page 0x1abc6, offset 0x0, grain 128, row 0, labels ":": i82975x UE
[51805.532042] EDAC MC0: UE page 0x1abc6, offset 0x0, grain 128, row 0, labels ":": i82975x UE
[51806.532033] EDAC MC0: UE page 0x1abc6, offset 0x0, grain 128, row 0, labels ":": i82975x UE

Makes using a VT difficult.

This is on an ASUS P5W DH Deluxe motherboard that has been stable running Ubuntu for the
previous 2 years.

Revision history for this message
beltbraces (beltbraces) wrote :

Does running memtest86+ fix things until the next cold boot?

Revision history for this message
mjames (mcj210) wrote :

beltbraces, I tried your trick, but exiting memtest86+ takes me back to the BIOS startup screen, and the bug still seems to happen.

Every so often, I will switch to a virtual console and notice that the bug is *not* occurring. I'd say that I see it on 80% of my boots, rather than 100% of the time. Could it be that when you tried your memtest86+ technique, you might have just run into one of those 20%? Do you think that you see it on 100% of your boots?

I haven't been able to correlate the 80/20 split with anything else that I'm doing, so I don't really have any theories as to what is causing the difference.

Revision history for this message
beltbraces (beltbraces) wrote :

I see the fix 100% of the time (in fact I do it about once a day, every time I turn on my PC), so it would seem we have different problems.

Can you turn off EDAC reporting by creating a file named /etc/modprobe.d/edac that reads:
    alias i82975x_edac off
and then rebooting?

Revision history for this message
mjames (mcj210) wrote :

Yeah, I think they are somewhat different.

I tried the edac file that you suggested, but it didn't have a noticeable effect. I am getting the log messages ending in "x38 UE" that Tobatus gave above, rather than ones like yours, so I tried "alias x38_edac off", but that also had no effect.

I actually have very little clue about how the edac system works or how to configure it, but I'm willing to help test any ideas here on this end.

Revision history for this message
beltbraces (beltbraces) wrote :

Oops, sorry, I was confusing you with Marty Lucich who mentioned i82975x! And I am obviously out of date with my module management.

OK, I created myself a file named /etc/modprobe.d/edac.conf that contains one line:
    blacklist x38_edac
and after reboot I no longer see any EDAC messages on my PC. Does that help you?

Revision history for this message
Tobatus (toba-hukassa) wrote :

I hate to reboot my server, so do you know if "rmmod x38_edac" works on the fly similar as blacklist does at the reboot? Or is it just better / safer to reboot?

Revision history for this message
beltbraces (beltbraces) wrote :

That works perfectly well until the next reboot. I just tried it, and it certainly stops the error messages being added to the log, with no other observable effect. It does leave the edac_core module still loaded, so to get the same effect as a blacklist/reboot you would also need to do "rmmod edac_core". I just tried that too, and again no observable effect.

Of course removing the edac module is a bit like removing the battery from a smoke alarm that is malfunctioning: it stops the false alarms, but it doesn't make the alarm work!

Revision history for this message
mjames (mcj210) wrote :

That works for me. I now have use of my virtual terminals and my system and kernel logs again! Thanks!

Now I'm just hoping that my house doesn't catch fire....

(It would be nice to have this bug actually fixed, instead of worked around, since I use ECC memory for a reason. At the moment, though, I'd rather have use of my logs than notification of ECC warnings.)

Revision history for this message
Luis Mondesi (lemsx1) wrote :

my systems hit this bug during Jaunty and now on Karmic it still shows the same crap.

there should be a "verbose" option on edac to throw this errors if one wants them. and keep quiet by default.

i also rmmod this from my system... i have no idea what this is used for but obviously is not working.

Revision history for this message
beltbraces (beltbraces) wrote :

What is not working is really your BIOS; the GIGABYTE engineers have been aware of the problem with their BIOSes since it was reported to them back in April, but to date they have been unable to correct the problem. I have no idea if they are still trying, or if they have decided it is not worth the effort. On the other hand, the ASUS motherboards using the same chipset don't suffer from this problem. Not sure about any other brands?

It would be possible for the EDAC module to work around the BIOS bug by initializing all memory at startup, or at least to avoid infinitely repeated errors by scrubbing (either scrubbing all memory in the background, or at least scrubbing each address that reported an error).

However, I would not like to see the default changed to not reporting the errors - not unless the problem motherboards could be identified, and the default changed only for the motherboards with the bad BIOSes...

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi Tobatus,

This bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/lucid.

If it remains an issue, could you run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 367774

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Joseph (jg-jgoettgens) wrote :

After installing 10.04 (Asus P5BV-C with Kingston ecc ram) dmesg reports the following EDAC error:
[14597.010027] EDAC MC0: UE page 0x0, offset 0x0, grain 1073741824, row 1, labels ":": i3200 UE
Previous Ubuntu versions did not report this error, but EDAC support for my server board could have been added with the latest kernels. The workarounds described above essentially work.

Booting the machine with the Quick Boot option disabled (or running memtest86+ for a while) minimizes the frequency of this error (maybe once every hour), but with the BIOS options Quick Boot enabled I'll get a message about every second. Memtest86+ 4.0 does not report any error---but for UEs it probably should have. The machine is rock stable so I guess it is essentially a bogus message.

Revision history for this message
Simon Eisenmann (longsleep) wrote :

Just upgraded a rock solid machine (SuperMicro no ECC RAM) and get tons and tons of the following error spammed to console and syslog:

EDAC MC0: UE page 0x0, offset 0x0, grain 536870912, row 0, labels ":": i3200 UE

Seems to have not any other effect, though very annoying.

Revision history for this message
Joseph (jg-jgoettgens) wrote :

The remaining EDAC messages on my system (Asus P5BV-C with Kingston ECC RAM) are related to running RAM intensive applications, but they do not seem to be related to a particular application. It's probably just RAM usage. Maybe this helps in finding out what is going on.

Revision history for this message
Matt Keys (mk6032) wrote :

I get the same EDAC messages with 10.04 amd64 kernels but not with i386. edac-utils -v is showing a few errors but I think this is a false reading... the box has always been rock solid.

matt@home:~$ sudo edac-util -v
mc0: 0 Uncorrected Errors with no DIMM info
mc0: 0 Corrected Errors with no DIMM info
mc0: csrow0: 75644 Uncorrected Errors
mc0: csrow0: ch0: 0 Corrected Errors
mc0: csrow0: ch1: 0 Corrected Errors
mc0: csrow2: 584 Uncorrected Errors
mc0: csrow2: ch0: 0 Corrected Errors
mc0: csrow2: ch1: 0 Corrected Errors
mc0: csrow4: 79763 Uncorrected Errors
mc0: csrow4: ch0: 0 Corrected Errors
mc0: csrow4: ch1: 0 Corrected Errors
mc0: csrow6: 1122 Uncorrected Errors
mc0: csrow6: ch0: 0 Corrected Errors
mc0: csrow6: ch1: 0 Corrected Errors

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

This bug report was marked as Incomplete and has not had any updated comments for quite some time. As a result this bug is being closed. Please reopen if this is still an issue in the current Ubuntu development release http://cdimage.ubuntu.com/daily-live/current/ . Also, please be sure to provide any requested information that may have been missing. To reopen the bug, click on the current status under the Status column and change the status back to "New". Thanks.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-expired
Changed in linux (Ubuntu):
status: Incomplete → Expired
Revision history for this message
rew (r-e-wolff) wrote :

Just for documentation for others that might experience problems like this.....

I had similar problems, The EDAC module was falsely reporting errors in my i3200 memory controller (I think it might have to do with the memory controller).

I do have ECC memory installed.

I went into the BIOS and disabled "quick boot". I think this is the relevant change I made there, but we also changed a few other settings. Like "display full schreen logo" was changed to "disabled".

Now I don't get these errors anymore.

So: Possible workaround: disable quick boot in the bios.

Revision history for this message
rew (r-e-wolff) wrote :

P.S. This is in a server that has to run "LTS", so it currently runs 10.4 LTS.

Revision history for this message
gabrys (piotr-ubuntubugs) wrote :

Disabling Quick Boot solved this for me as well (now BIOS boots 40 seconds, but ECC errors are gone).

Revision history for this message
Artsiom Shchatsko (cioma) wrote :

I had the same king of UE messages from EDAC on Shuttle SX58J3 (Intel X58 chipset) and on HP RV724AV (Intel X38 chipset) under Ubuntu 10.10 x86_64 (kernel 2.6.35-23-generic) but both systems were stable.

Enabling BIOS memory test (disabling QuickBoot) solved the issue. But EDAC should be modified to be able to handle such things himself.

Revision history for this message
Christian Brandt (brandtc) wrote :

Same problem with a P4C800-E (yeah, its very old) after upgrading from Debian 3.whatever+Pentium4-2,6Ghz to Ubuntu 10.04+Pentium4-3,2Ghz, 4x512MB Kingston DDR1-400.

Disabling "Quick Boot" solved the problem for the time being.

Revision history for this message
rhelmer (robert-roberthelmer) wrote : Stop prick from going soft

..Greeting___Customer

**************

Ordering me qej dica jfc tions onl iko ine is the best way to buy me sf dica hc tions discreet and to save on them. But there is a risk of running into some scams, so while choosing an onl if ine dru qmg gst mo ore you should be extremely careful. We recommend purchasing with «Can gcr adia td nPha pl rmacy».

Can cu adian «Can lh adia fd nPha ldi rmacy» onl bm ine dru qo gstore is famous for fast and efficient service, quick shipping, and reliability. Impressive selection of me yd dicat nzh ions contributes to «Can nz adia kh nPha afu rmacy»'s popularity.

**************

Don't hesitate, you won't find better place to purchase me fxt ds. ... www.drugskey.ru

**************

Revision history for this message
Mario Manno (manno) wrote :

I'm seeing this on a HP server with Ubuntu 12.10

I read http://www.kernel.org/doc/Documentation/edac.txt and disabled edac logging with:

   echo 0 > /sys/module/edac_core/parameters/edac_mc_log_ce

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.