corruption when using md raid1 as root fs on amd64

Bug #129260 reported by Joseph Fisk
4
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Won't Fix
Undecided
Unassigned

Bug Description

Binary package hint: linux-image-2.6.15-28-amd64-server

md5sum returns inconsistent results when run repeatedly on a large text file stored on a linux software (md) raid1 volume (ext3) which is the root filesystem. md5sum works fine when the raid1 is not the root fs, or when using a single standalone disk (also ext3).

This is not the iommu issue experienced by other amd64 users: https://bugs.launchpad.net/ubuntu/+source/linux-source-2.6.15/+bug/128568

However, this system did experience the iommu issue and is currently running with iommu=soft as a workaround.

This is happening on a dual dual-core opteron system with 8GB registered/ecc memory running up-to-date LTS. The system ran memtest86 for over 75 passes with no errors.

I started a thread about this problem at http://ubuntuforums.org/showthread.php?p=3101571 , but another user suggested it was turning into a bug report. Some more background info is available there.

System specs:
Tyan Transport B2891G24S4H
Motherboard: Thunder K8SRE S2891 (nForce Professional 2200)
8x1GB of Corsair CM72SD1024RLP-3200/S (S = Samsung)
Disks: root fs on two WD3200RE using linux raid1, SATA

Tags: linux
Revision history for this message
Joseph Fisk (mdmbkr) wrote :
Revision history for this message
Joseph Fisk (mdmbkr) wrote :

Using the iommu=soft boot option does not fix the problem. Although, curiously, the corruption is no longer visible in the dump file after attemptiong to load it with mysql. mysql still errors out but everything looks fine in the dump file at that location. This is nuts!

Revision history for this message
Joseph Fisk (mdmbkr) wrote :

Now I'm using md5sum on a large (16GB) sql file, and seeing different results each time. I'm going to try md5sum'ing some large files created with dd if=/dev/zero to see if the problem appears. That will make it easy for others to try to reproduce.

Revision history for this message
Joseph Fisk (mdmbkr) wrote :

dd if=/dev/zero doesn't reproduce the bug. This might mean that the corruption results from data being transposed from one location in the file to another. I'm now trying again with dd if=/dev/urandom, results coming in 12-16 hours.

Revision history for this message
Joseph Fisk (mdmbkr) wrote :

After reading this thread: http://www.nvnews.net/vbulletin/showthread.php?t=81716 I'm beginning to think the problem I am experiencing may be the same as what krader reports. Specifically, see http://www.nvnews.net/vbulletin/showpost.php?p=1117843&postcount=46

iommu=soft doesn't fix it for me.

I also see that a patch has been committed but don't know how to confirm whether that patch is present or absent in dapper currently?

Revision history for this message
Joseph Fisk (mdmbkr) wrote :

I've tried 2.6.15-50 from dapper-proposed with no luck. According to kernel.bugzilla.org, a fix was merged more than two months ago. Is there some way for me to obtain this fix through the dapper repositories?

Revision history for this message
Joseph Fisk (mdmbkr) wrote : Re: opteron amd64 data corruption

OK, I compiled 2.6.22 from kernel.org and the error still appears. I ran:

while (true) ; do md5sum bigfile.sql ; done

bigfile.sql is around 16GB, and I ran that loop for about an hour, and here's the results:

cb5714b4050178096ee5d7ebade86364 bigfile.sql
f199dc20003be8f5913d38d92d0b99e3 bigfile.sql
cb5714b4050178096ee5d7ebade86364 bigfile.sql
aae3ef4d02ecbe21827d3d96f2179681 bigfile.sql
6edfeafe8a73ee07c075164dc93b319e bigfile.sql
55743d67dded508cacfb5abc0af016e5 bigfile.sql
889c5e708c1c884ea4efd11bdd1ff2c6 bigfile.sql

Any suggestions on how to determine whether this is actually a bug and not faulty hardware? This machine ran memtest for almost a week with no errors!

Revision history for this message
Joseph Fisk (mdmbkr) wrote :

In the interest of complete clarity here's the kernel version string:

2.6.22.1-custom #3 SMP Tue Jul 31 22:58:43 CDT 2007 x86_64 GNU/Linux

I've attached output of /proc/cpuinfo and dmesg.

Revision history for this message
Joseph Fisk (mdmbkr) wrote :

More info:

The two systems in question have the following specs:

System 1 ("murray"):
Tyan Transport B2891G24S4H
Motherboard: Thunder K8SRE S2891 (nForce Professional 2200)
8x1GB of Corsair CM72SD1024RLP-3200/S (S = Samsung)
Disks: root fs on two WD3200RE using linux raid1, SATA

System 2 ("morgan"):
Tyan Transport B2891G24S4H
Motherboard: Thunder K8SRE S2891 (nForce Professional 2200)
4x1GB of Corsair CM72SD1024RLP-3200/S
4x1GB of Corsair CM72SD1024RLP-3200/M (M = Micron)
Disks: root fs on a WD740GD, /home on a WD3000, /data on a WD2500, all SATA

morgan's issues appear to have been fixed simply by using iommu=soft.

murray still experiences corruption. I've now tested it in the following ways:

Running 2.6.15-28 amd64 SMP:
with and without iommu=soft
with and without mem=1g
with and without iommu and memhole disabled in BIOS

Running 2.6.22 amd64 SMP:
with and without iommu=soft

Running 2.6.22 amd64 uni processor:
with and without iommu=soft

No luck yet.

Revision history for this message
Joseph Fisk (mdmbkr) wrote :

iommu=soft appears to have fixed the corruption on one system, but another is still having problems. The key difference between the fixed and broken systems is that the broken system has its root filesystem on a linux (md) raid1 disk pair.

I installed a single extra drive in the broken system and ran tests using md5sum with iommu=soft, and no corruption appeared. So apparently the broken system was suffering from the iommu issue as well.

I'm unsure whether to keep this bug report open, or file a new report? I could also use some advice on potential workarounds or ways to better isolate the problem.

Joseph Fisk (mdmbkr)
description: updated
Joseph Fisk (mdmbkr)
description: updated
Revision history for this message
Joseph Fisk (mdmbkr) wrote : Re: corruption when using md raid1 on amd64 smp

I reinstalled Ubuntu 6.06 LTS on the problem system. Now the root fs is on a single drive, and I made a separate raid1 (using md) on a pair of WD5000's, mounted on /test. So far, I haven't seen any corruption issues.

I'll let it keep running for now, and if by tomorrow I don't see any corruption, I'll reinstall again with the root fs on raid1, as before, using the same WD5000's, and see if the problem still occurs.

Previously, the problematic raid1 used a pair of WD3200's.

Joseph Fisk (mdmbkr)
description: updated
Joseph Fisk (mdmbkr)
description: updated
Revision history for this message
seisen1 (seisen-deactivatedaccount-deactivatedaccount) wrote :

Is this still a problem for you or is this problem fixed in one of the latest releases of Ubuntu?

Revision history for this message
Joseph Fisk (mdmbkr) wrote :

At the time it didn't seem as though a solution would be forthcoming, so I abandoned the idea of mirrored root fs.

Revision history for this message
Launchpad Janitor (janitor) wrote : This bug is now reported against the 'linux' package

Beginning with the Hardy Heron 8.04 development cycle, all open Ubuntu kernel bugs need to be reported against the "linux" kernel package. We are automatically migrating this linux-source-2.6.15 kernel bug to the new "linux" package. We appreciate your patience and understanding as we make this transition. Also, if you would be interested in testing the upcoming Intrepid Ibex 8.10 release, it is available at http://www.ubuntu.com/testing . Please let us know your results. Thanks!

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi Joseph,

Per your most recent comment I'm going to close this report for now. If you are able to test with Intrepid and verify this is still an issue, please feel free to reopen this by setting the status back to "New". Thanks.

Changed in linux:
status: New → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.