Comment 5 for bug 403026

Revision history for this message
Marc Clemente (marc-mclemente) wrote :

Hello,

I have a similar problem. I don't know if it's exactly the same thing. Here's my scenario.

I had an AMD Athlon 64 3700 @ 2.2GHz motherboard. 4 GB memory, running Debian 2.6.32 kernel. Hooked up to the motherboard were two sata drives (sda and sdb). Each drive had two partitions (sda1, sda2, sdb1, sdb2). I ran raid as follows (md0 is swap, and md1 is the root partition):

# cat /proc/mdstat
Personalities : [raid1]
md0 : active (auto-read-only) raid1 sdb1[0] sda1[1]
      4000064 blocks [2/2] [UU]

md1 : active raid1 sdb2[0] sda2[1]
      240195776 blocks [2/2] [UU]

unused devices: <none>

Everything worked fine for years. A few weeks ago, I decided to get a new motherboard, with an Intel i7 950 @ 3.07 GHz and 12 GB memory. An easy drop-in replacement, right? As soon as I started the computer, everything worked. A few minutes later, I get these errors:

Mar 12 10:57:32 marc kernel: [ 2818.975238] EXT3-fs error (device md1): ext3_readdir: bad entry in directory #21532139: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0
Mar 12 10:57:32 marc kernel: [ 2818.975244] Aborting journal on device md1.
Mar 12 10:57:32 marc kernel: [ 2818.977021] ext3_abort called.
Mar 12 10:57:32 marc kernel: [ 2818.977023] EXT3-fs error (device md1): ext3_journal_start_sb: Detected aborted journal
Mar 12 10:57:32 marc kernel: [ 2818.977025] Remounting filesystem read-only
Mar 12 10:57:32 marc kernel: [ 2819.002578] Remounting filesystem read-only

Of course, this would require a reboot, and an fsck. Only to happen again a few minutes later. This is what I did to further troubleshoot:

1. It's not the memory. I ran memtest86+ for days at a time with no errors. I replaced the memory with 4 GB from a different manufacturer. Problems continued.

2. The old processor was single-core, non-hyperthreading. The new processor is quad-core, hyperthreading. I went into the BIOS and turned off hyperthreading and multi-core. Problems continued.

3. It's probably not the hard drives. I have never had a hardware errors, and they were working fine two weeks ago with the old motherboard.

4. I have forced a resync of the raid array twice. Once by removing and re-adding sda2. Another time by removing and re-adding sdb2. Problems continued.

5. I did not change the kernel when I changed the motherboard.

6. At this point it might be a linux software raid issue. I installed a new hard drive with a single ext3 partition (sdc1). I copied the contents of the raid array to the new drive (cp -avx / /mnt). Rebooted the computer from sdc. Now sdc1 is my root partition. I have not yet had any errors with the root partition on sdc1. If I mount the raid partition (md1) and start using it, then I will get ext3 errors on it almost immediately.

7. I run Debian, this is an Ubuntu forum. I know.

Let me know if you need more info from me.

Marc