EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory

Bug #403026 reported by msp3k
24
This bug affects 4 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Expired
Undecided
Unassigned

Bug Description

We lost storage on three of our servers to (what I believe to be) a bug in the
ext3 kernel filesystem driver. Syslog shows the following:

Jul 14 01:45:04 home kernel: [981637.615765] EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory #217677844: rec_len % 4 != 0 - offset=0, inode=3672412761, rec_len=39053, name_len=147
Jul 14 01:45:04 home kernel: [981637.653163] EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory #216096885: rec_len % 4 != 0 - offset=0, inode=1193450085, rec_len=48787, name_len=24
Jul 14 01:45:05 home kernel: [981637.821762] EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory #216105487: directory entry across blocks - offset=0, inode=884141655, rec_len=41788, name_len=189

This error repeats, growing in length as other inodes are added to the list of errors.

Out of 8 drives formatted ext3, the only drives affected were very large RAID arrays (12TB and 16TB in size). Three of five RAID arrays suffered permanent dammage and could not be repaired. It is unknown if the size of the drive is a contributing factor.

Although this seems similar to a bug reported in 2007 here:
  http://marc.info/?l=linux-ext4&m=118067140512836&w=2
I was unable to reproduce the error with the original report's program and script. In fact, I have as yet been unable to reproduce the error at all.

ProblemType: Bug
Architecture: amd64
CurrentDmesg:
 [ 55.509463] NET: Registered protocol family 17
 [ 60.212556] Installing knfsd (copyright (C) 1996 <email address hidden>).
 [ 60.502550] NFSD: Using /var/lib/nfs/v4recovery as the NFSv4 state recovery directory
 [ 60.515702] NFSD: starting 90-second grace period
 [ 61.748024] NET: Registered protocol family 5
DistroRelease: Ubuntu 8.10
HalComputerInfo: Error: [Errno 2] No such file or directory
LsUsb:
 Bus 004 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 005 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
 Bus 003 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
 Bus 001 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub
Package: linux-image-2.6.27-14-server 2.6.27-14.35
ProcCmdLine: User Name=UUID=7eebe24c-49ee-46af-8c29-662e1574d25d ro quiet splash
ProcEnviron:
 PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games
 LANG=en_US.UTF-8
 SHELL=/bin/bash
ProcVersionSignature: Ubuntu 2.6.27-14.35-server
SourcePackage: linux

Revision history for this message
msp3k (peek-nimbios) wrote :
Revision history for this message
msp3k (peek-nimbios) wrote :
Revision history for this message
msp3k (peek-nimbios) wrote :

The RAID arrays are hardware-based arrays, each array presented to the OS as a single /dev/sdX device. While lvm is used, it is only used to partition the drive (as fdisk cannot handle 16TB). Ext3 is used on top of the lvm device.

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Hi msp3k,

This bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux 403026

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Please let us know your results.

Thanks in advance.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Revision history for this message
Marc Clemente (marc-mclemente) wrote :

Hello,

I have a similar problem. I don't know if it's exactly the same thing. Here's my scenario.

I had an AMD Athlon 64 3700 @ 2.2GHz motherboard. 4 GB memory, running Debian 2.6.32 kernel. Hooked up to the motherboard were two sata drives (sda and sdb). Each drive had two partitions (sda1, sda2, sdb1, sdb2). I ran raid as follows (md0 is swap, and md1 is the root partition):

# cat /proc/mdstat
Personalities : [raid1]
md0 : active (auto-read-only) raid1 sdb1[0] sda1[1]
      4000064 blocks [2/2] [UU]

md1 : active raid1 sdb2[0] sda2[1]
      240195776 blocks [2/2] [UU]

unused devices: <none>

Everything worked fine for years. A few weeks ago, I decided to get a new motherboard, with an Intel i7 950 @ 3.07 GHz and 12 GB memory. An easy drop-in replacement, right? As soon as I started the computer, everything worked. A few minutes later, I get these errors:

Mar 12 10:57:32 marc kernel: [ 2818.975238] EXT3-fs error (device md1): ext3_readdir: bad entry in directory #21532139: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0
Mar 12 10:57:32 marc kernel: [ 2818.975244] Aborting journal on device md1.
Mar 12 10:57:32 marc kernel: [ 2818.977021] ext3_abort called.
Mar 12 10:57:32 marc kernel: [ 2818.977023] EXT3-fs error (device md1): ext3_journal_start_sb: Detected aborted journal
Mar 12 10:57:32 marc kernel: [ 2818.977025] Remounting filesystem read-only
Mar 12 10:57:32 marc kernel: [ 2819.002578] Remounting filesystem read-only

Of course, this would require a reboot, and an fsck. Only to happen again a few minutes later. This is what I did to further troubleshoot:

1. It's not the memory. I ran memtest86+ for days at a time with no errors. I replaced the memory with 4 GB from a different manufacturer. Problems continued.

2. The old processor was single-core, non-hyperthreading. The new processor is quad-core, hyperthreading. I went into the BIOS and turned off hyperthreading and multi-core. Problems continued.

3. It's probably not the hard drives. I have never had a hardware errors, and they were working fine two weeks ago with the old motherboard.

4. I have forced a resync of the raid array twice. Once by removing and re-adding sda2. Another time by removing and re-adding sdb2. Problems continued.

5. I did not change the kernel when I changed the motherboard.

6. At this point it might be a linux software raid issue. I installed a new hard drive with a single ext3 partition (sdc1). I copied the contents of the raid array to the new drive (cp -avx / /mnt). Rebooted the computer from sdc. Now sdc1 is my root partition. I have not yet had any errors with the root partition on sdc1. If I mount the raid partition (md1) and start using it, then I will get ext3 errors on it almost immediately.

7. I run Debian, this is an Ubuntu forum. I know.

Let me know if you need more info from me.

Marc

Revision history for this message
braincloud (katielboyle) wrote :

I am also in the middle of tons of data loss. I installed Ubuntu 9.04 on a 64-bit Supermicro system. I have a 4.1TB RAID partition.

Randomly, suddenly, programs wouldn't run and a quick dmesg command showed hundreds of "htree_dirblock_to_tree: bad entry in directory" and also "attempt to access beyond end of device"

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Marc / braincloud,
    Would you mind opening new bugs for me on these issues? I'd like to track and have them worked separately. You can ping me on IRC to give me the bug numbers if you like.

Thanks!

~JFo

tags: added: kernel-fs
Revision history for this message
Marc Clemente (marc-mclemente) wrote : Re: [Bug 403026] Re: EXT3-fs error (device dm-0): htree_dirblock_to_tree: bad entry in directory

On 05/25/2010 09:27 AM, Jeremy Foshee wrote:
> Marc / braincloud,
> Would you mind opening new bugs for me on these issues? I'd like to track and have them worked separately. You can ping me on IRC to give me the bug numbers if you like.

Problem is... I can't reproduce the error anymore. I don't know if a
kernel upgrade fixed the problem. It does not happen with
2.6.32-3-amd64 and 2.6.32-5-amd64 (Debian kernels).

I will open a new bug if it happens again.

Marc

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

Marc,
    We see this often as a result of the integration of upstream stable patches. Thanks for letting me know about that, and i appreciate you opening the new bug should you see it again. :)

Thanks!

~JFo

Revision history for this message
Jeremy Foshee (jeremyfoshee) wrote :

This bug report was marked as Incomplete and has not had any updated comments for quite some time. As a result this bug is being closed. Please reopen if this is still an issue in the current Ubuntu release http://www.ubuntu.com/getubuntu/download . Also, please be sure to provide any requested information that may have been missing. To reopen the bug, click on the current status under the Status column and change the status back to "New". Thanks.

[This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kj-expired
Changed in linux (Ubuntu):
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.