kernel bug corrupts filesystem on heavy parallel I/O

Bug #337246 reported by ddi
12
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Unknown
linux (Ubuntu)
Fix Released
Medium
Unassigned

Bug Description

Distro is Ubuntu Server 8.10, running kernel image is 2.6.27-9-server smp on x86_64.

(Installed a couple months ago, never updated since the box does not have internet access per default.)

On one box with particularly heavy database I/O, we're seeing this in the system log:

==================
[1099257.456522] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header in inode #2621457: unexpected eh_depth - magic f30a, entries 340, max 340(0), depth 1(2)
[1099257.495979] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header in inode #2621457: unexpected eh_depth - magic f30a, entries 328, max 340(0), depth 1(2)
[1099257.505934] EXT4-fs error (device sdb1): ext4_ext_search_right: bad header in inode #2621457: unexpected eh_depth - magic f30a, entries 340, max 340(0), depth 1(2)
==================

... ad infinitum.

Sounds a lot like a problem reported against 2.6.23, see:
http://markmail.org/search/?q=ext4_ext_search_right+depth+%22bad+header+in+inode%22#query:ext4_ext_search_right%20depth%20%22bad%20header%20in%20inode%22+page:1+mid:cuek4hlduagxee5c+state:results

I couldn't locate if/when the discussed new-extent-function.patch was merged into the mainline kernel, since git.kernel.org does not have a search function and also specifically sets novisit for search engines (via "User-agent: * Disallow: /" in robots.txt) such as Google.

Here's dumpe2fs for the filesystem:
===================
dumpe2fs 1.41.3 (12-Oct-2008)
Last mounted on: <not available>
Filesystem UUID: 7adcc7a6-5dd4-4fd5-b988-c92f9429a06c
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr resize_inode dir_index filetype needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg dir_nlink extra_isize
Filesystem flags: signed_directory_hash
Default mount options: (none)
Filesystem state: clean with errors
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 90177536
Block count: 360683091
Reserved block count: 18034154
Free blocks: 116980051
Free inodes: 90170042
First block: 0
Block size: 4096
Fragment size: 4096
Reserved GDT blocks: 938
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 8192
Inode blocks per group: 512
Flex block group size: 16
Filesystem created: Tue Dec 2 17:23:58 2008
Last mount time: Wed Feb 18 21:44:25 2009
Last write time: Tue Mar 3 15:19:14 2009
Mount count: 5
Maximum mount count: 30
Last checked: Tue Dec 2 17:23:58 2008
Check interval: 15552000 (6 months)
Next check after: Sun May 31 18:23:58 2009
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 256
Required extra isize: 28
Desired extra isize: 28
Journal inode: 8
Default directory hash: half_md4
Directory Hash Seed: 2180b9fc-09e3-445e-8170-302178b2eadd
Journal backup: inode blocks
Journal size: 128M
========================

Has the bugfix in the above email been applied to the Ubuntu Server kernel images? If not, then that's probably the bottom of this issue. *hope* ;-).

Changed in linux:
status: Unknown → Confirmed
Revision history for this message
Pete Graner (pgraner) wrote :

Saw this on the ext4 list, as it was pointed out the Ubuntu was being "unresponsive" on the bug after only two days without any importance set. Linked with the upstream bug and will work it there.

Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

Hi ddi,

I'm going to reassign this to the "linux (Ubuntu)" kernel package rather than against the "linux-meta (Ubuntu)" package. The "linux (Ubuntu)" package is the actual Ubuntu kernel source package and is monitored more closely for incoming bugs. This is likely why this bug may have initially gotten overlooked so I apologize for that. In the future, also feel free to drop into the IRC #ubuntu-kernel channel on FreeNode if you feel it's necessary for a kernel bug to be looked at immediately. I do appreciate that you've followed up with upstream as well. Ted appears to be response in the upstream bug report and is the best point of contact for ext4 issues.

It sounds like you're going to do a few additional tests as mentioned in the upstream bug report. Additionally, I wanted to let you know that Ubuntu has started packaging the upstream mainline kernel in case you wanted to test the 2.6.29-rc7 upstream kernel. More information on where these upstream kernel builds can be found and how to install is documented at https://wiki.ubuntu.com/KernelMainlineBuilds . Hopefully that can help. Thanks.

Changed in linux-meta:
importance: Undecided → Medium
status: New → Triaged
Revision history for this message
Andy Whitcroft (apw) wrote :

This is being track upstream on the linked kernel bugzilla report. Upstream has requested a fsck of the filesystem with the latest e2fsprogs installed. Supplied lnks to Ubuntu .debs for the updated tools.

Revision history for this message
ddi (ddi-dubex) wrote :

If cases regularly 'fall into the cracks' then there must be a bigger inherent problem in the processes used, regardless of any minor technical issues like the "meta"-designation.

Anyway, no problem - 'you get what you pay for'! ;-P

The vanilla kernel package and matching e2fsprogs sounds great. Are there other low level utilities that should be matched up to the new kernel before I install, like for instance coreutils and the initramfs generator?

Is there a corresponding source .deb for the vanilla kernel, so I can apply Eric's depth-fix patch on top, before make and make install?

Will the updated e2fsprogs and e2fslibs .debs automatically be overwritten with packages from the mainline repository, once mainline catches up to Jaunty versions and beyond?

And, going to a slightly larger scope, is there an Ubuntu package for open-vm-tools, so I can install new kernels (and enable auto-update on servers for that matter!) without the network connection breaking, performance dropping etc due to missing net/block drivers?

Changed in linux:
status: Confirmed → In Progress
Changed in linux:
status: In Progress → Fix Released
TJ (tj)
Changed in linux (Ubuntu):
assignee: nobody → intuitivenipple
status: Triaged → In Progress
Revision history for this message
TJ (tj) wrote :

Bug #337246
http://launchpad.net/bugs/337246
"kernel bug corrupts filesystem on heavy parallel I/O"

Upstream has received a patch that fixes this issue via

http://bugzilla.kernel.org/show_bug.cgi?id=12821

I've confirmed it applies cleanly to Jaunty.

Please cherry-pick commit 395a87bfefbc400011417e9eaae33169f9f036c0

Author: Eric Sandeen <email address hidden>
Date: Tue Mar 10 18:18:47 2009 -0400

ext4: fix header check in ext4_ext_search_right() for deep extent trees.

The ext4_ext_search_right() function is confusing; it uses a
"depth" variable which is 0 at the root and maximum at the leaves,
but the on-disk metadata uses a "depth" (actually eh_depth) which
is opposite: maximum at the root, and 0 at the leaves.

The ext4_ext_check_header() function is given a depth and checks
the header agaisnt that depth; it expects the on-disk semantics,
but we are giving it the opposite in the while loop in this
function. We should be giving it the on-disk notion of "depth"
which we can get from (p_depth - depth) - and if you look, the last
(more commonly hit) call to ext4_ext_check_header() does just this.

Sending in the wrong depth results in (incorrect) messages
about corruption:

EXT4-fs error (device sdb1): ext4_ext_search_right: bad header
in inode #2621457: unexpected eh_depth - magic f30a, entries 340,
max 340(0), depth 1(2)

http://bugzilla.kernel.org/show_bug.cgi?id=12821

Reported-by: David Dindorp <email address hidden>
Signed-off-by: Eric Sandeen <email address hidden>
Signed-off-by: "Theodore Ts'o" <email address hidden>

Committed via bug #346194

Changed in linux (Ubuntu):
assignee: intuitivenipple → nobody
milestone: none → ubuntu-9.04-beta
status: In Progress → Fix Committed
Revision history for this message
Leann Ogasawara (leannogasawara) wrote :

I'm marking this as a duplicate to bug 346194 as the patch referenced here to fix this bug has already been cherry-picked via that bug. Thanks.

Changed in linux (Ubuntu):
status: Fix Committed → Fix Released
Revision history for this message
ddi (ddi-dubex) wrote :

Looked at bug 346194.

It's just a snippet of this issue (the active upstream part on kernel.org) along with the patch itself, and a note saying that it was committed.

Why on earth create a whole new issue for that, instead of just putting the commit message in this issue?

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.