Ubuntu
linux package

ext4: panic working with large files

Bug #348836 reported by suecom on 2009-03-26

This bug affects 3 people

Affects		Status	Importance	Assigned to	Milestone
	linux (Ubuntu)	Fix Released	High	Tim Gardner
	Jaunty	Fix Released	High	Tim Gardner

Bug Description

When working on large files (> ~10GB) the file system can become fatelly corrupted. The system will crash (freeze), and unable to reboot (Grub reports 'Error 2'). Loading from a live/recovery disk and trying to fsck the corrupted filesystem yeilds multiples error.

I have trashed two system running Jaunty (Alpha 3 and Alpha 6) on Ext4 root file system. Both times I was manipulating/using large files. The first time occuired when I simply removed a 48GB file (system frooze), and the second time when VMWare was writing to a virtual disk (large file). Both system had all updates installed (2.6.27-11 kernel)

I've attached a screen shot of part of the ensuing fsck. This is after all(?) the master (global?) blocks have been decalred invalid.If you can't see from the picture, at this stage fsck is reporting multiply-claimed blocks (by the large files being used at the time, and random smaller files).

The system was a new dual processor (Core Duo X9100) Thinkpad W500 running on a 2.5" SATA drive, 4GB core, Intel GPU.

Related branches

lp:ubuntu/karmic/linux-ports

lp:ubuntu/karmic/linux-rt

Revision history for this message

suecom (allister-nowatt) wrote on 2009-03-26:

Screen shot of the fsck Edit (47.1 KiB, image/jpeg)

Daniele Napolitano (dnax88) on 2009-03-30

affects:

ubuntu → linux (Ubuntu)

Revision history for this message

Eric Shattow (eshattow) wrote on 2009-04-08:

This could be related to https://bugzilla.redhat.com/show_bug.cgi?id=490026 "EXT4 panic, list corruption in ext4_mb_new_inode_pa".

I'm experiencing a fatal panic occasionally on interacting with large amounts of data. The system hardlocks and I'm usually working in X11, so I don't have access to the panic message to confirm. It does sound similar to the reported issue.

Please cherrypick http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=d33a1976fbee1ee321d6f014333d8f03a39d526c to Ubuntu 2.6.28

summary:

- Ext4 file system fatel corruption
+ ext4: panic working with large files

Leann Ogasawara (leannogasawara) on 2009-04-08

Changed in linux (Ubuntu):
importance:	Undecided → High
status:	New → Triaged

Revision history for this message

Leann Ogasawara (leannogasawara) wrote on 2009-04-08:

Hi Guys,

Just wanted to also add a note that the kernel is expected to be frozen tomorrow for Jaunty's release. I've pinged the kernel team to see if they can get this pulled in time. If not, I suspect it should qualify for a Stable Release Update for Jaunty. Thanks.

Revision history for this message

Leann Ogasawara (leannogasawara) wrote on 2009-04-08:

@suecom, also I notice in your description you say you ran Jaunty Alpha3 and Alpha 6 with all updates installed. However, you mention a 2.6.27-11 kernel??? I assume that was a typo? ie. Jaunty has a 2.6.28 based kernel.

Revision history for this message

Leann Ogasawara (leannogasawara) wrote on 2009-04-08:

Hi Guys,

One of our kernel devs threw together a test kernel with this patch applied and uploaded it to his PPA:

https://edge.launchpad.net/~timg-tpi/+archive/ppa

It's package "linux - 2.6.28-11.42~lp348836". It's currently still in the process of building but once it's finished if you could test and report back your results that would be great. For information on how to test from a PPA refer to https://wiki.ubuntu.com/Testing/KernelPPA specifically the Testing Developer PPA section. Thanks.

Revision history for this message

Eric Shattow (eshattow) wrote on 2009-04-09:

I will build and test, but there is no user case to reproduce. I've hit (what I think might be) this bug maybe 5 times in 2-3 months of heavy ext4 filesystem usage. There is usually file corruption afterward. My own use case is BitTorrent, and so files are checksummed and lost data is thrown out. I don't know if there is a user behavior that would more quickly reproduce the bug described by Original Poster.

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2009-04-09:

Bah - the PPA is having problems so I built locally and stashed test kernels at http://kernel.ubuntu.com/~rtg/2.6.28-lp348836

Changed in linux (Ubuntu):
assignee:	nobody → Tim Gardner (timg-tpi)
status:	Triaged → In Progress

Revision history for this message

Eric Shattow (eshattow) wrote on 2009-04-12:

No noticeable ext4-related problems with 2.6.28-11-generic #42~lp348836 SMP. I do not know if the OP's bug is fixed, only that ~lp348836 is working okay.

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2009-04-13:

@Eric - thanks for your response. I'll add this as an SRU request for the first upload after release.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2009-04-17:

#10

This bug was fixed in the package linux - 2.6.28-11.42

---------------
linux (2.6.28-11.42) jaunty; urgency=low

[ Tim Gardner ]

* Enabled LPIA CONFIG_PACKET=y
- LP: #362071

[ Upstream Kernel Changes ]

* ext4: fix bb_prealloc_list corruption due to wrong group locking
- LP: #348836

-- Stefan Bader <email address hidden> Thu, 16 Apr 2009 08:10:55 +0200

Changed in linux (Ubuntu Jaunty):
status:	In Progress → Fix Released

Revision history for this message

ArbitraryConstant (anthony-spamtrap) wrote on 2009-04-26:

#11

I am running kernel 2.6.28-11.42 generic amd64. I'm still able to crash my system with large files on ext4.

I used the following script to reproduce this:

while true; do dd if=/dev/zero of=zero bs=1M count=102400; dd if=zero of=/dev/null bs=1M; rm zero; done

The underlying device is an LVM on a VG that spans two disks.

Changed in linux (Ubuntu Jaunty):
status:	Fix Released → New

Revision history for this message

ArbitraryConstant (anthony-spamtrap) wrote on 2009-04-26:

#12

I noticed some other stuff:

[ 371.568931] EXT4-fs: barriers enabled
[ 371.569257] kjournald2 starting. Commit interval 5 seconds
[ 371.569824] EXT4 FS on dm-1, internal journal on dm-1:8
[ 371.569828] EXT4-fs: delayed allocation enabled
[ 371.569831] EXT4-fs: file extents enabled
[ 371.571151] EXT4-fs: mballoc enabled
[ 371.571157] EXT4-fs: mounted filesystem with ordered data mode.
[ 379.816940] JBD: barrier-based sync failed on dm-1:8 - disabling barriers

Barriers seem to be disabled.

$ sudo lvdisplay --maps bulk/testvol
  --- Logical volume ---
  LV Name /dev/bulk/testvol
  VG Name bulk
  LV UUID q3e1GQ-zqDS-c30Z-jneb-IbEu-GijR-zOwdL5
  LV Write Access read/write
  LV Status available
  # open 0
  LV Size 125.00 GB
  Current LE 32000
  Segments 1
  Allocation inherit
  Read ahead sectors auto
  - currently set to 256
  Block device 252:1

  --- Segments ---
  Logical extent 0 to 31999:
    Type linear
    Physical volume /dev/sda1
    Physical extents 102333 to 134332

The volume isn't spread across both disks.

The same script running on a comparable ext3 filesystem, on the same disk, on the same machine, has had no problems.

Revision history for this message

Leann Ogasawara (leannogasawara) wrote on 2009-04-28:

#13

@ArbitraryConstant, it would be better if you opened a new bug for the issue you are seeing - https://wiki.ubuntu.com/KernelTeam/KernelTeamBugPolicies . The reason is that the patch that was applied and uploaded here apparently didn't fix the issue you are seeing which will likely require a different patch and thus warrents a new bug report. Thanks in advance.