Bug #330824 “Soft lockups (freezes) when deleting files from ext...” : Jaunty (9.04) : Bugs : linux package : Ubuntu

Revision history for this message

Martin Vysny (vyzivus) wrote on 2009-02-27:

#1

I have exactly the same problem with 2.6.28-8.26. The problem started to appear only recently (2-4 days ago). The problem manifests only when deleting files - it never triggers when adding files. The problem occurs regardless of X running. Interesting is that the problem occurs on 32bit kernel only - 64bit 2.6.28-8.26 does not seem to be affected.

Revision history for this message

Michał Zając (quintasan) wrote on 2009-02-28:

#2

I've encountered it more than 7 times (today 3 times).
First time it happend while moving my /home (4GB) to /mnt/Data, I had to reset the computer and lost some data (not very important thankfully). Today I've tried to clean the pbuilder enviroment with "ARCH=amd64 DIST=jaunty sudo pbuilder --clean" and after restarting my .kde directory was gone.

It seems the freeze occurs when moving or deleting big portions of data. Anyone else can confirm it?

Linux nightwalker 2.6.28-8-generic #26-Ubuntu SMP Wed Feb 25 04:27:53 UTC 2009 x86_64 GNU/Linux

Revision history for this message

dnyaga (daniel-nyaga) wrote on 2009-03-14:

#3

I had experienced the same, and reported it at https://bugs.launchpad.net/ubuntu/+bug/334581. I have had to hard reset my computer four times today.

The circumstances were the same all 4 times: I was copying large directories between different ext4 partitions (using nautilus) when the system locked up. The directories in question have tens of thousands of small sized files. I am going to mark bug 334581 as a duplicate of this one so that we can focus our discussion and testing on one bug report.

Revision history for this message

dnyaga (daniel-nyaga) wrote on 2009-03-14:

#4

Alarming frequency of kernel freezes when working with directories that have lots of tiny files: see https://bugs.launchpad.net/ubuntu/+source/subversion/+bug/342164. That bug reporter's system froze, was hard reset, ext4 had not written the newest file to disk.

Question: what is causing all these freezes?

Changed in linux:
status:	New → Confirmed

Revision history for this message

Agent N2O (agentn2o) wrote on 2009-03-15:

#5

I have experienced something similar to the first poster: I installed ubuntu 9.04 alpha 5 last week on a newly formated ext4 partition. As I was setting the system up, I was updating the system with the latest package updates but I kept running into an error saying the drive was full (it was actually at 20% full of 160 GB). Tried moving and deleting files off the drive, nothing worked. Eventually a reboot solved this but I don't know why.

Upgraded to kernel 2.26.28-9 (alpha 6) on Friday. This weekend I went about converting 2 x 1 TB data drives to ext4 (from ext3) and all went initially well but I wanted to get the full extent (no pun intended) of ext4 file structure so I was cut and pasting data back and forth between the drives using nautilus but the OS kept freezing. Eventually I figured out that copying and pasting was fine but deleting was the culprit. I tried deleting in nautilus and that hung the OS. Tried in a terminal, same thing. Booting into recovery mode and down to the root prompt and went about deleting these files and got a series of these: "BUG: soft locking - CPU#0 stuck for 61s!"

In the end I managed to completely clear one drive off so I reformatted it and then transferred everything back and then reformatted the other. Now both TB drives have "native" ext4 partitions and I can delete from those drives without hangs or freezes.

Revision history for this message

dnyaga (daniel-nyaga) wrote on 2009-03-15:

#6

From Agent N20's comments above, it appears that the freezing occurs where ext3 partitions were converted to ext4 partitions. I have 3 converted ext4 partitions and one fresh/new one. Will try and test that theory a little tonight.

To the other reporters: were your ext4 partitions new or converted?

Revision history for this message

dnyaga (daniel-nyaga) wrote on 2009-03-15:

#7

Same behavior independently reported here: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/340628

The reporter of bug 340628 provided a stack trace.

Revision history for this message

Agent N2O (agentn2o) wrote on 2009-03-15:

#8

Forgot to mention that the freezing didn't happen ALL the time with the ext4 deletes. When I was cutting and pasting it would get a few mins in and freeze, and elsewise I was able to delete some files but others made it freeze. I suspect it may have been LARGE files but I do not have solid proof of that.

Revision history for this message

Brian J. Murrell (brian-interlinx) wrote on 2009-03-15: Re: [Bug 330824] Re: ext4 or 2.6.28 is completely freeze my system

#9

On Sun, 2009-03-15 at 20:23 +0000, Agent N2O wrote:
> Forgot to mention that the freezing didn't happen ALL the time with the
> ext4 deletes. When I was cutting and pasting it would get a few mins in
> and freeze, and elsewise I was able to delete some files but others made
> it freeze. I suspect it may have been LARGE files but I do not have
> solid proof of that.

In my bug, 340628, duped to this bug, the cause was almost certainly a
race. I had multiple deletes going on in the filesystem at the same
time.

FWIW, this is not a problem with 2.6.27-12 from Intrepid which I am
currently using with Jaunty due to this issue.

Revision history for this message

Agent N2O (agentn2o) wrote on 2009-03-15:

#10

It looks like the converted vs native ext4 filesystem info I gave earlier was a RED HERRING! I just got another system freeze deleting files off my EXT4 partition that I had reformatted (using mkfs.ext4) yesterday. I have just dropped down to a root shell on the recovery mode to see if I can figure out which specific file (size?, type?) causes problems.

Revision history for this message

Agent N2O (agentn2o) wrote on 2009-03-15:

#11

Well, I could not reproduce the latest system freeze. Certainly the frequency of the system freezing from EXT4 deletes is much, much lower on this new native EXT4 partition as opposed to the converted version. I am going to do some more spring cleaning to see if it will freeze up again.

Revision history for this message

Agent N2O (agentn2o) wrote on 2009-03-16:

#12

3 more nautilus delete freezes to report (all from same "native" EXT4 partition):

1. deleted a folder with a bunch of video files totalling 7.5 GB
2. deleted 16 folders and files totalling 1.1 GB
3. deleted 6 folder and files totalling 1.6 GB

In all 3 cases it froze immediately after I said yes to the "are you sure prompt" and also I was able to carry out the exact same delete after the reboot, without issue.

Revision history for this message

dnyaga (daniel-nyaga) wrote on 2009-03-16:

#13

The freezes I initially reported occurred when I was moving large folders between ext4 partitions (moves between partitions involve deletes). When I am doing this kind of re-organizing, I usually have several move operations going on concurrently. Could it be that the bug is triggered more easily when there are multiple delete/move operations going on concurrently?

Last night I moved 70GB of data between 2 ext4 partitions. All the 70 GB was moved in one sequential operation. The computer did not freeze. I dropped one one of the ext4 partitions, re-created it, then moved the data back. The machine still did not freeze.

This evening I will "manufacture" some data that I can afford to lose then move it helter skelter between several ext4 partitions, making sure that there is a large number of moves active at any particular time.

Revision history for this message

dnyaga (daniel-nyaga) wrote on 2009-03-16:

#14

just had another freeze. I was deleting a virtual machine snapshot (relatively large file). when I rebooted, I was able to finish deleting.

Revision history for this message

Pauli Virtanen (pauli-virtanen) wrote on 2009-03-18:

#15

Photo of a stack trace from SysRq+L Edit (28.9 KiB, image/png)

Confirm that similar regular freezing occurs only on my machine, with ext4 FS converted from ext3. Typically the freeze occurs under high disk activity; I believe when the freeze has happened, I have had a rsync job traversing whole /home, which contains a large number of small files.

I managed to get a SysRq+L stack trace, when the freeze occurred. (Photo attached; the machine was unresponsive, so can't attach it as text.) The trace is quite similar to that reported in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/340628
It might be of note that the system did not initially respond to SysRq commands at first, but responded only after a few minutes.

These freezes occur very frequently, typically within a few hours of uptime. This bug severely affects viability of using ext4 partitions (if the problem really has to do with ext4).

Probably unrelated information: freezes occur both with the non-free Nvidia driver and the free Xorg nv driver.

Revision history for this message

yaztromo (tromo) wrote on 2009-03-23:

#16

Posting to confirm same bug. Happens when emptying lots of files from the recycle bin, or doing a big rm -r *

Message is something like "BUG: soft locking - CPU#0 stuck for 61s!"

Xubuntu 9.04 and ext4 file system

Revision history for this message

yaztromo (tromo) wrote on 2009-03-23:

#17

I should add that my file system is new and not a convert from ext3.

Revision history for this message

davidnottingham (david-hill-home) wrote on 2009-03-25:

#18

Have experienced this on a daily basis, whenever I try empty the Trash folder. There are several large files in the Trash. This is on a x86_64 system running Ubuntu, and as mentioned above, under gnome and via the comand line (using rm -rf)

Revision history for this message

Xavier Fung (xavier114fch) wrote on 2009-03-26:

#19

Same thing happened to me when I use kdesvn-build to build KDE SVN. Usually it truncates the .svn/entries file just like what has been reported before:

kde-devel@xavier:~$ cd kdesvn/kdesupport
kde-devel@xavier:~/kdesvn/kdesupport$ svn up
svn: Working copy '.' locked
svn: run 'svn cleanup' to remove locks (type 'svn help cleanup' for details)
kde-devel@xavier:~/kdesvn/kdesupport$ svn cleanup
svn: Can't read file 'soprano/includes/Error/.svn/entries': End of file found

Whole system lockup is the end result and need a hard reset.

Revision history for this message

Eric Sandeen (sandeen-ubuntu) wrote on 2009-03-26:

#20

When it freezes, attaching the output of sysrq-w, either via

# echo w > /proc/sysrq-trigger
# dmesg > dmesg.txt

or doing the keyboard combination, would probably be helpful for getting to the bottom of what appears to be a deadlock.

Revision history for this message

Andrius Štikonas (stikonas) wrote on 2009-03-27:

#21

Vanilla kernel 2.6.29-rc8 works well for me. So either this problem was fixed in kernel 2.6.29-rc8, or the problem is caused by Ubuntu kernel patches.

Revision history for this message

yaztromo (tromo) wrote on 2009-03-27:

#22

Reproducing this bug to get a trace corrupted my system so badly not even a ubuntu jaunty CD will boot without locking the system hard. I'm now stuck on my laptop since I have no way to rescue!

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-03-27:

#23

ext4: fix locking typo in mballoc which could cause soft lockup hangs Edit (1.4 KiB, text/plain)

I'm not sure this patch will fix the problem (since I haven't been able to reproduce it yet), but it is at least plausible that this reported "brown paper bag" bug might be responsible for this failure mode.

I've also had one person (irc handle SuperSquirrel) tell us on ext4 that when he went to a stock 2.6.29 kernel, he could no longer reproduce the problem which he could reproduce reliable before. If this is true, then the patch I've attached may not be the solution, and it may be caused by something else in the Ubuntu specific kernel. (Although there was one person who reported a problem very similar to the one reported here on the linux-ext4 list that I don't think was using an Ubuntu kernel, so I'm not sure what to make of this "I went to stock 2.6.29 and it went away" report.)

The patch which I've attached fixes a real bug, and it will be headed to the stable kernel series as soon as it gets accepted upstream, and I'd strongly encourage Ubuntu to pick up this patch. Whether this patch fixes the rm -rf --> soft lockup problem is a different story.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-03-28:

#24

One more people for folks who can reproduce this to test, from the #ext4 IRC channel:

(09:27:06 PM) SuperSquirrel: I am using ubuntu stock kernel on jaunty now and hasnt frozen since i turned app armor off.
(09:29:18 PM) SuperSquirrel: i have deleted 30000 Files in one directory

Can anyone else confirm that if they disable apparmor, the problem goes away?

Revision history for this message

joijioj (fdjsio-deactivatedaccount-deactivatedaccount) wrote on 2009-03-28:

#25

Hello I am "SuperSquirrel" on the IRC.

I have compiled a 2.6.29 Kernel last night and my system has not hung yet. I have also tested the stock kernel in ubuntu jaunty alpha with app armor deleted and my system has not hung up yet. So i think the problem lies with apparmor somewhere as some ext4 developer said on IRC yesterday.

Revision history for this message

yaztromo (tromo) wrote on 2009-03-28:

#26

Simply unloading Apparmor service doesn't help. Is there a quick way to disable apparmor in the kernel too?

Vague guess but does this bug have any relevance? http://osdir.com/ml/file-systems.ext4/2008-01/msg00083.html

It seems to be something that was fixed in 2.6.29.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-03-28:

#27

@21:
>Vanilla kernel 2.6.29-rc8 works well for me. So either this problem was fixed in kernel >2.6.29-rc8, or the problem is caused by Ubuntu kernel patches.

Any chance you can try a vanilla 2.6.28 kernel and see if you can reproduce the problem there? Other very interesting test points would be 2.6.28-rc5, and 2.6.28-rc7. Potential fixes that might have fixed this are:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=ba4439165f0f0d25b2fe065cf0c1ff8130b802eb

and

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7ce9d5d1f3c8736511daa413c64985a05b2feee3

The first patch, which I suspect is more likely the fix, was merged into 2.6.28.8 and 2.6.28-rc6. The second patch was merged into 2.6.28-rc8, and isn't yet in a 2.6.28.y series yet, although it is in the for_stable branch of the ext4 git tree.

Hence it would be interesting to see if the problem is present in 2.6.28-rc5, and fixed in 2.6.28-rc6. (And thanks to whoever can do the test, since I haven't been able to figure out how to replicate it on my systems yet.)

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-03-28:

#28

@26:
>Vague guess but does this bug have any relevance?
>http://osdir.com/ml/file-systems.ext4/2008-01/msg00083.html

I don't think so. The date on that is January 2008, and that patch was integrated long ago.

>It seems to be something that was fixed in 2.6.29.

So you've independently confirmed that it was fixed in stock 2.6.29? If so, then I think we have two people who have confirmed that it was fixed in 2.6.29, and one person who has reported it fixed in 2.6.28-rc8. (See my previous note for potential patches that might have fixed this issue.)

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-03-28:

#29

Apparmor seems less likely to be the cause, as does any of Ubuntu's "sauce" patches. I have a report from someone who is using a completely stock kernel who has seen this bug on 2.6.28, 2.6.28.4, and 2.6.29-rc6 (which if confirmed rules out my "most likely fix" in comment #27 above). Since apparmor isn't in a stock mainstream kernel, it now looks like the problem may have been fixed sometime between 2.6.28-rc6 and 2.6.28-rc8.

(I would appreciate if others could confirm this, though --- since at least some people seem to be able to trigger this very easily, others seem to only trigger this on order of once a month or so. So if one of you Gentle Readers who have been able to reliably reproduce this hang can check to see whether or not it is present in stock 2.6.29-rc6, and but is apparently fixed in 2.6.29-rc8, I would be most grateful for the independent confirmation.)

Thanks to all who have been helping to work this bug!

Revision history for this message

Gabriel Thörnblad (gabriel-thornblad) wrote on 2009-03-28:

#30

Just to make things absolutely clear:
the kernel versions you would like us to test is 2.6.29-rc6 and 2.6.29-rc8? There have been numerous references to 2.6.28-rc kernels as well above which has got me all confused.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-03-28:

#31

@30: Gabriel,

Yes, that's correct; if you could test 2.6.29-rc6 and 2.6.29-rc8, I would be much obliged.

Sorry for the other references to other -rc kernels. I'm gathering information from other sources, including updates from this Launchpad comment stream, and each time I can get more information about "I can reproduce the problem on kernel <foo>" and "The problem seems to go away on kernel version <bar>", we get more information. The object here is to find out which patch actually solves the problem, so I can make a recommendation to the Ubuntu kernel devs to backport that individual patch --- since at this late date it is highly unlikely they will suddenly move Ubuntu Jaunty to use the just-released 2.6.29 kernel.

Thanks, regards,

Revision history for this message

yaztromo (tromo) wrote on 2009-03-28:

#32

@Theodore,

I justed tested 2.6.29-rc6 sourced from http://kernel.ubuntu.com/~kernel-ppa/mainline/

My usual test, which involved deleteing 40gig of video files, that reliably crashed 2.6.28 hasn't crashed 2.6.29-rc6 yet. Since I may have just gotten lucky I'll do some more testing tommorrow.

If I can't get rc6 to crash is there much point in testing rc8?

Revision history for this message

yaztromo (tromo) wrote on 2009-03-29:

#33

Update: After doing even more testing this morning, I'm 99% sure 2.6.29-rc6 isn't affected by this bug.

Revision history for this message

dpr (dpr-aha) wrote on 2009-03-29:

#34

Hi, I could not reproduce the bug in 2.6.29-rc6 or 2.6.29 final from the same source (http://kernel.ubuntu.com/~kernel-ppa/mainline/). But I can reproduce it in 2.6.28.9 as well as in the latest ubuntu kernel.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-03-30:

#35

Hmm. So two people have said they haven't been able to reproduce the bug in 2.6.29-rc6. Unfortunately, one poster on the linux-ext4 claims that he experienced the problem (including getting his file system corrupted) while running that version, 2.6.29-rc6.
I'll have to ask him to confirm this. Also, all of the most likely bug fixes in 2.6.29-rc6 were forward ported to 2.6.28.8 (and thus would have been in 2.6.28.9).

So we have some contradictory data out there. I'm not sure how to reconcile these reports.

Can those folks who say they aren't seeing a problem with 2.6.29-rc6 try with 2.6.29-rc4 and 2.6.29-rc5, to see if they can trigger the problem there?

Revision history for this message

yaztromo (tromo) wrote on 2009-03-30:

#36

I haven't tried with rc5 but I can't trigger the lockup in rc3 or rc4 at all (after much trying too!). Going back to ubuntu 2.6.28 I can still trigger it almost immediately.

http://kernel.ubuntu.com/~kernel-ppa/mainline/ doesn't have any more built kernels lower than rc3 so I'm stuck unless someone can point me to a tutorial on compiling rc1 from source.

Revision history for this message

Andrius Štikonas (stikonas) wrote on 2009-03-30:

#37

@36
download tarball from kernel.org
tar xf linux-*.tar.bz2
fakeroot make-kpkg --initrd linux_image

Revision history for this message

Andrius Štikonas (stikonas) wrote on 2009-03-30:

#38

@36
I made mistake in instructions:
tar xf linux-2.6.29-rc*.tar.bz2
cd linux-2.6.29-rc*
make menuconfig
fakeroot make-kpkg --initrd kernel_image

I am now compiling rc2. Will tell the result in a few hours.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-03-30:

#39

@yaztromo,

Can you tell me what you do to try to reproduce the problem? As I mentioned, I haven't been able to reproduce it myself, so I've had to rely other people's bug reports. If there's someone who is familiar with "git bisect", it would be really useful to try to do a "git bisect start v2.6.28 2.6.29 -- fs/ext4 fs/jbd2", reversing the sense of "git bisect good" and "git bisect bad" (i.e., if you can reproduce it, call it "git bisect good", and if you can't reproduce the soft lock, call it "git bisect bad"). It would probably require half a dozen builds or so but at the end of it, it would point us at a patch which apparently fixed the bug. (There are 91 commits invloving either the fs/ext4 or fs/jbd2 directories between .28 and .29, and log base 2 of 91 is about 6.5; so it will require approximately 7 git bisect tests in order to localize things down to a single commit.)

Again, this is mostly useful so we can tell the Ubuntu kernel devs which patch to backport for the official Ubunut Jaunty kernel. (Fedora 11 is going to be using 2.6.29, so they won't see this issue.) So unless someone can help me reproduce it on my test system (which is a 1Gig netbook with a 5400 rpm drive running Ubuntu 8.10 with an updated kernel), I really will need someone who can reproduce it and who knows how to drive git and do kernel builds out of a git source tree to localize this down.

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-03-30: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#40

hang.py Edit (612 bytes, text/x-python; charset=US-ASCII; name="hang.py")

I've been able to reproduce this consistently on my desktop (2.5gb
ram, amd@1.6ghz singlecore, 7200rpm drive) by writing half a meg to a
couple thousand different files sequentially, dropping the cache,
deleting them, and starting over. Usually the machine hardlocks
partway into the second cycle. Under 2.6.29, the test completes fine
with no intermittent hanging or otherwise. I haven't tried any other
kernels yet.

My laptop (1gb ram, intel@1.6ghz, 5400rpm drive) hangs intermittently
on the same workload, but doesn't hardlock consistently.

Revision history for this message

yaztromo (tromo) wrote on 2009-03-30:

#41

Theodore,

I set an rm -rf going of multiple copies of "The High Voltage SID collection", which in total is around 130,000 small files. At most two tries is usually enough to trigger the soft lock.

I'm compiling RC1 now, thanks to Andrius' instructions. Will let you know the result.

Revision history for this message

Andrius Štikonas (stikonas) wrote on 2009-03-30:

#42

I am quite familiar with git and can try to bisect, but my laptop is not so fast, so a dozen of build will probably take some time. I reproduce the problem by working (svn up or rm -rf) with KDE subversion repository.

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-03-30:

#43

I'm starting a git bisect now.

Revision history for this message

Brian J. Murrell (brian-interlinx) wrote on 2009-03-30: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#44

On Mon, 2009-03-30 at 16:56 +0000, yaztromo wrote:
> Theodore,
>
> I set an rm -rf going of multiple copies of "The High Voltage SID
> collection", which in total is around 130,000 small files. At most two
> tries is usually enough to trigger the soft lock.
>
> I'm compiling RC1 now, thanks to Andrius' instructions. Will let you
> know the result.

Yes, my feeling has always been that this is an unlink (or generically,
"rm") race as I only saw it when two processes were processing file
removals (with one being "rm -rf") in two different directories.

Now what might be of relevance is that the two trees would have had
shared a high percentage of hard links. Maybe it's a race on deleting a
hard linked file?

b.

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-03-30: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#45

> Yes, my feeling has always been that this is an unlink (or generically,
> "rm") race as I only saw it when two processes were processing file
> removals (with one being "rm -rf") in two different directories.
>
> Now what might be of relevance is that the two trees would have had
> shared a high percentage of hard links. Maybe it's a race on deleting a
> hard linked file?

I can reproduce with a large number of files in a single directory.

Revision history for this message

Brian J. Murrell (brian-interlinx) wrote on 2009-03-30: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#46

On Mon, 2009-03-30 at 17:40 +0000, Andrius Štikonas wrote:
> but my laptop is not
> so fast

ccache is your friend. Should most certainly be worth it for half a
dozen kernel (re-)builds.

b.

Revision history for this message

yaztromo (tromo) wrote on 2009-03-30:

#47

2.6.29RC1 is stable for me. The only kernels I can reproduce the error on is Ubuntu' s 2.6.28 from the repos and 2.6.28.9 from the ppa mentioned above.

Revision history for this message

Andrius Štikonas (stikonas) wrote on 2009-03-31:

#48

I can also confirm that this bug is not reproducible on 2.6.28 and 2.6.29-rc1 vanilla kernels.

It means that this problem occured because of Ubuntu kernel patches and that's why Theodore was not able to reproduce it on his Intrepid machine with updated vanilla kernels.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-03-31:

#49

@Andrius,

Unfortunately, I suspect it's not quite so simple as this. I have had one user who has been using a stock linux kernel using 2.6.28, 2.6.28.4, and 2.6.29-rc6 --- but it takes about a month or so before he's able to replicate the problem.

So at this point, given what everyone has reported, my best guess is that there is something in the Ubuntu "Sauce" patches which makes the bug much more likely to manifest, but which may not be the root cause of the problem by itself.

So what this means is at this point, is that I need some volunteer to try doing a git bisect, this time between a stock kernel version used as the base for an official Ubuntu kernel, and the fully applied set of patches for an Ubuntu kernel, so we can find which patch in the Ubuntu sauce series seems to make this problem easy to manifest (i.e., appearing within hours, versus taking a month or so of normal usage before it shows up). When we find the "problem" patch, it may not be the guilty patch, but it might unmask the real root cause of the problem, and so by looking at the patch, maybe we'll get a good hint about what the real underlying problem might be.

I'm currently frightfully busy, trying to get ready for the upcoming Linux Storage and Filesystem Workshop and the Linux Foundation Collaboration Summit next week, so I have very little time to do the grunt work --- so if someone could step forward and try to do the git bisect to determine the Ubuntu "sauce" patch that seems to be responsible for making the hang easy to occur, I would be terribly grateful.

Thanks in advance....

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-03-31: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#50

@Theo

I'm about halfway through a git bisect of mainline 2.6.28 -> 2.6.29,
and (unsurprisingly I now realize) haven't been able to reproduce.
I'm pulling down git://kernel.ubuntu.com/ubuntu/ubuntu-jaunty.git
right now, and will run a bisect against mainline 2.6.28 and ubuntu's
2.6.28-11.38.

On Tue, Mar 31, 2009 at 11:53 AM, Theodore Ts'o <email address hidden> wrote:
> So what this means is at this point, is that I need some volunteer to
> try doing a git bisect, this time between a stock kernel version used as
> the base for an official Ubuntu kernel, and the fully applied set of
> patches for an Ubuntu kernel, so we can find which patch in the Ubuntu
> sauce series seems to make this problem easy to manifest (i.e.,
> appearing within hours, versus taking a month or so of normal usage
> before it shows up). When we find the "problem" patch, it may not be
> the guilty patch, but it might unmask the real root cause of the
> problem, and so by looking at the patch, maybe we'll get a good hint
> about what the real underlying problem might be.

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-04-01:

#51

screenshot of the console Edit (259.2 KiB, image/jpeg)

I get this bug each time that I rsync via network a big quantity of files. It seems to survive longer if I do it from console compared to a normal boot with GNOME. I attach two screenshots showing debug outputs that appears in the console and debug from ALT + SysRec + W

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-04-01:

#52

Screenshot of ALT + SysRec + W Edit (198.0 KiB, image/jpeg)

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-04-01:

#53

Saivann,

The screen shot is very much appreciated, but unfortunately sysrq-w generates a *huge* amount of data; much more than can fit on a single screen, alas. Which means to get the full sysrq-w, we would either need to get you set up with a serial console, and then monitor the output via the serial console to record all of the data dumped out to the console, OR, if you can get a separate root partition and test partition, and replicate the hang on the test partition, hopefully writes to the root partition will still work. At that point, you might be able to do "dmesg > /dmesg.txt", and thus save the output of the sysrq-w.

One other thought --- does the Ubuntu kernel come with CONFIG_LOCKDEP and CONFIG_LOCKDEP_SUPPORT enabled? It may be that this will give us the warning we need, hopefully a little bit more succintly. If someone can reproduce the problem on a custom-made kernel that has Lockdep enabled, maybe that will shed some light on what's going on.

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-04-01:

#54

Theodore : Thanks for your guidance. I tried today to delete files in the destination partition from the ubuntu LiveCD (squashfs) and the result was the same, I wasn't able to input any text and/or use the mouse so it was not possible to give needed information. If you have other ideas, I'm ready to try different methods, including serial consoles if you can guide me a bit.

Special notes about my partition : It is almost full (40Gb total. 4Gb free.) and it is a cryptsetup encrypted ext4 filesystem.

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-04-01:

#55

git bisect start v2.6.28 Ubuntu-2.6.28-11.38

bisect_log so far:

git bisect start
# good: [4a6908a3a050aacc9c3a2f36b276b46c0629ad91] Linux 2.6.28
git bisect good 4a6908a3a050aacc9c3a2f36b276b46c0629ad91
# bad: [1c211f0a50c10a7a95e958dcad89a185ab2e1a1a] UBUNTU: Ubuntu-2.6.28-11.38
git bisect bad 1c211f0a50c10a7a95e958dcad89a185ab2e1a1a
# good: [c04f828bd3a42d738f547fe6b0549cf70510a380] relay: fix lock imbalance in relay_late_setup_files
git bisect good c04f828bd3a42d738f547fe6b0549cf70510a380
# bad: [d1b53be89bd6a5053596aee8decfccf135e725ae] ALSA: hda - Release ELD proc file
git bisect bad d1b53be89bd6a5053596aee8decfccf135e725ae
# good: [576c67b9784953f3796b54c0ea45ecd68acf0e50] USB: usb-storage: add Pentax to the bad-vendor list
git bisect good 576c67b9784953f3796b54c0ea45ecd68acf0e50

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-04-01:

#56

Disregard that last bisect good 576c, screwed up the test. 576c is bad.

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-04-01:

#57

netconsole.txt Edit (49.5 KiB, text/plain)

I was able to capture linux debug logs using netconsole. File attached

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-04-02:

#58

Almost done bisecting.

git-bisect start
# good: [c04f828bd3a42d738f547fe6b0549cf70510a380] relay: fix lock
imbalance in relay_late_setup_files
git-bisect good c04f828bd3a42d738f547fe6b0549cf70510a380
# bad: [75a9a0bdb7f5d4a9a29711a3232b24fab35eb4e0] cpuidle: Add
decaying history logic to menu idle predictor
git-bisect bad 75a9a0bdb7f5d4a9a29711a3232b24fab35eb4e0
# bad: [0bfe75ee038b6774197e03990c1e6132c26cc4dc] UBUNTU: SAUCE:
(revert before 2.6.28.y update) [PATCH] ext4: Fix race between
read_block_bitmap() and mark_diskspace_used()
git-bisect bad 0bfe75ee038b6774197e03990c1e6132c26cc4dc
# good: [938ded64f043e003a2381b46f890cafb0ebd5e2a] ALSA: hda - More
fixes on Gateway entries
git-bisect good 938ded64f043e003a2381b46f890cafb0ebd5e2a
# good: [7051f08630b7269d548930be358624f2830577df] UBUNTU: SAUCE:
(revert before 2.6.28.y update) [PATCH] ext4: Add support for
non-native signed/unsigned htree hash algorithms
git-bisect good 7051f08630b7269d548930be358624f2830577df
# good: [dad87da3db508b0e7befb67c2d7e70219b2bcafc] UBUNTU: SAUCE:
(revert before 2.6.28.y update) [PATCH] jbd2: Add barrier not
supported test to journal_wait_on_commit_record
git-bisect good dad87da3db508b0e7befb67c2d7e70219b2bcafc
# skip: [dbf8b1c4e8122e705447b69aea9ee6ef3a9caa30] UBUNTU: SAUCE:
(revert before 2.6.28.y update) [PATCH] ext4: Use
EXT4_GROUP_INFO_NEED_INIT_BIT during resize
git-bisect skip dbf8b1c4e8122e705447b69aea9ee6ef3a9caa30

That last one fails on boot to mount the filesystem due to a inode
that was double freed (I think, I'll rerun that one after I finish the
bisect).

This leaves the following patches:

[ bad] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4:
Fix race between read_block_bitmap() and mark_diskspace_used()
[ ? ] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4:
don't use blocks freed but not yet committed in buddy cache init
[ ? ] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4:
cleanup mballoc header files
[skip] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4:
Use EXT4_GROUP_INFO_NEED_INIT_BIT during resize
[ ? ] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4:
Add blocks added during resize to bitmap
[ ? ] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4:
Don't overwrite allocation_context ac_status

Almost done bisecting.

git-bisect start
# good: [c04f828bd3a42d738f547fe6b0549cf70510a380] relay: fix lock
imbalance in relay_late_setup_files
git-bisect good c04f828bd3a42d738f547fe6b0549cf70510a380
# bad: [75a9a0bdb7f5d4a9a29711a3232b24fab35eb4e0] cpuidle: Add
decaying history logic to menu idle predictor
git-bisect bad 75a9a0bdb7f5d4a9a29711a3232b24fab35eb4e0
# bad: [0bfe75ee038b6774197e03990c1e6132c26cc4dc] UBUNTU: SAUCE:
(revert before 2.6.28.y update) [PATCH] ext4: Fix race between
read_block_bitmap() and mark_diskspace_used()
git-bisect bad 0bfe75ee038b6774197e03990c1e6132c26cc4dc
# good: [938ded64f043e003a2381b46f890cafb0ebd5e2a] ALSA: hda - More
fixes on Gateway entries
git-bisect good 938ded64f043e003a2381b46f890cafb0ebd5e2a
# good: [7051f08630b7269d548930be358624f2830577df] UBUNTU: SAUCE:
(revert before 2.6.28.y update) [PATCH] ext4: Add support for
non-native signed/unsigned htree hash algorithms
git-bisect good 7051f08630b7269d548930be358624f2830577df
# good: [dad87da3db508b0e7befb67c2d7e70219b2bcafc] UBUNTU: SAUCE:
(revert before 2.6.28.y update) [PATCH] jbd2: Add barrier not
supported test to journal_wait_on_commit_record
git-bisect good dad87da3db508b0e7befb67c2d7e70219b2bcafc
# skip: [dbf8b1c4e8122e705447b69aea9ee6ef3a9caa30] UBUNTU: SAUCE:
(revert before 2.6.28.y update) [PATCH] ext4: Use
EXT4_GROUP_INFO_NEED_INIT_BIT during resize
git-bisect skip dbf8b1c4e8122e705447b69aea9ee6ef3a9caa30

That last one fails on boot to mount the filesystem due to a inode
that was double freed (I think, I'll rerun that one after I finish the
bisect).

This leaves the following patches:

[ bad] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4:
Fix race between read_block_bitmap() and mark_diskspace_used()
[  ? ] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4:
don't use blocks freed but not yet committed in buddy cache init
[  ? ] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4:
cleanup mballoc header files
[skip] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4:
Use EXT4_GROUP_INFO_NEED_INIT_BIT during resize
[  ? ] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4:
Add blocks added during resize to bitmap
[  ? ] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4:
Don't overwrite allocation_context ac_status

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-04-02:

#59

"UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4: cleanup
mballoc header files" also fails with the same error:

relevant retyped excerpt from dmesg:

EXT4-fs: barriers enabled
kjournald2 starting. Commit interval 5 seconds
EXT4-fs: delayed allocation enabled
EXT4-fs: file extents enabled
EXT4-fs: mballoc enabled
EXT4-fs: mounted filesystem with ordered data mode
<snip>
EXT4 FS on sdb1, interval journal on sdb1:8
EXT4-fs error (device sdb1): ext4_mb_generate_buddy: EXT4-fs: group
232: 15995 blocks in bitmap, 15994 in gd
Aborting journal on device sdb1:8
Remounting filesystem read-only
EXT4-fs error (device sdb1) in ext4_reserve_inode_write: Journal has aborted
EXT4-fs error (device sdb1) in ext4_reserve_inode_write: Journal has aborted
EXT4-fs error (device sdb1) in ext4_ext_remove_space: Journal has aborted
EXT4-fs error (device sdb1) in ext4_reserve_inode_write: Journal has aborted
EXT4-fs error (device sdb1) in ext4_orphan_del: Journal has aborted
EXT4-fs error (device sdb1) in ext4_reserve_inode_write: Journal has aborted
EXT4-fs error (device sdb1): mb_free_blocks: double-free of inode 0's
block 7607748(bit 5572 in group 232)

One more kernel to test...

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-04-02:

#60

"""There are only 'skip'ped commit left to test.
The first bad commit could be any of:
bfe25765f9d655bcdb0ed883786ef1ad8509b027
dbf8b1c4e8122e705447b69aea9ee6ef3a9caa30
f5de197efcd44096152aacc0e8d3c02637959185
We cannot bisect more!
"""

Final git bisect log:

cwillu@nokia:~/work/kernel/linux-2.6$ git bisect log
git-bisect start
# good: [c04f828bd3a42d738f547fe6b0549cf70510a380] relay: fix lock imbalance in relay_late_setup_files
git-bisect good c04f828bd3a42d738f547fe6b0549cf70510a380
# bad: [75a9a0bdb7f5d4a9a29711a3232b24fab35eb4e0] cpuidle: Add decaying history logic to menu idle predictor
git-bisect bad 75a9a0bdb7f5d4a9a29711a3232b24fab35eb4e0
# bad: [0bfe75ee038b6774197e03990c1e6132c26cc4dc] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4: Fix race between read_block_bitmap() and mark_diskspace_used()
git-bisect bad 0bfe75ee038b6774197e03990c1e6132c26cc4dc
# good: [938ded64f043e003a2381b46f890cafb0ebd5e2a] ALSA: hda - More fixes on Gateway entries
git-bisect good 938ded64f043e003a2381b46f890cafb0ebd5e2a
# good: [7051f08630b7269d548930be358624f2830577df] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4: Add support for non-native signed/unsigned htree hash algorithms
git-bisect good 7051f08630b7269d548930be358624f2830577df
# good: [dad87da3db508b0e7befb67c2d7e70219b2bcafc] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] jbd2: Add barrier not supported test to journal_wait_on_commit_record
git-bisect good dad87da3db508b0e7befb67c2d7e70219b2bcafc
# skip: [dbf8b1c4e8122e705447b69aea9ee6ef3a9caa30] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4: Use EXT4_GROUP_INFO_NEED_INIT_BIT during resize
git-bisect skip dbf8b1c4e8122e705447b69aea9ee6ef3a9caa30
# good: [455220fb409ce06ac3c902417c5a85d17b0308c0] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4: Add blocks added during resize to bitmap
git-bisect good 455220fb409ce06ac3c902417c5a85d17b0308c0
# skip: [bfe25765f9d655bcdb0ed883786ef1ad8509b027] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4: cleanup mballoc header files
git-bisect skip bfe25765f9d655bcdb0ed883786ef1ad8509b027
# bad: [f5de197efcd44096152aacc0e8d3c02637959185] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4: don't use blocks freed but not yet committed in buddy cache init
git-bisect bad f5de197efcd44096152aacc0e8d3c02637959185

"""There are only 'skip'ped commit left to test.
The first bad commit could be any of:
bfe25765f9d655bcdb0ed883786ef1ad8509b027
dbf8b1c4e8122e705447b69aea9ee6ef3a9caa30
f5de197efcd44096152aacc0e8d3c02637959185
We cannot bisect more!
"""

Final git bisect log:

cwillu@nokia:~/work/kernel/linux-2.6$ git bisect log
git-bisect start
# good: [c04f828bd3a42d738f547fe6b0549cf70510a380] relay: fix lock imbalance in relay_late_setup_files
git-bisect good c04f828bd3a42d738f547fe6b0549cf70510a380
# bad: [75a9a0bdb7f5d4a9a29711a3232b24fab35eb4e0] cpuidle: Add decaying history logic to menu idle predictor
git-bisect bad 75a9a0bdb7f5d4a9a29711a3232b24fab35eb4e0
# bad: [0bfe75ee038b6774197e03990c1e6132c26cc4dc] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4: Fix race between read_block_bitmap() and mark_diskspace_used()
git-bisect bad 0bfe75ee038b6774197e03990c1e6132c26cc4dc
# good: [938ded64f043e003a2381b46f890cafb0ebd5e2a] ALSA: hda - More fixes on Gateway entries
git-bisect good 938ded64f043e003a2381b46f890cafb0ebd5e2a
# good: [7051f08630b7269d548930be358624f2830577df] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4: Add support for non-native signed/unsigned htree hash algorithms
git-bisect good 7051f08630b7269d548930be358624f2830577df
# good: [dad87da3db508b0e7befb67c2d7e70219b2bcafc] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] jbd2: Add barrier not supported test to journal_wait_on_commit_record
git-bisect good dad87da3db508b0e7befb67c2d7e70219b2bcafc
# skip: [dbf8b1c4e8122e705447b69aea9ee6ef3a9caa30] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4: Use EXT4_GROUP_INFO_NEED_INIT_BIT during resize
git-bisect skip dbf8b1c4e8122e705447b69aea9ee6ef3a9caa30
# good: [455220fb409ce06ac3c902417c5a85d17b0308c0] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4: Add blocks added during resize to bitmap
git-bisect good 455220fb409ce06ac3c902417c5a85d17b0308c0
# skip: [bfe25765f9d655bcdb0ed883786ef1ad8509b027] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4: cleanup mballoc header files
git-bisect skip bfe25765f9d655bcdb0ed883786ef1ad8509b027
# bad: [f5de197efcd44096152aacc0e8d3c02637959185] UBUNTU: SAUCE: (revert before 2.6.28.y update) [PATCH] ext4: don't use blocks freed but not yet committed in buddy cache init
git-bisect bad f5de197efcd44096152aacc0e8d3c02637959185

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-04-02:

#61

Hmm, all of those appear to be backports of ext4 patches which are in 2.6.29. Maybe the backports are buggy, or there's something else going on. I'll have to grab the Ubuntu kernel git tree and try to see what's going on. It's wierd; some of the "revert before 2.6.28.y" Sauce patches are ones which were never going to be headed for the stable kernel series. (We don't do things like "clean up mballoc header patches" in the stable series.)

Thanks for the work in doing the bisect; it should be very useful!

Revision history for this message

James Clemence (jvc26) wrote on 2009-04-03:

#62

Just another comment, more of the same, amd64, ubuntu jaunty kernel 2.6.28-11-generic. Freezes with either empty wastebasket or rm -rf/rm/etc.

EXT4 native, not a converted EXT3, unfortunately system does not respond to SysRq and getting debug logs is almost impossible.

If I can be any more helpful rather than just adding another person onto the list of 'this affects me and is a bit of a showstopper', please shout.

Il

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2009-04-03:

#63

@Ted - these are the Jaunty fs/ext4 commits that are different then 2.6.28.9:

LP: #346194
bbf2bb7bd0a7efaaea309c88dc6bc7d1f89b7516 ext4: fix header check in ext4_ext_search_right() for deep extent trees.

Per your suggestion to ameliorate the delayed rename behavior:
0903d3a2925f3cffb78ca611c4e3356ac7ffef8a UBUNTU: SAUCE: (drop after 2.6.28) ext4: add EXT4_IOC_ALLOC_DA_BLKS ioctl
a4a01c495e3d445428015ff7f9825430e77f9567 UBUNTU: SAUCE: (drop after 2.6.28) ext4: Automatically allocate delay allocated blocks on close
f305d27b95849da130c3319e51054309c371e92a UBUNTU: SAUCE: (drop after 2.6.28) ext4: Automatically allocate delay allocated blocks on rename

Changed in linux (Ubuntu):
assignee:	nobody → timg-tpi
importance:	Undecided → Medium
status:	Confirmed → In Progress

Revision history for this message

JamesT (james-mi4) wrote on 2009-04-06:

#64

Just adding to the list of occurrences. on 2.6.28-11-generic Kubuntu 9.04 amd64

ext4 native again, no conversions. Anything that does really high disk IO seems to trigger it. The one app type that seems to trigger it quickly are news readers, hellanzb or klibido although i've seen it on a big apt-get update. Nothing gets logged except BUG: soft locking - CPU#0 stuck for 61s! to console. It always seems to be 61s

J

Revision history for this message

Leon Nardella (leon.nardella) wrote on 2009-04-12:

#65

Anybody sees the connection between this bug and the data loss bug ( Bug #317781 ) ? I just had my laptop freeze again ( I was building Firefox ) and then it was trashed so hard I couldn't even have it boot again!

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2009-04-13:

#66

Its possible this bug is related to LP 348836. Can you guys try the kernel referenced in https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/348836/comments/7 ?

Revision history for this message

Wade Menard (wade-ezri) wrote on 2009-04-13:

#67

I may now jinx myself but the patch in that build does appear to fix the issue

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-04-13:

#68

I will be very happy if this patch fixes the issue --- however, I'll note that this patch showed up in mainline shortly after 2.6.28-rc8, and it doesn't explain the reports from people who said they were not able to reproduce the problem using a stock 2.6.28 or 2.6.29-rc1 kernel.

Anyway, if more people who have been able to reproduce this bug could test the patch referenced above (there is a prebuilt available here: http://kernel.ubuntu.com/~rtg/2.6.28-lp348836) that would be much appreciated.

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-04-13:

#69

I installed the kernel package pointed by Tim Gardner (http://kernel.ubuntu.com/~rtg/2.6.28-lp348836/linux-image-2.6.28-11-generic_2.6.28-11.42_i386.deb), rebooted my computer and I was still able to reproduce the freeze after 3 seconds. So unfortunately, does not seem to be fixed here.

Revision history for this message

Wade Menard (wade-ezri) wrote on 2009-04-14:

#70

I'm now at a 21 hour uptime doing many things that would trigger the lockups for me pretty consistently under the main ubuntu kernel... compiling and clobbering large source trees, emptying trash, clearing application caches, etc.

I have also experienced the issue mentioned in bug 348836 before as well when creating large (~80GiB) files with TrueCrypt, resulting in unbootable filesystem corruption I had to repair with the LiveCD. I have tried to reproduce that again with this kernel and am not able to.

So far in on this kernel I'm quite happy. I am curious about Saivann's case though and would like to see tests from others.

Revision history for this message

dpr (dpr-aha) wrote on 2009-04-14:

#71

Current jaunty kernel and the one from http://kernel.ubuntu.com/~rtg/2.6.28-lp348836/linux-image-2.6.28-11-generic_2.6.28-11.42_i386.deb both freeze for me.

Revision history for this message

Nick B. (futurepilot) wrote on 2009-04-14: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#72

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I'm still able to reproduce the freeze with hang.py using the
suggested kernel 2.6.28-11 #42
I have a laptop that I have been able to reproduce this bug with
rather easily possibly due to it's low system specs.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAknk67wACgkQAGHzB9Tvw4yGvACcD/HRt17cAgx70Dl9uZEGeETM
0yUAn0dCo/jZJve6yfmgAypOrgVxcWtx
=VxjD
-----END PGP SIGNATURE-----

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-04-14:

#73

The existence of this bug should be noted in the release notes under known issues as well as in the feature summary for "Ext4 filesystem support".

Saivann Carignan (oxmosys) on 2009-04-15

affects:

ubuntu-website → ubuntu-release-notes

Revision history for this message

Brian J. Murrell (brian-interlinx) wrote on 2009-04-15:

#74

On Tue, 2009-04-14 at 23:35 +0000, Carey Underwood wrote:
> The existence of this bug should be noted in the release notes under
> known issues as well as in the feature summary for "Ext4 filesystem
> support".

So basically, the fact that ext4 simply does not work in Ubuntu Jaunty
should be in the release notes? Maybe you should just be honest with
the upcoming user base and simply disable it until it does work.

Delivering something known to be so broken rather than simply failing to
deliver, IMHO, is the worse option.

If people really want/need ext4 and it's a deciding factor in choosing a
distro, they should be allowed to evaluate their choices without the
smoke and mirrors that it works in Ubuntu. Don't waste their time.

I'm not sure why this is not being resolved anyway. Ted gave some
awesome advise into tracking down this bug and some other people spent a
significant portion of their own valuable time to do most of the grunt
work. Why are the results of those efforts not being put towards fixing
the problem and making a release that works?

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-04-15: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#75

Because it's still being investigated, and 9.04 isn't actually
released yet. There's a perfectly good chance that the culprit is
nailed down and fixed before release, and if it can't, then simply
removing it from the final release notes and installer would suffice.

Revision history for this message

Brian J. Murrell (brian-interlinx) wrote on 2009-04-15: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#76

On Wed, 2009-04-15 at 11:46 +0000, Carey Underwood wrote:
> Because it's still being investigated, and 9.04 isn't actually
> released yet. There's a perfectly good chance that the culprit is
> nailed down and fixed before release, and if it can't, then simply
> removing it from the final release notes and installer would suffice.

Ahh, yes. This sounds very good.

The previous comment, to which I was replying gave an impression that
nothing more was going to be done (i.e. like perhaps it was too close to
freeze to make any more changes) and Jaunty would just release "as is".

If the decision was made to simply disable it if it cannot be fixed,
then yes, this is very good. Doing otherwise, again, IMHO, would
further besmirch the reputation of Ubuntu's releases[1].

[1] As if Intrepid didn't do enough besmirching of its own. I know
people who are actively searching for alternate distros because of
Interepid's bugginess. I personally have not upgraded any of my
friends/family to Intrepid because of same said buginess. The jury is
still out on Jaunty for me, but including evolution 2.26 (technically
anything > 2.22) is not helping Jaunty's case.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-04-15:

#77

I'll note that part of the problem is that it doesn't seem to be trivially reproducible even on Ubuntu Jaunty. I was having lunch with a number of Ubuntu kernel developers at the Linux Foundation Collaboration summit, and I talked to them specifically about this bug --- a number of them indicated to me that they were using ext4, and deleting kernel trees, and building new trees all the time, without a problem, and they were on Jaunty.

That being said, there's clearly a problem here, but exactly what are all the reproduction conditions haven't been found yet, so not everyone is able to reliably reproduce the problem. And so far I'm not hearing much from the Fedora 11 beta --- but maybe there's something unique that Ubuntu is enabling by default that makes this bug much more easy to be tickled on Ubuntu that F11. Argh....

On my side, I still need to figure out why a Ubuntu Jaunty kernel built with my custom "no modules" config causes a black screen lockup when booting on an Ubuntu Intrepid userspace --- or find room to do an install of Ubuntu Jaunty beta and try to reproduce the problem myself. Oh, and I have to get my taxes filed too, and expense reports, and lots of other things related to my day job (which doesn't include ext4; it's been a long time since anyone has paid me to work on ext4 as my day job -- it's something I do in my copious spare time in the evenings or when I have a few spare moments).

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-04-15:

#78

Theodore Ts'o : I noticed that this bug is more likely to happen on partitions that does not have many space left, in case that this give you some hint. I have a 30 Gb partition which has only 3 Gb left. The bug can be reproduced immediately each time on that partition. I have a 700 Gb partition with 500 Gb left, it's pretty hard to reproduce the issue on it.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-04-15:

#79

@Saivann,

Thanks, that's a really good tip. I'll see if I can reproduce on a mostly full filesystem.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-04-15:

#80

One other question --- for the people that have managed to reproduce, do you know if there was any files getting created or otherwise blocks getting allocated on the filesystem under test, or is an "rm -rf" of some hierarchy in the filesystem sufficient on its own to cause the system to hang?

Revision history for this message

Brian J. Murrell (brian-interlinx) wrote on 2009-04-15:

#81

On Wed, 2009-04-15 at 13:42 +0000, Theodore Ts'o wrote:
> I'll note that part of the problem is that it doesn't seem to be
> trivially reproducible even on Ubuntu Jaunty.

It seems to reproduce here when two processes are removing files/trees
from the same filesystem. And as another comment suggests, my
filesystem is pretty damn full (98%) too, so maybe that's the key
factor.

> On my side, I still need to figure out why a Ubuntu Jaunty kernel built
> with my custom "no modules" config causes a black screen lockup when
> booting on an Ubuntu Intrepid userspace --- or find room to do an
> install of Ubuntu Jaunty beta and try to reproduce the problem myself.
> Oh, and I have to get my taxes filed too, and expense reports, and lots
> of other things related to my day job (which doesn't include ext4; it's
> been a long time since anyone has paid me to work on ext4 as my day job
> -- it's something I do in my copious spare time in the evenings or when
> I have a few spare moments).

I hear ya. I didn't at all mean to suggest that you (Ted) should be
working to fix this (indeed, your volunteered contributions on the bug
are significant in it's progress), but hopefully somebody who's getting
paid to work on Ubuntu Linux could devote some time to it.

Ted: As to your other question about allocation during removal, I tend
to doubt in my case there was. This is an archive/backup filesystem in
my case and the parallel (but not racing) deletes happen after the
backup run, so there shouldn't be any allocation happening at that time.

Ted: FWIW, you might (or might not) recall we discussed the speed of
deleting hardlink trees in ext3 vs. XFS a few months ago when I was
switching to XFS specifically for deletes speed... well, a few XFS
crashes (and one almost 24 hour xfs_repair) later I am back in the ext*
fold on ext4 and happy to report that XFS was just as slow as ext3 and
ext4 beats them both hands down at deleting big (many) hard linked trees
of files.

On Wed, 2009-04-15 at 13:42 +0000, Theodore Ts'o wrote:
> I'll note that part of the problem is that it doesn't seem to be
> trivially reproducible even on Ubuntu Jaunty.

It seems to reproduce here when two processes are removing files/trees
from the same filesystem.  And as another comment suggests, my
filesystem is pretty damn full (98%) too, so maybe that's the key
factor.

> On my side, I still need to figure out why a Ubuntu Jaunty kernel built
> with my custom "no modules" config causes a black screen lockup when
> booting on an Ubuntu Intrepid userspace --- or find room to do an
> install of Ubuntu Jaunty beta and try to reproduce the problem myself.
> Oh, and I have to get my taxes filed too, and expense reports, and lots
> of other things related to my day job (which doesn't include ext4; it's
> been a long time since anyone has paid me to work on ext4 as my day job
> -- it's something I do in my copious spare time in the evenings or when
> I have a few spare moments).

I hear ya.  I didn't at all mean to suggest that you (Ted) should be
working to fix this (indeed, your volunteered contributions on the bug
are significant in it's progress), but hopefully somebody who's getting
paid to work on Ubuntu Linux could devote some time to it.

Ted: As to your other question about allocation during removal, I tend
to doubt in my case there was.  This is an archive/backup filesystem in
my case and the parallel (but not racing) deletes happen after the
backup run, so there shouldn't be any allocation happening at that time.

Ted: FWIW, you might (or might not) recall we discussed the speed of
deleting hardlink trees in ext3 vs. XFS a few months ago when I was
switching to XFS specifically for deletes speed... well, a few XFS
crashes (and one almost 24 hour xfs_repair) later I am back in the ext*
fold on ext4 and happy to report that XFS was just as slow as ext3 and
ext4 beats them both hands down at deleting big (many) hard linked trees
of files.

Revision history for this message

Nick B. (futurepilot) wrote on 2009-04-15:

#82

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Saïvann Carignan wrote:
> Theodore Ts'o : I noticed that this bug is more likely to happen on
> partitions that does not have many space left, in case that this give
> you some hint. I have a 30 Gb partition which has only 3 Gb left. The
> bug can be reproduced immediately each time on that partition. I have a
> 700 Gb partition with 500 Gb left, it's pretty hard to reproduce the
> issue on it.
>
The two machines I've been able to reproduce this on have a root
partition of around 20GB. One has about 13GB of free space, and the
other has about 15GB of free space. hang.py is writing to /tmp so in
this case the writing and deleting is happening on the root file system.

> One other question --- for the people that have managed to reproduce, do
> you know if there was any files getting created or otherwise blocks
> getting allocated on the filesystem under test, or is an "rm -rf" of
> some hierarchy in the filesystem sufficient on its own to cause the
> system to hang?

hang.py is creating a bunch of files in /tmp/test and then deleting
them all from test/. So it's only deleting files, not a hierarchy of
any sort.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAknmF28ACgkQAGHzB9Tvw4wzVwCdHYSivqxE1KQJZ5ANhOMMWanJ
VPYAoIKBjVPK27+K39OeBaa4pnGfSrDR
=fUB+
-----END PGP SIGNATURE-----

Revision history for this message

yaztromo (tromo) wrote on 2009-04-15:

#83

@Theodore - At every hang I was only deleting files, no other activity was happening. A simple rm -rf on lot of small files or some big files would do the trick.

I have two systems here running Jaunty, the slowest (Athlon 1000Mhz) can reproduce the bug very easily. Yet I have to try multiple times with the fastest (Celeron 1.7ghz). HDD speed does not seem to matter.

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2009-04-15:

#84

@Ted - Saivann may have a point about a nearly full file system. Perhaps one of the reasons I've not experienced any issues is that all of my ext4 file systems are quite large (1.8T in one case) with relatively low space utilization (< 5%). I continue to use ext4 on a kernel build server (4 spindle 1.8T RAID0 on a dual quad-core). There is lots of overlapping file system activity on a machine like that.

Revision history for this message

yaztromo (tromo) wrote on 2009-04-15:

#85

Also to add, I can't confirm Saïvann Carignan's experience that it happens on mostly full filesystems, since I have 150gb free on a 500gb partition and can reproduce easily.

I feel personally that this is massive showstopper for ext4 and shouldn't be an option in the final release of jaunty. I think it's priority is worthy of more than "medium" too.

Revision history for this message

David Bowles (ubuntu-david) wrote on 2009-04-15:

#86

Just wanted to register myself as another user experiencing this bug and give some details on the conditions under which I have been able to reproduce it as they do not exactly correlate with some of the comments above.

Latest kernel in repository - 2.6.28.11.14
4.5TB ext4 partition on top of a same size dmcrypt partition, ~3 weeks old
3.9TB free - Indicates that a low amount of free space is not necessarily a large contributing factor

I was able to reliably cause this problem by deleting a folder (rm -rf) that contained two files of ~90GB. This is a data partition and I am reasonably confident (97%) that there were no other operations being performed on that partition at the time.

Lockup could also be caused by directly deleting the file (rm largefile.test) - Indicates that a directory tree traversal is not always necessary

Following a number of repeats of the process -
1. Start deletion of large file
2. Experience lock-up
3. Reboot
4. Repeat
The file was no longer present upon rebooting the machine.

The problem was reproducible though by creating new large files and then attempting to delete them.

I am very interested in getting this bug fixed and happy to help with any diagnosis I can.

Best of luck,
David

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-04-15:

#87

@yaztromo,

Well, the slowest machine I have is an netbook with an Atom N270 1.6GHz processor with 1.5gigs of memory (Hmm, I wonder if amount of memory has any significance; are people who can reproduce easily doing so with an especially large or small amounts of memory?) and a 5400 rpm drive.

I'll try again with the latest Ubuntu Jaunty kernel, and see if I can figure out why it wasn't booting on that box with an Intrepid userspace.

Revision history for this message

Brian J. Murrell (brian-interlinx) wrote on 2009-04-15:

#88

On Wed, 2009-04-15 at 17:46 +0000, Theodore Ts'o wrote:
> are people who can reproduce easily doing so with an
> especially large or small amounts of memory?

MemTotal: 2851296 kB
MemFree: 164428 kB
Buffers: 469320 kB
Cached: 609504 kB
SwapCached: 1780 kB
Active: 1895404 kB
Inactive: 557956 kB
HighTotal: 1964992 kB
HighFree: 147844 kB
LowTotal: 886304 kB
LowFree: 16584 kB
SwapTotal: 3145720 kB
SwapFree: 3119748 kB
Dirty: 1320 kB
Writeback: 0 kB
AnonPages: 1374060 kB
Mapped: 159116 kB
Slab: 145756 kB
SReclaimable: 103656 kB
SUnreclaim: 42100 kB
PageTables: 6940 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 4571368 kB
Committed_AS: 2345724 kB
VmallocTotal: 110584 kB
VmallocUsed: 60512 kB
VmallocChunk: 49136 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 4096 kB
DirectMap4k: 884736 kB
DirectMap4M: 32768 kB

Revision history for this message

Leon Nardella (leon.nardella) wrote on 2009-04-15:

#89

@Theodore Ts'o
I can consistently reproduce this bug on my Acer Aspire One AOA150, which also has an Atom N270 plus 1GB of RAM and a 120GB SATA HD.

It happens even on a clean install of Jaunty, as well as on a fully updated install, with more than 100GB of HD free.
I can reliably trigger this bug by 'rm -rf'ing the LLVM+GCC+CLANG after I build them.

You asked about other allocations ocorruring when the bug is triggered. I think the only other thing I was running during these lockups was Firefox plus a few extensions (NoScript, AdBlock Plus, DownloadThemAll ), but I can trigger these lockups even when I log in through Ubuntu's recovery menu on boot up ( which, I guess, has the absolute minimum of processes running in the background ).

I also should mention that kernel 2.6.29 ( from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.29.1/ ) seems to have fixed this bug for me.

Revision history for this message

Nick B. (futurepilot) wrote on 2009-04-15: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#90

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Theodore Ts'o wrote:
> @yaztromo,
>
> Well, the slowest machine I have is an netbook with an Atom N270 1.6GHz
> processor with 1.5gigs of memory (Hmm, I wonder if amount of memory has
> any significance; are people who can reproduce easily doing so with an
> especially large or small amounts of memory?) and a 5400 rpm drive.
>
> I'll try again with the latest Ubuntu Jaunty kernel, and see if I can
> figure out why it wasn't booting on that box with an Intrepid userspace.
>

The laptop I've been able to easily reproduce this on only has 768MB
of RAM. The other laptop has 2GB of RAM and it takes a lot more effort
to get that one to lock up.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)

iEYEARECAAYFAknmIzkACgkQAGHzB9Tvw4yGwACfQRwjMcMdcH2OTnG4KG1hXnxT
OywAoLAhCnG5RKdj25jUO0NmSteEML1p
=90+Y
-----END PGP SIGNATURE-----

Revision history for this message

David Bowles (ubuntu-david) wrote on 2009-04-15:

#91

@Theodore

I can reproduce the bug on a 1.6Ghz Via with 2GB of memory. The partition is on a 4 drive SATA II 7200rpm RAID 5 array. This box is running at a console level only. I do not have Gnome or any other applications, even Apache or MySQL installed. When the problem occurs for me, I have no other applications running.

I can cause this bug with only the barest of processes running, processor load at 0.00 and 1.9GB of free memory before I begin deleting. I say this to highlight that I do not believe resources, especially memory, are the issue.

That said, I now need to contradict myself somewhat.
Due to a different bug, I have had to disable my VIA padlock hardware assisted cryptography chip. (i.e. I have blacklisted the padlock_aes and padlock_sha modules)
Because of this, all operations on my ext4 data partition (which is encrypted) use the processor directly. Therefore during any large operation (e.g. possibly during a large delete such as one that causes the problem for me) I would imagine that the processor hits 100% usage very quickly. Unfortunately I cannot confirm this as the lockup happens too quickly. Therefore, despite the very low usage on my box it is still perhaps a processor constraint issue.

Thanks,
David

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-04-15:

#92

I also confirm the "small memory" hypothesis. It's easier to reproduce the bug on two computers which have respectively ~700Mb and 1024Mb of Ram, but much more difficult to reproduce with my other computer that has 2048 Mb (and large amount of free space). Might or might not be related.

Revision history for this message

Andrius Štikonas (stikonas) wrote on 2009-04-15: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#93

I have 640 MB ant svn up, on KDE repository core modules is enough to lockup.
So lockup happens quite easily.

Revision history for this message

Wade Menard (wade-ezri) wrote on 2009-04-15:

#94

Just to throw a wrench in, I have 8 GB of RAM and my volume is currently 52% full at 287 GB/583 GB.

It's interesting that from my count we have two people that haven't reproduced it yet with the ~lp348836 kernel and two that have. Certainly something else going on here.

Still going strong with the ~lp348836 kernel about 2.5 days in here.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-04-15:

#95

The config file used to build git commit 831d45, which is 6 commits after Ubuntu 2.6.28-11.41 Edit (72.8 KiB, text/plain)

For those who are following along, the latest Ubuntu kernel (with a custom config which I can include if anyone is interested) is blowing up in early boot, in what looks like SATA/SCSI stack on my Lenovo S10. Standard stock 2.6.28, 2.6.29 and 2.6.30-rc2 works just fine on the Lenovo S10, using substantially the same .config. I have no idea why the Ubuntu kernel isn't even booting on my netbook.

The fact that I can't even get the Ubuntu kernel to boot on my normal crash and burn machine makes it rather difficult for me to try to reproduce the problem; first I have to debug why it's crashing in early boot. (Sigh.)

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-04-15: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#96

2.5GB of memory, fairly high disk used (~85%), I can reproduce nearly at will.

The bisect I performed above may have only found a patch that makes
things significantly worse: when running a kernel from the 'good'
side of the bisect I still get occasional lockups (on the order of 2-4
days), compared to minutes otherwise. It almost seems that in that
case, concurrent activity is a necessary case: firefox's periodic
session saving (rewriting a 400kb file), starting audio playback,
running a gnome-shell build with io at idle priority, and installing a
package was the triggering state the last time I had a 'good' kernel
lock up.

Revision history for this message

Jordan (jordanu) wrote on 2009-04-15:

#97

I can reproduce this bug ( using hang.py ) with my testing machine and if it would help any, Theodore Ts'o ( or any other developer ), I can let you log in via ssh and you can have free reign to do whatever you need to to test. Not sure how helpful it would be without physical access but I figured I would offer anyway.

Revision history for this message

Leann Ogasawara (leannogasawara) wrote on 2009-04-15:

#98

@tytso,

I obviously don't have the same hw as you do (Dell Inspiron 1420), but I attempted to reproduce the boot issue you see but I couldn't. Just in case you wanted to test, I've put the image I built with your
config at http://people.ubuntu.com/~ogasawara/lp330824/

I did run into build issues, attempting to run:

ogasawara@emiko:~/ubuntu-jaunty$ AUTOBUILD=1 NOEXTRAS=1 skipabi=true fakeroot debian/rules binary-generic

However the following worked instead:

ogasawara@emiko:~/ubuntu-jaunty$ fakeroot make-kpkg --initrd --append-to-version=-lp330824 kernel_image kernel_headers

Revision history for this message

Steve Langasek (vorlon) wrote on 2009-04-16:

#99

Documented at <https://wiki.ubuntu.com/JauntyJackalope/ReleaseNotes#Lock-ups when deleting files from ext4 filesystems>:

In some cases, deleting files from an ext4 filesystem is reported to cause soft lock-ups in the kernel. Investigation of this problem is ongoing, and it is expected that a fix for this problem will be made available as a post-release update. To avoid this problem, users may wish to install using the default ext3 filesystem and convert their filesystem to ext4 (as documented on the ext4 wiki) once a fix is available.

Changed in ubuntu-release-notes:
status:	New → Fix Released

Revision history for this message

Justin Myers (chasemyers) wrote on 2009-04-17:

#100

Steve Langasek 18 hours ago
ubuntu-release-notes status: New → Fix Released

That's funny, because it was just about 18 hours ago now that this bug made itself apparent to me.

I actually had ubuntu freeze when i tried a disk cleaning utility (it didn't occur to me that it froze because things were being deleted). That was a few days ago. Lastnight, I made the mistake of trying to empty my trash can and it froze, so I just shut it down. Today, I tried again to empty the trash, same thing happened. I restarted this time, and the trash was empty. Since then it's frozen once, while opening firefox.

If the fix was released, how come I'm still having this problem even though I've got all the updates?

Revision history for this message

Sarah Kowalik (hobbsee-deactivatedaccount) wrote on 2009-04-17:

#101

That would be because ubuntu-release-notes is not Ubuntu. It *just* refers to the release notes. As Steve said, they are located at https://wiki.ubuntu.com/JauntyJackalope/ReleaseNotes#Lock-ups. This issue has been documented in the release notes, and thus, the fix for the release notes has been released.

You'll note that the actual issue is still marked as in progress for jaunty, which is why people are still having this problem.

Revision history for this message

Justin Myers (chasemyers) wrote on 2009-04-17:

#102

So it's still being worked on then? Not to be critical, but isn't this a bug that's been known for months?

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-04-17:

#103

> So it's still being worked on then? Not to be critical, but isn't this a
> bug that's been known for months?

Yes, the status is still "In progress".

Note that the mainline kernel's don't seem to be exhibiting this issue
(or at least orders of magnitude less often), so installing a 2.6.29
kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/ will
probably get you up and running (presuming your hardware is supported
by a stock kernel + dkms modules).

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-04-17:

#104

As an idea; for people who can reproduce it easily, preferably one of you who have been doing this on a relatively small (20-40GB filesystem), maybe you would be willing to do the following?

(a) Load up the filesystem with the test set of files that, when deleted, will product the hang.
(b) Run e2image -r /dev/sdXX - | bzip2 > /media/disk-some-other-filesystem/sdXX.e2i.bz2
(c) Run "sync" to make sure the e2image file is safely written to your USB drive, or other filesystem
(d) Try to reproduce the crash by running rm -rf to delete the filesystem.

Repeat steps (a) through (d) until you get a crash. Then contact me off-line and send me the sdXX.e2i.bz2 file. Also tell me exactly what kernel you were running at the time of the kernel.

Couple of caveats --- the compressed raw e2image file will not contain any filesystem data, but will contain file names. I promise not to look at them any more than necessary to debug this issue, and I certainly promise not to divulge them, but you should be aware of this from a privacy perspective.

What I will be able to do is to uncompress the raw e2image, and then try to reproduce the crash using exactly your filesystem layout. Thanks in advance for anyone who is willing to lend a hand.

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-04-18:

#105

@Theo, I originally experienced the crash on an external drive (i.e.,
not the root fs). I'm going to try to reproduce it on a small loop
image.

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-04-18:

#106

ext4-trial0.e2i.gz Edit (3.9 MiB, application/x-gzip; name="ext4-trial0.e2i.gz")

@Theo

That was easy :p

I've got a 2.5GB ext4 image which reproduced the issue when mounted on
/mnt/test. Attached is the e2image output produced immediately before
deleting the files. The hang itself occurred part way into writing
the files during the next cycle (as it usually does).

Revision history for this message

Andrius Štikonas (stikonas) wrote on 2009-04-18:

#107

I plugged my hard drive to another computer with 6GB RAM (instead of 640MB) and was not able to reproduce this bug at all neither with my usual methods nor with hang.py. I can always easilly reproduce the bug on 640MB machine in a few seconds.

Revision history for this message

Alessandro Rinaldi (rinaldi-aless) wrote on 2009-04-19:

#108

It happened really often to me with 2.6.28.x kernel, and now it doesn't happen at all (tried everything to reproduce, but I couldn't) with a 2.6.29-1 kernel, and also the 2.6.29 kernel is ok.
So we should just find the right patch that fixed it and apply to the current kernel.

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-04-20:

#109

Alessandro, please make a point of reading all the comments on a bug before adding your own. :)

Revision history for this message

Rocko (rockorequin) wrote on 2009-04-24:

#110

I reproduced this bug (kernel panic, total freeze) this morning by doing *two* simultaneous rm -rf commands on folders created by copying /bin /lib /boot /etc /lib64 /root /srv /usr /opt /sbin /var into two backup folders on another ext4 partition.

I had netconsole running on another computer, but nothing at all was logged there about the panic, unlike in https://bugs.launchpad.net/ubuntu/+source/linux/+bug/330824/comments/57.

At least something was damaged in the crash because Skype needed my password after rebooting, but I'm not sure what else might have been affected. The folders I was deleting at the time of the crash were largely intact.

Revision history for this message

Rocko (rockorequin) wrote on 2009-04-25:

#111

10.1.1.11-netconsole.log Edit (188.8 KiB, text/plain)

I got a kernel panic just now (ie the PC froze completely) and managed to capture a whole bunch of general protection faults and the panic with netconsole. The PC actually froze when I opened a new tab in Firefox rather than when I went to delete a number of files, but there are a number of ext4 calls in the stack traces including some that look like ext4_delete_inode so just in case it might be useful for this bug, here is the netconsole output.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-04-25:

#112

Download full text (7.1 KiB)

Rocko,

Thanks for your kernel panic log. This doesn't prove anything, but 14 seconds before the oops involving ext4_delete_inode, there was a recursive fault in the X server, which was apparently running the Nvidia proproietary driver.

So I hate to ask this but (a) how many other people who have been having this problem are running with the proprietary Nvidia driver? And (b) would it be possible for people who can easily reproduce this, if they are using the Nvidia proprietary driver, to see if the problem goes away if you shut down the X server and logging into the VT console, and running the rm -rf from either a VT Console or via an ssh login?

It's *possible* that the Nvidia driver is innocent victim, and not the cause, but I don't use the proprietary closed-source Nvida driver, and I'm having a devil of a time reproducing the problem. So if we can take the closed-source Nvidia driver out of the equation, it would be useful to see where that leaves us.

Apr 25 17:10:08 10.1.1.11 [85627.288225] Pid: 3678, comm: Xorg Tainted: P D 2.6.28-11-generic #42-Ubuntu
Apr 25 17:10:08 10.1.1.11 [85627.288228] RIP: 0010:[<ffffffffa0099e26>]
Apr 25 17:10:08 10.1.1.11 [<ffffffffa0099e26>] _nv020907rm+0x14/0x42 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.288379] RSP: 0018:ffff88011e9d19b0 EFLAGS: 00010202
Apr 25 17:10:08 10.1.1.11 [85627.288381] RAX: 64943378215a9b68 RBX: ffff8801074d2b90 RCX: 0000000000000000
Apr 25 17:10:08 10.1.1.11 [85627.288383] RDX: ffff8801074d2b90 RSI: ffff88010b582f70 RDI: e001208000000000
Apr 25 17:10:08 10.1.1.11 [85627.288385] RBP: ffff88010b582f68 R08: 0000000000000001 R09: ffffffffa0a2c330
Apr 25 17:10:08 10.1.1.11 [85627.288387] R10: 0000000000000001 R11: 0000000000011bf2 R12: ffff88010b582f70
Apr 25 17:10:08 10.1.1.11 [85627.288389] R13: 00000000e0012080 R14: ffff88010b582f9c R15: 0000000000000000
Apr 25 17:10:08 10.1.1.11 [85627.288392] FS: 0000000000000000(0000) GS:ffffffff80aa3000(0000) knlGS:0000000000000000
Apr 25 17:10:08 10.1.1.11 [85627.288394] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Apr 25 17:10:08 10.1.1.11 [85627.288396] CR2: 00007f4fa66e0620 CR3: 0000000000201000 CR4: 00000000000006a0
Apr 25 17:10:08 10.1.1.11 [85627.288399] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 25 17:10:08 10.1.1.11 [85627.288402] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 25 17:10:08 10.1.1.11 [85627.288404] Process Xorg (pid: 3678, threadinfo ffff88011e9d0000, task ffff88010b424320)
Apr 25 17:10:08 10.1.1.11 [85627.288406] Stack:
Apr 25 17:10:08 10.1.1.11 [85627.288409] ffffffffa00d3fc1
Apr 25 17:10:08 10.1.1.11 00000000c1d00001
Apr 25 17:10:08 10.1.1.11 00000000e0012080
Apr 25 17:10:08 10.1.1.11 ffff88010b582f90
Apr 25 17:10:08 10.1.1.11
Apr 25 17:10:08 10.1.1.11 [85627.288415]
Apr 25 17:10:08 10.1.1.11 ffffffffa00d3d69
Apr 25 17:10:08 10.1.1.11 ffff88010b582fa0
Apr 25 17:10:08 10.1.1.11 00000000c1d00001
Apr 25 17:10:08 10.1.1.11 0000000000000000
Apr 25 17:10:08 10.1.1.11
Apr 25 17:10:08 10.1.1.11 [85627.288421]
Apr 25 17:10:08 10.1.1.11 000000000100cb01
Apr 25 17:10:08 10.1.1.11 0000000000000000
Apr 25 17:10:08 10.1.1.11 ffffffffa03fe7b3
Apr 25 17:1...

Rocko,

Thanks for your kernel panic log.  This doesn't prove anything, but 14 seconds before the oops involving ext4_delete_inode, there was a recursive fault in the X server, which was apparently running the Nvidia proproietary driver.

So I hate to ask this but (a) how many other people who have been having this problem are running with the proprietary Nvidia driver?    And (b) would it be possible for people who can easily reproduce this, if they are using the Nvidia proprietary driver, to see if the problem goes away if you shut down the X server and logging into the VT console, and running the rm -rf from either a VT Console or via an ssh login?

It's *possible* that the Nvidia driver is innocent victim, and not the cause, but I don't use the proprietary closed-source Nvida driver, and I'm having a devil of a time reproducing the problem.  So if we can take the closed-source Nvidia driver out of the equation, it would be useful to see where that leaves us.

Apr 25 17:10:08 10.1.1.11 [85627.288225] Pid: 3678, comm: Xorg Tainted: P      D    2.6.28-11-generic #42-Ubuntu
Apr 25 17:10:08 10.1.1.11 [85627.288228] RIP: 0010:[<ffffffffa0099e26>] 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa0099e26>] _nv020907rm+0x14/0x42 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.288379] RSP: 0018:ffff88011e9d19b0  EFLAGS: 00010202
Apr 25 17:10:08 10.1.1.11 [85627.288381] RAX: 64943378215a9b68 RBX: ffff8801074d2b90 RCX: 0000000000000000
Apr 25 17:10:08 10.1.1.11 [85627.288383] RDX: ffff8801074d2b90 RSI: ffff88010b582f70 RDI: e001208000000000
Apr 25 17:10:08 10.1.1.11 [85627.288385] RBP: ffff88010b582f68 R08: 0000000000000001 R09: ffffffffa0a2c330
Apr 25 17:10:08 10.1.1.11 [85627.288387] R10: 0000000000000001 R11: 0000000000011bf2 R12: ffff88010b582f70
Apr 25 17:10:08 10.1.1.11 [85627.288389] R13: 00000000e0012080 R14: ffff88010b582f9c R15: 0000000000000000
Apr 25 17:10:08 10.1.1.11 [85627.288392] FS:  0000000000000000(0000) GS:ffffffff80aa3000(0000) knlGS:0000000000000000
Apr 25 17:10:08 10.1.1.11 [85627.288394] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Apr 25 17:10:08 10.1.1.11 [85627.288396] CR2: 00007f4fa66e0620 CR3: 0000000000201000 CR4: 00000000000006a0
Apr 25 17:10:08 10.1.1.11 [85627.288399] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Apr 25 17:10:08 10.1.1.11 [85627.288402] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Apr 25 17:10:08 10.1.1.11 [85627.288404] Process Xorg (pid: 3678, threadinfo ffff88011e9d0000, task ffff88010b424320)
Apr 25 17:10:08 10.1.1.11 [85627.288406] Stack:
Apr 25 17:10:08 10.1.1.11 [85627.288409]  ffffffffa00d3fc1
Apr 25 17:10:08 10.1.1.11 00000000c1d00001
Apr 25 17:10:08 10.1.1.11 00000000e0012080
Apr 25 17:10:08 10.1.1.11 ffff88010b582f90
Apr 25 17:10:08 10.1.1.11 
Apr 25 17:10:08 10.1.1.11 [85627.288415] 
Apr 25 17:10:08 10.1.1.11 ffffffffa00d3d69
Apr 25 17:10:08 10.1.1.11 ffff88010b582fa0
Apr 25 17:10:08 10.1.1.11 00000000c1d00001
Apr 25 17:10:08 10.1.1.11 0000000000000000
Apr 25 17:10:08 10.1.1.11 
Apr 25 17:10:08 10.1.1.11 [85627.288421] 
Apr 25 17:10:08 10.1.1.11 000000000100cb01
Apr 25 17:10:08 10.1.1.11 0000000000000000
Apr 25 17:10:08 10.1.1.11 ffffffffa03fe7b3
Apr 25 17:10:08 10.1.1.11 ffff880112cb5660
Apr 25 17:10:08 10.1.1.11 
Apr 25 17:10:08 10.1.1.11 [85627.288430] Call Trace:
Apr 25 17:10:08 10.1.1.11 [85627.288433] 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa00d3fc1>] ? _nv019426rm+0x5c/0x91 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.288526] 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa00d3d69>] ? _nv019519rm+0x39/0x78 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.288617] 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa03fe7b3>] ? _nv003729rm+0x61/0x220 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.288747] 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa03fc819>] ? _nv003740rm+0x96/0x20b [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.288872] 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa03fc4ab>] ? _nv003744rm+0x3e/0x316 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.288996] 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa0478407>] ? _nv003714rm+0xe0/0x121 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.289124] 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa047a1c7>] ? rm_free_unused_clients+0x69/0xb7 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.289249] 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa054cd7f>] ? nv_kern_ctl_close+0x6f/0x100 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.289352] 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa054f1fb>] ? nv_kern_close+0x2eb/0x3c0 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.289452] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff802e8f6f>] ? __fput+0xcf/0x1f0
Apr 25 17:10:08 10.1.1.11 [85627.289458] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff802e90ad>] ? fput+0x1d/0x30
Apr 25 17:10:08 10.1.1.11 [85627.289462] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff802e553b>] ? filp_close+0x5b/0x90
Apr 25 17:10:08 10.1.1.11 [85627.289466] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff802530ad>] ? put_files_struct+0x7d/0xe0
Apr 25 17:10:08 10.1.1.11 [85627.289471] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff8025315f>] ? exit_files+0x4f/0x60
Apr 25 17:10:08 10.1.1.11 [85627.289475] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff80254f51>] ? do_exit+0x1b1/0x3b0
Apr 25 17:10:08 10.1.1.11 [85627.289480] 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa054d1cd>] ? nv_kern_ioctl+0x15d/0x490 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.289581] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff8069f81e>] ? oops_end+0xbe/0xc0
Apr 25 17:10:08 10.1.1.11 [85627.289586] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff80215cbe>] ? die+0x5e/0x90
Apr 25 17:10:08 10.1.1.11 [85627.289591] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff8069f4c8>] ? do_general_protection+0x158/0x160
Apr 25 17:10:08 10.1.1.11 [85627.289594] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff8069e96a>] ? error_exit+0x0/0x70
Apr 25 17:10:08 10.1.1.11 [85627.289598] 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa054d1cd>] ? nv_kern_ioctl+0x15d/0x490 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.289698] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff802e31c4>] ? __kmalloc+0x74/0x110
Apr 25 17:10:08 10.1.1.11 [85627.289705] 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa054d1cd>] ? nv_kern_ioctl+0x15d/0x490 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.289805] 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa054d53c>] ? nv_kern_unlocked_ioctl+0x1c/0x20 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.289905] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff802f62d1>] ? vfs_ioctl+0x31/0xa0
Apr 25 17:10:08 10.1.1.11 [85627.289910] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff802f6685>] ? do_vfs_ioctl+0x75/0x230
Apr 25 17:10:08 10.1.1.11 [85627.289913] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff802f68d9>] ? sys_ioctl+0x99/0xa0
Apr 25 17:10:08 10.1.1.11 [85627.289917] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff802e82a0>] ? sys_read+0x50/0x90
Apr 25 17:10:08 10.1.1.11 [85627.289920] 
Apr 25 17:10:08 10.1.1.11 [<ffffffff8021253a>] ? system_call_fastpath+0x16/0x1b
Apr 25 17:10:08 10.1.1.11 [85627.289924] Code: 
     ....
Apr 25 17:10:08 10.1.1.11 
Apr 25 17:10:08 10.1.1.11 [85627.289993] RIP 
Apr 25 17:10:08 10.1.1.11 [<ffffffffa0099e26>] _nv020907rm+0x14/0x42 [nvidia]
Apr 25 17:10:08 10.1.1.11 [85627.290079]  RSP <ffff88011e9d19b0>
Apr 25 17:10:08 10.1.1.11 [85627.290173] ---[ end trace 3dc5288f733b0548 ]---
Apr 25 17:10:08 10.1.1.11 [85627.290175] Fixing recursive fault but reboot is needed!
Apr 25 17:10:08 10.1.1.11 [85627.300924] general protection fault: 0000 [#6] 
Apr 25 17:10:08 10.1.1.11 SMP 
Apr 25 17:10:08 10.1.1.11

Revision history for this message

Andrius Štikonas (stikonas) wrote on 2009-04-25:

#113

I don't use proprietary software on my system, so Nvidia driver shouldn't
cause this lockup. Maybe it can trigger it like "rm -rf" but not cause it.

Revision history for this message

yaztromo (tromo) wrote on 2009-04-25:

#114

I can trigger the bug without X running, my card is also an ATI running the driver provided by Xorg

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-04-25:

#115

I reproduced this bug on 3 computers, running nvidia, intel (proprietary) and ati (OpenSource) cards. I also reproduced this bug in the "recovery-mode" without X started, so definitively not related to Xorg.

Revision history for this message

Nick B. (futurepilot) wrote on 2009-04-25: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#116

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I use the Nvidia binary driver however I've been able to reproduce
this from a console without X even running. I logged in from a
console, shut down X, then ran hang.py and it froze after about 3
rounds, that's when I saw the "soft lockup on cpu0" message that just
repeated over and over.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAknzR5MACgkQAGHzB9Tvw4wd/QCgiwUdCjepeSPS/Yv7D7Dt0U1M
UZwAn0FE/VH1jeWxurnaEoLukqqwLn18
=IXdY
-----END PGP SIGNATURE-----

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-04-25: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#117

Can reliably trigger this on nvidia, nv and vesa.

Theo, did you get my e2image attached above?

Revision history for this message

Derek (bugs-m8y) wrote on 2009-04-28:

#118

Repeatedly triggered trying an rm -rf of a subversion tree.
Intel driver. I guess I'll subscribe and await developments.

A workaround for me was to do following:
find ~/svn/trunk -type f | while read f;do rm -f "$f";sleep .1;done
find ~/svn/trunk -depth -type d | while read d;do rmdir "$d";sleep .1;done

There were no lockups after that.
Unfortunately, not a terribly efficient way to erase a large tree, one file every 100 milliseconds took a quarter of an hour for the 10,000 or so files in question.
Perhaps I could have skipped the sleep or shrunk it to 10 milliseconds.
Will try that the next time I need to do this.

Until a patch is released, I suppose I can make a wrapper script for safely doing rm ;)

Revision history for this message

moli (f-launchpad-moli-hu) wrote on 2009-04-28:

#119

this bug works with final jaunty 2.6.28-11-generic on an amd athlon, installed hardy which was updated by hand to intrepid which updated to jaunty, with no gnome nor x server. ext3 filesystem converted to ext4 by hand, 250gb pata hdd one partition, no swap.

i can reproduce this bug any time with any file, in fact i can not delete not one file without freezing so deep even soft power button doesnt work. now i have a worthless noisy heavy black box in the corner.

it would be nice to fix this with really high priority. thank you.

Revision history for this message

moli (f-launchpad-moli-hu) wrote on 2009-04-28:

#120

to be accurate replace 'file' to 'directory', sorry. i would like to delete more than 130000 files because i copied all my data within the hdd to use extents

Revision history for this message

Gytis Raciukaitis (noxxious) wrote on 2009-04-29:

#121

I can always get a lock with Back-In-Time (http://backintime.le-web.org/), which uses rsync.

If try to to backup my home dir (ext4 formatted) to another drive (ext4 formated) during the initial snapshot generation it locks up everytime.

I'm using jaunty with 2.6.28-11-generic.

Revision history for this message

jpfle (jpfle) wrote on 2009-04-29:

#122

I have this bug too after I've updated from Ubuntu 8.10 to Ubuntu 9.04. Here are more info:

- I have relatively small filesystems (30 GB and 80 GB), converted by hand to ext4
- My kernel version is Linux 2.6.28-11-generic #42-Ubuntu SMP Fri Apr 17 01:57:59 UTC 2009 i686 GNU/Linux
- I have an Nvidia card but I don't use the proprietary Nvidia driver
- I've encountered 3 freezes:
- two times when cleaning my Trash (about 9000 files each time)
- one time when moving a .tar.gz archive in Trash

When freezing, my system no longer works, so I have to use the reset button.

I'll try to obtain some log files.

Revision history for this message

Derek (bugs-m8y) wrote on 2009-04-29:

#123

Hey guys.
Just wondering.
This recent ext4 fix looks hopeful (typo correction).
http://patchwork.kernel.org/patch/19495/

Found while been browsing for potential fixes in ext4.

Does anyone know if it is in a Jaunty backport?

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2009-04-29:

#124

@Derek - that patch is already in Jaunty.

static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
...
        list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order],
                                                pa_inode_list) {
                spin_lock(&tmp_pa->pa_lock);
                if (tmp_pa->pa_deleted) {
....

Revision history for this message

Phil-S (phil-ingineerix) wrote on 2009-04-30:

#125

Just found this report after a number of forced reboots, Chiming in with another verification.

Running Stock Jaunty: 2.6.28-11-generic (buildd@palmer) (gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) ) #42-Ubuntu SMP Fri Apr 17 01:57:59 UTC 2009

Have a 3TB 4-drive software RAID-5 array with EXT4. Hangs without fail any time a large directory full of files is rm'd. This is my first time using EXT4. I only had confidence in it because of the mainstream Ubuntu release, I would not have tried it otherwise!

This is one of the worst travesties I have experienced using a mainstream Distro, I hope a cause (fix) is found soon!

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-04-30:

#126

Phil,

The only suggestion I can make at this point is to use a mainline kernel. This seems to be some kind of interaction between Ubuntu "Sauce" patches and the ext4 code. People have reported success with stock 2.6.28, stock 2.6.29 and a bleeding-edge 2.6.30-rcX kernels. There have been attempts to bisect the Ubuntu Sauce patches, but not much luck. The Ubuntu kernel developers that I talked to before Jaunty shipped haven't been able to reproduce the bug, and I haven't as well, which is one of the reasons why fixing it has been slow and painful.

And I'm a volunteer, and I don't have a lot of free time, so while I've finally updated a netbook to Jaunty, I haven't had time to try to track this down, as I've got lots of other items on my todo list. (And I use my own personally built 2.6.30-rcX bleeding edge kernel, and have been using ext4 in production since July of last year, and I've never seen the problem. But then again, I don't use any of the Ubuntu proprietary drivers, I don't use Apparmor, etc., so even if I were using the Ubuntu kernel there's no guarantee I would see the problem.)

Someone has graciouslly offerred to ship me a pre-set up computer that has the problem, but my main problem right now is ENOTIME. Hopefully I'll have more time by mid-May, and hopefully someone else will be able to track this down in the meantime.

P.S. Ubuntu does ship its own pre-built stock kernels w/o any "sauce" patches. So even if you aren't up to building your own kernel, you might want to try using one of those mainline kernels.

Revision history for this message

Derek (bugs-m8y) wrote on 2009-04-30:

#127

A bit more info on my circumstance.
1) I triggered the lockup at one point by running Evolution - no idea what it was doing in the background

2) I was trying to figure out why subversion was giving me an assertion on a particular samba mounted file:// repo.
(this has been going on for a while, intermittently).
So, I was doing a lot of rm -rf of a local trunk/ checkout.
3 or 4 times, everything was fine.
Then, suddenly, a lockup. This time on the *checkout* not the rm -rf

So clearly erasing is not the only thing that triggers it for me :-/

Revision history for this message

Christian Kirbach (christian-kirbach-e) wrote on 2009-04-30:

#128

I think I am bitten by this, too, on AMD64, with a root fs converted from ext3 to ext4 using

tune2fs -O extents,uninit_bg,dir_index /dev/DEV

linux-image-2. 2.6.28-11.42

I have not found a way yet to trigger, "rm -rf" on ten thousand files once did not.

On an irregular basis I can see any app doing I/O to freeze for about 3 minutes.
Causing apps to do I/O (opening files etc) during that period freezes them, too.
I can continue working with anything that does not require/trigger disk I/O.
So this in fact seems to be a kernel lock on any disk I/O.

Periods between I/O freezes vary, it may be hours and sometimes only minutes.
Heavy disk I/O is not required.
There is nothing suspicous in any of the log files in /var/log/

Revision history for this message

Derek (bugs-m8y) wrote on 2009-05-01:

#129

I've switched to:
http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.30-rc4/ (and previously rc3)

So far, so good, on both machines using ext4, despite still plenty of rm -rf.

here's hoping...

Revision history for this message

Geoff123 (gsking1) wrote on 2009-05-02:

#130

system info attached Edit (3.2 KiB, text/plain)

I also just triggered a crash using the hang.py script from this forum. I just just switched to ext4 and had to try.
I have two observations someone might find useful.

1) Running as normal user. Ran hang.py and it locked up on the 5th round. I had to push the reset button. Note that the script did not run correctly and I got an error such that "sh: cannot create /proc/sys/vm/drop_caches: Permission denied" at each loop.

2) Running as root user. Ran hang.py and it did NOT lockup. Tried it twice and it completed the full 10 rounds both times. The drop_cache worked as intended in the script.

System is:
Linux version 2.6.28-11-generic (buildd@palmer) (gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4) ) #42-Ubuntu SMP Fri Apr 17 01:57:59 UTC 2009 (Ubuntu 2.6.28-11.42-generic)
2GB RAM
150GB 7400 rpm hard drive 75% full
nvidia driver
Converted from ext3 yesterday and ran fine all day doing basic office stuff today.

Let me know if you need any more info. Geoff

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-05-02:

#131

According to last Theodore Ts'o comment, I am bisecting ubuntu-jaunty git branch and the result so far are very positive. If everything goes well, I will be able to reveal exactly which commit caused the bug in a few. 65 commit left to test. 4 good commits identified, 2 bad commits identified. The problem lays around 2.6.28.2 and 2.6.28.3.

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-05-03:

#132

bisect.log Edit (3.8 KiB, text/plain)

My git bisect is finished and revealed that commit dbf8b1c4e8122e705447b69aea9ee6ef3a9caa30 is the faulty one, confirming Carey Underwood previous results. Bisect log attached and commit diff attached.

Trying to revert that whole patch in current jaunty kernel makes it FTBFS so I could not test reverting the patch to re-confirm that it fixes the bug.

However, results are pretty definitive as this bug is VERY easy to reproduce on my laptop (2 seconds), and each git commit that I tested had to remove the same huge quantity of files without crashing to get a "git bisect good".

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-05-03:

#133

dbf8b1c4e8122e705447b69aea9ee6ef3a9caa30 Edit (16.2 KiB, text/plain)

Revision history for this message

Aurius Bendikas Chang (aurius-bendikas) wrote on 2009-05-03:

#134

Also installed http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.30-rc4/. No freezes so far. At last I can work with my workstation.

@Derek - THANK YOU for the post!

Revision history for this message

pritam ghanghas (pritam-ghanghas) wrote on 2009-05-03:

#135

Just increasing the counter of people affected. I have faced system freeze 4 times since I upgraded my file system to ext4. luckily my "/" fs is not on ext4.

Revision history for this message

_dan_ (dan-void) wrote on 2009-05-03:

#136

This issue is *not* fixed in current jaunty.
Deleting files will not freeze your machine at once or so often anymore.
But it still happens once a day approximately 1-3 seconds after the deleting is done, system just freezes.

Revision history for this message

Matt Zimmerman (mdz) wrote on 2009-05-03: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#137

On Sun, May 03, 2009 at 01:32:13PM -0000, _dan_ wrote:
> This issue is *not* fixed in current jaunty.

No, it is not fixed in Jaunty, as the status of the bug indicates. "In
Progress" means that someone is working on the bug, but it is not yet fixed.

Fortunately, it looks like we now have a very good idea where the problem
begins, and so hopefully can isolate the root cause now.

--
- mdz

Revision history for this message

Pauli Virtanen (pauli-virtanen) wrote on 2009-05-03:

#138

@Tim Gardner on 2009-04-30:
> @Derek - that patch is already in Jaunty.
>
> static void ext4_mb_add_n_trim(struct ext4_allocation_context *ac)
> ...
> list_for_each_entry_rcu(tmp_pa, &lg->lg_prealloc_list[order],
> pa_inode_list) {
> spin_lock(&tmp_pa->pa_lock);
> if (tmp_pa->pa_deleted) {
> ....

I don't think it is: currently in git://kernel.ubuntu.com/ubuntu/ubuntu-jaunty.git (or is the current Jaunty kernel tree somewhere else?), fs/ext4/mballoc.c reads

http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-jaunty.git;a=blob;f=fs/ext4/mballoc.c;h=add854a140a1c3b0207daaaab0e92d3cca5bb882;hb=HEAD#l4418
                ...
                spin_lock(&tmp_pa->pa_lock);
                if (tmp_pa->pa_deleted) {
                        spin_unlock(&pa->pa_lock);
                        continue;
                }
                ...

while in mainline also the latter pa-> is tmp_pa->. Same for the linux-image-2.6.28-11-generic 2.6.28-11.42 tarball.

Has anyone tried to reproduce the bug with this change applied?

Carey Underwood (cwillu) on 2009-05-03

description:

updated

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2009-05-03:

#139

@Ted - Pauli's observation is correct. I was comparing Jaunty against 2.6.28.9 fs/ext4/mballoc.c. They are indeed the same wrt the code in ext4_mb_add_n_trim(). However, Linus' tree has commit e7c9e3e99adf6c49c5d593a51375916acc039d1e which was Cc <email address hidden>, and corrects the spin_unlock() typo. This commit did not appear in 2.6.28.y (and now likely won't with stable support for 2.6.28 ending).

Revision history for this message

Tim Gardner (timg-tpi) wrote on 2009-05-03:

#140

Test kernels in http:/kernel.ubuntu.com/~rtg/2.6.28-13.44-lp330824 contain stable updates through 2.6.28.10 as well as the aforementioned missing commit e7c9e3e99adf6c49c5d593a51375916acc039d1e from Linus' tree.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-05-03:

#141

@Tim:

I would be delighted if commit e7c9e3e9 fixes this bug. However, this fix was added after 2.6.29 was released (it was pushed to Linus between during the 2.6.29..2.6.29-rc1 merge window), and Ubuntu users have reported that using the 2.6.29 stock kernel was enough to avoid the rm -rf hang problem --- and 2.6.29 would not have had this commit. That being said, (a) I would encourage Ubuntu to include it in a future sauce patch, and (b) I would encourage users to try out the test kernels that you pointed out.

As far as stable patches to 2.6.28, I do have a stack of patches that apply on top of 2.6.28.10. It contains all of the patches that I was planning on pushing to <email address hidden>, but I ran out of time testing them before they released the final 2.6.28.x release. (This is where the problem of no one paying me to work on ext4 rears its ugly head; I only get to do this stuff late at night after I'm done with my day job.) In any case, this set was originally based on 2.6.28.8, and I just rebased them on top of 2.6.28.10. I'm doing some quick QA testing of that branch now, but I expect it to be sane.
You can find it on my ext4-tree git tree on kernel.org, with the branch name for-stable-2.6.28.

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-05-03:

#142

The bug can be reproduced with Tim Gardner test kernels http://kernel.ubuntu.com/~rtg/2.6.28-13.44-lp330824

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-05-03: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#143

@Tim, Theo:

2.6.28-13.44 from http://kernel.ubuntu.com/~rtg/2.6.28-13.44-lp330824
hung partway through boot with "BUG: soft locking - CPU#0 stuck for
61s! [rmdir:####]"

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-05-03:

#144

Saïvann,

Thank you very much for doing the bisect and confirming Carey's results. The problem with this is that the commit that you figured as causing the problem, dbf8b1c4, is identical to commit 82eb4869, which appeared in 2.6.28.7, and commit 920313a7, which appeared in 2.6.29-rc1, and so was in 2.6.29. And users of stock 2.6.28.8 and 2.6.29 kernels have reported that the issue reported in this bug have gone away.

So as I had said earlier, this looks like it's a combination of multiple patches that are necessary to cause this problem, and it's not just commit dbf8b1c4. OK, so what I've done is the following. I ran the command "git log --reverse --pretty=oneline --abbrev-commit --abbrev=8 v2.6.28.. fs/ext4 fs/jbd2" on the Jaunty kernel git tree, and extracted the list of commits that touched the fs/ext4 and fs/jbd2 kernel. There are only 10 of these patches between 2.6.28 and commit dbf8b1c4 (including dbf8b1c4). I then did a "for i in `cat /tmp/clist`; do git cherry-pick $i; done" to create a kernel tree which *only* has the Ubuntu sauce patches that touched the fs/ext4 and fs/jbd2 trees.

I've exported this as branch "ubuntu-330824-test" on my ext4.git tree. Unfortunately, it looks like master.kernel.org is being slow to mirror out changes to git.kernel.org, so if you're not seeing at:

http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git

I've also pushed the branch ubuntu-330824-test to:

http://github.com/tytso/ext4/tree/ubuntu-330824-test
git://github.com/tytso/ext4.git

Saïvann, can you give this tree a try? If it works, then the problem must be an interaction with one of the *other* Sauce patches between v2.6.28 and dbf8b1c4, and that may be a very large haystack. If you get a failure (i.e., you can reproduce the bug), then the problem must be in one of the ten commits on ubuntu-330824-test. Obviously, a failure is preferable since it will be easier haystack with 10 commits than it will be to find the problem in a haystack with 700 commits. :-)

Either way, though, it should give us more information.

Thanks for the work you've done so far,

Saïvann,

Thank you very much for doing the bisect and confirming Carey's results.  The problem with this is that the commit that you figured as causing the problem, dbf8b1c4, is identical to commit 82eb4869, which appeared in 2.6.28.7, and commit 920313a7, which appeared in 2.6.29-rc1, and so was in 2.6.29.   And users of stock 2.6.28.8 and 2.6.29 kernels have reported that the issue reported in this bug have gone away.

So as I had said earlier, this looks like it's a combination of multiple patches that are necessary to cause this problem, and it's not just commit dbf8b1c4.   OK, so what I've done is the following.   I ran the command "git log --reverse --pretty=oneline --abbrev-commit --abbrev=8 v2.6.28.. fs/ext4 fs/jbd2" on the Jaunty kernel git tree, and extracted the list of commits that touched the fs/ext4 and fs/jbd2 kernel.  There are only 10 of these patches between 2.6.28 and commit dbf8b1c4 (including dbf8b1c4).   I then did a "for i in `cat /tmp/clist`; do git cherry-pick $i; done" to create a kernel tree which *only* has the Ubuntu sauce patches that touched the fs/ext4 and fs/jbd2 trees.

I've exported this as branch "ubuntu-330824-test" on my ext4.git tree.  Unfortunately, it looks like master.kernel.org is being slow to mirror out changes to git.kernel.org, so if you're not seeing at:

http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git
     git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git

I've also pushed the branch ubuntu-330824-test to:

http://github.com/tytso/ext4/tree/ubuntu-330824-test
    git://github.com/tytso/ext4.git

Saïvann, can you give this tree a try?   If it works, then the problem must be an interaction with one of the *other* Sauce patches between v2.6.28 and dbf8b1c4, and that may be a very large haystack.   If you get a failure (i.e., you can reproduce the bug), then the problem must be in one of the ten commits on ubuntu-330824-test.    Obviously, a failure is preferable since it will be easier haystack with 10 commits than it will be to find the problem in a haystack with 700 commits.   :-)

Either way, though, it should give us more information.

Thanks for the work you've done so far,

Revision history for this message

Nick B. (futurepilot) wrote on 2009-05-04: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#145

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Confirming the 2.6.28-13.44 kernel from Tim Gardner still has the problem.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkn+NFkACgkQAGHzB9Tvw4zaHgCgofNfbUOmZM6mILTawR+19Y7B
SpcAoImn8Ck7t3Ade8tptpJPmwyToifN
=vaum
-----END PGP SIGNATURE-----

Revision history for this message

Eric Shattow (eshattow) wrote on 2009-05-04:

#146

Confirm that 2.6.28-11.42 had the problem. I was able to reproduce the issue 5 or more times by issuing 'rm -fr dirname' on an 80gb directory hierarchy of files up to 5gb each. The system soft locked just like the reported bug symptoms. Booting 2.6.30-020630rc4 resolves the issue for me and 'rm -fr dirname' completed successfully.

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-05-04:

#147

Theodore :

I built and tested your branch : git://github.com/tytso/ext4.git and the bad news is : no crash. I was able to delete my whole filesystem and the problem was not reproducible when it always takes 2 seconds for the bug to appear.

However the good news is that I'm ready to take a lot more time on this. If it requires me to git bisect again the whole jaunty kernel and to apply commit dbf8b1c4 each time I test one commit, I'm ready to do it. If you can only guide me about what is the best method to continue and find the information you need, I'll be overjoy to play again with this.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-05-05:

#148

Saïvann,

Thanks for your willingness to work on this.

>However the good news is that I'm ready to take a lot more time on this. If it requires me to
>git bisect again the whole jaunty kernel and to apply commit dbf8b1c4 each time I test one
>commit, I'm ready to do it. If you can only guide me about what is the best method to
>continue and find the information you need, I'll be overjoy to play again with this.

Actually, I have a better, more efficient way of doing this. I've updated the "ubuntu-330824-test" branch on the ext4.git and my ext4 github repositories to include the rest of the Ubuntu Sauce patches up to dbf8b1c4. That is, I started with the ubuntu-330824-test branch which you tested, and then I create a script which cherry-picked all of the non-ext4 commits from the Ubuntu kernel branch onto the ubuntu-330824-test branch. So right now, the contents of ubuntu-330824-test and the Ubuntu sauce patch up to dbf8b1c4 are identical; that is, if you were to do a "git diff ubuntu-330824-test dbf8b1c4" you would get no output indicating there is no difference between these these two branches. The difference is that in the ubuntu-330824-test then ten ext4 patches were moved to the beginning of the patch series. So if you test the tip of ubuntu-330824-test, you should see a failure, since that tree is identical to tree pointed to by the commit dbf8b1c4. However, if you bisect this git branch, you will hopefully find the other commit(s) that are required to trigger the bug.

Does that make sense?

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-05-05:

#149

Theodore Ts'o : That's excellent, I'm going to start bisecting today. Thanks!

Revision history for this message

unggnu (unggnu) wrote on 2009-05-07:

#150

I can confirm this.

The easiest way to reproduce it is to run bonnie++ without options on an ext4 device. It freezes the system every time for me on a laptop and in a virtual machine.

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-05-08:

#151

Theodore Ts'o : As I'm not a expert with git, my previous results were wrong because I tested the "master" branch of git://github.com/tytso/ext4.git instead of the "ubuntu-330824-test" branch (shame on me). So I cloned your current git repository, checkout to branch "ubuntu-330824-test", then checkout to commit "18fde579ee7e5895a802e9e04a38c26f4c0ed351" (which is the same as dbf8b1c4e8122e705447b69aea9ee6ef3a9caa30 and which comes after the 9 ubuntu SAUCE patches you added) and then I built and tested the kernel AND I was able to reproduce the bug.

So unless I'm wrong, that proves exactly the opposite of what I said in my last comment https://bugs.launchpad.net/ubuntu/+source/linux/+bug/330824/comments/147

And that would also means that one of the 10 ubuntu SAUCE patches is the other commit that is necessary to cause this bug.

I would probably need assistance to help identifying which of the ten commits is the good one since as long as I don't revert commit 18fde579ee7e5895a802e9e04a38c26f4c0ed351, it is not possible to revert previous commit without errors. I will probably try to revert some of these manually but if you have a better trick, I'll take it!

Thanks.

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-05-08:

#152

Theodore Ts'o : I found that there is not 10 but 9 ubuntu SAUCE commits in your branch. commit "18fde579ee7e5895a802e9e04a38c26f4c0ed351" is the one I identified being the bad one earlier. For each 8 others, I built the kernel with the specific SAUCE patch reverted and tested if the bug was reproducible. I did not build the kernel with many commit reverted at the time, each kernel only had one commit reverted. I was able to test 7 reverted commits and none of these following commits fixed the bug :

a096e007588607e97dff55e5b0c480a0a828af1d
c144d01a5b47a85e6780f4b05091b3f462037352
bfc1d1043508329247fcde65568045a1fc50ed1d
d86e6874827de254a1af384e82e4466b958a3329
6f8186b11ad47fdd53f11ae35de5152131d54dfc
a9263740b5aea896e3c550ef447094be858aeb8f
3d2e31f32bbda8f156c1a748a2482961e240d836

The only one commit left to test is commit f206e5cac5c7e0e3d7985ac82d3cb059b243b32f

This commit comes right before commit 18fde579ee7e5895a802e9e04a38c26f4c0ed351 and is a big commit play on the same ground. Unfortunately, I can't revert this commit without reverting commit 18fde579ee7e5895a802e9e04a38c26f4c0ed351 and all attempt to fix FAILED hung did lead me to FTBFS with the following errors :

fs/ext4/balloc.c: In function ‘ext4_free_blocks_sb’:
fs/ext4/balloc.c:400: erreur: ‘grp’ undeclared (first use in this function)
fs/ext4/balloc.c:400: erreur: (Each undeclared identifier is reported only once
fs/ext4/balloc.c:400: erreur: for each function it appears in.)
fs/ext4/balloc.c:540: erreur: ‘EXT4_GROUP_INFO_NEED_INIT_BIT’ undeclared (first use in this function)
fs/ext4/balloc.c:541: erreur: ‘blocks_freed’ undeclared (first use in this function)
make[2]: *** [fs/ext4/balloc.o] Erreur 1
make[1]: *** [fs/ext4] Erreur 2
make: *** [fs] Erreur 2

I must admit that I don't really know what to do next. If you have any idea, I'm ready to continue testing.

Revision history for this message

PsychedEric (eric-chaudy) wrote on 2009-05-13:

#153

Maybe I am wrong : so take it just as a suggestion,

Maybe the bug is in gvfs

or Maybe it is a conflict between gvfs and ext4

Revision history for this message

getaceres (getaceres) wrote on 2009-05-14:

#154

It's not related to gvfs because I'm having it in KDE too.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-05-26:

#155

Sorry for not checking in sooner, I've been swamped with upstream issues with ext4, and this is pretty clearly an Ubuntu specific bug. Maybe later this weekend I'll have time to try to figure out which ones of the Ubuntu-backported patches is responsible for the problem.

However, I wanted to report that I have tried to reproduce the bug using 2.6.28-11.15 with a Ubuntu Jaunty userspace, using a filesystem that was originally an image dump of an ext3 filesystem, converted to ext4 using "tune2fs -O extents,dir_index,uninit_bg /dev/mini/testext4; e2fsck -fDp /dev/mini/testext4". I then mounted it using 2.6.28-11.15, and then ran the hang.py python script attached to this bug report. The system that I used was an Lenovo S10 with an N270 Atom processor with 512 megs of memory. This is apparently all of the ideal conditions reported by those who can reproduce the problem ---- and I was not able to reproduce the problem. This makes it even more difficult for me to determine which one of the Ubuntu backports is responsible.

Random question --- people have reported that using the uptream kernels, without the Ubuntu sauce patches, have fixed the problem for them. Has anyone tried using the 2.6.30-7.8 Karmic kernel? Has anyone been able to reproduce the problem using the Karmic Koala Kernel? Or does that kernel also work w/o problems, just as the mainline 2.6.28, 2.6.29, and 2.6.30-rc6 kernels also apparently seem to be free of this bug?

Revision history for this message

Jordan (jordanu) wrote on 2009-05-27:

#156

I cannot reproduce the hang with a karmic kernel ( 2.6.30-6-generic ) on the same machine that can easily reproduce it with a Jaunty kernel ( 2.6.28-11-generic ).

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-05-27:

#157

Theodore Ts'o : No 2.6.30-6-generic kernel does not have this bug, this is only in jaunty. BTW, did you look my last comment, I was giving you the information you needed concerning which commits cause this bug.

Revision history for this message

unggnu (unggnu) wrote on 2009-05-27:

#158

@Theodore Ts'o
Just run bonnie++. It works for me all the time.

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-05-27:

#159

I can't duplicate this under karmic's 2.6.30-generic (-4, -5 or -6), I don't think karmic suffers from this issue.

Saivann Carignan (oxmosys) on 2009-05-30

Changed in linux (Ubuntu Karmic):
status:	In Progress → Fix Released

Revision history for this message

iamringo (michael-libertin) wrote on 2009-05-30:

#160

So this is probably not terribly helpful, but I'd like to add that like the person in a duplicate bug (#34312), I also experience lockups seemingly at random (I'll sit down at my computer which has been idle for awhile and find everything frozen and the caps lock key flashing). I'm running Jaunty w/ 2.6.28-11-generic kernel and Ext4, and I know of at least one other user who has the same thing happen to her. The reason I'm mentioning this is to point out that while recursively removing lots of files does seem to be a sure-fire way to reproduce the bug (hang.py causes my computer to freeze, as does emptying the trash if it's got a lot of stuff in it), there seem to be other things that set off the behavior as well....

Revision history for this message

Rocko (rockorequin) wrote on 2009-05-30:

#161

@iamringo: I got other frequent random lockups also with the stock kernels (both 2.6.28-11 and -12), possibly related to ext4 (I suspect the mlocate update was causing some). But I've not had a single one in four weeks using the 2.6.30 vanilla kernels and they've been very stable. IMO it's definitely worth installing 2.6.30 (if you don't want to build it from source, you can install it either from the karmic repository or from the vanilla builds at http://kernel.ubuntu.com/~kernel-ppa/mainline/).

Revision history for this message

iamringo (michael-libertin) wrote on 2009-05-30:

#162

@Rocko: just switched, and everything seems fine. I can run hang.py w/o a problem, and I've got a hunch there won't be any random freezes anymore. Given this bug report, I've been meaning to switch kernels, but only just got around to it....Thanks!

Revision history for this message

f3a97 (f3a97) wrote on 2009-05-31:

#163

FWIW,
on different PCs (desktop and laptop ) where I installed Jaunty + ext4 I'm having these hangs,
I can trigger them reliably with the hang.py script...

Revision history for this message

Ioannis (ioannisnousias) wrote on 2009-06-02:

#164

I think it also happens when resizing files. For instance I get lockups when running a VM and the disk image changes in size.

I don't know why this is marked as 'importance = medium', since it renders the system useless.

Revision history for this message

Derek (bugs-m8y) wrote on 2009-06-02:

#165

Perhaps the VM is growing the image by using temp files?

I noticed lockup on checking out a clean subversion tree, which should only have involved file creation.

Oh well, changing kernels is a decent workaround.

Revision history for this message

bcrook (brian-w-crook) wrote on 2009-06-02:

#166

@loannis

Probably because ext4 is optional file system. I believe it is noted in Release Notes for Jaunty that ext4 has been calling problems...it's not like people are forced to use it.

Revision history for this message

Ioannis (ioannisnousias) wrote on 2009-06-02:

#167

@Derek
it probably does create some temporary files in there process, deleting them shortly after.

@bcrook

good point. That's the price of living on the edge...

I'm a bit stuck here. Karmic has not yet decided what to do with the 'restricted' modules (I have an Nvidia gfx), thus I'm left with the option of building the nvidia module manually and see how that goes.

Revision history for this message

Carey Underwood (cwillu) wrote on 2009-06-02: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#168

loannis, just install a 2.6.30 kernel. I've been running an rc of 2.6.30 on
jaunty for months now with nvidia, works fine. DKMS will take care of the
restricted modules as necessary.

Revision history for this message

Ioannis (ioannisnousias) wrote on 2009-06-02:

#169

patch for nvidia 96.43.10 module for 2.6.30 kernel Edit (3.1 KiB, text/plain)

yes, thanks Carey

In my case, I had to apply a patch to the default nvidia 96.43.10 source. There is a patch for the 180.44 driver that you can find here:

http://www.nvnews.net/vbulletin/showthread.php?t=131597

I've attached a version made for the 96.43.10 module (actually, I didn't try out the above patch directly, it might work)

In more details:
assuming all required packages are installed(like linux-headers, nvidia-96-kernel-source, etc), apply the patch to the following directory:
/usr/src/nvidia-96.43.10

I've installed the 2.6.30-7-generic kernel, so the build command for this looks like this:
sudo dkms build -m nvidia -v 96.43.10 -k 2.6.30-7-generic

and install it with:
sudo dkms install -m nvidia -v 96.43.10 -k 2.6.30-7-generic

if all went well, 'dkms status' should include an entry:
nvidia, 96.43.10, 2.6.30-7-generic, i686: installed

obviously instead of '96.43.10', you'll have your driver version there (if not the same as mine).

Revision history for this message

Ulrich Hobelmann (u-hobelmann) wrote on 2009-06-07:

#170

This problem definitely IS NOT LIMITED to deletion, as most comments seem to indicate.

First I only had it sometimes when emptying my trash (with a few roughly 1GB files), but after a reboot the trash was empty, so it seems that the freeze occurred after syncing the disk / completing the operation.

I've also had the freeze yesterday during a svn checkout of chromium (a few GB of data, but rather slowly, so the system had looots of time to sync everything in small chunks, you'd think).

After deleting and re-checkout (which worked the second time), I created a tar-ball (also about 1GB) of the source tree. That froze the system, too (and wasn't doing much at all in-between excepting surfing the web).

Today I wanted to delete the huge source+binary tree, which resulted in having to reboot five times, until finally an "rm -rf" succeeded without freezing the system. I've found that running a kind of sync-daemon (sleep 5-10 secs, sync) loop in the background might help (I've now created another 1GB tarball and been doing other things, without a freeze).

Not sure if this problem has to do with the system trying to flush lots of data at a time (but then I'm not sure why the svn checkout or the tarballing also froze the system, it wasn't THAT much data per second).

Everything on an up-to-date jaunty system.

Revision history for this message

DanielV (danielveldkamp-deactivatedaccount) wrote on 2009-06-07:

#171

You are correct. I have this problem when my newsreader program MOVES data from the temporary folder to the download folder.

Revision history for this message

Luke Maurer (luke-maurer) wrote on 2009-06-08:

#172

@Ulrich: Are you sure none of those operations involved deleting files? The usual symptom is that the system hangs *after* the rm is successful, so that's not out of character. SVN does some amount of file-based locking, IIUC, which means at some point it has to rm the lock file. Creating a tarball shouldn't remove anything - but surfing the Web may very well, if the cache is full and it has to start clearing out old entries.

And I know that it's not in *all* cases that heavy load is important - as I'm encountering the bug, ANY rm freezes the system every single time.

@DanielV: Are they on the same volume? If not, then mv = cp + rm, no?

Revision history for this message

Scott S (sstehno-cox) wrote on 2009-06-09:

#173

I am using jaunty 64 bit. I formated using ext4 on 320 gig hard drive. It hangs when moving files from NTFS usb drive that I used for backup. It also tells me the trash can is full. I changed the trash can to 50% or 107 gigs. It should not have been full. I also states the files are to large to delete. I have done some hard reboots to unhang the system when trying to restore my backup files. It hangs on both the delete from the usb or the main hard drive.

I made 4.0 gig boot partition.
I made 60 gig / root partition.
I made 217 gig home partition.

I formated all from boot cd-disk to be ext4.

In the right corner I have the blue I showing but the files are not being deleted. Seem to be locked up but have internet still running.

2.6.28-11-generic

2.6.28-11-generic
scott@scott-linux:~$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 9.04
Release: 9.04
Codename: jaunty
scott@scott-linux:~$

If some would like more information or need me to run some command lines to get better information just let me know.

Revision history for this message

DanielV (danielveldkamp-deactivatedaccount) wrote on 2009-06-09:

#174

@Luke: yes the files that were moved that caused the computer to freeze were on the same volume.

Revision history for this message

Ulrich Hobelmann (u-hobelmann) wrote on 2009-06-09:

#175

@Luke, you're probably right, if even a single rm can freeze the system. I guess some data is always being moved around... When the bug is fixed, I'll see if I still get freezes.

Revision history for this message

iamringo (michael-libertin) wrote on 2009-06-09:

#176

@Ulrich...I know this has been mentioned before, but you could try out one of the karmic kernels and see if you still have problems...if you are, then you can be fairly sure your problems aren't related....

Revision history for this message

vaibhav mishra (vinu76jsr) wrote on 2009-06-11:

#177

I installed the karmic kernel (using kernelcheck) ,and rebooted, no system freezes though but as a side-effect my wireless stopped working, I am on Jaunty with broadcomm b43 wireless, the lid didn't go on, as if thre is no wireless , System>Administration>Hardware drivers show that driver is deactivated , and click on activated button produces downloading and installing window but changes nothing,

Revision history for this message

vaibhav mishra (vinu76jsr) wrote on 2009-06-11:

#178

update: now hardware drivers say , broadcom wireless is activated but still it is not working.

Revision history for this message

Åskar (olskar) wrote on 2009-06-14:

#179

@vaibhav mishra; you should open a new report about the issue with your wireless broadcom.

Revision history for this message

Nicholas Roberts (nicholasdavidroberts) wrote on 2009-06-19:

#180

I have resisted posting because 'me too' is not helpful. However, I am operating two very different machines (no two pieces of hardware the same) with the same operating system (Jaunty [ext4] with all the latest updates). Both machines exhibit the same 'freezing' problem on deleting files and the effect is random with seemingly no preference to size or number of files being deleted.

As an engineer I hate to give qualitative (versus quantitative) comment, but I have noticed that the frequency of system freezes has increased roughly threefold (conservatively) in the last few weeks, particularly since the move to 2.6.28-13 from 2.6.28-12.

For the first time in 30 years I am having to modify my software to have it not delete temporary files just to avoid the operating system hanging... madness! Dare I say I cannot remember Windoze ever having a bug this bad; I think this one needs nailing fast gang.

Revision history for this message

Øyvind Stegard (oyvindstegard) wrote on 2009-06-19: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#181

fr., 19.06.2009 kl. 07.26 +0000, skrev Nicholas Roberts:
> I have resisted posting because 'me too' is not helpful. However, I am
> operating two very different machines (no two pieces of hardware the
> same) with the same operating system (Jaunty [ext4] with all the latest
> updates). Both machines exhibit the same 'freezing' problem on deleting
> files and the effect is random with seemingly no preference to size or
> number of files being deleted.
>
> As an engineer I hate to give qualitative (versus quantitative) comment,
> but I have noticed that the frequency of system freezes has increased
> roughly threefold (conservatively) in the last few weeks, particularly
> since the move to 2.6.28-13 from 2.6.28-12.
>
> For the first time in 30 years I am having to modify my software to have
> it not delete temporary files just to avoid the operating system
> hanging... madness! Dare I say I cannot remember Windoze ever having a
> bug this bad; I think this one needs nailing fast gang.
>

If you need your computers to be stable, then there is no-one stopping
you from using EXT3, which is rock solid. EXT4 is not the Jaunty default
fs and warnings about EXT4 are clearly available in the release-notes:
http://www.ubuntu.com/getubuntu/releasenotes/904#Lock-ups%20when%20deleting%20files%20from%20ext4%20filesystems

My advice: don't jump on EXT4 until Karmic.

Regards,
Øyvind
--
< Øyvind Stegard
< http://www.oyvind.nu/

Revision history for this message

_dan_ (dan-void) wrote on 2009-06-19:

#182

I dont want to be rude but the "its your own fault, dont use it attitude" does not help anyone.
I am pretty sure everyone knows they can use ext3, thats not the point of a bugreport tho.
If Ubuntu ships with ext4 supoort it should work period.
In this current state its almost unusable.

Revision history for this message

Nicholas Roberts (nicholasdavidroberts) wrote on 2009-06-19:

#183

Regarding Øyvind's wise words...

I completely agree Øyvind and you are right sir. That said, there are an awful lot of people (like me) who have opted for ext4 believing that any major bugs would be sorted in the usual timely fashion that we have come to enjoy from the Ubuntu community. I think in retrospect ext4 should not have been offered as an option because it is just 'too tempting' to either take the option at install time or follow the procedure for upgrading from ext3 to ext4; you only have to look at the plethora of posts in this and other fora to see how many folk (a lot of them newbies to Linux and/or technically weak) were and still are desperate to try out the new FS because that is a normal human trait.

I can live with the bug or go back to ext3 and wait for KK; my post was not really about me. My real point is the damage to reputation and trust that something like this has with the public at large, particularly those 'converts' from Window$ who maybe are not as technically minded as us (and use such fora as this to research what is going on) who will 'bad mouth' Linux/Ubuntu and got back to Windoze for all the wrong reasons.

My personal advice... either the bug gets fixed in a timely fashion or a much stronger warning is issued (including taking the option away to opt for ext4).

Please be assured of my best intentions... Ubuntu is king in my book! Here endeth my lesson, I will not post further on the issue unless I spot a solution ;-)

Revision history for this message

Øyvind Stegard (oyvindstegard) wrote on 2009-06-19:

#184

fr., 19.06.2009 kl. 08.53 +0000, skrev _dan_:
> I dont want to be rude but the "its your own fault, dont use it attitude" does not help anyone.
> I am pretty sure everyone knows they can use ext3, thats not the point of a bugreport tho.
> If Ubuntu ships with ext4 supoort it should work period.
> In this current state its almost unusable.
>

Yep, and it was not my intention to be rude either. But at the risk of
sounding a bit harsh, I'd say that comments like «this should be fixed
NOW because it ain't working» also don't help. But I certainly do
understand the frustration that many EXT4 Jaunty users must be having
wrt. this, and I think this bug is grave. I initially had all three of
my machines converted to EXT4 when upgrading to Jaunty. But after
observing this bug report for a few weeks, I quikly realised the dangers
and converted about 13 partitions on 5 disks back to EXT3 :).

My reason for replying was that Nicholas Roberts stated that he was
modifying his software to work-around Jaunty EXT4 bugs. And I think he'd
be better off leaving his software alone and going back to EXT3 and wait
a few more months for EXT4 goodness :).

Revision history for this message

Øyvind Stegard (oyvindstegard) wrote on 2009-06-19:

#185

fr., 19.06.2009 kl. 08.57 +0000, Nicholas Roberts:
> Regarding Øyvind's wise words...

I agree with what you're saying ! This bug sucks. Also see my reply to
dan. And the warning about EXT4 should be more prominent in the release
notes. Thankfully, they did not set it as default fs. Sorry for the bug
report spamming in general :), I'll stop now.

Revision history for this message

Andrioid (andri80) wrote on 2009-06-19:

#186

I realize that I'm just adding to the problem, but... Please conduct these discussions somewhere else. There are a lot of people subscribed to this thread, not because they want to discuss it, but probably because they are waiting for a fix to be released.

So at a risk of being rude; please restrict the replies to bugfixing efforts and hopefully this will get fixed at some point.

Revision history for this message

jagnet (jagnetx) wrote on 2009-06-19:

#187

Well said Andri. Now back to the bug.

I have been running 2 entirely different servers with the desktop version of 9.10 for 3 weeks now with EXT3 and other with EXT4.

In 8.10 and 9.04 Ubuntu froze as described above when I deleted large numbers of files. Or indeed when any software deleted anything (eg Sabnzbd, newsleecher etc) However I upgraded the Kernel on one server as recommended and upgraded to 9.10 on the other and the problem hasn't shown its ugly head in three weeks. Whatever the problem is it isn't restricted to EXT4.

So whatever is in the new kernel or 9.10 is working and should be brought to 9.04.

If like me your not a huge Linux tech and looking for a quick easy fix from 9.04 press ALT F2, in the box type update-manager -d and press run, in the next window click upgrade.

Revision history for this message

Bryan Quigley (bryanquigley) wrote on 2009-06-19: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#188

I can not recommend strongly enough that you DO NOT follow jagnet's
advice and upgrade to Karmic to get away from this bug. Yes the bug
is fixed in Karmic, but in development releases many many worse bugs
can be found (especially this early).

Unfortunately, the only truly supported solution has always been to
use EXT3 (which means reinstall in most cases).

Revision history for this message

pritam ghanghas (pritam-ghanghas) wrote on 2009-06-19:

#189

Download full text (3.3 KiB)

Hi

There is a better solution as suggested before in the discussion. Download
2.6.30 from here http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.30/.
I never had a problem after installing mainline. I wont recommend it if you
are using some proprietary drivers from Ubuntu. They may not be supported.
My "/" is not on ext4 though but all other partitons which I used for
storage are on ext4 now and I was having problem with stock jaunty kernel.

On Fri, Jun 19, 2009 at 4:28 PM, Bryan Quigley <email address hidden> wrote:

> I can not recommend strongly enough that you DO NOT follow jagnet's
> advice and upgrade to Karmic to get away from this bug. Yes the bug
> is fixed in Karmic, but in development releases many many worse bugs
> can be found (especially this early).
>
> Unfortunately, the only truly supported solution has always been to
> use EXT3 (which means reinstall in most cases).
>
> --
> Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28
> https://bugs.launchpad.net/bugs/330824
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Ubuntu Release Notes: Fix Released
> Status in “linux” source package in Ubuntu: Fix Released
> Status in linux in Ubuntu Jaunty: In Progress
> Status in linux in Ubuntu Karmic: Fix Released
>
> Bug description:
> [
> Please read *all* previous comments before posting.
>
> Mainline kernels are known to not experience this bug, although in general
> are not supported (i.e., using one is a workaround, but if they break other
> things you're generally out of luck).
>
> Additional "me-too" comments aren't useful, feel free to select the "This
> bug affects me too" option and/or subscribe to this bug instead.
> ]
>
> Binary package hint: linux-image-2.6.28-8-generic
>
> I'm using 8.10 Kubuntu with all updates done on system.
>
> System is a clean installed system with EXT4 formating and using 2.6.8-8
> linux kernel.
>
> System sometimes lock and freeze whole inputs even keyboard or mouse.
> I have closed X and kdm and try to reprocedure same bug in console ( not
> konsole )
> so i have killed X and kdm.
>
> And try to compile qt-copy in one console and try to svn up on KDE and on
> other console
> i tryto apt-get update to make system under CPU load. and after a while
> it happens again.
>
> No Keyboard response no harddisc response total freeze.
>
> I have waited a while after freeze and about 4 min later a text appeared on
> screen saying :
>
> BUG: soft locking - CPU#0 stuck for 61s! [uic: 5356]
>
> after waiting about 4 more minutes a newer but same text appeared unter
> this message :
>
> BUG: soft locking - CPU#0 stuck for 61s! [uic: 5356]
> BUG: soft locking - CPU#0 stuck for 61s! [uic: 5356]
>
>
> There isn't any error records on /etc/log/messages releated on hardware
> while around freezing/locking times
>
> And for information : Sometimes i have seen that i'm getting messages like
> disc is full but
> I'm sure that it isn't. Because df shows me there are more than 7 Gb
> freespace. Not always getting this error.
> if a file shows this error while i'm updating it i'm deleting it and
> downloading a bigger file system won't in...

Hi

There is a better solution as suggested before in the discussion. Download
2.6.30 from here http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.30/.
I never had a problem after installing mainline. I wont recommend it if you
are using some proprietary drivers from Ubuntu. They may not be supported.
My "/" is not on ext4 though but all other partitons which I used for
storage are on ext4 now and I was having problem with stock jaunty kernel.

On Fri, Jun 19, 2009 at 4:28 PM, Bryan Quigley <gQuigs@gmail.com> wrote:

> I can not recommend strongly enough that you DO NOT follow jagnet's
> advice and upgrade to Karmic to get away from this bug.  Yes the bug
> is fixed in Karmic, but in development releases many many worse bugs
> can be found (especially this early).
>
> Unfortunately, the only truly supported solution has always been to
> use EXT3 (which means reinstall in most cases).
>
> --
> Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28
> https://bugs.launchpad.net/bugs/330824
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Ubuntu Release Notes: Fix Released
> Status in “linux” source package in Ubuntu: Fix Released
> Status in linux in Ubuntu Jaunty: In Progress
> Status in linux in Ubuntu Karmic: Fix Released
>
> Bug description:
> [
> Please read *all* previous comments before posting.
>
> Mainline kernels are known to not experience this bug, although in general
> are not supported (i.e., using one is a workaround, but if they break other
> things you're generally out of luck).
>
> Additional "me-too" comments aren't useful, feel free to select the "This
> bug affects me too" option and/or subscribe to this bug instead.
> ]
>
> Binary package hint: linux-image-2.6.28-8-generic
>
> I'm using 8.10 Kubuntu with all updates done on system.
>
> System is a clean installed system with EXT4 formating and using 2.6.8-8
> linux kernel.
>
> System sometimes lock and freeze whole inputs even keyboard or mouse.
> I have closed X and kdm and try to reprocedure same bug in console  ( not
> konsole )
> so i have killed X and kdm.
>
> And try to compile qt-copy in one console and try to svn up on KDE and on
> other console
> i tryto apt-get update    to make system under CPU load. and after a while
> it happens again.
>
> No Keyboard response no harddisc response total freeze.
>
> I have waited a while after freeze and about 4 min later a text appeared on
> screen saying :
>
> BUG: soft locking  - CPU#0 stuck for 61s!   [uic: 5356]
>
> after waiting about 4 more minutes a newer but same text appeared unter
> this message :
>
> BUG: soft locking  - CPU#0 stuck for 61s!   [uic: 5356]
> BUG: soft locking  - CPU#0 stuck for 61s!   [uic: 5356]
>
>
> There isn't any error records on /etc/log/messages releated on hardware
> while around freezing/locking times
>
> And for information : Sometimes i have seen that i'm getting messages like
> disc is full but
> I'm sure that it isn't. Because df shows me there are more than 7 Gb
> freespace. Not always getting this error.
> if a file shows this error while i'm updating it i'm deleting it and
> downloading a bigger file system won't interrupts me
> like saying disk is full. I think it is releated to Ext4.
>
> But i'm not sure these 2 bugs releated or not.
>
> Thanks
>

-- 
pritam

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-06-19:

#190

Another possibility is to use the Karmic kernel but keep a Jaunty userspace. I haven't tried this myself, but I suspect it is highly likely to work.

As far as people complaining --- please note that (a) Ubuntu has not made this the default, and (b) when I talked to the Canonical kernel team, they were using ext4 and very happy with it; they weren't seeing this problem; (c) I've not been able to reproduce the problem, even using some of the Python reproduction scripts provided here, on a 1024meg Atom netbook running Jaunty. It may be because I (and the canonical kernel team) don't run some critical program which is needed to enable this bug to trigger. (d) From working with the people who *can* trigger this bug reliably, it seems to be related to the Ubuntu specific backports of ext4 patches; if those 10 patches are removed, the system is stable. With *any* upstream kernel, either before, after, or at the Jaunty snapshot, these problems don't show (at least for the people who have done testing for me and who are able to replicate bug --- as I've said, I can't reproduce it all on my systems).

As far as jagnet's report that he sees hangs when deleting large number of files using ext3 with the Jaunty kernel --- that's interesting. I don't know how to square that with people who took the stock kernel at the Jaunty snapshot point, saw that they had no problems, applied the Ubuntu-specific ext4 patches from the Jaunty "sauce" (aka 'value-added' distribution patches) and then were able to reproduce system hangs when deleting files.

In any case, I'm a volunteer, and I do ext4 development largely on my own time. I'm not paid to support Ubuntu. So I have to ration my ext4 late-night development hours carefully, and there's been a lot of need of my time to get fixes and support for 64-bit e2fsprogs in upstream. And unfortunately, there is no paid Ubuntu resource supporting ext4, at least as far as I know. Eric Sandeen, on the other hand, is a Red Hat employee, who has spent a *huge* amount of his time (both paid and personal time) helping to make sure that Fedora 11's ext4 was rock solid stable. As a former SGI employee employed to work on XFS, he's extended the XFS test suite to work on ext4, and we're using it to make sure that the upstream ext4 is rock-solid stable, and he's been using it to make sure F11's ext4 is highly stable. I'm sorry I can't spend more time working on Ubuntu's ext4, but at the end of the day it boils down to time management --- especially when there are workarounds such as using the Karmic kernel or using upstream kernel. I'm sorry for those of you using proprietary drivers, but there's a reason upstream kernel developers aren't terribly fond of such things.

Another possibility is to use the Karmic kernel but keep a Jaunty userspace.   I haven't tried this myself, but I suspect it is highly likely to work.

As far as people complaining --- please note that (a) Ubuntu has not made this the default, and (b) when I talked to the Canonical kernel team, they were using ext4 and very happy with it; they weren't seeing this problem; (c) I've not been able to reproduce the problem, even using some of the Python reproduction scripts provided here, on a 1024meg Atom netbook running Jaunty.   It may be because I (and the canonical kernel team) don't run some critical program which is needed to enable this bug to trigger.   (d) From working with the people who *can* trigger this bug reliably, it seems to be related to the Ubuntu specific backports of ext4 patches; if those 10 patches are removed, the system is stable.  With *any* upstream kernel, either before, after, or at the Jaunty snapshot, these problems don't show (at least for the people who have done testing for me and who are able to replicate bug --- as I've said, I can't reproduce it all on my systems).

As far as jagnet's report that he sees hangs when deleting large number of files using ext3 with the Jaunty kernel --- that's interesting.  I don't know how to square that with people who took the stock kernel at the Jaunty snapshot point, saw that they had no problems, applied the Ubuntu-specific ext4 patches from the Jaunty "sauce" (aka 'value-added' distribution patches) and then were able to reproduce system hangs when deleting files.

In any case, I'm a volunteer, and I do ext4 development largely on my own time.   I'm not paid to support Ubuntu.    So I have to ration my ext4 late-night development hours carefully, and there's been a lot of need of my time to get fixes and support for 64-bit e2fsprogs in upstream.   And unfortunately, there is no paid Ubuntu resource supporting ext4, at least as far as I know.   Eric Sandeen, on the other hand, is a Red Hat employee, who has spent a *huge* amount of his time (both paid and personal time) helping to make sure that Fedora 11's ext4 was rock solid stable.  As a former SGI employee employed to work on XFS, he's extended the XFS test suite to work on ext4, and we're using it to make sure that the upstream ext4 is rock-solid stable, and he's been using it to make sure F11's ext4 is highly stable.  I'm sorry I can't spend more time working on Ubuntu's ext4, but at the end of the day it boils down to time management --- especially when there are workarounds such as using the Karmic kernel or using upstream kernel.   I'm sorry for those of you using proprietary drivers, but there's a reason upstream kernel developers aren't terribly fond of such things.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-06-19:

#191

P.S. There is a known ext4 file system corruption bug which is fixed in the 2.6.30 mainline kernel and in 2.6.29.5. It was found after the stable kernel series stopped updating for 2.6.28, but I do carry a fix for it in my for-stable-2.6.28 branch of the ext4 git tree, located here:

git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git for-stable-2.6.28
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=shortlog;h=for-stable-2.6.28

(This is where it's handy to have a file system specialist working at the distribution; when I found the problem, I was able to contact Eric and he made sure the patch was quickly dropped into the F11 kernel.)

Revision history for this message

Ulrich Hobelmann (u-hobelmann) wrote on 2009-06-19:

#192

FWIW, I've installed a kernel from http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.30/ and I haven't had any problems since then.

Revision history for this message

Leann Ogasawara (leannogasawara) wrote on 2009-06-19:

#193

Hi Ted,

I've opened bug 389555, https://bugs.edge.launchpad.net/ubuntu/jaunty/+source/linux/+bug/389555, to track the issue you mentioned in comment https://bugs.edge.launchpad.net/ubuntu/+source/linux/+bug/330824/comments/191 . Thanks.

Revision history for this message

Derek (bugs-m8y) wrote on 2009-06-19: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#194

On Fri, 19 Jun 2009, Theodore Ts'o wrote:

> P.S. There is a known ext4 file system corruption bug which is fixed in
> the 2.6.30 mainline kernel and in 2.6.29.5. It was found after the
> stable kernel series stopped updating for 2.6.28, but I do carry a fix
> for it in my for-stable-2.6.28 branch of the ext4 git tree, located
> here:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git for-stable-2.6.28
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=shortlog;h=for-stable-2.6.28
>
> (This is where it's handy to have a file system specialist working at
> the distribution; when I found the problem, I was able to contact Eric
> and he made sure the patch was quickly dropped into the F11 kernel.)

Is this:
http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=16cb5dd9f53e569130584696909d423b6fe38c1e

?

'cause the machines I was getting lockups on were single core.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-06-19: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#195

On Fri, Jun 19, 2009 at 04:49:19PM -0000, Derek wrote:
> > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git for-stable-2.6.28
> > http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=shortlog;h=for-stable-2.6.28
> >
> > (This is where it's handy to have a file system specialist working at
> > the distribution; when I found the problem, I was able to contact Eric
> > and he made sure the patch was quickly dropped into the F11 kernel.)
>
> Is this:
> http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commitdiff;h=16cb5dd9f53e569130584696909d423b6fe38c1e
>
> 'cause the machines I was getting lockups on were single core.

That's the one; this bug could potentially cause inode table or
corruption of block groups, causing data loss. It could *potentially*
show up on single processor systems, if CONFIG_PREMPT is defined, but
in practice it's unlikely for UP systems to hit the race.

- Ted

Revision history for this message

Jason Waddle (jwaddle) wrote on 2009-06-20:

#196

I registered just to say that I have been seriously affected by this bug, and it certainly deserves more than the "medium" designation. Single core, ~1GHz processor, would deadlock consistently on rm -rf'ing a directory of photos (~4MB each, thousands of them) using the stock ubuntu kernel 2.6.28-11 (then -.13) on an ext4 filesystem (native), LVMed and RAID-1. All of the LVM & RAID overhead might have exacerbated the bug if it was a race condition.

I love Linux in general, and have been very happy with the "hands off" safe, stable feel of Ubuntu distros, but this latest install was a freaking disaster since I primarily use this machine remotely. If EXT4 was known to be so unstable, it should have been marked so. Since the bug has been known for months, and people have offered to ship pre-installed machines that can reliably reproduce the problem, some more progress should have been made. Admittedly, it's tough to bitch when I haven't done anything to help fix the problem, but seriously I would be much happier if someone just upgraded the bug status from "medium".

I've gone to the mainline kernel 2.6.30 and I had to install the newer 180 version of nvidia drivers in order to do so. Very helpful directions for how to do this are given here: https://bugs.launchpad.net/ubuntu/+source/nvidia-common/+bug/384639. After the switch to the mainline kernel, no problem with the mass deletes. I highly suggest that anyone having this problem do the same. These directions should be linked on the main Ubuntu Jaunty page. Releasing with ext4 (without any sort of cautionary warnings in the installer!) in this shape was a serious mistake.

Ubuntu folks, when am I going to be able to tell my mom it's a good idea to ditch the Windows and go Ubuntu?

Revision history for this message

moli (f-launchpad-moli-hu) wrote on 2009-06-20:

#197

Thank you Jason Waddle, i've updated my console-only server to 2.6.30 and succeed deleting 35494 files from an ext4 device in ~45 seconds without freezing.

Help for the update:
http://www.ramoonus.nl/2009/06/10/linux-kernel-2-6-30-installation-guide-for-ubuntu-and-debian-linux/
http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.30/
http://www.cyberciti.biz/faq/ubuntu-linux-how-do-i-install-deb-packages/

"""Ubuntu folks, when am I going to be able to tell my mom it's a good idea to ditch the Windows and go Ubuntu?"""

I think this is the most important factor in this case that most people here simply dont get. People will switch back to windows they do not care about excuses.

Revision history for this message

Brian Shannon (brian.shannon) wrote on 2009-06-20: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#198

"""Ubuntu folks, when am I going to be able to tell my mom it's a good
idea to ditch the Windows and go Ubuntu?"""

To be blunt, when you decide to use the stable defaults Ubuntu provides.

Sorry about being off topic but this was just so ridiculous I had to
pass comment.

Revision history for this message

Vincenzo Ciancia (vincenzo-ml) wrote on 2009-06-20:

#199

Moreover, that comment came after the poster updated a "console only" server :)

Revision history for this message

Michael Rooney (mrooney) wrote on 2009-06-20:

#200

On Sat, Jun 20, 2009 at 2:16 AM, Jason Waddle<email address hidden> wrote:
> I registered just to say that I have been seriously affected by this
> bug, and it certainly deserves more than the "medium" designation.

While I can definitely understand your frustration, please keep in
mind the context of your issue. Ext4 is not the default file system so
it only affects a small minority of users. See
https://wiki.ubuntu.com/Bugs/Importance for more information. I am not
sure if this would qualify as a severe impact and it is already fixed
in newer kernels as you noted. Also this isn't affecting all ext4
users, as some people on this report including myself are unable to
reproduce it and have had absolutely zero problems with ext4.

>
> I love Linux in general, and have been very happy with the "hands off"
> safe, stable feel of Ubuntu distros, but this latest install was a
> freaking disaster since I primarily use this machine remotely. If EXT4
> was known to be so unstable, it should have been marked so.

As others have said, if you want the most stable release, go with the
file system the OS recommends. The fact that it isn't the default
should in itself tell you it is believed to be less stable. The
overall stability also was not fully known, which was the point of
including it as an option in 9.04, to get testing and learn about
important bugs like this so they can be fixed. As with any software,
you are not going to know all the bugs until it is released.

> I've gone to the mainline kernel 2.6.30 and I had to install the newer
> 180 version of nvidia drivers in order to do so. Very helpful
> directions for how to do this are given here:
> https://bugs.launchpad.net/ubuntu/+source/nvidia-common/+bug/384639.
> After the switch to the mainline kernel, no problem with the mass
> deletes.

Excellent, glad to hear another confirmation that the issues will be
addressed in Karmic.

>
> Ubuntu folks, when am I going to be able to tell my mom it's a good idea
> to ditch the Windows and go Ubuntu?

Right now, because she is probably not going to select a non-default
file system. This is really only relevant more advanced users. Your
Mom probably doesn't know or care what a file system is, and shouldn't
have to. It is for this exact reason ext4 was not the default in 9.04,
and is becoming default in 9.10 (I believe) only after testing and
resolving all the issues. So your Mom can happily install Karmic
without issues including this one just as she could have Jaunty :)

On Sat, Jun 20, 2009 at 2:16 AM, Jason Waddle<jwaddle@gmail.com> wrote:
> I registered just to say that I have been seriously affected by this
> bug, and it certainly deserves more than the "medium" designation.

While I can definitely understand your frustration, please keep in
mind the context of your issue. Ext4 is not the default file system so
it only affects a small minority of users. See
https://wiki.ubuntu.com/Bugs/Importance for more information. I am not
sure if this would qualify as a severe impact and it is already fixed
in newer kernels as you noted. Also this isn't affecting all ext4
users, as some people on this report including myself are unable to
reproduce it and have had absolutely zero problems with ext4.

>
> I love Linux in general, and have been very happy with the "hands off"
> safe, stable feel of Ubuntu distros, but this latest install was a
> freaking disaster since I primarily use this machine remotely.  If EXT4
> was known to be so unstable, it should have been marked so.

As others have said, if you want the most stable release, go with the
file system the OS recommends. The fact that it isn't the default
should in itself tell you it is believed to be less stable. The
overall stability also was not fully known, which was the point of
including it as an option in 9.04, to get testing and learn about
important bugs like this so they can be fixed. As with any software,
you are not going to know all the bugs until it is released.

> I've gone to the mainline kernel 2.6.30 and I had to install the newer
> 180 version of nvidia drivers in order to do so.  Very helpful
> directions for how to do this are given here:
> https://bugs.launchpad.net/ubuntu/+source/nvidia-common/+bug/384639.
> After the switch to the mainline kernel, no problem with the mass
> deletes.

Excellent, glad to hear another confirmation that the issues will be
addressed in Karmic.

>
> Ubuntu folks, when am I going to be able to tell my mom it's a good idea
> to ditch the Windows and go Ubuntu?

Right now, because she is probably not going to select a non-default
file system. This is really only relevant more advanced users. Your
Mom probably doesn't know or care what a file system is, and shouldn't
have to. It is for this exact reason ext4 was not the default in 9.04,
and is becoming default in 9.10 (I believe) only after testing and
resolving all the issues. So your Mom can happily install Karmic
without issues including this one just as she could have Jaunty :)

Revision history for this message

Andrius Štikonas (stikonas) wrote on 2009-06-20:

#201

2009/6/20 Brian Shannon <email address hidden>

> """Ubuntu folks, when am I going to be able to tell my mom it's a good
> idea to ditch the Windows and go Ubuntu?"""
>
> To be blunt, when you decide to use the stable defaults Ubuntu provides.
>
> Sorry about being off topic but this was just so ridiculous I had to
> pass comment.
>
> --
> Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28
> https://bugs.launchpad.net/bugs/330824
> You received this bug notification because you are a direct subscriber
> of the bug.
>

If everybody is using stable defaults then who will test or develop new
software?
Every software community must have early testers and if they report
something it would be in the best interests of community to fix reported bug
and if early testers are ignored, they will just stop testing which
eventually will be very bad for Ubuntu.

No progress is being made on this bug for quite some time and it is likely
to remain so. If nobody knows which patch introduces this bug just apply all
diffs between 2.6.28 and 2.6.29 in fs/[ext4,jbd2] to Jaunty kernel. This is
a huge change, but it unlikely to introduce bugs that are even half as bad
as this one.

Revision history for this message

Vincenzo Ciancia (vincenzo-ml) wrote on 2009-06-20:

#202

Sorry for multiple pollution of the bug report but I got things wrong. And in fact a warning to the installer would be a nice addition, however the release notes are clear on the ext4 bugs.

Revision history for this message

Michael Rooney (mrooney) wrote on 2009-06-20:

#203

On Sat, Jun 20, 2009 at 2:16 AM, Jason Waddle<email address hidden> wrote:
> Releasing with ext4 (without any sort of cautionary warnings in the
> installer!) in this shape was a serious mistake.

It is also documented in the release notes:
http://www.ubuntu.com/getubuntu/releasenotes/904#Lock-ups%20when%20deleting%20files%20from%20ext4%20filesystems

When I decided to give ext4 a try I definitely searched the release
notes for ext4 related issues and understood the risk I was taking.
Note this isn't something most people are expected to do, but it is
definitely a good idea for the minority of people adjusting critical
parts of an OS.

Revision history for this message

Michael Rooney (mrooney) wrote on 2009-06-20:

#204

On Sat, Jun 20, 2009 at 9:27 AM, Andrius Štikonas<email address hidden> wrote:
> If everybody is using stable defaults then who will test or develop new
> software?
> Every software community must have early testers and if they report
> something it would be in the best interests of community to fix reported bug
> and if early testers are ignored, they will just stop testing which
> eventually will be very bad for Ubuntu.
>

You are correct, and these users are definitely important and
appreciated! We just must keep in mind the perspective of what we are
doing as testers, testing early software, and understand that
sometimes it IS going have issues and not to use it for production
environments or systems which need to be stable. As such it isn't
reasonable or helpful to become infuriated when such an issue is
experienced.

> No progress is being made on this bug for quite some time and it is likely
> to remain so. If nobody knows which patch introduces this bug just apply all
> diffs between 2.6.28 and 2.6.29 in fs/[ext4,jbd2] to Jaunty kernel. This is
> a huge change, but it unlikely to introduce bugs that are even half as bad
> as this one.

The problem here is that developers' time is limited. Since the issue
is already fixed, their time can be spent benefiting the most users by
working on improving things in the default feature set and future
release, especially since Jaunty is not an LTS. However of course any
contributions by community members are always welcome, including
tracking down the commit or getting out a PPA of a fixed kernel, or
providing workarounds such as Jason did which have already helped one
user on this report (thanks!).

Revision history for this message

Jason Waddle (jwaddle) wrote on 2009-06-20:

#205

I want to apologize for my last comment. I really like having a free operating system and all the free software, and I love the people who volunteer their time putting it all together and making it work as well as it does (thank you!). I was in a bit of a mood after messing with this thing all night.

I use Ubuntu because I am lazy. I don't go out of my way to install the newest cutting edge stuff because I would rather not be the one who spends his hours sifting through these forums / the source / git bisect / etc. I'd rather benefit from the hard work of the people who like to do that stuff (thanks again!) So once every year or so, whenever I get a new drive or update some hardware in some significant way, I go to the Ubuntu website and download the latest installation image, and usually only after it's been out a few months. With minimal faffing I have my system up and running in half an hour or so, all my old data happily copying over to the new system. I never read release notes, things just work (well, laptops usually need a little tweaking).

I had no idea this last time that when I installed 9.04 that selecting ext4 was in any way dangerous or experimental. Deviating away from the "stable defaults that Ubuntu provides" was as easy as cursoring up (or down, I don't remember) from ext3 to ext4. I saw the option, thought to myself, "cool, it's here, this is Ubuntu, it must be working well" why not use ext4 vs. ext3? I had no idea that I was doing something even slightly risky, because there was nothing in the lazy man's path from download to having a running system that would have told me so. If there was, I would have gone ext3 and I never would have read this thread and you wouldn't be wasting your time reading this right now. From now on, I'll read the release notes and hopefully save us all some time. I'd like to be more lazy, but there simply isn't a better option than Ubuntu right now.

So lazy talk aside, I'd actually like to be more helpful when these things come up. This particular case looks (from the tiny bit I have seen) like a classic deadlock. In the codebase I am most familiar with, it's pretty simple to turn on traces for locks that you know are in contention and use the trace to find the culprit. It's been almost ten years since I've done any significant work in the Linux kernel, however, so I'd have to do some ramping up. Anyone have pointers to your favorite kernel hacking resources (I'm sure even the basic tools have changed a lot in 10 years) in case I find myself motivated and with some time on my hands?

Thanks,
Jason

I want to apologize for my last comment.  I really like having a free operating system and all the free software, and I love the people who volunteer their time putting it all together and making it work as well as it does (thank you!).  I was in a bit of a mood after messing with this thing all night.

I use Ubuntu because I am lazy.  I don't go out of my way to install the newest cutting edge stuff because I would rather not be the one who spends his hours sifting through these forums / the source / git bisect / etc.  I'd rather benefit from the hard work of the people who like to do that stuff (thanks again!)  So once every year or so, whenever I get a new drive or update some hardware in some significant way, I go to the Ubuntu website and download the latest installation image, and usually only after it's been out a few months.  With minimal faffing I have my system up and running in half an hour or so, all my old data happily copying over to the new system.  I never read release notes, things just work (well, laptops usually need a little tweaking).

I had no idea this last time that when I installed 9.04 that selecting ext4 was in any way dangerous or experimental.  Deviating away from the "stable defaults that Ubuntu provides" was as easy as cursoring up (or down, I don't remember) from ext3 to ext4.  I saw the option, thought to myself, "cool, it's here, this is Ubuntu, it must be working well" why not use ext4 vs. ext3?  I had no idea that I was doing something even slightly risky, because there was nothing in the lazy man's path from download to having a running system that would have told me so.  If there was, I would have gone ext3 and I never would have read this thread and you wouldn't be wasting your time reading this right now.  From now on, I'll read the release notes and hopefully save us all some time.  I'd like to be more lazy, but there simply isn't a better option than Ubuntu right now.

So lazy talk aside, I'd actually like to be more helpful when these things come up.  This particular case looks (from the tiny bit I have seen) like a classic deadlock. In the codebase I am most familiar with, it's pretty simple to turn on traces for locks that you know are in contention and use the trace to find the culprit.  It's been almost ten years since I've done any significant work in the Linux kernel, however, so I'd have to do some ramping up.  Anyone have pointers to your favorite kernel hacking resources (I'm sure even the basic tools have changed a lot in 10 years) in case I find myself motivated and with some time on my hands?

Thanks,
Jason

Revision history for this message

JoseStefan (josestefan) wrote on 2009-06-20:

#206

This bug doesn't seem to trigger on ext3 file systems mounted as ext4.
Can anyone confirm?

Maybe we can pinpoint which of the new ext4 attributes are needed to trigger the bug?

Revision history for this message

Andrew Aylett (andrew-aylett) wrote on 2009-06-20: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#207

On Sat, 20 Jun 2009 17:49:01 -0000
JoseStefan <email address hidden> wrote:

> This bug doesn't seem to trigger on ext3 file systems mounted as ext4.
> Can anyone confirm?
>
> Maybe we can pinpoint which of the new ext4 attributes are needed to
> trigger the bug?

I can confirm that I only saw the bug after converting my filesystems,
not before when they were ext3 mounted as ext4. They are all converted
now, and I'm running a stock kernel to avoid the crashes so I'm not
best placed to test, sorry.

Revision history for this message

Vincenzo Ciancia (vincenzo-ml) wrote on 2009-06-21:

#208

On 20/06/2009 Michael Rooney wrote:
> . Ext4 is not the default file system so
> it only affects a small minority of users. See
> https://wiki.ubuntu.com/Bugs/Importance for more information. I am not
> sure if this would qualify as a severe impact and it is already fixed
> in newer kernels as you noted.

As I understand priorities, these are per-package, not per distribution.
Otherwise a bug that makes e.g. xournal completely unusable would be low
priority. Instead it must be high priority, it's high in xournal, not in
ubuntu. Likewise, the "medium" designation for a bug that deadlocks the
kernel seems a bit wrong also in my opinion. It should be "high" in the
kernel.

Said this, this bug is being actively worked on, and the priority
"medium" does not reflect the effort devs are putting in it, which is
"high" so there is nothing to complain about.

Stefan Bader (smb) on 2009-07-03

Changed in linux (Ubuntu Jaunty):
status:	In Progress → Fix Committed

Revision history for this message

Brian J. Murrell (brian-interlinx) wrote on 2009-07-03:

#209

On Fri, 2009-07-03 at 12:35 +0000, Stefan Bader wrote:
> ** Changed in: linux (Ubuntu Jaunty)
> Status: In Progress => Fix Committed

Which kernel? Can you paste the changelog entry?

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-07-05:

#210

We might have a break in this bug. For those people who can reliably reproduce the problem, are you using ecryptfs, possibly extensively? Roland Drier has reported a potential lockdep report that might explain why some of us have had extreme problems reproducing the problem; namely we might not be using ecryptfs. See:

http://lkml.org/lkml/2009/7/4/93

If so, there is a sample patch which **might*** fix this problem. See:

http://lkml.org/lkml/2009/7/5/79

Also, if people could try building their kernel with CONFIG_PROVE_LOCKING, that would be interesting to see if we get a lockdep report.

If so, it might be that this bug has been around all along, but it's something about Ubuntu patches that makes it 100,000 times more likely to trigger. (In practice, it looks like it should only trigger on a truncate to a size which is not a multiple of the filesystem blocksize, not on an unlink --- but maybe there was a bug in the Ubuntu backports that made this possible to trigger on an unlink. This is only a theory, but it's the first lockdep report I've gotten that could at least potentially be related to this problem that to date, only Ubuntu users have been complaining about, even though at this point we've got a huge number of Fedora 11 users using ext4 w/o any problems.)

Revision history for this message

Abraham Smith (adsmith) wrote on 2009-07-06:

#211

Ted,
I've NOT been using ecryptfs, but I was using NFS and rsync-over-ssh on ext4 extensively during my lockups.

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-07-06: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#212

On Mon, Jul 06, 2009 at 01:11:40AM -0000, Abraham Smith wrote:
> Ted,
> I've NOT been using ecryptfs, but I was using NFS and rsync-over-ssh on ext4 extensively during my lockups.

NFS client or server? If NFS server, were you exporting an ext4 filesystem?

- Ted

Revision history for this message

Abraham Smith (adsmith) wrote on 2009-07-06:

#213

Both, actually. Yes, exporting ext4.

Revision history for this message

dnyaga (daniel-nyaga) wrote on 2009-07-06: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#214

Download full text (3.6 KiB)

Ted,
I've also not been using ecryptfs. I was one of the earliest reporters of
this bug. At that time, (during the beta of Jaunty), the most common use
case that triggered the bug was the running of multiple Virtualbox virtual
machines WHILE simultaneously moving/deleting files across/from multiple
partitions.

I must note that, over the last two and a half months, I have had a
relatively stable experience with Ext4 on Ubuntu. My Ubuntu laptop sees lots
of daily use, and I have only had about five lockups (which may not be the
fault of ext4). I still use Virtualbox - although not as much. I still move
files around - but not as much.

The more recent freezes have all occurred when I have a few large projects
open in Eclipse and I am doing an ant build outside on the commandline. The
builds and projects are usually large enough to cause a bit of memory
pressure (and swapping) on a laptop with 4GB RAM.

I am no filesystem expert - so I don't know if this information helps or
just clouds things up further.

Regards

On Mon, Jul 6, 2009 at 4:11 AM, Abraham Smith <email address hidden> wrote:

> Ted,
> I've NOT been using ecryptfs, but I was using NFS and rsync-over-ssh on
> ext4 extensively during my lockups.
>
> --
> Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28
> https://bugs.launchpad.net/bugs/330824
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Ubuntu Release Notes: Fix Released
> Status in “linux” package in Ubuntu: Fix Released
> Status in linux in Ubuntu Jaunty: Fix Committed
> Status in linux in Ubuntu Karmic: Fix Released
>
> Bug description:
> [
> Please read *all* previous comments before posting.
>
> Mainline kernels are known to not experience this bug, although in general
> are not supported (i.e., using one is a workaround, but if they break other
> things you're generally out of luck).
>
> Additional "me-too" comments aren't useful, feel free to select the "This
> bug affects me too" option and/or subscribe to this bug instead.
> ]
>
> Binary package hint: linux-image-2.6.28-8-generic
>
> I'm using 8.10 Kubuntu with all updates done on system.
>
> System is a clean installed system with EXT4 formating and using 2.6.8-8
> linux kernel.
>
> System sometimes lock and freeze whole inputs even keyboard or mouse.
> I have closed X and kdm and try to reprocedure same bug in console ( not
> konsole )
> so i have killed X and kdm.
>
> And try to compile qt-copy in one console and try to svn up on KDE and on
> other console
> i tryto apt-get update to make system under CPU load. and after a while
> it happens again.
>
> No Keyboard response no harddisc response total freeze.
>
> I have waited a while after freeze and about 4 min later a text appeared on
> screen saying :
>
> BUG: soft locking - CPU#0 stuck for 61s! [uic: 5356]
>
> after waiting about 4 more minutes a newer but same text appeared unter
> this message :
>
> BUG: soft locking - CPU#0 stuck for 61s! [uic: 5356]
> BUG: soft locking - CPU#0 stuck for 61s! [uic: 5356]
>
>
> There isn't any error records on /etc/log/messages releated on hardware
> while around freezing/locking times
>
> And fo...

Ted,
I've also not been using ecryptfs. I was one of the earliest reporters of
this bug. At that time, (during the beta of Jaunty), the most common use
case that triggered the bug was the running of multiple Virtualbox virtual
machines WHILE simultaneously moving/deleting files across/from multiple
partitions.

I must note that, over the last two and a half months, I have had a
relatively stable experience with Ext4 on Ubuntu. My Ubuntu laptop sees lots
of daily use, and I have only had about five lockups (which may not be the
fault of ext4). I still use Virtualbox - although not as much. I still move
files around - but not as much.

The more recent freezes have all occurred when I have a few large projects
open in Eclipse and I am doing an ant build outside on the commandline. The
builds and projects are usually large enough to cause a bit of memory
pressure (and swapping) on a laptop with 4GB RAM.

I am no filesystem expert - so I don't know if this information helps or
just clouds things up further.

Regards

On Mon, Jul 6, 2009 at 4:11 AM, Abraham Smith <ads2@duke.edu> wrote:

> Ted,
> I've NOT been using ecryptfs, but I was using NFS and rsync-over-ssh on
> ext4 extensively during my lockups.
>
> --
> Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28
> https://bugs.launchpad.net/bugs/330824
> You received this bug notification because you are a direct subscriber
> of the bug.
>
> Status in Ubuntu Release Notes: Fix Released
> Status in “linux” package in Ubuntu: Fix Released
> Status in linux in Ubuntu Jaunty: Fix Committed
> Status in linux in Ubuntu Karmic: Fix Released
>
> Bug description:
> [
> Please read *all* previous comments before posting.
>
> Mainline kernels are known to not experience this bug, although in general
> are not supported (i.e., using one is a workaround, but if they break other
> things you're generally out of luck).
>
> Additional "me-too" comments aren't useful, feel free to select the "This
> bug affects me too" option and/or subscribe to this bug instead.
> ]
>
> Binary package hint: linux-image-2.6.28-8-generic
>
> I'm using 8.10 Kubuntu with all updates done on system.
>
> System is a clean installed system with EXT4 formating and using 2.6.8-8
> linux kernel.
>
> System sometimes lock and freeze whole inputs even keyboard or mouse.
> I have closed X and kdm and try to reprocedure same bug in console  ( not
> konsole )
> so i have killed X and kdm.
>
> And try to compile qt-copy in one console and try to svn up on KDE and on
> other console
> i tryto apt-get update    to make system under CPU load. and after a while
> it happens again.
>
> No Keyboard response no harddisc response total freeze.
>
> I have waited a while after freeze and about 4 min later a text appeared on
> screen saying :
>
> BUG: soft locking  - CPU#0 stuck for 61s!   [uic: 5356]
>
> after waiting about 4 more minutes a newer but same text appeared unter
> this message :
>
> BUG: soft locking  - CPU#0 stuck for 61s!   [uic: 5356]
> BUG: soft locking  - CPU#0 stuck for 61s!   [uic: 5356]
>
>
> There isn't any error records on /etc/log/messages releated on hardware
> while around freezing/locking times
>
> And for information : Sometimes i have seen that i'm getting messages like
> disc is full but
> I'm sure that it isn't. Because df shows me there are more than 7 Gb
> freespace. Not always getting this error.
> if a file shows this error while i'm updating it i'm deleting it and
> downloading a bigger file system won't interrupts me
> like saying disk is full. I think it is releated to Ext4.
>
> But i'm not sure these 2 bugs releated or not.
>
> Thanks
>

Revision history for this message

Feistybird (bryanjen-tw) wrote on 2009-07-06:

#215

Hi Ted,

> even though at this point we've got a huge number of Fedora 11 users using ext4 w/o any problems.)

Fedora 11 uses linux kernel 2.6.29.4, but Ubuntu 9.04 uses kernel 2.6.28 by default.

I don't have any ext4 soft-lock-up problems using Ubuntu mainline kernel 2.6.29 and 2.6.30 netiher

Regards,
Bryan

Revision history for this message

Derek (bugs-m8y) wrote on 2009-07-06: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#216

On Sun, 5 Jul 2009, Theodore Ts'o wrote:

> We might have a break in this bug. For those people who can reliably
> reproduce the problem, are you using ecryptfs, possibly extensively?
> Roland Drier has reported a potential lockdep report that might explain
> why some of us have had extreme problems reproducing the problem; namely
> we might not be using ecryptfs. See:

I was not using ecryptfs on either machine that consistently locked up.
The one machine, a laptop, didn't really have any network mounting at all.
I did use sshfs but rarely, and never had it mounted when lockups occured.
The other machine used CIFS, and might have been in conjunction with lockups.

Revision history for this message

Nick B. (futurepilot) wrote on 2009-07-06:

#217

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Theodore Ts'o wrote:
> We might have a break in this bug. For those people who can reliably
> reproduce the problem, are you using ecryptfs, possibly extensively?
> Roland Drier has reported a potential lockdep report that might explain
> why some of us have had extreme problems reproducing the problem; namely
> we might not be using ecryptfs. See:
>
> http://lkml.org/lkml/2009/7/4/93
>
> If so, there is a sample patch which **might*** fix this problem. See:
>
> http://lkml.org/lkml/2009/7/5/79
>
> Also, if people could try building their kernel with
> CONFIG_PROVE_LOCKING, that would be interesting to see if we get a
> lockdep report.
>
> If so, it might be that this bug has been around all along, but it's
> something about Ubuntu patches that makes it 100,000 times more likely
> to trigger. (In practice, it looks like it should only trigger on a
> truncate to a size which is not a multiple of the filesystem blocksize,
> not on an unlink --- but maybe there was a bug in the Ubuntu backports
> that made this possible to trigger on an unlink. This is only a theory,
> but it's the first lockdep report I've gotten that could at least
> potentially be related to this problem that to date, only Ubuntu users
> have been complaining about, even though at this point we've got a huge
> number of Fedora 11 users using ext4 w/o any problems.)
>
>
I've been using Ecryptfs only on ~/Private.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQEcBAEBCAAGBQJKUXBIAAoJEIltSrFpUGteit8IAI1XZ0Co/2bSnErSa5qid68J
VEey/fPv8RRxvWd0QvZBRR9c9K+isjAS24+2A3V+veOh1aBd/C9dGOZ8fD4i5ZqZ
1i9f70ZPQnKxc9o1ER+qZ7X0EGiDXOVjFoHlV5GY8/OaaZ6/0R9ILOp2X+LvrY3s
ZZgFDjdKMO4058nI11y/Z00fuLwOlk8i2wr4a0ofkRCEkMWCfOtYOD270unEcJNx
CIi88ma0IwMelkjdoN+IqPqOSUbhAW/qCSJOBY5uuYlMXtdeXFFC/gMLdb5q34lk
vXbbb38744g4THJC03g2qIuDMPAw+QGl3+kMX+DIVs+oP83Je25kHIGiW2oj1v8=
=HMof
-----END PGP SIGNATURE-----

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Theodore Ts'o wrote:
> We might have a break in this bug.   For those people who can reliably
> reproduce the problem, are you using ecryptfs, possibly extensively?
> Roland Drier has reported a potential lockdep report that might explain
> why some of us have had extreme problems reproducing the problem; namely
> we might not be using ecryptfs.   See:
>
> http://lkml.org/lkml/2009/7/4/93
>
> If so, there is a sample patch which **might*** fix this problem.   See:
>
> http://lkml.org/lkml/2009/7/5/79
>
> Also, if people could try building their kernel with
> CONFIG_PROVE_LOCKING, that would be interesting to see if we get a
> lockdep report.
>
> If so, it might be that this bug has been around all along, but it's
> something about Ubuntu patches that makes it 100,000 times more likely
> to trigger.   (In practice, it looks like it should only trigger on a
> truncate to a size which is not a multiple of the filesystem blocksize,
> not on an unlink --- but maybe there was a bug in the Ubuntu backports
> that made this possible to trigger on an unlink.  This is only a theory,
> but it's the first lockdep report I've gotten that could at least
> potentially be related to this problem that to date, only Ubuntu users
> have been complaining about, even though at this point we've got a huge
> number of Fedora 11 users using ext4 w/o any problems.)
>
>  
I've been using Ecryptfs only on ~/Private.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iQEcBAEBCAAGBQJKUXBIAAoJEIltSrFpUGteit8IAI1XZ0Co/2bSnErSa5qid68J
VEey/fPv8RRxvWd0QvZBRR9c9K+isjAS24+2A3V+veOh1aBd/C9dGOZ8fD4i5ZqZ
1i9f70ZPQnKxc9o1ER+qZ7X0EGiDXOVjFoHlV5GY8/OaaZ6/0R9ILOp2X+LvrY3s
ZZgFDjdKMO4058nI11y/Z00fuLwOlk8i2wr4a0ofkRCEkMWCfOtYOD270unEcJNx
CIi88ma0IwMelkjdoN+IqPqOSUbhAW/qCSJOBY5uuYlMXtdeXFFC/gMLdb5q34lk
vXbbb38744g4THJC03g2qIuDMPAw+QGl3+kMX+DIVs+oP83Je25kHIGiW2oj1v8=
=HMof
-----END PGP SIGNATURE-----

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-07-06: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#218

On Mon, Jul 06, 2009 at 02:57:32AM -0000, Feistybird wrote:
> Hi Ted,
>
> > even though at this point we've got a huge number of Fedora 11 users
> using ext4 w/o any problems.)
>
> Fedora 11 uses linux kernel 2.6.29.4, but Ubuntu 9.04 uses kernel 2.6.28
> by default.
>
> I don't have any ext4 soft-lock-up problems using Ubuntu mainline kernel
> 2.6.29 and 2.6.30 netiher

Yes, and some people have reported that mainline 2.6.28 or 2.6.28.*
also seems to be without problems. I just haven't had the time to try
to go through the Ubuntu Jaunty ext4 patch backports, which is where I
assume the problem might be. However, the problem Roland reported
could very well lead to a deadlock in the truncate path, which is very
much related to the problems are reported. The only way the problem
Roland's potential deadlock could be related to this is if there is a
bug in the Ubuntu ext4 backport such that the truncate code thinks
it's only doing a partial truncate in the case of a delete. But we it
seems pretty clear that there was a bug in the Ubuntu backports of
various ext4 patches, so the one-line patch which I suggested might
actually make things better.

- Ted

Revision history for this message

Keith Moyer (keithmoyer) wrote on 2009-07-06:

#219

I also see this problem with no ecryptfs use (no FUSE-anything, just EXT4 all around). Does the "fix" that was committed for 9.04 assume this is caused by the ecryptfs deadlock?

Revision history for this message

Brian J. Murrell (brian-interlinx) wrote on 2009-07-06: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#220

On Sun, 2009-07-05 at 23:32 +0000, Theodore Ts'o wrote:
> We might have a break in this bug. For those people who can reliably
> reproduce the problem, are you using ecryptfs, possibly extensively?

Not at all here. But I rolled back to using ext4dev with the Intrepid
kernel and now have leapfrogged over the Jaunty kernel and am using the
Karmic kernel on Jaunty userspace. I just don't have the time to
continually recover from the Jaunty crashes.

Revision history for this message

Luke Maurer (luke-maurer) wrote on 2009-07-06:

#221

On Sun, 2009-07-05 at 23:32 +0000, Theodore Ts'o wrote:
> We might have a break in this bug. For those people who can reliably
> reproduce the problem, are you using ecryptfs, possibly extensively?

I've been able to reproduce it 100% reliably (rm a single file => crashy crashy) just in the Jaunty LiveCD environment. AFAIK, the most exotic filesystem hackery it uses is that union filesystem, though I was crashing on deleting something on a separate, non-unionized volume.

Revision history for this message

Roland Dreier (roland.dreier) wrote on 2009-07-08:

#222

Luke Maurer wrote:
> I've been able to reproduce it 100% reliably (rm a single file => crashy crashy) just in the Jaunty
> LiveCD environment. AFAIK, the most exotic filesystem hackery it uses is that union filesystem,
> though I was crashing on deleting something on a separate, non-unionized volume.

Can you give a recipe for how you're able to reproduce it with the Jaunty Live CD?

Revision history for this message

Martin Pitt (pitti) wrote on 2009-07-08:

#223

Accepted linux into jaunty-proposed, the package will build now and be available in a few hours. Please test and give feedback here. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you in advance!

tags:

added: verification-needed

Revision history for this message

martinm1000 (martinmiller-gmail) wrote on 2009-07-08:

#224

I can also reproduce this bug easily, but I never tried EnabledProposed;
Where can I find the new kernel in aptitude text mode ?

Revision history for this message

Luke Maurer (luke-maurer) wrote on 2009-07-10: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#225

Roland Dreier wrote:
>
> Luke Maurer wrote:
> > I've been able to reproduce it 100% reliably (rm a single file => crashy crashy) just in the Jaunty
> > LiveCD environment. AFAIK, the most exotic filesystem hackery it uses is that union filesystem,
> > though I was crashing on deleting something on a separate, non-unionized volume.
>
> Can you give a recipe for how you're able to reproduce it with the
> Jaunty Live CD?

Um, the "rm" command? :-)

Seriously, I boot it up, mount an ext4 volume on a garden-variety disk
partition, and try to delete a file. It hangs (evidently *after*
deleting the file). Every time. (I'm pretty sure I tried an ext4 image
mounted over loopback as well, to no avail.)

I realize it's a decidedly extreme case of the bug, but besides the
ease of reproduction, the symptoms are identical to those reported
here.

- Luke

Revision history for this message

Jared Heath (jared-heath) wrote on 2009-07-10:

#226

Just started using Ubuntu and I had to go looking for why the server constantly locked on rm commands.

This happens pretty much 100% of the time in my ext4 filesystems that are large (100gb 60%+ full). I cannot delete more than 20-30 files before it hangs.

I also was able to go up to 2.6.29 and not see it happen again. I cannot find 2.6.28 change to bring in even though this says it is commited....which I would prefer, since 2.6.29 causes all manner of Samba issues. I brought down every Jaunty proposed package and the lockup was still happening as of 20 minutes ago...so if it being presented via the Update Manager/Proposed it isn't fixed I'd have to say.

I also was able to reproduce this with the Live CD when I mounted the filesystems there and attempted to do the rms...here is what I did:

1 - boot live CD
2 - mount ext4 filesystem
3 - termial rm files from it
4 - lockup

As far as ecryptfs goes...I don't even know what it is, so I doubt it is running.

Revision history for this message

Igor Tarasov (tarasov-igor) wrote on 2009-07-10:

#227

I have installed new 2.6.28-14 from jaunty-proposed and switched to it from 2.6.29. Today I had another lockup during intensive disk usage. This fix does not work for me.

Revision history for this message

=0yP)F]|L(0YNrv (ccgjsz8xdbdyyvy-deactivatedaccount) wrote on 2009-07-10:

#228

I have all the latest upgraded drivers, kernel, etc. Using a native ext4 partition, Jaunty w/latest KDE. Freezes and locks up the system almost every single time I empty the trash or delete certain items via Dolphin. Noticed this problem about 5 days ago, never seemed to be evident before then. I thought it might be something I did to my system, but I realized that with how much care I have given my system, it can't be this messed up.

I'm not the best Linux guru, but I'm extremely competent, so if any log files are needed, output, errors, more information, etc, just ask, and if possible include a way how to get it. I hope this gets fixed soon!

Revision history for this message

martinm1000 (martinmiller-gmail) wrote on 2009-07-13:

#229

This fix seems to be working for me; didn't have a crash since.
That was not easy to install, but I managed...

Revision history for this message

Pauli Virtanen (pauli-virtanen) wrote on 2009-07-13:

#230

Installed linux-image-2.6.28-14-generic 2.6.28-14.46 from jaunty-proposed. Ran hang.py, and still obtained a soft lockup. Doesn't seem like the update fixes this problem. So I guess Theodore Tso was right in pointing out above that this might not be the correct fix. (Note that for me 2.6.29-02062902 does not have this lockup problem.)

Also, on one machine, obtaining a crash reliably requires running several hang.py in parallel and modifying the script so that each instance uses its own set of file names.

@martinm1000: Check that `uname -a` reports kernel version 2.6.28-14.

Revision history for this message

martinm1000 (martinmiller-gmail) wrote on 2009-07-13:

#231

I am on .28-14;

I Didn't know about the python script; I'll try it after work.

Revision history for this message

Franz Dietzmann (tdk-le) wrote on 2009-07-13:

#232

I just read through all the comments (I hope), and did not find this mentioned, so I thought it might be helpful..

I had the problem for a long time, but didn't bother too much. Now it got annoying and after some searching I installed mainline 2.6.30 to see if it would work.
As has been mentioned here before it does, but unfortunatly my UMTS didn't work anymore, so I just deleted my Trash and went back to .28
After logging in I found I had 10GB more space on my Home-Partition (the Trash only had ~1GB in it) The partition is only 40 GB total, so that's a lot. I checked if something was missing, but didn't find anything, which was strange.

I ran baobab just out of curiosity and there I found 5GB in ~/.local/share/Trash/expunged/
On closer inspection these were all files I supposedly deleted a long time ago, when the freeze appeared afterwards. I have no idea how they got there, I'm just a user...but maybe that info can point someone into the right direction.

Revision history for this message

Derek (bugs-m8y) wrote on 2009-07-13: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#233

On Mon, 13 Jul 2009, Franz Dietzmann wrote:

> I just read through all the comments (I hope), and did not find this
> mentioned, so I thought it might be helpful..
>
> I had the problem for a long time, but didn't bother too much. Now it got annoying and after some searching I installed mainline 2.6.30 to see if it would work.
> As has been mentioned here before it does, but unfortunatly my UMTS didn't work anymore, so I just deleted my Trash and went back to .28
> After logging in I found I had 10GB more space on my Home-Partition (the Trash only had ~1GB in it) The partition is only 40 GB total, so that's a lot. I checked if something was missing, but didn't find anything, which was strange.
>
> I ran baobab just out of curiosity and there I found 5GB in ~/.local/share/Trash/expunged/
> On closer inspection these were all files I supposedly deleted a long time ago, when the freeze appeared afterwards. I have no idea how they got there, I'm just a user...but maybe that info can point someone into the right direction.

I'm sure that this is just one of the many ways to trigger this ext4 thing, still, interested me even if not the cause of the bug.

http://ubuntuforums.org/showthread.php?t=1196171&page=2
Found this thread which seems to be same issue.

Appears that this is related to permission/ownership - so presumably you deleted read-only files.

I can imagine that might happen if, for example, the files were copied off a CD and had default read-only permissions.

I'm suprised nautilus doesn't handle this more gracefully.

Revision history for this message

Franz Dietzmann (tdk-le) wrote on 2009-07-13:

#234

I highly doubt that it had something to do with permissions, as there were really all kinds of files (audio, video, documents..) from different sources (downloads, self-made..).

I didn't mean this to be a cause of the bug, but rather a result and maybe an indicator to where things might be going wrong.

Revision history for this message

martinm1000 (martinmiller-gmail) wrote on 2009-07-14:

#235

Yep, I crashed using hang.py :

Linux lantea 2.6.28-14-generic #46-Ubuntu SMP Wed Jul 8 07:21:34 UTC 2009 i686 GNU/Linux

Filesystem Type Size Used Avail Use% Mounted on
/dev/sda5 ext4 90G 76G 9.9G 89% /

Didn't crash with 10GB of 100GB.

/dev/sda5 ext4 90G 82G 4.2G 96% /

Yep, crashed on round 3.

;-(

Revision history for this message

Stephan Frank (sfrank) wrote on 2009-07-15: still freezes with 2.6.28-14

#236

Hallo,

I'm sorry to say that my system still hard locks with the new 2.6.28-14
kernel in jaunty when I rsync my home partion (ext3) with my backup
partition (ext4). It does not matter wether I use 'rsync -av --delete
...' or only 'rync -av ...'. The latter one just takes a little longer
for the freeze to happen. This is on a AMD Athlon 64 Processor 3700+.

The weird thing is that a have access to another system with an Intel
Quad-Core CPU that is fully ext4 but runs without a hitch. I think that
suggests that we are really running into a timing/race problem here.

Best regards,
Stephan

Revision history for this message

Luke Maurer (luke-maurer) wrote on 2009-07-16:

#237

Huh. My system's also a single-core Athlon 64, and I'm getting it even worse (a single "rm" hangs). Is it possible that this is a race condition that's *more* likely on a single-core box? Seems like we've exhausted every other theory :-)

Revision history for this message

Jared Heath (jared-heath) wrote on 2009-07-16:

#238

It happened very frequently on my Dual Core i86 based system (never got more than 5 single rm commands off without a hang before I went to the higher kernel) so it certanly can happen on multi-core systems often.

Your theory on race conditions is interesting though--it certainly exhibits the behavior of a race that goes infinite and does not get caught.

Revision history for this message

Colin Sindle (csindle) wrote on 2009-07-16: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#239

Apologies, this is a qualitative post --- but now that people are talking
about different processors, I'll contribute some fluffy info.

That said, I experienced many "freezes" per day on my Core Solo laptop when
doing "dangerous" operations (svn update, rsync, rm, etc.). Then I swapped
to a Core 2 Duo, and when doing these same operations, I got about the same
number "freezes", only now they recovered faultlessly (so far...) after
second or two.
After an upgrade to 2.6.30-020630-generic #020630 from the Ubuntu Kernel-ppa
mainline, (to solve unrelated HP laptop sound issues), I have not
experienced any more "freezes" temporary, or otherwise.

c.

2009/7/16 Jared Heath <email address hidden>

> It happened very frequently on my Dual Core i86 based system (never got
> more than 5 single rm commands off without a hang before I went to the
> higher kernel) so it certanly can happen on multi-core systems often.
>
> Your theory on race conditions is interesting though--it certainly
> exhibits the behavior of a race that goes infinite and does not get
> caught.
>
>

Revision history for this message

Stephan Frank (sfrank) wrote on 2009-07-16: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#240

Colin Sindle wrote:
> After an upgrade to 2.6.30-020630-generic #020630 from the Ubuntu Kernel-ppa
> mainline, (to solve unrelated HP laptop sound issues), I have not
> experienced any more "freezes" temporary, or otherwise.

I have as well now manually switched to the 2.6.30-020630 kernel and the
freezes are gone...

Best regards,
Stephan

Revision history for this message

martinm1000 (martinmiller-gmail) wrote on 2009-07-16:

#241

I'm going to reboot, after installing 2.6.30 + newer (185.18.14) NVidia drivers
see https://bugs.launchpad.net/ubuntu/+source/nvidia-common/+bug/384639/comments/8
to do it with NVidia working ;-)

Hoping this will solve the crash problem. I would suggest to others to try the same, since the problem was
apparently solved and NOBODY decided to just backport the damn patches from the more recent kernels... I mean, its been MONTHS, and I'm not running Linux to have random crashes.

Revision history for this message

Borph (borph) wrote on 2009-07-17:

#242

Full acknowledgement!

For me, I installed Kubuntu Jaunty fresh with native ext4 and external backup drive, also ext4. Actually it was because of a system crash in which I lost my complete partition. So I want to have the backup-system working now before I proceed! But I was stuck because of this ext4-bug, system freezed very often!

I'm just a user and didn't want to experiment!! Ext4 is not the default fs on ubuntu I read above, ok but I really regret that I chose this during graphical installation! Sorry that I didn't read the full release notes, I had no idea that it is that experimental!

Anyway, now I'm stuck, as don't want to re-format my disks, especially not for an issue which doesn't occure in mainline kernel. So I decided to tweak the system and get the kernel 2.6.29 (the 2.6.30 seems to have other problems..), following:

http://www.ramoonus.nl/2009/03/24/linux-kernel-2629-installation-guide-for-ubuntu-and-debian-linux/

But this doesn't put it in GRUB, so you have to change your menu.lst and do update-grub and update-initramfs.

Well, no crashes so far, even copying about 30gig. I actually removed the "nodelalloc" mount option, still stable so far.

I really recommend to get a newer kernel ( >=.29), especially because this is just an Ubuntu problem and Ted Ts'o is probably busy fixing more important stuff :) But the ubuntu guys should provide indeed an automatic update for the _really_ unexperienced people!

Revision history for this message

JoseStefan (josestefan) wrote on 2009-07-17:

#243

I've also been using the Karmic kernels on Jaunty (and the new nvidia drivers) as suggested by martinm1000. Unfortunately, it seems to require also updating the graphics drivers, in my case nvidia.

I've applied this temporary fix a while back, seeing this is taking too long to fix. I also vote for a backport as a temporary fix, instead of having inexperienced users jump through hoops. Most of the solutions posted so far seem to mess with your 3d acceleration, either requiring an update to the video drivers or manual installation. Another reason why i think a backport would be preferred.

I understand package policy would make it difficult for kernel 2.6.29 or newer to make it into jaunty. But isn't that what "jaunty-backports" is for? Using mainline kernels or getting karmic packages is not exactly a 1 click installation, and in fact could break your system. A backport on the other hand can be enabled using the GUI. And could provide an easier fix for those who need it.

The solution i adopted is very similar to having a backport:
1) Add a pin, by editing /etc/apt/preferences
Package: *
Pin: release a=karmic
Pin-Priority: 50

2) Append karmic to your sources.list:
deb http://us.archive.ubuntu.com/ubuntu/ karmic main restricted

3) Update your repositories.
sudo apt-get update

4) Use apt or synaptic to get the packages you want.
linux-image-2.6.31-3-generic
linux-headers-2.6.31-3-generic
linux-headers-2.6.31-3
nvidia-glx-180
nvidia-kernel-common

Revision history for this message

Borph (borph) wrote on 2009-07-17: Re: [Bug 330824] Re: Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

#244

2009/7/17 JoseStefan <email address hidden>:
> I've also been using the Karmic kernels on Jaunty (and the new nvidia
> drivers) as suggested by martinm1000. Unfortunately, it seems to require
> also updating the graphics drivers, in my case nvidia.

Because I'm using Nvidia, too, and read about some problems, I went
for 2.6.29 and it worked, I have 3D.

> I understand package policy would make it difficult for kernel 2.6.29 or
> newer to make it into jaunty. But isn't that what "jaunty-backports" is
> for? Using mainline kernels or getting karmic packages is not exactly a
> 1 click installation, and in fact could break your system. A backport on
> the other hand can be enabled using the GUI. And could provide an easier
> fix for those who need it.

I actually didn't even enable jaunty-proposed or jaunty-backport, I
wanted just a normal failsafe ubuntu. It took so much time to figure
out it's actually ext4 causing the troubles! There should be an update
even for users who got scared with the sentence "if you enable
'proposed' or 'backport', your system maybe not stable anymore!".

Your "pin" sounds promising, I will try this. But with care, as it's
currently running! :)

Peter

Revision history for this message

enb (elitenoobboy) wrote on 2009-07-23:

#245

Updating the kernel fixed this for me. Thanks JoseStefan for easy instructions. I think one of the hardest parts of trouble shooting this is that it only seems to happen on certain hardware configurations, which means that initially I thought it was a hardware glitch of some kind due to it not happening on any other computers with almost the same software setup.

Revision history for this message

Wei-Yee Chan (chanweiyee) wrote on 2009-07-29:

#246

This sounds similar to a problem that I experienced yesterday.

I did a fresh installation of Ubuntu 9.04 recently and formatted every partition to ext4. Yesterday, I was moving huge video files from my home directory to a removable USB hard disk (formatted to ext4 as well) when the system froze permanently (i.e. all hard disks stopped running completely). I did this with a couple of my other removable USB hard disks and the same thing happened many times.

The problem can be replicated by copying or moving files within the same IDE hard disk as well. Just a while ago, the system froze when I emptied Trash.

The computer has Windows XP installed, and no such problem problem occurs when I'm running it.

However, as far as I know, I have not experienced any data loss.

With reference to a few of the comments made above, I have more than 40Gb on every partition at any time, so the locking up seems unrelated to the amount of free hard disk space that one has.

Revision history for this message

getaceres (getaceres) wrote on 2009-07-29:

#247

I've installed the kernel in Jaunty proposed some days ago and since then I haven't had any hang. My system seems much more stable now.

Revision history for this message

Keith Moyer (keithmoyer) wrote on 2009-08-11:

#248

I have the -14 kernel, and just hit this bug again last night (actually caused me to lose a fair amount of data).

Are people still looking into this? By most accounts, the "fix committed" doesn't fix the problem.

Revision history for this message

Borph (borph) wrote on 2009-08-11:

#249

@getaceres:
Which kernel version are you using exactly?
Mine is 2.6.29-020629-generic, manually installed. But I would prefer to have a system with standard components. But I don't want to risk loosing my data again.

Revision history for this message

Igor Tarasov (tarasov-igor) wrote on 2009-08-11:

#250

I've tried using latest kernel from proposed (2.6.28-15) but I had two lockups, though they might be not that easy provoking. So, the bug is not fixed, I am back on 2.6.29-02062906

Revision history for this message

Xavier Guillot (valeryan-24) wrote on 2009-08-12:

#251

Since the last updates, it worked better : I could suppress definitively files in Nautilus without crashing.

But one time doing this I got a freeze, and 2 times also during copy / cut - paste of files (around 9 Gb), on a partition with a lot of space available for the first one.

SInce yesterday, due to this recurring problem (and risk of important datas loss), I installed Karmic alpha 3...

Arnaud Faucher (arnaud-faucher) on 2009-08-16

Changed in linux (Ubuntu Jaunty):
status:	Fix Committed → Confirmed

Steve Langasek (vorlon) on 2009-08-17

tags:

added: verification-failed
removed: verification-needed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2009-08-17:

#252

Download full text (7.2 KiB)

This bug was fixed in the package linux - 2.6.28-15.48

---------------
linux (2.6.28-15.48) jaunty-proposed; urgency=low

[ Andy Whitcroft ]

  * SAUCE: pnp: add PNP resource range checking function
    - LP: #349314
  * SAUCE: i915: enable MCHBAR if needed
    - LP: #349314

[ Brad Figg ]

  * SAUCE: Add information to recognize Toshiba Satellite Pro M10 Alps
    Touchpad
    - LP: #330885

[ Colin Ian King ]

* Input: atkbd - add forced release keys quirk for Samsung Q45
- LP: #347623

[ Manoj Iyer ]

* SAUCE: Added quirk to enable the installer to recognize NetXen NIC.
- LP: #389603

[ Stefan Bader ]

* SAUCE: input: Blacklist digitizers from joydev.c
- LP: #300143

[ Tim Gardner ]

  * Revert "SAUCE: md: wait for possible pending deletes after stopping an
    array"
    - LP: #334994

[ Upstream Kernel Changes ]

  * bonding: Fix updating of speed/duplex changes
    - LP: #371651
  * net: fix sctp breakage
    - LP: #371651
  * ipv6: don't use tw net when accounting for recycled tw
    - LP: #371651
  * ipv6: Plug sk_buff leak in ipv6_rcv (net/ipv6/ip6_input.c)
    - LP: #371651
  * netfilter: nf_conntrack_tcp: fix unaligned memory access in tcp_sack
    - LP: #371651
  * xfrm: spin_lock() should be spin_unlock() in xfrm_state.c
    - LP: #371651
  * bridge: bad error handling when adding invalid ether address
    - LP: #371651
  * bas_gigaset: correctly allocate USB interrupt transfer buffer
    - LP: #371651
  * USB: EHCI: add software retry for transaction errors
    - LP: #371651
  * USB: fix USB_STORAGE_CYPRESS_ATACB
    - LP: #371651
  * USB: usb-storage: increase max_sectors for tape drives
    - LP: #371651
  * USB: gadget: fix rndis regression
    - LP: #371651
  * USB: add quirk to avoid config and interface strings
    - LP: #371651
  * cifs: fix buffer format byte on NT Rename/hardlink
    - LP: #371651
  * b43: fix b43_plcp_get_bitrate_idx_ofdm return type
    - LP: #371651
  * Add a missing unlock_kernel() in raw_open()
    - LP: #371651
  * x86, PAT, PCI: Change vma prot in pci_mmap to reflect inherited prot
    - LP: #371651
  * security/smack: fix oops when setting a size 0 SMACK64 xattr
    - LP: #371651
  * x86, setup: mark %esi as clobbered in E820 BIOS call
    - LP: #371651
  * dock: fix dereference after kfree()
    - LP: #371651
  * mm: define a UNIQUE value for AS_UNEVICTABLE flag
    - LP: #371651
  * mm: do_xip_mapping_read: fix length calculation
    - LP: #371651
  * vfs: skip I_CLEAR state inodes
    - LP: #371651
  * net/netrom: Fix socket locking
    - LP: #371651
  * kprobes: Fix locking imbalance in kretprobes
    - LP: #371651
  * netfilter: {ip, ip6, arp}_tables: fix incorrect loop detection
    - LP: #371651
  * ALSA: hda - add missing comma in ad1884_slave_vols
    - LP: #371651
  * SCSI: libiscsi: fix iscsi pool error path
    - LP: #371651
  * SCSI: libiscsi: fix iscsi pool error path again
    - LP: #371651
  * posixtimers, sched: Fix posix clock monotonicity
    - LP: #371651
  * sched: do not count frozen tasks toward load
    - LP: #371651
  * spi: spi_write_then_read() bugfixes
    - LP: #371651
  * powerpc: Fix data-corrupting bug in __futex_atomic_op
    - LP...

This bug was fixed in the package linux - 2.6.28-15.48

---------------
linux (2.6.28-15.48) jaunty-proposed; urgency=low

[ Andy Whitcroft ]

* SAUCE: pnp: add PNP resource range checking function
    - LP: #349314
  * SAUCE: i915: enable MCHBAR if needed
    - LP: #349314

[ Brad Figg ]

* SAUCE: Add information to recognize Toshiba Satellite Pro M10 Alps
    Touchpad
    - LP: #330885

[ Colin Ian King ]

* Input: atkbd - add forced release keys quirk for Samsung Q45
    - LP: #347623

[ Manoj Iyer ]

* SAUCE: Added quirk to enable the installer to recognize NetXen NIC.
    - LP: #389603

[ Stefan Bader ]

* SAUCE: input: Blacklist digitizers from joydev.c
    - LP: #300143

[ Tim Gardner ]

* Revert "SAUCE: md: wait for possible pending deletes after stopping an
    array"
    - LP: #334994

[ Upstream Kernel Changes ]

* bonding: Fix updating of speed/duplex changes
    - LP: #371651
  * net: fix sctp breakage
    - LP: #371651
  * ipv6: don't use tw net when accounting for recycled tw
    - LP: #371651
  * ipv6: Plug sk_buff leak in ipv6_rcv (net/ipv6/ip6_input.c)
    - LP: #371651
  * netfilter: nf_conntrack_tcp: fix unaligned memory access in tcp_sack
    - LP: #371651
  * xfrm: spin_lock() should be spin_unlock() in xfrm_state.c
    - LP: #371651
  * bridge: bad error handling when adding invalid ether address
    - LP: #371651
  * bas_gigaset: correctly allocate USB interrupt transfer buffer
    - LP: #371651
  * USB: EHCI: add software retry for transaction errors
    - LP: #371651
  * USB: fix USB_STORAGE_CYPRESS_ATACB
    - LP: #371651
  * USB: usb-storage: increase max_sectors for tape drives
    - LP: #371651
  * USB: gadget: fix rndis regression
    - LP: #371651
  * USB: add quirk to avoid config and interface strings
    - LP: #371651
  * cifs: fix buffer format byte on NT Rename/hardlink
    - LP: #371651
  * b43: fix b43_plcp_get_bitrate_idx_ofdm return type
    - LP: #371651
  * Add a missing unlock_kernel() in raw_open()
    - LP: #371651
  * x86, PAT, PCI: Change vma prot in pci_mmap to reflect inherited prot
    - LP: #371651
  * security/smack: fix oops when setting a size 0 SMACK64 xattr
    - LP: #371651
  * x86, setup: mark %esi as clobbered in E820 BIOS call
    - LP: #371651
  * dock: fix dereference after kfree()
    - LP: #371651
  * mm: define a UNIQUE value for AS_UNEVICTABLE flag
    - LP: #371651
  * mm: do_xip_mapping_read: fix length calculation
    - LP: #371651
  * vfs: skip I_CLEAR state inodes
    - LP: #371651
  * net/netrom: Fix socket locking
    - LP: #371651
  * kprobes: Fix locking imbalance in kretprobes
    - LP: #371651
  * netfilter: {ip, ip6, arp}_tables: fix incorrect loop detection
    - LP: #371651
  * ALSA: hda - add missing comma in ad1884_slave_vols
    - LP: #371651
  * SCSI: libiscsi: fix iscsi pool error path
    - LP: #371651
  * SCSI: libiscsi: fix iscsi pool error path again
    - LP: #371651
  * posixtimers, sched: Fix posix clock monotonicity
    - LP: #371651
  * sched: do not count frozen tasks toward load
    - LP: #371651
  * spi: spi_write_then_read() bugfixes
    - LP: #371651
  * powerpc: Fix data-corrupting bug in __futex_atomic_op
    - LP: #371651
  * hpt366: fix HPT370 DMA timeouts
    - LP: #371651
  * pata_hpt37x: fix HPT370 DMA timeouts
    - LP: #371651
  * mm: pass correct mm when growing stack
    - LP: #371651
  * SCSI: sg: fix races during device removal
    - LP: #371651
  * SCSI: sg: fix races with ioctl(SG_IO)
    - LP: #371651
  * SCSI: sg: avoid blk_put_request/blk_rq_unmap_user in interrupt
    - LP: #371651
  * usb gadget: fix ethernet link reports to ethtool
    - LP: #371651
  * USB: ftdi_sio: add vendor/project id for JETI specbos 1201 spectrometer
    - LP: #371651
  * USB: fix oops in cdc-wdm in case of malformed descriptors
    - LP: #371651
  * USB: usb-storage: augment unusual_devs entry for Simple Tech/Datafab
    - LP: #371651
  * Input: gameport - fix attach driver code
    - LP: #371651
  * r8169: Reset IntrStatus after chip reset
    - LP: #371651
  * hugetlbfs: return negative error code for bad mount option
    - LP: #371651
  * block: revert part of 18ce3751ccd488c78d3827e9f6bf54e6322676fb
    - LP: #371651
  * anon_inodes: use fops->owner for module refcount
    - LP: #371651
  * KVM: x86: Reset pending/inject NMI state on CPU reset
    - LP: #371651
  * KVM: call kvm_arch_vcpu_reset() instead of the kvm_x86_ops callback
    - LP: #371651
  * KVM: MMU: Extend kvm_mmu_page->slot_bitmap size
    - LP: #371651
  * KVM: VMX: Move private memory slot position
    - LP: #371651
  * KVM: SVM: Set the 'g' bit of the cs selector for cross-vendor migration
    - LP: #371651
  * KVM: SVM: Set the 'busy' flag of the TR selector
    - LP: #371651
  * KVM: MMU: Fix aliased gfns treated as unaliased
    - LP: #371651
  * KVM: Fix cpuid leaf 0xb loop termination
    - LP: #371651
  * KVM: Fix cpuid iteration on multiple leaves per eac
    - LP: #371651
  * KVM: Prevent trace call into unloaded module text
    - LP: #371651
  * KVM: Really remove a slot when a user ask us so
    - LP: #371651
  * KVM: x86 emulator: Fix handling of VMMCALL instruction
    - LP: #371651
  * KVM: set owner of cpu and vm file operations
    - LP: #371651
  * KVM: Advertise the bug in memory region destruction as fixed
    - LP: #371651
  * KVM: MMU: check for present pdptr shadow page in walk_shadow
    - LP: #371651
  * KVM: MMU: handle large host sptes on invlpg/resync
    - LP: #371651
  * KVM: mmu_notifiers release method
    - LP: #371651
  * KVM: PIT: fix i8254 pending count read
    - LP: #371651
  * KVM: x86: disable kvmclock on non constant TSC hosts
    - LP: #371651
  * KVM: x86: fix LAPIC pending count calculation
    - LP: #371651
  * KVM: VMX: Flush volatile msrs before emulating rdmsr
    - LP: #371651
  * ath9k: implement IO serialization
    - LP: #371651
  * ath9k: AR9280 PCI devices must serialize IO as well
    - LP: #371651
  * md: fix deadlock when stopping arrays
    - LP: #334994
  * block: include empty disks in /proc/diskstats
    - LP: #371651
  * powerpc: Sanitize stack pointer in signal handling code
    - LP: #371651
  * fs core fixes
    - LP: #371651
  * fix ptrace slowness
    - LP: #371651
  * crypto: ixp4xx - Fix handling of chained sg buffers
    - LP: #371651
  * PCI: fix incorrect mask of PM No_Soft_Reset bit
    - LP: #371651
  * b44: Use kernel DMA addresses for the kernel DMA API
    - LP: #371651
  * thinkpad-acpi: fix LED blinking through timer trigger
    - LP: #371651
  * Linux 2.6.28.10
    - LP: #371651
  * ext4: fix locking typo in mballoc which could cause soft lockup hangs
    - LP: #330824, #371651
  * V4L/DVB (9667): Fixed typo in sizeof() causing NULL pointer OOPS
    - LP: #316405
  * ALSA: hdsp - poll for iobox
    - LP: #363003
  * revalidate parent inode when rmdir done within that directory
    - LP: #317274
  * ext4: Fix race in ext4_inode_info.i_cached_extent
    - LP: #389555
  * V4L/DVB (9848): gspca: Webcam 06f8:3004 added in sonixj.
    - LP: #374122
  * kernel/resource.c: fix sign extension in reserve_setup()
    - LP: #370003
  * iwl3945: release resources before shutting down
    - LP: #345710
  * iwl3945: use cancel_delayed_work_sync to cancel rfkill_poll
    - LP: #345710

-- Stefan Bader <stefan.bader@canonical.com>   Mon, 01 Jun 2009 17:25:15 +0200

Changed in linux (Ubuntu Jaunty):
status:	Confirmed → Fix Released

Revision history for this message

Steve Langasek (vorlon) wrote on 2009-08-18:

#253

verification failed, but the patch doesn't appear to have introduced regressions, so the updated kernel has been published to jaunty-updates. Resetting for the next pass.

Changed in linux (Ubuntu Jaunty):
status:	Fix Released → Confirmed
tags:	removed: verification-failed

Revision history for this message

Phil Norbeck (ptn107) wrote on 2009-08-25:

#254

logs.tar.bz2 Edit (66.9 KiB, application/octet-stream)

I can reproduce this every single time when deleting large files from ext3 partitions as well as ext4. I have too noticed that it is easier to reproduce when the working partition is low on free space. In my case though when reviewing the log files each soft lockup instance has lines in common relating to 'eCryptfs'. My other kernels 2.6.29.6 and 2.6.30.5 do not have this problem.

Logs attached.

Ubuntu 9.04 x86_64
Linux phil-desktop 2.6.28-15-generic #49-Ubuntu SMP Tue Aug 18 19:25:34 UTC 2009 x86_64 GNU/Linux

Revision history for this message

santiago (santiagozky) wrote on 2009-08-30:

#255

Im running a fully updated Jaunty and I am still experiencing lockups when deleting large files/directories. Any idea of when will have a fix release for jaunty?

Revision history for this message

Theodore Ts'o (tytso) wrote on 2009-08-30:

#256

At this point, it seems pretty clear to me that no one is really working on this for Jaunty; if you must use Januty, the only thing I can suggest is to use a mainline kernel --- any mainline kernel, whether it is 2.6.28, 2.6.29, or 2.6.30 will work fine. The problem seems to be in Canonical's backports of patches to the 2.6.28 kernel, and the only people who could work on it are busy working on the Karmic release and/or the Karmic kernel. Those of us (like myself) who are working on the upstream ext4 are busy working on the latest set of improvements and bug fixes that will go into 2.6.31 or 2.6.32.

For those of you who need some proprietary drivers, I'm sorry to say, the only thing you can really do is wait for them to become ported to the Karmic kernel (or port them yourself).

Revision history for this message

Andrew Berry (andrewberry) wrote on 2009-09-01:

#257

Is there a list somewhere of notable patches / features which Canonical has integrated into their kernel? I'd like to switch to a mainline kernel to avoid this bug (which is still affecting me), but want to be sure I'm not missing anything critical which Canonical has changed.

Revision history for this message

papukaija (papukaija) wrote on 2009-09-04:

#258

Should we close this bug for Jaunty as no one is working for it (see comment 256) ?

Revision history for this message

Saivann Carignan (oxmosys) wrote on 2009-09-04:

#259

No, Jaunty is still supported (it's still the latest release) and the bug is still confirmed, therefore closing it would be inappropriate. It would also don't help developers to track the bug and work on it later.

Revision history for this message

tiagolp (tiagolp) wrote on 2009-09-09:

#260

mounting the ext4 filesystem with the mount options "sync,barrier=1" seems to solve the problem on my case (2.6.28-15-generic).

Revision history for this message

Logicwax (logicwax) wrote on 2009-09-10:

#261

thanks taigolp! I can confirm as well that mounting my native ext4 with "sync,barrier=1" option in my fstab solves the problem on Jaunty.

Revision history for this message

Logicwax (logicwax) wrote on 2009-09-10:

#262

actually I'm sorry, I take that back. I was trying to rm -rf over 1.3TB of data, composed of over 17,000 sub directories each a dozen or so files located inside.

I too had complete system lock-up when I would try deleting them (moving and copying was fine).

I tried to move the directories in blocks of about 100 or so to another directory, then tried deleting those. I had the same lockup issues.

The method that taigolp proposed helped a lot, but didn't completely solve my problem. While I could delete about a 100 or so directories now, I still can't delete the entire 17k directory tree without a full lock-up.

for the record, I'm running jaunty 32-bit, 2.6.28-15-generic. ext4 native on a LVM volume spanned across two 1.5TB sata drives on a silicon image SATA pci card.

Revision history for this message

Andrew Berry (andrewberry) wrote on 2009-10-05:

#263

It seems to me that this is fixed in the patches committed from #418197. Can anyone else confirm? I was able to delete around 2.6 million links and files in a single rm -rf, which would previously cause a lockup in a minute or two.

Revision history for this message

Andrew Berry (andrewberry) wrote on 2009-10-05:

#264

Link since comments don't autolink to bug numbers: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/418197

Revision history for this message

Rene (g.xrc) wrote on 2009-10-16:

#265

Since I upgraded to
Linux rgm 2.6.28-15-generic #52-Ubuntu SMP Wed Sep 9 10:49:34 UTC 2009 i686 GNU/
no freeze when deleting big files together (> 1GB)
no "BUG: soft locking - CPU#0 stuck for 61s! [uic: 5356]"
mean 2 PC had the problem, 2PC solved !!!
Previously I had to switch to mainline kernel (I chose 2.6.30.6).
Thank you.

Andrius Štikonas (stikonas) on 2009-10-16

Changed in linux (Ubuntu Jaunty):
status:	Confirmed → Fix Released

Revision history for this message

ViPeRaY (mail-erayyilmaz) wrote on 2010-01-07:

#266

It seems like the fix has been released for this but I am still having this problem. I can copy large files (around 15-20 gig) to a NTFS hard drive and there is no problem. However when I try to copy same files to an internal hard drive which uses ext4, the system freezes. I am using Karmic with kernel 2.6.31-16-generic.

My question is, how do I get the fix? I get auto updates but do I have to manually install the fix? And where is the patch files are located?

Thanks,

Revision history for this message

enb (elitenoobboy) wrote on 2010-01-07:

#267

"However when I try to copy same files to an internal hard drive which uses ext4, the system freezes."

This would be a different bug, as this bug only occurs when removing files.

"My question is, how do I get the fix?"

It looks like the latest karmic kernel release is 2.6.31-17. You might want to try installing that.

If that doesn't work, and assuming that it really is a kernel problem and not caused by something else, you could try the 2.6.32 kernel from lucid's repository, though since lucid is still in alpha stages, it might be best to find out if it really is being caused by the kernel first.

Revision history for this message

hoover (uwe-schuerkamp) wrote on 2011-01-23:

#268

I have experienced a similar bug removing largish video files (about 4GB or so) from an internal SATA drive formatted with an xfs filesystem.

Sometimes when doing an "rm -rf" on a directory on that file system, the rm will hang and remain pegged at 100% cpu usage. As opposed to other posters in this thread, I don't see any suspicious messages in dmesg about hangs or timeouts, and usually I'm able to "rm -rf" the directory in question from another terminal session without a hang.

The only thing that kills the rm is a reboot, kill -9, Ctrl-C and so on all won't work on that process.

Please let me know if you need any further logs, I'm running kernel 2.6.32-27-generic #49-Ubuntu SMP Wed Dec 1 23:52:12 UTC 2010 i686 GNU/Linux on Linux Mint10 which is based on Maverick 10.10.

Revision history for this message

reini (rrumberger) wrote on 2011-01-24:

#269

Since this report is about ext4 and you're having problems with xfs, you really should open a separate report...

Revision history for this message

pritam ghanghas (pritam-ghanghas) wrote on 2012-09-06: Invitation to connect on LinkedIn

#270

LinkedIn
------------

Bug,

I'd like to add you to my professional network on LinkedIn.

- Pritam

Pritam Ghanghas
Technology specialist at Infosys
Bengaluru Area, India

Confirm that you know Pritam Ghanghas:
https://www.linkedin.com/e/-xbysru-h6rdrmt4-2s/isd/8524018569/kdL7IApK/?hs=false&tok=1nzfTR9Cd5ylo1

--
You are receiving Invitation to Connect emails. Click to unsubscribe:
http://www.linkedin.com/e/-xbysru-h6rdrmt4-2s/u8T3vuO4neBI5tyng4kKHld4Y3irWqJhOpbybZf/goo/330824%40bugs%2Elaunchpad%2Enet/20061/I2866543655_1/?hs=false&tok=2vmglBmsx5ylo1

Ubuntu
linux package

Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

Bug Description

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

	Status	Importance	Assigned to
Release Notes for Ubuntu	Fix Released	Undecided	Unassigned
linux (Ubuntu)	Fix Released	Medium	Tim Gardner
Jaunty	Fix Released	Medium	Tim Gardner
Karmic	Fix Released	Medium	Tim Gardner

Ubuntulinux package

Soft lockups (freezes) when deleting files from ext4 partitions on 2.6.28

Bug Description

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Ubuntu
linux package