Ubuntu

2.6.28-11 causes massive data corruption on 64 bit installations

Reported by Graziano on 2009-03-22
328
This bug affects 42 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Critical
Manoj Iyer
Declined for Dapper by Pete Graner
Declined for Hardy by Pete Graner
Declined for Intrepid by Pete Graner
Declined for Jaunty by Pete Graner
Declined for Karmic by Pete Graner
Declined for Lucid by Pete Graner

Bug Description

Binary package hint: linux-image-2.6.28-11-generic

I had on my notebook ubuntu running now for more than two years, from gutsy and going through all development releases. I was always using JFS as my root filesystem and NVIDIA binary drivers.
Notebook is a COMPAL clone from an italian assembler, SANTECH. Attached will find the lshw output.

-) I started upgrading to jaunty a week after intrepid was out. All upgrades went ok for months, until this last week switch to 2.6.28-11 kernel. Without a system crash, silently JFS started to glitch. Files started to disappear. No messages from kernel, except applications started to segfault. FIrst apt, then python, then system was lost. I repeat, when problem occurred, no kernel messages. just application segfaulting.
-) Reboting leaded to JFS repairing himself, each time loosing files (messages on tepair were something like d_ino != inode, can't exactly remember. read ahead).
-) First tought was something wrong with disk, so I just started from sysrescuecd and launched smartctl long test on disk. No problem reported, and I am still far from end of disk life.
-) Second thought was something went wrong with filesystem, and reading ext4 threads, thought developers screw up things in journaling, but I was told it was not the case.
-) As my system was in a terrible mess, lost tens of files, I installed fresh jaunty Alpha alpha6. Installation went ok. System was up again. No installation problems.
-) Went on system upgrade. Installed again 2.6.28-11. Reboot, same problems as before: applications started to segfault in series. At reboot, again filesystem problems....
-) Next try: installed this time on ext4. Again installation of fresh alpha 6 ok. And again on system upgrade same behaviour, this time leading to complete filesystem whoes (unable to mount).
-) Again I installed fresly alpha6, this time leaving 2.6.28-9. This situation seems stable.

Summary of this.

I have not too much to report, as I have no messages from kernel when problems appear, except segfaults from applications. Seems my disk I/O system gets crazy switching to 2.6.28-11. As this happens both with jfs AND ext4, seems something related to controller and/or common I/O.

Have no more hints....

BoomSie (gideon-poort) wrote :

Same issue over here. Saturday I decided to give Jaunty a try, after a collegue of mine warned for the issues with ext4.

I looked out that the dist-upgrade wouldn't touch the filesystem NOR update it. Nevertheless:

* First boot OK, few time crashing nautilus & applets in gnome though
* Reboot and everything was f*cked
* sbin/init* was gone
* fsck'd both home&root partition to recover
* yesterday night it was up and running again, so I figured I could work 'normally' today on my laptop again
* this morning I boot, login into gnome -> CRASH. Apparently some configuration issues, figured that my root filesystem was mounted read only. So again, fsck and a shitload of files in Lost+Found now

I have a Compal as well btw. A JFL92 (see attachment)

Hope you guys figure out whats going on before the official release next month.

Cheers & keep up the good work

Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Triaged
BoomSie (gideon-poort) wrote :

Yes Graziano, you hit the nail on the spot I'm afraid.

I'm not really familiar how it works with those patches or when to expect this update/grade available in the Alpha/Beta repo's, so I'll watch from a distance for the coming week(s).

(Unless someone can guarantee me, the fix is already there, then I'm MORE then happy to do a clean alpha 6 install to fiddle around some more with this stunning new release)

~..~
(oo) <<< MOOh

Well, after some days of work on the machine, I can say that at least jaunty alpha6 is stable (with all updates installed) IF keeping the 2.6.28-9 kernel running (I have modified the default target in grub menu.lst, just to be sure).
As of today, on the same system, starting with 2.6.28-11 leads to filesystem fault. And without having in hand a system rescue cd with filesystem repair tools (and being lucky on where the faults happened) the system gets unusable.
Note that on a desktop system at home (AsRock eSata motherboards, E6400 Core processor) the kernel 2.6.28-11 has not this problems (at least not so visible: tracker sometimes segfaults, but system does not crash and filesystem is preserved).
Seems something triggered by the hardware, but which is present latent in the codepath.

I am completely with BoomSie: the official release cannot live with this problems around.

BoomSie (gideon-poort) wrote :

Bug remains in the Beta release of Jaunty too. How can one tell whether this bug is 'fixed' in a release or not? (without rereading the entire changelog)

Manoj Iyer (manjo) wrote :

The patch mentioned above is already in the jaunty Kernel.

god-mok (god-mok) wrote :

I got the same problem on my clevio m57ru. As long as i boot to the older kernel 2.6.28-9 everything is fine.

An hour ago i tried the 28-11 kernel and it worked for a few minutes. But as i tried to search with synaptic, it closed. Under the terminal i got only a core dump. After that i coudn't do anything with apt-get, or some other programs. My wlan and any network connection was dead.

As i let it be and looked out for logs it seemed that my windows disappeared. The border were there but the content, menu and buttons were gone under nautilus. as i recovered everything i tried the same under kde. The same happened for the system, but there the content of the windows didn't disappeared.

The windows content didn't disappear every time i tried it. My system freeze often before that point.

After the freeze the system can't mount my user partition, or get any network to work. The log files won't show anything strange until i saw that the log were before the freeze. As i tried to run dpkg it told me, that the it could not write to /var/cach/apt folder. I looked there, and everything was fine.

As i started gdm it seemed that X couldn't initialize the xorg.conf. So i looked after that, but nothing was wrong with it. I tried to run apt-get update as a test and there it tells me, the system seems to be in a read only state.

No wonder that after the freeze all the logs were untouched.

After that i tried to boot under 28-9 but no luck. There the devil spread out his wings. Hope someone will look after that.

Oh yeah: after a fresh reinstall with the beta iso i got the same problem from the beginning without anything done, only looked out for files with gedit and nautilus. The more strange was, that is could not see any driver under hardware-driver but it worked on the usb-stick and the livecd.

And sorry for my bad english :)

Manoj Iyer (manjo) on 2009-04-03
summary: - jaunty kernel 2.6.28-11 kernel destroy system
+ jaunty kernel 2.6.28-11 kernel update renders the system un-usable.

The patch mentioned above was suspected by me to cause lot of trouble, but it NEEDS to be something related also to HW configuration.

Tried today again with 2.6.28-11 and got EXACTLY same behaviour as god-mok. I was really lucky to recover filesystem and get back here. AGAIN: No Problem with 2.6.28-9. System gets unusable (actually all filesystem can be lost, and if You call it unusable, I do prefer destroyed, as I had multiple times to recover from backups) using 2.6.28-11 for more than some minutes.

I repeat: Seems something related ALSO to hardware, as I have tried the same schema on multiple hardware (fresh install with 2.6.28-9, then update to 2.6.28-11) and I had problems ONLY on my notebook.

Please god-mok, would You please post the output of lshw, as I have not find complete specs of Your Clevio?
Looking at a summary hardware, it seems really similar to the COMPAL setup:

• 17" WSXGA Aktiv Matrix Glare TFT (1680*1050)
• inkl. 1.3 MPix Webcam
• nVIDIA GeForce 8800M GTX 512MB
• Core2Duo T9300 / 2,5GHz 6MB/800 MHz
• 4096 MB (2x2048) SO-DIMM DDR2 800MHz
• 200 GB / 7200 U/min S-ATA
• DVD±R/±RW DL (Dual Layer) 8x/8x Multinorm-Brenner
• Intel Wireless WiFi Link 4965AGN
• int. Bluetooth-Module

But I am really curious about chipset used. Let us sort this trouble out!

Manoj Iyer (manjo) on 2009-04-03
Changed in linux (Ubuntu):
assignee: nobody → manjo
god-mok (god-mok) wrote :

sorry, totaly forgotten. Here is my lshw file.

At the time i have no battery attached, but doesn't matter, 'cause it was the same.
And yeah, our hardware seems very simmilar.
Only thing is i do not have the Bluetooth-Module. Everything else is veri similar to my hardware.

Oh yeah, and another thing: I tried to recover files with testdisk and some other programs, but most of it failed. I found some Pics and some movie files (totaly splitted) and so it was totaly lost time.

But before that, ich checked the partitions like everyone. I even tried it with ext3 and ext4 with my home partition, but it was the same: many inodes issues (199), and after that nothing changed. Rerun the check, and it happend the same again, as it could not be changed (read only).
As i checked every time the disk, then once something happend: fsck tried to change the filesystem ext4 to ext2. I don't know why that happened, and it also didn't worked, but I tried :)

Maybe the last point doesn't matter, but for me it was very strange...

Oh, and under livecd i could not find the home partition with gparted, but the root partition was ok, and i could even mount it.

Well, i think even now why the system thinks it is on read only status, but i can manage, change and do everything with the files...

If you need any more specific hardware details, than ask. I will gladly look at my paper for it.

mirix (miromoman) wrote :

Installing 9.04 beta (AMD) via the Live CD and then compiling the kernel to 2.6.29.1 (before doing any other update via Aptitude/Synaptic) renders a very stable and "performant" system.

god-mok (god-mok) wrote :

@mirix: too bad, if someone knows what he have to do, than it's no problem, but i don't think thats the idea behind everything. As long as there is no support it's not such a good idea, right?

Today I had another crash. I reinstalled again, and after reboot the bug came at first boot, not as always at the second after the freeze. So like again, I mounted my home partition manually and everything worked so far. Updates and diver installed, added some repos to my source list and than there was the freeze again.

Thought it could be some new package problem, but it happened not, because i reinstalled again, and after the first save boot i didn't do anything. I let it run for almost 30 minutes, than i rebooted. Same problem, so i had to mount manually, but sometimes it didn't worked.
There was a notice after the failed boot, something like "two files share same sector/inode" in a folder "/home/god-mok/???/a9...x86_64..." Too bad I have no "???"-folder and i didn't memorized the whole numbers so thats all now. The numbers looked like a md5 hash until it reached x86_64. I have no idea where that came from after the fresh install with formated partitions.

mirix (miromoman) wrote :

@god-mok:

when you install a beta version you take some risks. as far as I know, the problem only happens when you upgrade the kernel. so if you stick to the kernel provided by the installation CD (2.6.28-9) until the problem is solved, you should be safe. I guess this is a problem specific to the Ubuntu kernel 2.6.28-11.

I upgraded to the latest Linux kernel (2.6.29.1) because I wanted to. but I was not recommending or even suggesting to anybody to do so.

cheers

Just to bump. Problem is still here. 2.6.28-9 is stable, 2.6.28-11 is unusable on this hardware. As all hardware is currently working, I have no problem in living with -9, but am worried about the problem.

Today I just tried again with 2.6.28-11.41 just to have program segfaulting after a minute or two, and a filesystem repair process at startup. Now I am with ext4, cannot report anymore on jfs.

I have read all patches introduced in the -9 -> -11 jump, and I was unable to find something really interesting.
Jaunty will for sure ship with 2.6.28, and as we are 10 days now from release, the problem is better handled by some kernel people.

I decided to go unstable just to help, as I am a somewhat experienced user who knows how to recover from backups and knows how to live with system hiccups. If we are here to build a community system, someone has to test it, and discover bugs for not experienced users to avoid them. Ubuntu, was something about community, or am I missing something?

I am plenty of options to live bleeding edge, am here just to help the Ubuntu project. I won't stay with Jaunty, will jump on Karmik as soon as it starts its development phase: lot of bug reporters will be available for official release, and I am of much help ahead of them.

Steffen Rusitschka (rusi) wrote :

I'm also a bit concerned about the RC/final of Jaunty... Anyway, here's a short summary of all lshw.txt attached to this bug an its duplicate:

Common to all machines:
- Intel Core 2 Duo
- 4 GB RAM
- 4965 AG(N) WiFi
- Intel 965 Memory Controller
- Nvidia Graphics Card: 8x00M

I'm not sure if everyone is running a 64-bit version of Jaunty.

But: those hardware combinations are far from being exotic - almost all new Laptops have a similar configuration...

KJ (cortexbuster) wrote :

I have exactly the same HW and I'm running the 64 bit Version of Jaunty.
And I experience also the same problems you all have.
I can't run 2.6.28-11 without severe data loss.

Louis-Dominique Dubeau (ldd) wrote :

I'm experiencing the same problem on a Compal IFL90 (aka Sager NP2090). Running Jaunty 64-bit.

No data loss on my side but I've experienced some random kernel and process crashes. (Emacs works fine but a few hours later all executions of emacs result in automatic segfaults!)

Downgrading to 2.6.28-9 fixed this issue but created other issues. I'm now running into this bug:

https://bugs.launchpad.net/ubuntu/+source/pulseaudio/+bug/330814

Fixing it requires a kernel upgrade!

KJ (cortexbuster) wrote :

the trouble continues with linux-image-2.6.28-11-generic (2.6.28-11.42)
filesystem access fails after a few minutes.

Kevin W. (eyecreate) wrote :

I am also getting the same things as people above. I am currently running off live cd and found this bug here, which hits it right on the mark for me. I have the exact system specs as the common list and only seemed to have the problem after kernel upgrade. Hope there is a fix for release.

Attached lshw just in case.

Kevin W. (eyecreate) wrote :

I forgot to say on the above that I installed with the RC.

KJ (cortexbuster) wrote :

since this is such a showstopper for everyone with this hardware config I'd really appreciate any dev comment.
is anybody working on the issue?
do you need more information?
what can be done to track down the error?
we're close to the official release. this could turn into a disaster for a lot of regular users upgrading to jaunty.
I'm concerned about the quality of ubuntu. it's a great os. I use it on a couple of servers as well das desktops / notebooks. but such a widespread error so close to a release can do the project real harm.
just my two cents.

Kevin W. (eyecreate) wrote :

I just did a disk repair in order to boot up again. Because I'd rather not lose anything else, I'm going to use the ppa for .29 kernel until something better is shown here.

Kevin W. (eyecreate) wrote :

I found out after rebooting to try and install the new kernel that my partition was too far gone to boot up anymore, so I had to reinstall. I also want to add that I found out that when I reinstalled Kubuntu RC for the fourth time that even if I don't upgrade to the latest kernel, it still messes things up. IDK what kernel is in the RC by default, but it seems I will for sure have to use a different kernel. Here goes a fifth time.

Kevin W. (eyecreate) wrote :

Oooh, I found something else interesting out. It's too bad it's this close to release, but it seems the upgrade to the .29 kernel also fixed a bug I(and others) seemed to have about network manager not connecting to encrypted networks. So far, I am able to connect to my WEP wifi hotspot which I couldn't do on livecd or fresh install. I will try my university's WPA2 Enterprise connection next. I do wish this was the default kernel in Jaunty, because it'd make things work better and make life less difficult.

IIRC Jaunty will definitively ship with a 2.6.28 kernel.

https://lists.ubuntu.com/archives/kernel-team/2009-February/004321.html

Can confirm bug is not present using mainline 2.6.29.1, but this will be of some help with Karmic.
For people who do not know how to get mainline, here You can find "stock" linux kernels compiled for Ubuntu:

http://kernel.ubuntu.com/~kernel-ppa/mainline/

I do not recommend in any way to do this, but if You really have problems, this is the way to go. Remember that You will not find help on kernel related problems using mainline from Ubuntu developers!
If want to use Ubuntu kernel, keep reading this thread for possible solutions. As we are really close to release, I expect this bug to be tracked down by developers after April, 24.

Kevin W. (eyecreate) wrote :

Just to comment on my previous comment, it seems WPA2E still doesn't work.

Kevin W. (eyecreate) wrote :

I have gotten WAP2E to work by using wicd as my network manager. Sounds like I have another bug to search for.

Steffen Rusitschka (rusi) wrote :

Did anyone try if the final version of Jaunty still has this issue?

AFAIK 24/04 is release date for jaunty and on my system the current ubuntu kernel version (not the one I am running with, but the one linked to the linux-image-generic) is 2.6.28-11.42. So the message from KJ above,

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/346691/comments/18

answers Your question. Last I personally tried is 11.41.

Ivo Smits (ivo-ufo-net) wrote :

Have seen the same problem here on my HP Compaq 8510w. While downloading updates (using the update manager) the filesystem suddenly locked up (switched to read-only mode). At that moment I had one virtual machine (VirtualBox) running as well, but I do not think think that's related to the problem.

After trying ubuntu for the first time, a friend of mine told me about this problem (exactly the same hardware). I told him it couldn't be Ubuntu's fault and finally decided to try myself. My system worked fine for at least a week before it crashed, and fsck was able to repair the filesystem.

I have Ubuntu Jaunty amd64 running on a VMWare virtual machine and have not seen any problems there (yep.. the issue seems to be hardware-related...).

I'll attach my lshw result as well.

®om (rom1v) wrote :

Maybe related : bug 350268

Siegfried Gevatter (rainct) wrote :

Seems like it's the same problem I have... http://bloc.eurion.net/archives/2009/one-week-with-debian/

KJ (cortexbuster) wrote :

as far as I see yes.
either the current 2.6.29 mainline kernel or 2.6.28-9 can help you out.

fuchur (ckellner-gmx) wrote :

Ok, as this seems the original bug report:

Until this is fixed, I am thinking about using a new kernel, but

a) which problems could this cause - there must be a reason that kernels are never replaced during a release cycle

b) how would I do this - apt-get typically hangs after a few packages are downloaded, and the file system quickly gets read-only.
could I install the new kernel to intrepid and then dist-upgrade to jaunty? how is the correct procedure

c)If this bug was known march 20, and confirmed some days before the release, why is there not even at least some remark in the release notes ?

1 comments hidden view all 204 comments
giorgio130 (gm89) wrote :

same problem here on a compal jhl90. I managed to install 2.6.29 and it seems to run fine now.

str0g (buskol-waw-pl) wrote :

i have same bug, on my jhl90 but downgrading kernel doesnt work :/

adamski (adam-hasselbalch) wrote :

I am amazed that this bug is not marked as Critical!

Obviously, this affects a great deal of users in a way that is extremely destructive.

In my opinion, this bug alone renders Jaunty completely unfit for a production environment! There's no getting around that data loss and file system corruption due to a kernel error is absolutely and 100% unacceptable in a so-called "stable" release. The fact that it apparently happens on very common hardware does not help.

Sorry for the harsh words, but this is simply Not Good Enough!

str0g (buskol-waw-pl) wrote :

I tell you why it isn't critical, my desktop c2q,x38+ich9r,4gb,3 hdd, works great with it, my friend laptop with x2, and dekstop with x2 also works. Its realy hard to say what when wrong but obviusly developers should be test system on modern laptops, to avoid this kind of problems...

I've install kerenel Linux lukasz-laptop 2.6.29-02062901-generic #02062901 SMP Fri Apr 3 13:36:07 UTC 2009 x86_64 GNU/Linux

and i have some minnor issues like to days acpi update cannot be installed and there are some minnor errors with kerenel header installetion, but system is stable, and there are no data losses.

mirix (miromoman) wrote :

I agree that it is not acceptable to release a so called "stable" version being aware of such a serious bug.

Fuchur:

I have compiled kernel 2.6.29.1 (2.6.29.2 is already available but I have not tested it) a few weeks ago and I have not had any issues since then.

You can download precompiled packages from the Ubuntu site (I do not have the URL) and install them with dpkg.

A few people describe easy ways to compile it from source:

http://izanbardprince.wordpress.com/2009/03/26/how-to-fix-ubuntu-jaunty-warning-hacks-ahead/

http://koroshiyaitchy.wordpress.com/2009/04/25/ubuntu-904-jaunty-jackalope-customised-for-performance-on-a-nexoc-osiris-e705iii-clevo-m57ru-laptop/

I followed these older instructions:

http://symbolik.wordpress.com/2007/11/10/vanilla-kernel-26231-on-gutsy-gibbon/

Just changing the obvious parts. I guess all three methods are actually the same. Just a few kernel configuration options change.

The only annoying and unresolved issue I have found this far is related to this:

http://ubuntuforums.org/showthread.php?p=3593262

I have followed the instructions on that how-to to no avail. I have also tried a Gentoo method with uvesafb with no better luck. However, regular Ubuntu installations also give similar problems if you install the proprietary NVIDIA or ATI drivers.

In fact, provided the big deal of manual configuration I have ended up carrying out, I am seriously considering moving back to good old Debian, which is far more stable, faster and less buggy than Ubuntu. Ubuntu is more modern, but less than, for instance, Fedora.

summary: - jaunty kernel 2.6.28-11 kernel update renders the system un-usable.
+ jaunty kernel 2.6.28-11 kernel update makes the system unusable.
sam tygier (samtygier) on 2009-05-22
Changed in linux (Ubuntu):
importance: High → Critical
tags: added: amd64
summary: - jaunty kernel 2.6.28-11 kernel update makes the system unusable.
+ 2.6.28-11 causes data corruption with ICH8 on 64 bits installations
summary: - 2.6.28-11 causes data corruption with ICH8 on 64 bits installations
+ 2.6.28-11 causes massive data corruption with ICH8/ICH9 on 64 bits
+ installations
summary: - 2.6.28-11 causes massive data corruption with ICH8/ICH9 on 64 bits
- installations
+ 2.6.28-11 causes massive data corruption on 64 bit installations
124 comments hidden view all 204 comments
quixote (commer-greenglim) wrote :

Just two things for what it's worth:

Fact: I've been using the 2.6.29-04 kernel with up to date 64-bit Jaunty on ext3 for a couple of months now with no problems at all.

Opinion: I am still horrified, appalled, perplexed, and angry that there are neither any warnings on LiveCD iso downloads for 64-bit, nor an update for the kernel shipped with the iso. I'm starting to get the impression that whoever makes those decisions at ubuntu doesn't think it's very important if I lose data. I know I'm repeating myself, and it amazes me that in this community that should even be necessary, but that is Not Good.

mosgjig (mosgjig) wrote :

I came across this issue and found that the solution proposed by Lorant Nemeth on comment #122 worked, though with a slight twist. I was unable to install a fresh copy of intrepid because the liveCD could not mount the swap (too lazy to investigate after dealing with this mess), therefore I just installed the liveCD Jaunty 64bit and followed the instructions. So far so good, installed this morning at work and been gradually re-installing all kinds of apps and goodies with frequent reboots.

My specs

Asus M51Sn
4GB Ram
Intel Core2 Duo T8300 @ 2.4GHz
GeForce 9500m GS

Following the instructions, went from Jaunty amd64 live cd with kernel 2.6.28-11-generic to .28-13-generic to .29-02062904-generic before rebooting from install.

If ya don't hear from me in a couple of days, then take it as A solution to this ridiculous bug that ate 5 hrs of me-life. But seriously, other than these minor glitches, good work on the dist, hope to find some time and contribute some code one of these days.

Godspeed!

Thomas Aaron (tom-system76) wrote :

Could we please get an update on the prospects for fixing this bug?
It's been about two weeks since the above post.
Is it fixed in the *-14 kernel?

This thing is reaking havoc on a lot of our older systems, and possibly a couple of our newer ones. Not only is it destroying data, it's destroying profit.

If there is any information we can add to help, please let us know.

Best Regards,
Tom
System76
<email address hidden>

swordthower (mnrjj) wrote :

I have successfully applied the fix in #122 as well. I have an ASUS N80Vb laptop. Everything seems to be working, and I have had no crashes or fs corruption after several reboots.

Fingers crossed...

mirix (miromoman) wrote :

The bug is fixed in Karmic Koala alpha 4 (kernel 2.26.31 RC5) and the 2.6.30 familiy is also bug-free to this respect.

Paradoxically, Koala seems faster and less bloated than Jackalope ;-)

SecuGuru (christopherthe1) wrote :

This bug still exists in the 2.6.28-14 amd64 kernel...interestingly, it didn't manifest in my system until I upgraded my RAM from 2GB to 4GB.

My system was down for mobo RMA (bad voltage reg) for the last 3 weeks. Was running fine since Jaunty first went live when I disassembled for RMA, root filesystem on ext3 partition.

The first indication of a problem came after I booted and let the update manager run...installation of the 2.6.28-15 kernel image keeps failing due to corrupted tarfile errors. Repeatedly tried to download it...some succeed but throw corruption error on unpacking, other attempts fail outright with 'package checksum mismatch.' I pulled it down manually via wget (~24MB), but the md5sum didn't match the published value for the package. Then I re-ran it and got a different value! And kept getting different values on subsequent md5sum runs.

My system is dual-boot XP, so I switch to windows and wget the .deb package again onto my NTFS partition. This time, md5sum returns the published value. Reboot using Ubuntu 8.10 (Intrepid, 32-bit) live CD and mount the NTFS partition read-only. I ran fsck on the ext3 partition, but it aborts as clean...so I force fsck and all checks OK. I run md5sum again on the package, it returns the expected value. I mount the ext3 partition and do a 'cp -av' to copy it to /var/cache/apt/archives for the update manager.

Here's where it gets fun. After the copy, I check the md5sum, and it's wrong. So, I check the md5sum on the original copy of the file on the read-only mounted NTFS partition...it's wrong too! WTF?

I reboot to WinXP and check the md5sum on the packages...the copy on the NTFS partition that was downloaded under XP returns the correct MD5 value, but the copy I made to the ext3 partition under the Intrepid Live CD is wrong. (I mount my ext3 partitions in WinXP using an ext2/3 volume manager) I delete the ext3 copy and use 'copy /b /v' to copy the package again from NTFS to ext3. This time, under WinXP the md5sum returns the correct value for both copies.

SUMMARY:
Problem doesn't seem to manifest until 4G RAM installed
Problem exists in both 32-bit and 64-bit kernels (errors under both Ubuntu 8.10 i386 Live CD and 9.04 AMD64 HDD installation currently on 2.6.28-14 kernel)
Problem DOES NOT manifest under WinXP boot, making it very unlikely hardware is the cause

SYSTEM SPECS:
Asus A8N-E mainboard (nForce4 Ultra)
Opteron 185 CPU (dual-core)
4GB Patriot DDR-400 (PC3200) SDRAM, CAS 2
Maxtor 1TB SATA-II HDD
**p1 = 250GB, NTFS
**p2 = 8GB, ext3
**p3 = 680GB, ext3
**p4 = 2GB, swap

SecuGuru (christopherthe1) wrote :

Addendum to previous comment's System Specs:

512MB nVidia 9800GT graphics card

SecuGuru (christopherthe1) wrote :

Installed 2.6.29 kernel per workaround suggestion (http://ubuntuforums.org/showpost.php?p=7382178&postcount=29) to no avail.

Data corruption appears to manifest only in files 8MB or larger. Attempting to update package ia32-libs via update manager results in failed download (hash mismatch). Using wget to pull the package manually results in different MD5 sums each time.

Same file downloaded under WinXP checks out with the correct MD5 sum every time.

chastell (chastell) wrote :

Thanks for the detailed testing, SecuGuru. Can you try with 2.6.30 mainline kernel?
http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.30.5/

SecuGuru: can you try to reproduce the problem, booting separately with 'iommu=soft', 'iommu=off', 'mem=2G' please?

Each time, it's worthwhile catching the IOMMU settings with 'dmesg | grep -i iommu' after bootup.

Dr Emixam (dr.emixam) on 2009-10-12
Changed in linux (Ubuntu):
status: Triaged → In Progress
status: In Progress → Confirmed
Zakhar (alainb06) wrote :

I withdraw from this list. As I forcasted 5 month ago (post 162) this bug is still uncorrected and now Karmic is out. So I'm not waiting anymore for a correction of this bug, and skip directly to Karmic 64 which is an awesome version.

Keep up the good job !..

So, none of the kernel updates of Jaunty did not fix it? Scary.. I moved to
mainline kernel, since I was not able to work on my new laptop because of
that bug. And I'm still on mainline, now it is 2.6.31-02063107-generic. I do
not see a way to test other kernels, because it would possibly trash my
system. If I only had a spare system on dual boot..

raketenman (sesselastronaut) wrote :

i can confirm this bug with an 2.6.31-9-rt kernel
my dmesg:
[27216.779223] EXT4-fs error (device sda3): ext4_add_entry: bad entry in directory #859924: directory entry across blocks - offset=0, inode=3633236108, rec_len=180364, name_len=142
[27216.779231] Aborting journal on device sda3:8.
[27216.779448] EXT4-fs (sda3): Remounting filesystem read-only
[27216.780388] EXT4-fs error (device sda3) in ext4_delete_inode: Journal has aborted
[27216.780393] EXT4-fs error (device sda3) in ext4_create: IO failure

@raketeman, please post this, along with system details to <email address hidden>; here isn't going to help

raketenman (sesselastronaut) wrote :

attached the lshw associated with the 2.6.31-9-rt kernel

raketenman (sesselastronaut) wrote :

thanks Daniel for this hint - mail is on the road!

Hi everybody. I'm coming here after a lot of searches about fs corruptions in Ubuntu. Description from the original poster seems to apply very well to my situation.

"Suddenly", already running apps start to seg fault, while new started ones usually report some error with shared objects (missing, not loadable due to header problems... I can't recall the exact messages). I got no data loss, maybe because when these errors start to show I shut down the system as quickly as possible. Usually shutdowns fail and sometimes I'm able to see the EXT3-fs error messages (similar to those in #177).

These are the BAD NEWS. Ubuntu is 9.10 Karmic, and:
$ uname -a
Linux frank 2.6.31-20-generic #57-Ubuntu SMP Mon Feb 8 09:05:19 UTC 2010 i686 GNU/Linux

My system is a 32bit, 6 years old Acer laptop. I can't remember when the bug first showed up, but surely it was there with release 2.6.31-16 (or -15).

Now the GOOD NEWS (I hope). It seems I'm able to REPRODUCE IT!

Some weeks ago I tried to run AC3D 6.5, a (not free) 3d modeler, and after some minutes exploring it, closed it and got a strange error: "Unable to save configuration file in xxx" (or similar). Very peculiar, but the system seemed ok, so I forgot about it and went on; after a few minutes errors was so frequent that I had to shutdown the system and fsck the disk from a Live USB.

I tried that software other times always getting the same problem, even after some kernel upgrade. So I gave up with it and decided to give K-3D a try. I downloaded and started it. After the splash screen showed, the software had a segmentation fault AND the fs got corrupted once again. Again after some kernel upgrade, I repeated the test and got the same system failure. Then I tried Blender and... surprise, I experienced the bug once again. I tried another (let's say it) OpenGL (non free) software, which I started and operated successfully some month ago, and the bug was there.

From my little experience, I can conclude that any time I start some OpenGL 3D software, this bug shows up. Otherwise, I can keep my system up for days without any problem. Please note that I have Compiz disabled, because it has some glitches with Java Swing applications, and no screensaver running. Moreover, if I remember correctly (must check it), I had some mesa or radeon driver update in these months, between the last working execution of an OpenGL app and the first appearance of the bug.

I'm going to test with some other 3D app just to see whether that path goes anywhere.

I hope this report will help you.

Bye,
Marco

Marco, there is a known hardware data-corruption issue in certain revisions of Via VT82C586A/B/VT82C686/A/B/VT823x/A/C disk controllers; this is most likely the issue you're hitting.

I'm not aware of what workarounds exist. To confirm the issue is with this disk controller, mount the internal 2.5" harddisk in a USB enclosure, boot off it and see if you can reproduce the issue via USB, and you'll know.

Daniel, it seems quite strange because I've been running Linux on this laptop since I bought it, in 2003, and never got this kind of problem. It only showed up since a few weeks and always following the same pattern. However, I'm going to confirm the issue following your suggestions as soon as possible. Thank you for your notice.

chastell (chastell) wrote :

Marco: In my case (64-bit Jaunty on a ThinkPad X301 + a 128 GB Samsung MMCQE28G SSD) the issue went away as soon as I switched to a vanilla (mainline) kernel: https://wiki.ubuntu.com/KernelTeam/MainlineBuilds

Can you try to reproduce your issue with one of these kernels? (I’ve been happily using 2.6.30.5 for quite some time now.)

Shot, finally I had some time for testing with other kernels and these are my results.
Versions are expressed as shown in the "Installed version" column of Synaptic.

Ubuntu-specific kernels

2.6.31-20.57 - Doesn't work
2.6.31-19.56 - Doesn't work
2.6.31-18.55 - Doesn't work
2.6.31-17.54 - Doesn't work
2.6.28-16.55 - Works

Ubuntu mainline kernels

2.6.32-02063208 - Doesn't work
2.6.31-02063112 - Doesn't work
2.6.30-02063010 - Works

There are chances the bug was introduced in 2.6.31 vanilla kernel. Now I'm staying with 2.6.30, which works and allows me to do the (simple) 3D tasks I need.

I'm here for any other test or update. Sadly I still can't check the disk controller path (#182).

Bye,
Marco

chastell (chastell) wrote :

Marco: By „doesn’t work” do you mean that the bug manifests itself, or that the given kernel doesn’t work at all?

I’m asking because on my 64-bit Jaunty mainline 2.6.32 don’t even boot properly (haven’t tried 2.6.31, went with mainline 2.6.30.10 which works very well).

The custom 2.6.28 kernel(s) that Ubuntu ships with Jaunty manifested this bug in my case, so I’m very reluctant to try any non-mainline kernel (the data loss is non-obvious and can happen to a backup when its drive is connected, so I don’t see a way to safely test a non-mainline kernel).

The first you said. On my machine, every kernel, both mainline and custom, just works. Apart from this very annoying problem, I'm not having kernel panics since... 6 years?

However, I also never got a data loss, maybe because I'm quite used to recognise the symptoms and hard stop the computer before any loss can happen.

If the bug is not disk controller related, and if you are able to reproduce it like me, you could prepare an Ubuntu live usb pen, install the bug firing program on it, update with the want-to-test kernel, boot from it and check. Never did it before, so I can't figure out any practical problem with this procedure.

Bye,
Marco

Csimbi (turbotalicska) wrote :

Hi there,
I am afraid I have the same problem - the EXT4 file system getting corrupted over time.
I've built a NAS from Ubuntu 9.10 Server amd64. The system+temp is on an SSD drive, while the data is on a RAID6 array using an Adaptec 51645 card and 8 identical 1.5TB disks.
I use use SSH/PUTTY to manage the box and Samba to access the data (fill using my Windows machine, play using XBMC on Linux mini), and I never reboot the machine unless there was an update installed.

uname -r: 2.6.31-19-server
fstab: /dev/sdb1 /mnt/raid6 ext4 suid,dev,exec,nodelalloc 0 0

I never noticed anything wrong, but today I was looking for a file that was supposed to be there. The file was not there, but there were files from other directories(!). So, obviously there is something wrong.
I removed the mount command from fstab, reboot, then I run:
sudo fsck.ext4 -fyv /dev/sdb1
I got massive amounts of inode issues - I could not take a copy because the PUTTY buffer seems to be too small (text has been pushed off very quickly).
Right now it says "Clone multiply-claimed blocks? yes" and it's hanging - I understand it takes quite a while.
I wonder if I am a victim of the same corruption reported in this thread and whether I can fix it using fsck.ext4.

I would not like to loose any data, because I just can't recover it from anywhere (these are HD family movies, pictures and such nowhere else to be found). I never planned on making backups because RAID6 offers good protection and using nodelalloc in the mount options should protect from power loss.

Please advise. Thank you.

Csimbi (turbotalicska) wrote :

I managed to grab a part of the long long output (this is just a fraction of the whole dump).
See attachment.

Tim McCormack (phyzome) wrote :

I wiped my Intrepid box and installed Karmic... and hit the bug. 2.6.31-14 still causes superblock corruption on my amd64 machine. Here are my specs:

Clevo M762T <http://www.clevo.com.tw/en/products/prodinfo_2.asp?productid=88> with 250 GB SATA Fujitsu MJA2250BH G2 drive. Intel Corporation ICH9M/M-E 2 port SATA IDE Controller.

I will attempt to set the drive to AHCI mode and try another installation.

Tim McCormack (phyzome) wrote :

I was able to set the drive to AHCI mode by setting OS compatibility in the (Phoenix?) BIOS to "Vista" (instead of "Other"), which unlocked an IDE vs. AHCI switch.

I was unable to reliably reproduce the bug while running in IDE mode (across several wipe-and-installs), but did not encounter it at all in AHCI mode. I kept it in that mode and restored my files, and have not seen corruption. (I did have to nuke my WinXP partition to do this. Win7 seems to be OK with AHCI, but probably needs a fresh install or a repair to accommodate the switchover.)

For testing I tried to use the iozone filesystem benchmarking tool from repository in an effort to generate lots of file writes in different ways, but it did not do as I hoped.

While my system now functions, the bug still lurks, waiting.

beej (beej) wrote :

manoj: are you still working on this?

Chelmite (steve-kelem) wrote :
Download full text (9.4 KiB)

I upgraded from Karmic to Lucid on my x86_64 box. I tried upgrading. When that didn't work, I resorted to formatting the drive and installing from scratch. The initial system works, but (a) doesn't have enough of the packages installed that I need for work, and (b) normal apt-get upgrade or synaptic updates put the system in a nearly unusable state.

Right now, when I boot, I get a purplish "starry" screen, the audible tom-toms, and nothing more...no login greeter, no panels, the mouse doesn't reveal anything on the periphery of the screen, right-clicking on the desktop doesn't bring up anything. I end up having to use a console to log in as root to do xhost +, then use another console to log in as me, then start xfce4-panel. Then I can use emacs. But, firefox, synaptic, thunderbird all get a segmentation fault.

I looked in /var/log/gdm.
What's interesting is that the crash happens in libc. When I run synaptic, it also crashes in libc, as reported in bug 577159. The following is from :0-greeter.log:
Window manager warning: Failed to read saved session file /var/lib/gdm/.config/metacity/sessions/10c5860066ae4f5bf1127424360828629600000013940005.ms: Failed to open file '/var/lib/gdm/.config/metacity/sessions/10c5860066ae4f5bf1127424360828629600000013940005.ms': No such file or directory
** (process:1406): DEBUG: Greeter session pid=1406 display=:0.0 xauthority=/var/run/gdm/auth-for-gdm-rEAtyV/database
gdm[1422]: ******************* START **********************************
gdm[1422]: [Thread debugging using libthread_db enabled]
gdm[1422]: 0x00007f667240744e in waitpid () from /lib/libpthread.so.0
gdm[1422]: #0 0x00007f667240744e in waitpid () from /lib/libpthread.so.0
gdm[1422]: #1 0x000000000042d02b in ?? ()
gdm[1422]: #2 0x000000000042d0d7 in ?? ()
gdm[1422]: #3 <signal handler called>
gdm[1422]: #4 0x00007f666e9827f0 in ?? () from /lib/libc.so.6
gdm[1422]: #5 0x00007f666f774a6a in __xmlParserInputBufferCreateFilename ()
gdm[1422]: from /usr/lib/libxml2.so.2
gdm[1422]: #6 0x00007f666f749d9d in xmlNewInputFromFile () from /usr/lib/libxml2.so.2
gdm[1422]: #7 0x00007f666f7647bb in xmlCtxtReadFile () from /usr/lib/libxml2.so.2
gdm[1422]: #8 0x00007f666fa6f786 in xkl_config_registry_load_from_file ()
gdm[1422]: from /usr/lib/libxklavier.so.16
gdm[1422]: #9 0x00007f666fa6fbe5 in xkl_config_registry_load_helper ()
gdm[1422]: from /usr/lib/libxklavier.so.16
gdm[1422]: #10 0x0000000000427a2c in ?? ()
gdm[1422]: #11 0x0000000000428018 in ?? ()
gdm[1422]: #12 0x00000000004279a4 in ?? ()
gdm[1422]: #13 0x0000000000424013 in ?? ()
gdm[1422]: #14 0x00000000004242a8 in ?? ()
gdm[1422]: #15 0x00000000004278f2 in ?? ()
gdm[1422]: #16 0x00007f666f2ee935 in g_type_create_instance ()
gdm[1422]: from /usr/lib/libgobject-2.0.so.0
gdm[1422]: #17 0x00007f666f2d283c in ?? () from /usr/lib/libgobject-2.0.so.0
gdm[1422]: #18 0x0000000000422886 in ?? ()
gdm[1422]: #19 0x00007f666f2d3841 in g_object_newv () from /usr/lib/libgobject-2.0.so.0
gdm[1422]: #20 0x00007f666f2d42ad in g_object_new_valist ()
gdm[1422]: from /usr/lib/libgobject-2.0.so.0
gdm[1422]: #21 0x00007f666f2d44f1 in g_object_new () from /usr/lib/libgobject-2.0.so.0
gdm[1422]: #22 0...

Read more...

Emily Wind (emilywind) wrote :

It seems this bug is related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/346691, which seems to randomly affect different 64bit kernel releases and not others. This would explain why the error report on the Ubuntu forums about this dated back to 2008 and such. If the developers looked for a patch pattern within the affected kernels, that would likely be a good start.

I think the reason this recently started affecting me a lot might be due to the latest kernel (2.6.32-22). I did not have the issues as all with kernel 2.6.32-21 as GUmeR reports, so reverting to that is the best bet for avoiding this issue for now. Cheers.

Emily Wind (emilywind) wrote :

Disregard the above post, except for the points about looking at patch patterns in the affected kernels and that 2.6.32-21 did not have the issue for me and GUmeR who posted in this bug report which seems to cover the same issue: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/515937

Cheers.

Emily Wind (emilywind) wrote :

These are the bugs fixed in 2.6.32-22 according to the update-manager along with https://lists.ubuntu.com/archives/lucid-changes/2010-April/011181.html

[ Andy Whitcroft ]
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/526354

[ Tim Gardner ]
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/567016

It is likely that those patches accidentally broke something causing this error, such as if it was written without 64bit in mind at some point in the coding. I am going to contact Manoj about this to get his attention. Cheers.

Chelmite (steve-kelem) wrote :

I have the problem in 2.6.32.21.

Emily Wind (emilywind) wrote :

It seems that could possibly be unrelated at this point, but https://bugzilla.kernel.org/show_bug.cgi?id=16006 seems to have an answer. It is my error, but reading some of the comments here makes me think this bug report might not be the same as 515937, but could be causing some of the issues reported here. Hopefully we can get the ball rolling on a kernel update soon. :)

Chelmite (steve-kelem) wrote :

I installed the new kernel, 2.6.32, and still have the same problem with the greeter, synaptic, firefox, and thunderbird getting segmentation faults. The gdb traceback for synaptic hints that the problem is in libc. The top of the traceback follows. It looks to my (partially-trained eyes) that there's a problem with strncmp/strcmp for x86_64. This may affect the kernel, but the effect I'm seeing is in programs outside the kernel.

Program received signal SIGSEGV, Segmentation fault.
__strncmp_ssse3 () at ../sysdeps/x86_64/multiarch/../strcmp.S:100
100 ../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
 in ../sysdeps/x86_64/multiarch/../strcmp.S
(gdb) where
#0 __strncmp_ssse3 () at ../sysdeps/x86_64/multiarch/../strcmp.S:100
#1 0x00007ffff6dd6a6a in __xmlParserInputBufferCreateFilename ()
   from /usr/lib/libxml2.so.2

tags: added: cherry-pick
®om (rom1v) wrote :

The bug was fixed in later versions of kernel, but it seems it appears again in 2.6.35-19 (in maverick beta) : https://bugs.launchpad.net/ubuntu/+source/linux/+bug/636430

Pete Graner (pgraner) on 2011-01-10
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Tim McCormack (phyzome) wrote :

Pete, what was the actual bug, and where is the fix released?

RobM (robert-meerman) wrote :

I second that - what was/is the bug, and where can I obtain the fix?

Other words - in which "stock" Ubuntu kernel was it fixed?

--
Dmitry

Displaying first 40 and last 40 comments. View all 204 comments or add a comment.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.