Ubuntu

2.6.28-11 causes massive data corruption on 64 bit installations

Reported by Graziano on 2009-03-22
328
This bug affects 42 people
Affects Status Importance Assigned to Milestone
linux (Ubuntu)
Critical
Manoj Iyer
Declined for Dapper by Pete Graner
Declined for Hardy by Pete Graner
Declined for Intrepid by Pete Graner
Declined for Jaunty by Pete Graner
Declined for Karmic by Pete Graner
Declined for Lucid by Pete Graner

Bug Description

Binary package hint: linux-image-2.6.28-11-generic

I had on my notebook ubuntu running now for more than two years, from gutsy and going through all development releases. I was always using JFS as my root filesystem and NVIDIA binary drivers.
Notebook is a COMPAL clone from an italian assembler, SANTECH. Attached will find the lshw output.

-) I started upgrading to jaunty a week after intrepid was out. All upgrades went ok for months, until this last week switch to 2.6.28-11 kernel. Without a system crash, silently JFS started to glitch. Files started to disappear. No messages from kernel, except applications started to segfault. FIrst apt, then python, then system was lost. I repeat, when problem occurred, no kernel messages. just application segfaulting.
-) Reboting leaded to JFS repairing himself, each time loosing files (messages on tepair were something like d_ino != inode, can't exactly remember. read ahead).
-) First tought was something wrong with disk, so I just started from sysrescuecd and launched smartctl long test on disk. No problem reported, and I am still far from end of disk life.
-) Second thought was something went wrong with filesystem, and reading ext4 threads, thought developers screw up things in journaling, but I was told it was not the case.
-) As my system was in a terrible mess, lost tens of files, I installed fresh jaunty Alpha alpha6. Installation went ok. System was up again. No installation problems.
-) Went on system upgrade. Installed again 2.6.28-11. Reboot, same problems as before: applications started to segfault in series. At reboot, again filesystem problems....
-) Next try: installed this time on ext4. Again installation of fresh alpha 6 ok. And again on system upgrade same behaviour, this time leading to complete filesystem whoes (unable to mount).
-) Again I installed fresly alpha6, this time leaving 2.6.28-9. This situation seems stable.

Summary of this.

I have not too much to report, as I have no messages from kernel when problems appear, except segfaults from applications. Seems my disk I/O system gets crazy switching to 2.6.28-11. As this happens both with jfs AND ext4, seems something related to controller and/or common I/O.

Have no more hints....

BoomSie (gideon-poort) wrote :

Same issue over here. Saturday I decided to give Jaunty a try, after a collegue of mine warned for the issues with ext4.

I looked out that the dist-upgrade wouldn't touch the filesystem NOR update it. Nevertheless:

* First boot OK, few time crashing nautilus & applets in gnome though
* Reboot and everything was f*cked
* sbin/init* was gone
* fsck'd both home&root partition to recover
* yesterday night it was up and running again, so I figured I could work 'normally' today on my laptop again
* this morning I boot, login into gnome -> CRASH. Apparently some configuration issues, figured that my root filesystem was mounted read only. So again, fsck and a shitload of files in Lost+Found now

I have a Compal as well btw. A JFL92 (see attachment)

Hope you guys figure out whats going on before the official release next month.

Cheers & keep up the good work

Changed in linux (Ubuntu):
importance: Undecided → High
status: New → Triaged
BoomSie (gideon-poort) wrote :

Yes Graziano, you hit the nail on the spot I'm afraid.

I'm not really familiar how it works with those patches or when to expect this update/grade available in the Alpha/Beta repo's, so I'll watch from a distance for the coming week(s).

(Unless someone can guarantee me, the fix is already there, then I'm MORE then happy to do a clean alpha 6 install to fiddle around some more with this stunning new release)

~..~
(oo) <<< MOOh

Well, after some days of work on the machine, I can say that at least jaunty alpha6 is stable (with all updates installed) IF keeping the 2.6.28-9 kernel running (I have modified the default target in grub menu.lst, just to be sure).
As of today, on the same system, starting with 2.6.28-11 leads to filesystem fault. And without having in hand a system rescue cd with filesystem repair tools (and being lucky on where the faults happened) the system gets unusable.
Note that on a desktop system at home (AsRock eSata motherboards, E6400 Core processor) the kernel 2.6.28-11 has not this problems (at least not so visible: tracker sometimes segfaults, but system does not crash and filesystem is preserved).
Seems something triggered by the hardware, but which is present latent in the codepath.

I am completely with BoomSie: the official release cannot live with this problems around.

BoomSie (gideon-poort) wrote :

Bug remains in the Beta release of Jaunty too. How can one tell whether this bug is 'fixed' in a release or not? (without rereading the entire changelog)

Manoj Iyer (manjo) wrote :

The patch mentioned above is already in the jaunty Kernel.

god-mok (god-mok) wrote :

I got the same problem on my clevio m57ru. As long as i boot to the older kernel 2.6.28-9 everything is fine.

An hour ago i tried the 28-11 kernel and it worked for a few minutes. But as i tried to search with synaptic, it closed. Under the terminal i got only a core dump. After that i coudn't do anything with apt-get, or some other programs. My wlan and any network connection was dead.

As i let it be and looked out for logs it seemed that my windows disappeared. The border were there but the content, menu and buttons were gone under nautilus. as i recovered everything i tried the same under kde. The same happened for the system, but there the content of the windows didn't disappeared.

The windows content didn't disappear every time i tried it. My system freeze often before that point.

After the freeze the system can't mount my user partition, or get any network to work. The log files won't show anything strange until i saw that the log were before the freeze. As i tried to run dpkg it told me, that the it could not write to /var/cach/apt folder. I looked there, and everything was fine.

As i started gdm it seemed that X couldn't initialize the xorg.conf. So i looked after that, but nothing was wrong with it. I tried to run apt-get update as a test and there it tells me, the system seems to be in a read only state.

No wonder that after the freeze all the logs were untouched.

After that i tried to boot under 28-9 but no luck. There the devil spread out his wings. Hope someone will look after that.

Oh yeah: after a fresh reinstall with the beta iso i got the same problem from the beginning without anything done, only looked out for files with gedit and nautilus. The more strange was, that is could not see any driver under hardware-driver but it worked on the usb-stick and the livecd.

And sorry for my bad english :)

Manoj Iyer (manjo) on 2009-04-03
summary: - jaunty kernel 2.6.28-11 kernel destroy system
+ jaunty kernel 2.6.28-11 kernel update renders the system un-usable.

The patch mentioned above was suspected by me to cause lot of trouble, but it NEEDS to be something related also to HW configuration.

Tried today again with 2.6.28-11 and got EXACTLY same behaviour as god-mok. I was really lucky to recover filesystem and get back here. AGAIN: No Problem with 2.6.28-9. System gets unusable (actually all filesystem can be lost, and if You call it unusable, I do prefer destroyed, as I had multiple times to recover from backups) using 2.6.28-11 for more than some minutes.

I repeat: Seems something related ALSO to hardware, as I have tried the same schema on multiple hardware (fresh install with 2.6.28-9, then update to 2.6.28-11) and I had problems ONLY on my notebook.

Please god-mok, would You please post the output of lshw, as I have not find complete specs of Your Clevio?
Looking at a summary hardware, it seems really similar to the COMPAL setup:

• 17" WSXGA Aktiv Matrix Glare TFT (1680*1050)
• inkl. 1.3 MPix Webcam
• nVIDIA GeForce 8800M GTX 512MB
• Core2Duo T9300 / 2,5GHz 6MB/800 MHz
• 4096 MB (2x2048) SO-DIMM DDR2 800MHz
• 200 GB / 7200 U/min S-ATA
• DVD±R/±RW DL (Dual Layer) 8x/8x Multinorm-Brenner
• Intel Wireless WiFi Link 4965AGN
• int. Bluetooth-Module

But I am really curious about chipset used. Let us sort this trouble out!

Manoj Iyer (manjo) on 2009-04-03
Changed in linux (Ubuntu):
assignee: nobody → manjo
god-mok (god-mok) wrote :

sorry, totaly forgotten. Here is my lshw file.

At the time i have no battery attached, but doesn't matter, 'cause it was the same.
And yeah, our hardware seems very simmilar.
Only thing is i do not have the Bluetooth-Module. Everything else is veri similar to my hardware.

Oh yeah, and another thing: I tried to recover files with testdisk and some other programs, but most of it failed. I found some Pics and some movie files (totaly splitted) and so it was totaly lost time.

But before that, ich checked the partitions like everyone. I even tried it with ext3 and ext4 with my home partition, but it was the same: many inodes issues (199), and after that nothing changed. Rerun the check, and it happend the same again, as it could not be changed (read only).
As i checked every time the disk, then once something happend: fsck tried to change the filesystem ext4 to ext2. I don't know why that happened, and it also didn't worked, but I tried :)

Maybe the last point doesn't matter, but for me it was very strange...

Oh, and under livecd i could not find the home partition with gparted, but the root partition was ok, and i could even mount it.

Well, i think even now why the system thinks it is on read only status, but i can manage, change and do everything with the files...

If you need any more specific hardware details, than ask. I will gladly look at my paper for it.

mirix (miromoman) wrote :

Installing 9.04 beta (AMD) via the Live CD and then compiling the kernel to 2.6.29.1 (before doing any other update via Aptitude/Synaptic) renders a very stable and "performant" system.

god-mok (god-mok) wrote :

@mirix: too bad, if someone knows what he have to do, than it's no problem, but i don't think thats the idea behind everything. As long as there is no support it's not such a good idea, right?

Today I had another crash. I reinstalled again, and after reboot the bug came at first boot, not as always at the second after the freeze. So like again, I mounted my home partition manually and everything worked so far. Updates and diver installed, added some repos to my source list and than there was the freeze again.

Thought it could be some new package problem, but it happened not, because i reinstalled again, and after the first save boot i didn't do anything. I let it run for almost 30 minutes, than i rebooted. Same problem, so i had to mount manually, but sometimes it didn't worked.
There was a notice after the failed boot, something like "two files share same sector/inode" in a folder "/home/god-mok/???/a9...x86_64..." Too bad I have no "???"-folder and i didn't memorized the whole numbers so thats all now. The numbers looked like a md5 hash until it reached x86_64. I have no idea where that came from after the fresh install with formated partitions.

mirix (miromoman) wrote :

@god-mok:

when you install a beta version you take some risks. as far as I know, the problem only happens when you upgrade the kernel. so if you stick to the kernel provided by the installation CD (2.6.28-9) until the problem is solved, you should be safe. I guess this is a problem specific to the Ubuntu kernel 2.6.28-11.

I upgraded to the latest Linux kernel (2.6.29.1) because I wanted to. but I was not recommending or even suggesting to anybody to do so.

cheers

Just to bump. Problem is still here. 2.6.28-9 is stable, 2.6.28-11 is unusable on this hardware. As all hardware is currently working, I have no problem in living with -9, but am worried about the problem.

Today I just tried again with 2.6.28-11.41 just to have program segfaulting after a minute or two, and a filesystem repair process at startup. Now I am with ext4, cannot report anymore on jfs.

I have read all patches introduced in the -9 -> -11 jump, and I was unable to find something really interesting.
Jaunty will for sure ship with 2.6.28, and as we are 10 days now from release, the problem is better handled by some kernel people.

I decided to go unstable just to help, as I am a somewhat experienced user who knows how to recover from backups and knows how to live with system hiccups. If we are here to build a community system, someone has to test it, and discover bugs for not experienced users to avoid them. Ubuntu, was something about community, or am I missing something?

I am plenty of options to live bleeding edge, am here just to help the Ubuntu project. I won't stay with Jaunty, will jump on Karmik as soon as it starts its development phase: lot of bug reporters will be available for official release, and I am of much help ahead of them.

Steffen Rusitschka (rusi) wrote :

I'm also a bit concerned about the RC/final of Jaunty... Anyway, here's a short summary of all lshw.txt attached to this bug an its duplicate:

Common to all machines:
- Intel Core 2 Duo
- 4 GB RAM
- 4965 AG(N) WiFi
- Intel 965 Memory Controller
- Nvidia Graphics Card: 8x00M

I'm not sure if everyone is running a 64-bit version of Jaunty.

But: those hardware combinations are far from being exotic - almost all new Laptops have a similar configuration...

KJ (cortexbuster) wrote :

I have exactly the same HW and I'm running the 64 bit Version of Jaunty.
And I experience also the same problems you all have.
I can't run 2.6.28-11 without severe data loss.

Louis-Dominique Dubeau (ldd) wrote :

I'm experiencing the same problem on a Compal IFL90 (aka Sager NP2090). Running Jaunty 64-bit.

No data loss on my side but I've experienced some random kernel and process crashes. (Emacs works fine but a few hours later all executions of emacs result in automatic segfaults!)

Downgrading to 2.6.28-9 fixed this issue but created other issues. I'm now running into this bug:

https://bugs.launchpad.net/ubuntu/+source/pulseaudio/+bug/330814

Fixing it requires a kernel upgrade!

KJ (cortexbuster) wrote :

the trouble continues with linux-image-2.6.28-11-generic (2.6.28-11.42)
filesystem access fails after a few minutes.

Kevin W. (eyecreate) wrote :

I am also getting the same things as people above. I am currently running off live cd and found this bug here, which hits it right on the mark for me. I have the exact system specs as the common list and only seemed to have the problem after kernel upgrade. Hope there is a fix for release.

Attached lshw just in case.

Kevin W. (eyecreate) wrote :

I forgot to say on the above that I installed with the RC.

KJ (cortexbuster) wrote :

since this is such a showstopper for everyone with this hardware config I'd really appreciate any dev comment.
is anybody working on the issue?
do you need more information?
what can be done to track down the error?
we're close to the official release. this could turn into a disaster for a lot of regular users upgrading to jaunty.
I'm concerned about the quality of ubuntu. it's a great os. I use it on a couple of servers as well das desktops / notebooks. but such a widespread error so close to a release can do the project real harm.
just my two cents.

Kevin W. (eyecreate) wrote :

I just did a disk repair in order to boot up again. Because I'd rather not lose anything else, I'm going to use the ppa for .29 kernel until something better is shown here.

Kevin W. (eyecreate) wrote :

I found out after rebooting to try and install the new kernel that my partition was too far gone to boot up anymore, so I had to reinstall. I also want to add that I found out that when I reinstalled Kubuntu RC for the fourth time that even if I don't upgrade to the latest kernel, it still messes things up. IDK what kernel is in the RC by default, but it seems I will for sure have to use a different kernel. Here goes a fifth time.

Kevin W. (eyecreate) wrote :

Oooh, I found something else interesting out. It's too bad it's this close to release, but it seems the upgrade to the .29 kernel also fixed a bug I(and others) seemed to have about network manager not connecting to encrypted networks. So far, I am able to connect to my WEP wifi hotspot which I couldn't do on livecd or fresh install. I will try my university's WPA2 Enterprise connection next. I do wish this was the default kernel in Jaunty, because it'd make things work better and make life less difficult.

IIRC Jaunty will definitively ship with a 2.6.28 kernel.

https://lists.ubuntu.com/archives/kernel-team/2009-February/004321.html

Can confirm bug is not present using mainline 2.6.29.1, but this will be of some help with Karmic.
For people who do not know how to get mainline, here You can find "stock" linux kernels compiled for Ubuntu:

http://kernel.ubuntu.com/~kernel-ppa/mainline/

I do not recommend in any way to do this, but if You really have problems, this is the way to go. Remember that You will not find help on kernel related problems using mainline from Ubuntu developers!
If want to use Ubuntu kernel, keep reading this thread for possible solutions. As we are really close to release, I expect this bug to be tracked down by developers after April, 24.

Kevin W. (eyecreate) wrote :

Just to comment on my previous comment, it seems WPA2E still doesn't work.

Kevin W. (eyecreate) wrote :

I have gotten WAP2E to work by using wicd as my network manager. Sounds like I have another bug to search for.

Steffen Rusitschka (rusi) wrote :

Did anyone try if the final version of Jaunty still has this issue?

AFAIK 24/04 is release date for jaunty and on my system the current ubuntu kernel version (not the one I am running with, but the one linked to the linux-image-generic) is 2.6.28-11.42. So the message from KJ above,

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/346691/comments/18

answers Your question. Last I personally tried is 11.41.

Ivo Smits (ivo-ufo-net) wrote :

Have seen the same problem here on my HP Compaq 8510w. While downloading updates (using the update manager) the filesystem suddenly locked up (switched to read-only mode). At that moment I had one virtual machine (VirtualBox) running as well, but I do not think think that's related to the problem.

After trying ubuntu for the first time, a friend of mine told me about this problem (exactly the same hardware). I told him it couldn't be Ubuntu's fault and finally decided to try myself. My system worked fine for at least a week before it crashed, and fsck was able to repair the filesystem.

I have Ubuntu Jaunty amd64 running on a VMWare virtual machine and have not seen any problems there (yep.. the issue seems to be hardware-related...).

I'll attach my lshw result as well.

®om (rom1v) wrote :

Maybe related : bug 350268

Siegfried Gevatter (rainct) wrote :

Seems like it's the same problem I have... http://bloc.eurion.net/archives/2009/one-week-with-debian/

KJ (cortexbuster) wrote :

as far as I see yes.
either the current 2.6.29 mainline kernel or 2.6.28-9 can help you out.

fuchur (ckellner-gmx) wrote :

Ok, as this seems the original bug report:

Until this is fixed, I am thinking about using a new kernel, but

a) which problems could this cause - there must be a reason that kernels are never replaced during a release cycle

b) how would I do this - apt-get typically hangs after a few packages are downloaded, and the file system quickly gets read-only.
could I install the new kernel to intrepid and then dist-upgrade to jaunty? how is the correct procedure

c)If this bug was known march 20, and confirmed some days before the release, why is there not even at least some remark in the release notes ?

giorgio130 (gm89) wrote :

same problem here on a compal jhl90. I managed to install 2.6.29 and it seems to run fine now.

str0g (buskol-waw-pl) wrote :

i have same bug, on my jhl90 but downgrading kernel doesnt work :/

adamski (adam-hasselbalch) wrote :

I am amazed that this bug is not marked as Critical!

Obviously, this affects a great deal of users in a way that is extremely destructive.

In my opinion, this bug alone renders Jaunty completely unfit for a production environment! There's no getting around that data loss and file system corruption due to a kernel error is absolutely and 100% unacceptable in a so-called "stable" release. The fact that it apparently happens on very common hardware does not help.

Sorry for the harsh words, but this is simply Not Good Enough!

str0g (buskol-waw-pl) wrote :

I tell you why it isn't critical, my desktop c2q,x38+ich9r,4gb,3 hdd, works great with it, my friend laptop with x2, and dekstop with x2 also works. Its realy hard to say what when wrong but obviusly developers should be test system on modern laptops, to avoid this kind of problems...

I've install kerenel Linux lukasz-laptop 2.6.29-02062901-generic #02062901 SMP Fri Apr 3 13:36:07 UTC 2009 x86_64 GNU/Linux

and i have some minnor issues like to days acpi update cannot be installed and there are some minnor errors with kerenel header installetion, but system is stable, and there are no data losses.

mirix (miromoman) wrote :

I agree that it is not acceptable to release a so called "stable" version being aware of such a serious bug.

Fuchur:

I have compiled kernel 2.6.29.1 (2.6.29.2 is already available but I have not tested it) a few weeks ago and I have not had any issues since then.

You can download precompiled packages from the Ubuntu site (I do not have the URL) and install them with dpkg.

A few people describe easy ways to compile it from source:

http://izanbardprince.wordpress.com/2009/03/26/how-to-fix-ubuntu-jaunty-warning-hacks-ahead/

http://koroshiyaitchy.wordpress.com/2009/04/25/ubuntu-904-jaunty-jackalope-customised-for-performance-on-a-nexoc-osiris-e705iii-clevo-m57ru-laptop/

I followed these older instructions:

http://symbolik.wordpress.com/2007/11/10/vanilla-kernel-26231-on-gutsy-gibbon/

Just changing the obvious parts. I guess all three methods are actually the same. Just a few kernel configuration options change.

The only annoying and unresolved issue I have found this far is related to this:

http://ubuntuforums.org/showthread.php?p=3593262

I have followed the instructions on that how-to to no avail. I have also tried a Gentoo method with uvesafb with no better luck. However, regular Ubuntu installations also give similar problems if you install the proprietary NVIDIA or ATI drivers.

In fact, provided the big deal of manual configuration I have ended up carrying out, I am seriously considering moving back to good old Debian, which is far more stable, faster and less buggy than Ubuntu. Ubuntu is more modern, but less than, for instance, Fedora.

kikvors (kikvors) wrote :

I can tell you that I lost two days work with this "bug". Next time I will not assume a final release is stable and wait before upgrading.

KJ (cortexbuster) wrote :

fuchur:
actually I first installed 8.10 and performed a dist upgrade. after the first reboot I selected the old ubuntu 8.10 during the grub menu. once the system was up and running again I downloaded 2.6.28-9 debs from packages.ubuntu.com and installed it.

the strict ubuntu release cycle harms the renown of ubuntu. such bugs should simply delay a release. there are so many laptops out there which suffer from this bug.

®om (rom1v) wrote :

I agree this bug should be critical.
It seems to affect only jaunty 64 bits, but it makes the system totally unusable.

Once resolved, a new .iso of Jaunty must be released (9.04.1), because it is not possible to install the current final release (which doesn't work) to apt-get upgrade (which segfault due to this kernel).

Jaunty alpha4 was more stable...

giorgio130 (gm89) wrote :

@ ®om: let's hope it's solved before karmic....

You can workaround the apt-get segfault with "sudo rm
/var/cache/apt/*.bin" and then install a different kernel.

But apt-get is not the only bug : many files could be corrupted... cf my duplicate of this bug : bug 350268
How to be sure there is no problem after installing a new kernel which fixes the problem?

giorgio130 (gm89) wrote :

After you've installed kernel 2.6.29 or 2.6.28-9, run a fsck from a live cd.

str0g (buskol-waw-pl) wrote :

2.6.29-02062901-generic
fsck found no error, works good for me.

KJ (cortexbuster) wrote :

mainline kernel 2.6.29 indeed does not contain the bug anymore. but this is quite inconvenient for the regular user since the modules for proprietary drivers (nvidia) are missing...

2009/5/2 KJ <email address hidden>:
> mainline kernel 2.6.29 indeed does not contain the bug anymore. but this
> is quite inconvenient for the regular user since the modules for
> proprietary drivers (nvidia) are missing...

They work perfectly here. If you install the new kernel using one of
the .deb's from kernel.ubuntu.com the nvidia drivers will
automatically get rebuild for it.

Indeed, Nvidia works in the current 2.6.29 perfectly. But VMWare Server won't compile the modules needed, even though 2.6.29 headers are installed and even though the gcc version it requests is installed...

I had same problem.
System was running fine, until I updated to jaunty (64bit). After second reboot auto fsck failed (on one of my ext3 partitions). So I`ve repaired it manually. Some data was lost, so I buy an external usb harddisk and start back-up. System freeze and didn`t boot anymore. I try repair it, without success. I reinstaled it (jaunty again, kernel 2.6.28-11), but backup (on external disk with vfat) was also corrupted.:(
dmesg was full of error messages about filesystem, a lot of programs (sudo eg.) stop working, filesystem was remounted readonly.
If I boot to jaunty live cd, output of fdisk -l /dev/sda was very strange. (Error messages about partitions which have begin and end in another partition. I havent this output - and if yes, it will be lost.)
Using fdisk I recreated partitions, install jaunty again. (+ ext3 was replaced by ext4.) The filesystem was full of errors again. I found this bug, so I`ve tried to install another kernel from kernel.ubuntu.org (2.6.30-020630rc4-generic). (It isnt easy if /var/lib/dpkg/available is also corrupted file.)

ok, in this post isn`t a lot of useful information - maybe just that one about fdisk output. I hope that helps.

Lorant Nemeth (loci) wrote :

On my T61p I got all sort of filesystem problems just like Martin Peterka (no io errors in dmesg). Applications were crashing with memory corruption error and so on so I decided to reinstall the system from scratch. Install went fine, but after the second reboot I got back initrd prompt saying that init could not be found although I was able to mount rootfs manually. I'll do a new reinstall and check when things get screwed. I'll let you know about the result.

Lorant Nemeth (loci) wrote :

Reinstall done:
- first boot ok
- install all upgrades
- reboot ok
- install nvidia 180 driver
- first reboot stuck before splash screen !?!
- second reboot successful
- apt-get install wireshark tshark vim-full compizconfig-settings-manager mc ----> complains about corrupted /var/lib/dpkg/status
- dmesg still doesn't show any io errors, but I'll check smart once I fix dpkg

Lorant Nemeth (loci) wrote :
Download full text (4.6 KiB)

The saga continues:
- cp /var/lib/dpkg/status-old /var/lib/dpkg/status
- apt-get -f install runs fine
root@kolibri:~# apt-get install wireshark tshark vim-full compizconfig-settings-manager mc
Reading package lists... Done
Building dependency tree
Reading state information... Done
The following extra packages will be installed:
  libadns1 liblua5.1-0 libruby1.8 menu python-compizconfig tcl8.4 vim-gnome
  vim-gui-common vim-runtime wireshark-common
Suggested packages:
  adns-tools arj xpdf dbview odt2txt tclreadline cscope vim-doc ttf-dejavu
The following NEW packages will be installed:
  compizconfig-settings-manager libadns1 liblua5.1-0 libruby1.8 mc menu
  python-compizconfig tcl8.4 tshark vim-full vim-gnome vim-gui-common
  vim-runtime wireshark wireshark-common
0 upgraded, 15 newly installed, 0 to remove and 0 not upgraded.
Need to get 0B/26.2MB of archives.
After this operation, 104MB of additional disk space will be used.
Do you want to continue [Y/n]? Y
Selecting previously deselected package libadns1.
(Reading database ... 101625 files and directories currently installed.)
Unpacking libadns1 (from .../libadns1_1.4-2_amd64.deb) ...
Selecting previously deselected package liblua5.1-0.
Unpacking liblua5.1-0 (from .../liblua5.1-0_5.1.4-2_amd64.deb) ...
Selecting previously deselected package libruby1.8.
Unpacking libruby1.8 (from .../libruby1.8_1.8.7.72-3_amd64.deb) ...
Selecting previously deselected package mc.
Unpacking mc (from .../mc_2%3a4.6.2~git20080311-4ubuntu1_amd64.deb) ...
Selecting previously deselected package menu.
Unpacking menu (from .../menu_2.1.41ubuntu1_amd64.deb) ...
Selecting previously deselected package tcl8.4.
Unpacking tcl8.4 (from .../tcl8.4_8.4.19-2_amd64.deb) ...
Selecting previously deselected package wireshark-common.
Unpacking wireshark-common (from .../wireshark-common_1.0.7-1ubuntu1_amd64.deb) ...
Selecting previously deselected package tshark.
Unpacking tshark (from .../tshark_1.0.7-1ubuntu1_amd64.deb) ...
Selecting previously deselected package vim-gui-common.
Unpacking vim-gui-common (from .../vim-gui-common_2%3a7.2.079-1ubuntu5_all.deb) ...
Selecting previously deselected package vim-runtime.
Unpacking vim-runtime (from .../vim-runtime_2%3a7.2.079-1ubuntu5_all.deb) ...
Adding `diversion of /usr/share/vim/vim72/doc/help.txt to /usr/share/vim/vim72/doc/help.txt.vim-tiny by vim-runtime'
Adding `diversion of /usr/share/vim/vim72/doc/tags to /usr/share/vim/vim72/doc/tags.vim-tiny by vim-runtime'
dpkg-deb: subprocess paste killed by signal (Broken pipe)
dpkg: error processing /var/cache/apt/archives/vim-runtime_2%3a7.2.079-1ubuntu5_all.deb (--unpack):
 short read in buffer_copy (backend dpkg-deb during `./usr/share/vim/vim72/doc/tags')
Selecting previously deselected package wireshark.
Unpacking wireshark (from .../wireshark_1.0.7-1ubuntu1_amd64.deb) ...
Selecting previously deselected package python-compizconfig.
Unpacking python-compizconfig (from .../python-compizconfig_0.8.2-0ubuntu1_amd64.deb) ...
Selecting previously deselected package compizconfig-settings-manager.
Unpacking compizconfig-settings-manager (from .../compizconfig-settings-manager_0.8.2-0ubuntu1_all.deb) ...
Selecting previously de...

Read more...

chriz (christian-seipel) wrote :

I will note here that corrupted RAM modules also could result in corrupted files and file system errors. So in my case with Ubuntu 9.04 x64 and ext4. Replacing the corrupted modules stopped getting more and more file system errors. But nevertheless were some errors remaining what is related to this bug in my opinion.

Could a memtest run detect a problem with RAM then?

On Tue, May 5, 2009 at 12:24 PM, chriz <email address hidden> wrote:
> I will note here that corrupted RAM modules also could result in
> corrupted files and file system errors. So in my case with Ubuntu 9.04
> x64 and ext4. Replacing the corrupted modules stopped getting more and
> more file system errors. But nevertheless were some errors remaining
> what is related to this bug in my opinion.
>

Not a memory problem here. System is rock solid with either 2.6.28-9 kernel OR 2.6.29 mainline. Definitively NOT a memory problem: it is just unusable with 2.6.28-11 which is jaunty final kernel.
I am now on karmic devel, using 2.6.29-02062902-generic and have NO problems on same hardware.

Yes. In my case has memtest reported errors in my RAM modules.

Edmundo wrote:
> Could a memtest run detect a problem with RAM then?
>
> On Tue, May 5, 2009 at 12:24 PM, chriz <email address hidden> wrote:
>
>> I will note here that corrupted RAM modules also could result in
>> corrupted files and file system errors. So in my case with Ubuntu 9.04
>> x64 and ext4. Replacing the corrupted modules stopped getting more and
>> more file system errors. But nevertheless were some errors remaining
>> what is related to this bug in my opinion.
>>
>>
>
>

In my case there was no problem with memtest. I even checked it half an hour without problems. Well, i have Jaunty on another notebook, and there it does the job right. After all i installed the 2.6.29 kernel. everything works fine. Too bad the 30rc4 won't install because it fails to build the nvidia parts. There is a trick to get over it, but... well, it is a little bit dirty ;)

I C.... well.... over here, it's a 64 bit laptop AMD based... I'm
using ext3 partitions with a .28 kernel. It's stable. As a matter of
fact, now that I think about it.... why am I associated in this bug?
:-D Anyway.

Hi, I ran memtest and it reported no error. The very same laptop is stable with no errors if booted from en external drive with intrepid.

Good morning,

my FSC-notebook has the same problems described above. IT was upgraded to jaunty (stable) from Ibex. After 2 days it didn't boot anymore. A trial to save data only had partial success - lots of data are missing :-(

Did some reinstalls with several FS-types (reiser, ext4, ext3) with always the same effect - some reboots ok, suddenly at runtime a read-only FS. The system does not react anymore. Only a hard reset helps. After this the FS is broken.

As it is my production notebook I reinstalled it with Intrepid Ibex and everything is running fine :-(((

Attached is the output of lshw.

It looks strange to me that canonical does not fix this bug for so much time. I have been appreciating their work for years and many releases since 4.07 I think.

Best regards,

Claudio

Just for developers of karmic, bug is no more present in Ubuntu 2.6.30-2.3-generic.

Radoslav Georgiev (valsodarg) wrote :

I was also a victim of this bug, in which I experienced the following:
1. My applications started receiving segfaults. I tried to strace them and found that the segfaults were mostly due to problems with files (corrupted files)
2. Then my administrative applications started segfaulting and after that my kernel wouldn't boot as a results in a heavily corrupted file system. I ran fsck.ext3 on my boot partition and all the files were corrupted.

To solve:
Reinstalled ubuntu from the live CD and updated the kernel to the 2.6.29-020629. So far I have had no problem.

I am a little surprised that my brother who has a 32 bit system does not have this problem with the 2.6.28-11 kernel and as it seams that only 64bit systems are affected. Can anyone with 32 bit system confirm this. I also believe that it has to do with how files are written/updated as many of the errors that fsck reported were:
1. Inode referenced in future data
2. Inode reference count too high
3. Unattached Inodes

Another note is that none of my boot files (as my grub is on its own partition) were corrupted.
System specs:
64 bit, Intel Core 2 duo, 4 GB ram, 320 GB WD hdd (2.5" inch)
Multiboot Setup

Radoslav Georgiev (valsodarg) wrote :

Also I forgot to mention all partitions were ext3 formatted during installation.

Were there any changes in the ata_piix driver (the driver used by 82801HBM/HEM (ICH8M/ICH8M-E) SATA IDE Controller which 5 of the people suffering from this bug have in their machines according to posted output from lshw) between 2.6.28.9 and 2.6.28.11?

I can confirm that on affected system is "82801HBM/HEM (ICH8M/ICH8M-E) SATA IDE Controller".
On another computer without mentioned IDE Controller is 2.6.28-11 running fine.

I've got an ICH9M/M-E here, so it seems both 8 & 9 are affected.

str0g (buskol-waw-pl) wrote :

wired if its controler error why, smart doesn't show any errors on ich9 ? :P

If it is an error in the drivers (e.g. software), why should it show up in a hardware test?
From what information I've got it looks like something was broken in the ata_piix kernel module between versions 2.6.28-9 and 2.6.28-11 and then subsequently fixed (it could also be an error somewhere else that triggers erroneous behaviour in the ata_piix driver).

Here is the whole changelog:
http://changelogs.ubuntu.com/changelogs/pool/main/l/linux/linux_2.6.28-11.42/changelog

It seems there have been one or two changes to ata_piix. Maybe the problem could be related to some of the libata changes.

KJ (cortexbuster) wrote :

My failing system also contains a 82801HBM/HEM (ICH8M/ICH8M-E) SATA IDE Controller
Just for your information.

str0g (buskol-waw-pl) wrote :

if its controller driver error , why on my foxconn x38 ich9r there are no errors?

Lorant Nemeth (loci) wrote :

Hi,

I have ICH8 as well:

00:1f.2 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA IDE Controller (rev 03)

giorgio130 (gm89) wrote :

anyone has tested 2.6.28-12? it is in the "proposed" repository.

adamski (adam-hasselbalch) wrote :

I have not tested 2.6.28-12, but the changelog does not mention this bug being closed.

Changelog is at https://launchpad.net/ubuntu/jaunty/+source/linux/2.6.28-12.43

PsYcHoK9 (psychok9) wrote :

I've same problem with Ext4, Jaunty 9.04 x64 and kernel 2.6.28-11).
Sometime, without blackout or crash, the applications on Jaunty diseappear or become not visible.
Last time I've started GParted and don't found ntfsprogs but is installed!
Sometime diseappear the icon on gnome menu of some application...
Sometime diseappear a aMule configuration files...

josh04 (josh04) wrote :

Confirming that I too have an ICH8M/M-E and had the problem. I upgraded to 2.6.29 and it's not gotten any worse, but I need to fsck.

Lorant Nemeth (loci) wrote :

Those who have upgraded to 2.6.29: did you use vanilla kernel, or you used some ubuntu repo

 My only problem with the kernel upgrade is, that first you'd need to install ubuntu from the CD with the same buggy kernel, and only after that you can install the new kernel. This means that by the end of the installation you can't tell what got correctly written to HDD and what did not.

kikvors (kikvors) wrote :

I used the ubuntu repo System was stable enough with buggy kernel to upgrade to the newer one. After that I ran fsck in case there were any errors. So far no problems with the newer kernel.

str0g (buskol-waw-pl) wrote :

http://kernel.ubuntu.com/~kernel-ppa/mainline/

after instaling system just
sudo dpkg -i generic.deb all.deb image.deb
and there wont be any other errors trust me ;-)

PsYcHoK9 (psychok9) wrote :

This is my lshw.

2.6.30rc5 (vanilla) don't seem to be affected.
I've downloaded it from here: http://kernel.ubuntu.com/~kernel-ppa/mainline/

MiceX (rsmad) wrote :

lshw of my system. Affected with x64 and not affected with x32.

This bug has been open for two months, and was fixed upstream some time ago, but not in Ubuntu.

As a consequence, 2.6.28-11-generic and 2.6.28-11-server - the production release media kernels deliver silent and show-stopping data corruption, to the extent that the kernel kills user processes accessing parts of the filesystem corrupted in a particular way. That's as "game over" as it can get, and damaging for Ubuntu's image.

On the upside, there are a bunch of related fixes upstream for if the Ubuntu kernel is rebased to a newer point release, and the patch posted two months ago [1], is also probably the fixer.

What else do we need?

[1] http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-jaunty.git;a=commit;h=b29e79bf557ce777878518da154f4a0becb1de0e

We need it to be made as an official upgrade, is what. For a user to
have to go through "hmm, all my programs just spontaneously crash", go
search for ubuntu bugs, and read through 40 comments, just to get a
fix is asking too much, especially considering a large group of users
barely know how to do anything other than open their email app and
surf the web. Seeing as this is such a wide-spread problem and a
kernel upgrade fixes it in what seems to be 100% of the cases, a
kernel upgrade should really be put into the live repository.

The official installation media has to be updated as well. I wouldn't trust a system which was installed with a kernel that includes such a problem. (I'm surprised no one faced any installation problems so far....)

As a workaround to avoid silent data corruption, boot with 'maxcpus=1' until the kernel is patched.

Lorant Nemeth (loci) wrote :

maxcpus=1 does not help.

I just reinstalled my laptop from minicd. I booted the installer with maxcpus=1 and double checked that even after the install the kernel would boot with that option. I installed a few packages, installed nvidia restricted driver. rebooted and /sbin/init was gone, so next bootup failed.

KJ (cortexbuster) wrote :

2.6.28-12-generic #43-Ubuntu is out there, and maybe this one finally fixes the bug. I'm already testing it as I write these lines.

KJ (cortexbuster) wrote :

I did some testing and to me it seems to be working. My partitions are not in ro mode and none of my apps are segfaulting.
If it continues to work without a hassle I think it's time for a 9.04.2!

KJ (cortexbuster) wrote :

2.6.28-12-generic #43-Ubuntu did it again. but this time it took significantly longer...
the filesystem crashed and I had to recover it using live cd....
narf.
so no bug fix for me yet, except installing 2.6.29 kernel using ppa...

Lorant Nemeth (loci) wrote :

I think this bug deserves a "Critical" flag. It's not only causing instability, but it causes data loss too.

summary: - jaunty kernel 2.6.28-11 kernel update renders the system un-usable.
+ jaunty kernel 2.6.28-11 kernel update makes the system unusable.

I am surprised this is not yet labeled as "Critical". I lost some of my data and belief in Ubuntu with it.

zielgruppe (pajoma-gmx) wrote :

I didn't really lost my faith in Linux through this, but I have now around 6GB in my lost&found directory due to this bug. I was running kernel v2.6.30-rc4, which solved my issues. I just switched back for a few minutes to try out if my new Wacom tablet. The open source driver does apparently not support 2.6.30 yet. Well, these few minutes destroyed several days of work :(

sam tygier (samtygier) wrote :

sounds reasonable for this to be labelled critical. its causing data loss for several people. any news on a fix Manoj?

Changed in linux (Ubuntu):
importance: High → Critical
tags: added: amd64
summary: - jaunty kernel 2.6.28-11 kernel update makes the system unusable.
+ 2.6.28-11 causes data corruption with ICH8 on 64 bits installations

Wanted to add that I experienced the same bug on a laptop with ICH9 (?): "ICH9M/M-E 2 port SATA IDE Controller"
(full lshw output attached).

giorgio130 (gm89) wrote :

I also have an ICH9. Current description is reductive.

summary: - 2.6.28-11 causes data corruption with ICH8 on 64 bits installations
+ 2.6.28-11 causes massive data corruption with ICH8/ICH9 on 64 bits
+ installations

Same problem on my brothers Samsung X22; lost 3 (re)installations now (including data!!). Had to resort to re-installing 8.10.

00:1f.2 IDE interface: Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA IDE Controller (rev 03)

Waiting for a fix...

Hospik (jmhospers) wrote :

I suffer from this problem on a HP/Compaq 8510w with ICH8. Instead of upgrading to a 2.6.29 kernel I'm currently running with 8.10 stable kernel (2.6.27-11) from the Intrepid repos. Which is fine for me. (Will not help you if you want to run ext4).

Still waiting for a kernel upgrade to solve the issue...

Manoj Iyer (manjo) wrote :

Can someone try a mainline kernel and report if it is still broken?

http://kernel.ubuntu.com/~kernel-ppa/mainline/

Dmitry Diskin (diskin) wrote :

> Can someone try a mainline kernel and report if it is still broken?

It was already mentioned here, that 2.6.29 from mainline is fine. I'm running 2.6.29-02062902-generic successfully.

Steffen Rusitschka (rusi) wrote :

I can also confirm that the mainline kernel 2.6.29 from the PPA works fine here.

I observed a milder version of the symptoms with a Dell Latitude D830, which also has ICH8, using ext3.
The problem became noticeable only when fsck found about 30 errors (duplicated blocks, wrong blocksizes, inode inconsistencies..) that needed to be repaired in manual mode during the first routine check after upgrade to Jaunty. Apparently his concerned only a few files. Otherwise the system had seemed stable.

nbp (nobradpitt) wrote :

I'm also confirming that 2.6.29-02062902-generic is running successfully (finally!).

Why? It's a patch for ext4, but this problem hasn't got anything to do with ext4 at all.

quixote (commer-greenglim) wrote :

As one of the regular users who's been KO'ed by this bug (and who's pretty much frothing that this bug was *known* prior to the final release and nobody bothered to put out a huge general HEADS UP about it ... but that's another whole issue) --

could somebody post easy instructions on how to install the kernel that seems to solve the problem? At least then us poor schmoes could get some use out of our jaunty installs. I'll be glad to put the workaround in the "known jaunty bugs and workarounds" sticky on ubuntuforums.

For instance, would something along these lines be right:
[CODE]wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.29.3/
sudo dpkg -i linux-image-2.6.29-and-then-what-goes-here??.deb
sudo dpkg -i linux-headers-2.6.29-and-then-what-goes-here??.deb linux-headers-2.6.29-and-then-what-goes-here??_all.deb[/CODE]

(I gather the 29 kernel would be the preferred solution? Yes? No?)

Which commands -- or which further commands -- would one need?

I understand that this is not is not standard practice, yadda, yadda, yadda. Having constant filesystem corruption is worse, though, so please give us a fix!

quixote (commer-greenglim) wrote :

Or at least a workaround!

I previously experienced massive ext4 inode bitmap corruption on an x86-64 Opteron w/ a CK804 chipset, while performing a large rsync, the ext4 corruption issue is *not specific* to ICH8/9.

A number of the reports (including duplicate) mention ext4 and the original report in this LP entry mentions JFS. Are we over-merging bug reports?

Also, it's crucial to understand if the ext4 corruption has been observed on i386 systems or not...please mention if you've seen ext4 corruption on i386.

@quixote workaround is:

$ wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.29.4/linux-image-2.6.29-02062904-generic_2.6.29-02062904_amd64.deb
$ sudo dpkg -i linux-image-2.6.29-02062904-generic_2.6.29-02062904_amd64.deb
(or _i386.deb if on 32bit)

summary: - 2.6.28-11 causes massive data corruption with ICH8/ICH9 on 64 bits
- installations
+ 2.6.28-11 causes massive data corruption on 64 bit installations

This bug also occurs with ext3 and other filesystems. If someone experiences ext4 filesystem corruption on i386, it is not related to this problem. Maybe some bug reports were over merged.

This bug is specifically about the corruption that occurs on mainly Compal notebooks with the AMD64 version of Ubuntu irregardless of filesystem.

quixote (commer-greenglim) wrote :

@Daniel J. Blueman: Thanks! I'll try that out today.

My system specs, btw, are: Core 2 Duo P8400 2.26GHz, 4GB RAM, Intel GMA X4500 graphics, 64-bit Jaunty and Intrepid dual boot, and ext3 filesystem. The laptop is an "MSI 1223" (i.e. no-name brand, I guess.)

The filesystem corruption I was experiencing with the stock Jaunty kernel (2.6.28-11-server) on an x86-64 system was down to the nVidia CK804 PCIe chipset corrupting data on PCIe read completions from the SATA controller's DMA engine. I have observed this with a PCIe bus analyser.

On a system without the CK804 (or MCP55 or related) chipset, I find the stock Jaunty kernel solid.

Lesley Lutomski (lutomski) wrote :

This is my first report, so I apologise if it's incorrect, and for my lack of technical knowledge.

I upgraded my desktop from Hardy to Jaunty, and after a week, the problems described above began - random system freezes, followed by X server error and running fsck, which reported various inode errors. I originally thought the problem was to do with the nVidia drivers, but I removed them completely several days ago and the freezes still occurred. Yesterday's crash has left me unable to boot up at all - error messages about missing files, then kernel panic. (I have more details, if needed.)

I am running 32bit Ubuntu, with ext3. I have a Core Duo 3.4GHz processor, nVidia GeForce 7900 GS graphics card and 2Gb RAM.

giorgio130 (gm89) wrote :

@Lesley:
I think your report is important. Until now, it happened always on x86_64 installations... attach the output of your lshw.

Lesley Lutomski (lutomski) wrote :

Attached as requested. I'm going to have to do a complete reinstall (of 8.10) to get the system up and running again; is there anything else you need before I overwrite the disk?

giorgio130 (gm89) wrote :

I think this is enough... Anyway, as stated before, if you want to keep using 9.04 you can always use a more recent kernel, or, as you said you came from hardy, the old one that shipped with it, which should be still selectable in grub.

Richard Huddleston (rhuddusa) wrote :

The issue is on 64bit kernels 2.6.29-02062904-generic AND kernel 2.6.28-11-server.

I believe I'm seeing the same issue on ICH10 AND SiI 3132 ... separate software raid (raid1 and raid5) on both controllers are suffering. Everything was OK until a couple of days ago, issue only came up when I started playing with larger files. Issue manifests itself on XFS, ext4, and directly on software raid device, and directly against disk.

I've run memtest 6+ passes with no errors, in addition to badblocks on each disk and the combined arrays. No SMART issues on the disks.

My test is basically creating files (11 gigs in size) with /dev/random and dd, and then reading back X bytes with md5sum.

To build a better understanding of the mechanism, it's worthwhile finding out:
 - is there sufficient cooling for the southbridge and northbridge?
 - are you running the latest BIOS?
 - are you running the vendor's validated BIOS defaults?
 - is the powersupply of reasonable quality/spec
  -> for lower ripple and supply rails within tolerances
 - the output from 'lspci'
 - importantly: can the corruption be provoked in MS Windows?

I've experiences two cases where bad memory has been exposed though fast I/O (in a HPC environment), but memtest didn't detect issues, so there is still a small chance.

The first question above in my case is not applicable: my laptop was rock solid with Ubuntu 8.10 and is rock solid with development karmic. Cannot report about windows, and IMHO it shouldn' even be asked such a question. No settings have been modified/altered, and BIOS is vendor default.
Attached my lspci output.

ACH SO.... Use this, I was normal user.....

Lorant Nemeth (loci) wrote :

I thought a short howto would be good for some people, about how to get Jaunty running on the effected systems. This is for paranoid people like me, who don't trust a system installed with an installer running the buggy kernel or booted up with it once (and have no intention to create/wait for a new installer). Note: headers might not be needed on some system, but it can not hurt (and for nvidia driver it will be needed)

- save all your data if possible
- install 64 bit intrepid
- boot up system
- upgrade to latest packages
- upgrade to jaunty, but do not restart system after upgrade finished
- wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.29.4/linux-headers-2.6.29-02062904-generic_2.6.29-02062904_amd64.deb
- wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.29.4/linux-headers-2.6.29-02062904_2.6.29-02062904_all.deb
- wget http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.29.4/linux-image-2.6.29-02062904-generic_2.6.29-02062904_amd64.deb
- dpkg -i linux*2.6.29-02062904*deb
- reboot
- make sure to select above kernel (most probably it will be the default, but it worth to check)
- install nvidia driver if needed (it's ok if you installed it on intrepid, it will be "recompilled" when you install the new kernel automatically

This way your system should not get corrupted as you'll never run the buggy kernel. Of course this way you'll download most of the packages 3 times (intrepid version, intrepid update version, jaunty version). In case there's an updated install medium, please let us know, because that way all this stuff in unnecessary.

Have a nice day! Loci

Richard Huddleston (rhuddusa) wrote :

Well, I'm still seeing my large file data corruption issue on mainline kernel 2.6.30-020630-generic

md1 : active raid5 sdc1[0] sdf1[3] sdd1[1] sdg1[4] sde1[2]
      1953535744 blocks level 5, 64k chunk, algorithm 2 [5/5] [UUUUU]

md0 : active raid1 sda1[0] sdb1[1]
      78123968 blocks [2/2] [UU]

md0 is 2 sata disks on intel ich10 chipset

md1 is 5 sata disks on sil Silicon Image, Inc. SiI 3132

pv -L 35M bigfile | md5sum
46a2c9e932c3feb32dc3edcd60a81d98 -
pv -L 35M bigfile | md5sum
24287e601aa23abb7b380c6577750bb0 -
pv -L 20M bigfile | md5sum
c65b3137a83c00d0bd20fd95c1ee2e88 -

I've gone in and checked the temperatures of all the components i can and they are well under max temps as stated in the system's tech specs. (using both in bios diagnostics and lm-sensors). the power supply is 300W and only consuming 100W ... the 5 sata disks are in an external chasis with separate power supply.

I've also rotated the disk, rebuilt the raid arrays, and tried downgrading the disks from 3.0 to 1.5 through jumper changes (and verified speed with dmesg). i've also removed a sata cd drive which was on ich10 controller.

i also thought this might have something to do with pci latency, as linux was setting everything to 64 latency (seen in dmesg), but my bios was set to 32 (don't know if that was actually an issue though). I've changed the bios values from 32 to 64 and all the way up to 248 (or something like that). still no change. i'm currently running at 64

i've also tried throttling down read speeds with pv -L rate.

interesting note ... i burned a new ubuntu install cd, and did a media test of it on the cd drive (before removed) and the test failed ... but performing on another box said the media was good. don't know if it was a bad cd drive or another manifestation of this issue.

my lspci is attached on a previous comment

i have not tested my system under 32 bit mode as the install media verification issue.

Richard Huddleston, do you have a "Intel Corporation 82801HBM/HEM (ICH8M/ICH8M-E) SATA IDE Controller" (or another controller that uses the ata_piix kernel module)? If not, this bug is probably not the cause of your problems.

Richard Huddleston (rhuddusa) wrote :

I'm using the Intel board DQ45CB which has the Q45 Chipset with the 82801JO I/O Controller Hub, the information is available in my lspci from my first post. I don't know the differences between the 82801JO and the 82801HBM/HEM controllers.

It seem like you've got a SATA controller using the ata_piix kernel module which which I believe common to all people suffering from this bug. My personal experience (and other people are reporting similar observations in earlier commens to this bug) is that upgrading to a mainline kernel of later version than the one shipped with jaunty will fix this issue. I am however not using RAID which could explain the difference.

mirix (miromoman) wrote :

I cannot believe this bug is still open.

To all the people doing hardware diagnostics: This is not a hardware issue. As far as I can tell, it affects exclusively to the Unbuntu kernel 2.6.28-11. Any other kernel/distro I have tried (older or newer) works perfectly. I am running 2.6.30 from kernel.org now. No problem.

David Birch (david-birch) wrote :

This is the worst ubuntu bug i have ever struck - i just discovered my backup of my root part was no good due to being made from a usb boot of 9.04, and now my real root partition is gone too. this bites big time. I have requested some update to the release notes...

I feel pretty stuffed here now as to whether any of my data touched since upgrade is any good...

If this truly is a software bug, our best shot at addressing this bug is via bisection, from lack of specific knowledge.

There are enough reports the we consider the ubuntu kernel 2.6.28-9(.31?) good and we know (at least) ubuntu kernel 2.6.28-11.42 is bad. We can rebuild intermediate kernels from [http://kernel.ubuntu.com/git?p=ubuntu/ubuntu-jaunty.git;a=tags].

First, it may be easier to see if we can bisect using the mainline kernels, since they are prebuilt at [http://kernel.ubuntu.com/~kernel-ppa/mainline/]. We get the ubuntu to mainline mapping from [http://kernel.ubuntu.com/~kernel-ppa/info/kernel-version-map.html].

 -> ubuntu 2.6.28-9.31 is based on mainline 2.6.28.7 ("A")
http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.28.7/linux-image-2.6.28-02062807-generic_2.6.28-02062807_amd64.deb

 -> ubuntu 2.6.28-11.42 is based on mainline 2.6.28.9 ("Z")
http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.28.9/linux-image-2.6.28-02062809-generic_2.6.28-02062809_amd64.deb

** Can at least two people who can reproduce this issue with 2.6.28-11.42, double-check they can/can't reproduce it with kernel Z please? ** We'll know if it's ubuntu patches/configuration, or upstream then.

quixote (commer-greenglim) wrote :

fwiw, since I moved to kernel 2.6.29.02062904 (which is a bit later than either "A" or "Z"?) I haven't had any problems. No crashes, no instability, even after repeated suspends.

Richard Huddleston (rhuddusa) wrote :

I saw my data corruption issues on kernels:
ubuntu kernel 2.6.28-11-server
mainline kernel 2.6.29-02062904-generic
mainline kernel 2.6.30-020630-generic

i've seen issue on both SATA and USB attached disks. also, doing MD5sum checks of the install cd fails on that machine ... but same disk does not fail on other machines.

unfortunately, in attempting to diagnose my problem, i upgraded to a new intel bios released on 6/10 ... and it bricked my board. All attempts at loading a recovery bios have failed. i'm working with intel to get a new board. perhaps i had a bad board all along ?

Bobby (robert-rankin-jr) wrote :

Am installing Kernel "Z" right now, will report back in the morning after leaving it run all night. Also, if there's still any doubt, it is a 64bit only issue. I install 32bit ubuntu on the same computer and it works perfectly.

Bobby (robert-rankin-jr) wrote :

02062809 "Z" seems to work for me. It's not usable in my case because the nVidia driver won't work, but that applies to any replacement kernels. Still, no filesystem errors here.

2009/6/18 Bobby <email address hidden>:
> 02062809 "Z" seems to work for me. It's not usable in my case because
> the nVidia driver won't work, but that applies to any replacement
> kernels. Still, no filesystem errors here.

Not sure what that "kernel Z" is but I'm using one of the vainilla
kernels provided by the Kernel Team and it works great (including the
NVIDIA kernel which is rebuild through DKMS).

Bobby (robert-rankin-jr) wrote :

It was just what Daniel asked to be tested in order to identify the problem. Never mind, though, installed the headers and everything works perfectly. No FS issues, and graphics are fine. Hope that helps.

This bug bit me on a Dell Latitude D830 laptop (64-bit Core2 Duo CPU, ICH8 SATA controller).

Laptop had been very stable while running Intrepid for months.

Installed the Jaunty CD (9.04, from April 2009), which has the now-infamous 2.6.28-11 kernel. Got lots of silent filesystem corruption within days. :-(

Upgraded to to linux-generic 2.6.28.13.17, problem appears to have gone away, though my disk is still corrupted :-(

I wish there was a note in the Jaunty release notes about this! And a new Jaunty install CD with the fixed kernel. I guess it's back to Intrepid and my most recent backup for me.

mecat (habdankm) wrote :

notebook asus f6a
Linux 2.6.28-11
DISTRIB_DESCRIPTION="Ubuntu 9.04"
Intel(R) Core(TM)2 Duo CPU P8400
ICH9M/M-E 2 port SATA IDE Controller
i tried many configurations of filesystems.
i have lost whole lvm volumes (root xfs on logical volume). i tried to use ext4 on raw partition with same resul - all data was lost.
xfs on a raw partition was the only one filesystem configuration which was able to recovery past system crash. system usually hangs past 10 minutes.
now i am testing 2.6.28-02062809-generic

mecat (habdankm) wrote :

after lastest upadates on
notebook asus f6a
Linux 2.6.27-14-generic x86_64
DISTRIB_DESCRIPTION="Ubuntu 8.10"
Intel(R) Core(TM)2 Duo CPU P8400
ICH9M/M-E 2 port SATA IDE Controller
i have same problem as above. Maybe this information can by useful - on both versions of ubuntu i have big problems on new kernel with acpi - (fn key to set brightnes do not work, i am unable to set brightnes via proc-fs: echo 81 > /proc/acpi/video/VGA/LCDD/brightness).
On older version of kernel 2.6.27-9 there was no problem with acpi/brightnes and file system corruption.

on ubuntu 9.04 with 2.6.28-02062809-generic kernel on x86_64 system still do not hang (2 hours later:-).

Tim McCormack (phyzome) wrote :

The important question here is not "How do we fix this bug?" but "How do we prevent this sort of bug from occurring again?" Please take the time to read this excellent article about how software is written for NASA's shuttles, paying close attention to the part about "The Process": http://www.fastcompany.com/node/28121/print

The Ubuntu team's process failed to stop a major data corruption bug from being released, even though it was known about ahead of time. I'm not assigning blame to any one person; we are all affected by the process. So, what part of the process failed *us*? What needs to change about the way the way software is released in the Ubuntu project?

Lorant Nemeth (loci) wrote :

Let's not go to deep in this NASA thingy... They are in a much easier situation from HW platform point of view. I doubt there are too many kind of space-shuttles they have to support :) On the other hand I agree that these kind of problems should be discovered earlier, maybe gathering and registering testers with different kind of HW would increase HW coverage and prevent such severe problems.

Bobby (robert-rankin-jr) wrote :

While I agree that that discussion needs to take place, I really don't think that this is the place for that discussion. Maybe a forum topic would be more appropriate. Just my opinion.

It seems i have found a solution which should work for at least Compal notebooks. I found this based on information I found in one of the duplicate bugs. It seems at least the ICH9 controllers have two modes of operation. One is called IDE compatible or no-AHCI mode and is enabled by default. This uses the ata-piix driver. If you manage to switch the controller to AHCI mode it should use the ahci driver instead, which does not seem to have any problems.

You can check what mode your controller is in using lspci. It will show either IDE or AHCI depending on what mode is set. You can also check the dmesg output for ata-piix or ahci.

Switching the controller to AHCI mode can be quite tricky. I was able to do it using a Dos program from Compal's site. Here are the instructions:
1. Make sure syslinux is installed
2. Get https://haar.student.utwente.nl/~julius/ahci.dsk.gz
3. Gunzip the archive and put it in /boot
4. Add the following lines to /boot/grub/menu.lst:
title FreeDOS AHCI switch disk
root (hd0,1) #copy this from the other Ubuntu item
kernel /usr/lib/syslinux/memdisk #if your /boot is on a different partition, copy this file to /boot and put /boot/memdisk here
initrd /boot/ahci.dsk
5. Boot the new boot item from grub (just press enter on the time & date prompt, there is no config.sys)
6. Run ahci_en

This should enable AHCI mode. You can verify this by running lspci or checking the dmesg output. I built this bootdisk myself based on FreeDOS and the program from Compal's website. You can also disable AHCI again with this disk.
I was able to do a clean Jaunty AMD64 install, do an update and run quite a few applications after I had done this. fsck doesn't find any problems at all after this and there were no strange crashes anymore.

I suspect that mainly Compal notebooks come with this mode set to no-AHCI, which is why there are so many reports of this problem with Compal notebooks.
For others without Compal notebooks with this problem, verify your hardware is in no-AHCI mode and try to switch it to AHCI mode through a similar tool. My bootdisk probably won't work.

I understand that AHCI mode also gives better performance, so this should be an optimal solution. The problem with the ata-piix driver in this kernel should still be considered a bug though.

giorgio130 (gm89) wrote :

Anyone has tried the above solution? I'd try it but this is my only machine, I'd like to preserve it from such corruption... :) Moreover, is this supposed to work with a Compal jhl90?

mirix (miromoman) wrote :

A new kernel, 2.26.28-13 is available from the update repositories. Now I guess that when someone installs Ubuntu from the CD image, the kernel will be updated from 2.26.28-9 (which is known to be free from this bug) to 2.26.28-13 (which I hope will be bug-free as well). So, practically speaking, the problem should be solved, even if the bug remains.

The problem is that this bug was first reported when Jaunty was in alpha stage, then in beta, then RC and the with the stable version and it took months for the Ubuntu developers to release a new kernel. The situation is particularly absurd because the problematic kernel was not the one released with the iso image, but an update.

I think that the huge delay that the Ubuntu community needed to react to such a critical bug, should make us reconsider the efficiency of the current structures and pipelines.

2.6.28-13 is not likely to be bugfree. I checked the changelog and there is no change since the 2.6.28-11 kernel which seems to be related to this problem.

giorgio130, I suspect it will work on your system, I have a jhl91 myself which appears to be very similar. In any case I would suggest you to enable AHCI mode, because it should work better in any case.

giorgio130 (gm89) wrote :

@Julius:
Well, it worked and even quite painlessly. I'll try the buggy kernel as soon as I've the time to make a decent backup. However, I don't think I'll switch back to 2.6.28.

quixote (commer-greenglim) wrote :

This is just to second (and third and fourth and fifteenth!) the comments by mirix and Tim McCormack above.

It is essential that Ubuntu's processes for software release are able to prevent such appalling disasters in the future. It is a measure of how much goodwill Ubuntu has in the community that nobody took this story and ran with it. That goodwill won't last if this sort of thing happens.

I'm not saying the devs have to be perfect and never make a mistake.

What I'm saying is that when a potentially data-destroying bug is KNOWN to be present before the release, there need to be methods in place to prevent the problem.

mecat (habdankm) wrote :

on ubuntu 9.04 with 2.6.28-02062809-generic kernel on x86_64 i do not have any problems with fs. I am using:
product: ICH9M/M-E 2 port SATA IDE Controller
configuration: driver=ata_piix latency=0
Is it possible to switch IDE mode to AHCI without reinstalling a system (i tried with windows and i wasn't able to boot it any more, even past switching to piix again)?
On this kernel, there are still a lot of bugs with ACPI (ex. brightnes/buttons/hdd suspending/power management) with wasn't available on older kernels.

Switching shouldn't bring any problems to installed systems. The kernel will automaticly use the correct driver. At least it worked fine for me (also with my already installed system) :)

mecat (habdankm) wrote :

You were right - it works in this mode. I will try new kernel.

mecat (habdankm) wrote :

Linux inf16 2.6.28-13-generic #44-Ubuntu SMP Tue Jun 2 07:55:09 UTC 2009 x86_64 GNU/Linux
In AHCI mode system works correctly.

Richard Hansen (a7x) wrote :

2.6.28-13.44 (amd64, core i7, AHCI mode) does NOT work for me -- I also experienced filesystem corruption.

Theodore Ts'o (lead ext4 developer) has some patches in the 'for-stable-2.6.28' branch of his git repository (see http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=shortlog;h=for-stable-2.6.28) that are NOT integrated into 2.6.28-13.44. One of these is a fix for a filesystem corruption bug (see http://git.kernel.org/?p=linux/kernel/git/tytso/ext4.git;a=commit;h=16cb5dd9f53e569130584696909d423b6fe38c1e) which may or may not be related to what everyone is reporting.

According to the commit message, this bug has been in ext4 for a very long time, but people seem to only be experiencing this bug in 2.6.28-11 and 2.6.28-13. Also, this bug is not specific to 64-bit systems yet people seem to only be having problems on 64-bit platforms. However, because it's a concurrency bug, it could be that the bug is only tickled under certain circumstances and some change between 2.6.28-9 and 2.6.28-11 is causing the bug to be tickled much more regularly on 64-bit systems.

Thus, I'm hopeful that this patch will fix the problem for everyone, but I'm not confident. This patch is scheduled to be included in the next Jaunty kernel release (see bug #389555).

In the meantime, I added all of tytso's patches to 13.44 and uploaded the resulting package to my PPA. I basically took the difference between git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git ('for-stable-2.6.28' branch) and git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6.28.y.git ('master' branch) and applied it to git://kernel.ubuntu.com/ubuntu/ubuntu-jaunty.git ('Ubuntu-2.6.28-13.44' tag). If you're feeling adventurous and want to try it out, go to https://launchpad.net/~a7x/+archive/ext4fixes and follow the install instructions.

giorgio130 (gm89) wrote :

@a7x: this is not related only to ext4. Corruption comes with ext3, reiserfs as well.

feanorknd (regino-m) wrote :

@giorgio130: Ok.. there is ext3, reiserfs, etc, problems.... but for now, there are fixes/patches created by lead developer of ext4 due to tested data corruption problems, available to fix current 2.6.28 kernel at jaunty, and directly applied to 2.6.29 and 2.6.30 kernels. So......... there is a ext4 related data corruption problem, and there are patches for it.. why not include them into a new 2.6.28 release????

Thanks.

@feanorknd: These patches may all be valid, but this is unrelated to this bug. This bug is about data corruption that occurs on ext3, reiserfs and other non-ext4 filesystems. Anyone with ext4 problems should not post here if they cannot reproduce this with ext3. Do a proper search for the appropriate bug or create a new bug.

What we know:

*) Bug it is NOT FS related: we have reports so far for jfs, xfs, reiserfs, ext3 and ext4
*) Apart from a dubitable case, is x86_64 related, but concurrency is not the issue (even going UP it persists).
*) Is FOR SURE hardware related, ata_piix driver being the most probable candidate to host the bug.
*) It is critical for a lot of people as can trigger complete loss of disk stored data

About the method:

-) I can be considered responsible for not pointing out at first the problem. I apologize. My faults are:
   *) I nominated ext4 in my first post, in a moment where lots of bugs were being solved and reported on that filesystem, and this for sure triggered down the importance of the bug.
   *) I entitled the bug at first "kernel 2.6.28-11 kernel destroy system", which was for sure an overshoot I apologize for, and which caused developers think I was a N00b playing with its new shining toy.
-) I nevertheless consider the forced inclusion of the ext4 filesystem in jaunty the head problem of this: it triggered lots of bugs in a critical timeframe (just weeks before the release) and patches to make it work in a non ready for it kernel possibly triggered bugs in other drivers considered stable.

just my 2€c

Manoj Iyer (manjo) wrote :

Is this stilll an issue with Jaunty? iirc this bug was opened early in the dev stage of jaunty, Jaunty proposed is at 2.6.28-14.46. Can someone verify that this is fixed in Jaunty so that I can close this bug as fix-released ?

Thanks a ton

Probably the bug is still there as long as there are no ata-piix related changes in the changelog. Other than that it appears data corruption bugs and system stability aren't very high priority with the Ubuntu devs :S
Even with my controller in AHCI mode, I've experienced several different types of crashes & hangs with the AMD64 kernel (no data corruption anymore though!). I'll be trying out Karmic soon on my hardware, I guess we should hope Karmic won't be such a catastrophe as Jaunty.

Richard Hansen (a7x) wrote :

It may not be the ata_piix driver -- I set my controller to AHCI mode before I even installed Jaunty, yet I experienced filesystem corruption. My corruption could have been caused by an unrelated bug, however.

I haven't had any problems since I upgraded to a backported Karmic kernel. If anyone else wants to try the Karmic kernel, you can get it from my PPA at <https://launchpad.net/~a7x/+archive/kbp>. I will try to keep it up-to-date as newer versions of the Karmic kernel are released.

Hospik (jmhospers) wrote :

I have not dared trying the 2.6.28-14 kernel with the ata_piix driver because I did not see any related changes in the changelog aswell. However, after switching my HP/Compaq 8510w (through the BIOS) to AHCI mode (or native mode as HP calls it) I did upgrade to 2.28-14 and have not noticed any crashes/corruption since!

Currently running an up-to-date Jaunty...

Dan Halbert (dhalbert) wrote :

@Manoj (2009-07-09): I believe this bug is still present in 2.6.28-13.44, the latest released kernel. Is there any reason to expect it is fixed in 2.6.28-14.46 (in jaunty proposed)? Do you have a specific patch in mind?

A colleague had these symptoms on a Dell E6500 (has ICH10 and Nvidia graphics). The problem is not as pervasive as that of some of the original reporters, but there has been significant filesystem corruption every few days. We have updated his kernel to a 2.6.29 kernel just today, and hope to see an improvement.

Zakhar (alainb06) wrote :

This bug is really a show-stopper.
I can't believe the Ubuntu team wants to close this report even if the bug is still there (see 5 posts above)

This has stopped me from upgrading to Jaunty, and I bet this bug will stay uncorrected up to Karmic, where it will (hopefully) disappear with the .30 kernel.
Ubuntu team, if people stopped complaining and reporting about this bug, it is simply because they lost hope that someone really knows what going on here, and will have time to fix it. I don't blame, I know such bugs are hard to fix... if you should just admit it, we will simply forget this doomed version of Ubuntu.
Isn't there a status : "we won't correct that" (although it is said to be critical!)

Such bugs really tears down the good image of Ubuntu... especially when you know the bug was ("supposedly") corrected upstream.
I hope you should reconsider delaying the next release if such critical bugs happen again... a good thing for that would be to release at the beginning of the month, it would give some time to handle for such things.

kikvors (kikvors) wrote :

If you add up the amount of time lost by people having lost data due to file corruption with this bug, it will probably outweigh the effort needed to fix this.

Johan Sköld (johan-skold) wrote :

Note that the bug has already been fixed in later kernel versions, it's just
the ISO that hasn't been updated. Even though it really should be updated,
the bug will not persist 'til Karmic.

quixote (commer-greenglim) wrote :

Just two things for what it's worth:

Fact: I've been using the 2.6.29-04 kernel with up to date 64-bit Jaunty on ext3 for a couple of months now with no problems at all.

Opinion: I am still horrified, appalled, perplexed, and angry that there are neither any warnings on LiveCD iso downloads for 64-bit, nor an update for the kernel shipped with the iso. I'm starting to get the impression that whoever makes those decisions at ubuntu doesn't think it's very important if I lose data. I know I'm repeating myself, and it amazes me that in this community that should even be necessary, but that is Not Good.

mosgjig (mosgjig) wrote :

I came across this issue and found that the solution proposed by Lorant Nemeth on comment #122 worked, though with a slight twist. I was unable to install a fresh copy of intrepid because the liveCD could not mount the swap (too lazy to investigate after dealing with this mess), therefore I just installed the liveCD Jaunty 64bit and followed the instructions. So far so good, installed this morning at work and been gradually re-installing all kinds of apps and goodies with frequent reboots.

My specs

Asus M51Sn
4GB Ram
Intel Core2 Duo T8300 @ 2.4GHz
GeForce 9500m GS

Following the instructions, went from Jaunty amd64 live cd with kernel 2.6.28-11-generic to .28-13-generic to .29-02062904-generic before rebooting from install.

If ya don't hear from me in a couple of days, then take it as A solution to this ridiculous bug that ate 5 hrs of me-life. But seriously, other than these minor glitches, good work on the dist, hope to find some time and contribute some code one of these days.

Godspeed!

Thomas Aaron (tom-system76) wrote :

Could we please get an update on the prospects for fixing this bug?
It's been about two weeks since the above post.
Is it fixed in the *-14 kernel?

This thing is reaking havoc on a lot of our older systems, and possibly a couple of our newer ones. Not only is it destroying data, it's destroying profit.

If there is any information we can add to help, please let us know.

Best Regards,
Tom
System76
<email address hidden>

swordthower (mnrjj) wrote :

I have successfully applied the fix in #122 as well. I have an ASUS N80Vb laptop. Everything seems to be working, and I have had no crashes or fs corruption after several reboots.

Fingers crossed...

mirix (miromoman) wrote :

The bug is fixed in Karmic Koala alpha 4 (kernel 2.26.31 RC5) and the 2.6.30 familiy is also bug-free to this respect.

Paradoxically, Koala seems faster and less bloated than Jackalope ;-)

SecuGuru (christopherthe1) wrote :

This bug still exists in the 2.6.28-14 amd64 kernel...interestingly, it didn't manifest in my system until I upgraded my RAM from 2GB to 4GB.

My system was down for mobo RMA (bad voltage reg) for the last 3 weeks. Was running fine since Jaunty first went live when I disassembled for RMA, root filesystem on ext3 partition.

The first indication of a problem came after I booted and let the update manager run...installation of the 2.6.28-15 kernel image keeps failing due to corrupted tarfile errors. Repeatedly tried to download it...some succeed but throw corruption error on unpacking, other attempts fail outright with 'package checksum mismatch.' I pulled it down manually via wget (~24MB), but the md5sum didn't match the published value for the package. Then I re-ran it and got a different value! And kept getting different values on subsequent md5sum runs.

My system is dual-boot XP, so I switch to windows and wget the .deb package again onto my NTFS partition. This time, md5sum returns the published value. Reboot using Ubuntu 8.10 (Intrepid, 32-bit) live CD and mount the NTFS partition read-only. I ran fsck on the ext3 partition, but it aborts as clean...so I force fsck and all checks OK. I run md5sum again on the package, it returns the expected value. I mount the ext3 partition and do a 'cp -av' to copy it to /var/cache/apt/archives for the update manager.

Here's where it gets fun. After the copy, I check the md5sum, and it's wrong. So, I check the md5sum on the original copy of the file on the read-only mounted NTFS partition...it's wrong too! WTF?

I reboot to WinXP and check the md5sum on the packages...the copy on the NTFS partition that was downloaded under XP returns the correct MD5 value, but the copy I made to the ext3 partition under the Intrepid Live CD is wrong. (I mount my ext3 partitions in WinXP using an ext2/3 volume manager) I delete the ext3 copy and use 'copy /b /v' to copy the package again from NTFS to ext3. This time, under WinXP the md5sum returns the correct value for both copies.

SUMMARY:
Problem doesn't seem to manifest until 4G RAM installed
Problem exists in both 32-bit and 64-bit kernels (errors under both Ubuntu 8.10 i386 Live CD and 9.04 AMD64 HDD installation currently on 2.6.28-14 kernel)
Problem DOES NOT manifest under WinXP boot, making it very unlikely hardware is the cause

SYSTEM SPECS:
Asus A8N-E mainboard (nForce4 Ultra)
Opteron 185 CPU (dual-core)
4GB Patriot DDR-400 (PC3200) SDRAM, CAS 2
Maxtor 1TB SATA-II HDD
**p1 = 250GB, NTFS
**p2 = 8GB, ext3
**p3 = 680GB, ext3
**p4 = 2GB, swap

SecuGuru (christopherthe1) wrote :

Addendum to previous comment's System Specs:

512MB nVidia 9800GT graphics card

SecuGuru (christopherthe1) wrote :

Installed 2.6.29 kernel per workaround suggestion (http://ubuntuforums.org/showpost.php?p=7382178&postcount=29) to no avail.

Data corruption appears to manifest only in files 8MB or larger. Attempting to update package ia32-libs via update manager results in failed download (hash mismatch). Using wget to pull the package manually results in different MD5 sums each time.

Same file downloaded under WinXP checks out with the correct MD5 sum every time.

chastell (chastell) wrote :

Thanks for the detailed testing, SecuGuru. Can you try with 2.6.30 mainline kernel?
http://kernel.ubuntu.com/~kernel-ppa/mainline/v2.6.30.5/

SecuGuru: can you try to reproduce the problem, booting separately with 'iommu=soft', 'iommu=off', 'mem=2G' please?

Each time, it's worthwhile catching the IOMMU settings with 'dmesg | grep -i iommu' after bootup.

Dr Emixam (dr.emixam) on 2009-10-12
Changed in linux (Ubuntu):
status: Triaged → In Progress
status: In Progress → Confirmed
Zakhar (alainb06) wrote :

I withdraw from this list. As I forcasted 5 month ago (post 162) this bug is still uncorrected and now Karmic is out. So I'm not waiting anymore for a correction of this bug, and skip directly to Karmic 64 which is an awesome version.

Keep up the good job !..

Dmitry Diskin (diskin) wrote :

So, none of the kernel updates of Jaunty did not fix it? Scary.. I moved to
mainline kernel, since I was not able to work on my new laptop because of
that bug. And I'm still on mainline, now it is 2.6.31-02063107-generic. I do
not see a way to test other kernels, because it would possibly trash my
system. If I only had a spare system on dual boot..

raketenman (sesselastronaut) wrote :

i can confirm this bug with an 2.6.31-9-rt kernel
my dmesg:
[27216.779223] EXT4-fs error (device sda3): ext4_add_entry: bad entry in directory #859924: directory entry across blocks - offset=0, inode=3633236108, rec_len=180364, name_len=142
[27216.779231] Aborting journal on device sda3:8.
[27216.779448] EXT4-fs (sda3): Remounting filesystem read-only
[27216.780388] EXT4-fs error (device sda3) in ext4_delete_inode: Journal has aborted
[27216.780393] EXT4-fs error (device sda3) in ext4_create: IO failure

@raketeman, please post this, along with system details to <email address hidden>; here isn't going to help

raketenman (sesselastronaut) wrote :

attached the lshw associated with the 2.6.31-9-rt kernel

raketenman (sesselastronaut) wrote :

thanks Daniel for this hint - mail is on the road!

Hi everybody. I'm coming here after a lot of searches about fs corruptions in Ubuntu. Description from the original poster seems to apply very well to my situation.

"Suddenly", already running apps start to seg fault, while new started ones usually report some error with shared objects (missing, not loadable due to header problems... I can't recall the exact messages). I got no data loss, maybe because when these errors start to show I shut down the system as quickly as possible. Usually shutdowns fail and sometimes I'm able to see the EXT3-fs error messages (similar to those in #177).

These are the BAD NEWS. Ubuntu is 9.10 Karmic, and:
$ uname -a
Linux frank 2.6.31-20-generic #57-Ubuntu SMP Mon Feb 8 09:05:19 UTC 2010 i686 GNU/Linux

My system is a 32bit, 6 years old Acer laptop. I can't remember when the bug first showed up, but surely it was there with release 2.6.31-16 (or -15).

Now the GOOD NEWS (I hope). It seems I'm able to REPRODUCE IT!

Some weeks ago I tried to run AC3D 6.5, a (not free) 3d modeler, and after some minutes exploring it, closed it and got a strange error: "Unable to save configuration file in xxx" (or similar). Very peculiar, but the system seemed ok, so I forgot about it and went on; after a few minutes errors was so frequent that I had to shutdown the system and fsck the disk from a Live USB.

I tried that software other times always getting the same problem, even after some kernel upgrade. So I gave up with it and decided to give K-3D a try. I downloaded and started it. After the splash screen showed, the software had a segmentation fault AND the fs got corrupted once again. Again after some kernel upgrade, I repeated the test and got the same system failure. Then I tried Blender and... surprise, I experienced the bug once again. I tried another (let's say it) OpenGL (non free) software, which I started and operated successfully some month ago, and the bug was there.

From my little experience, I can conclude that any time I start some OpenGL 3D software, this bug shows up. Otherwise, I can keep my system up for days without any problem. Please note that I have Compiz disabled, because it has some glitches with Java Swing applications, and no screensaver running. Moreover, if I remember correctly (must check it), I had some mesa or radeon driver update in these months, between the last working execution of an OpenGL app and the first appearance of the bug.

I'm going to test with some other 3D app just to see whether that path goes anywhere.

I hope this report will help you.

Bye,
Marco

Marco, there is a known hardware data-corruption issue in certain revisions of Via VT82C586A/B/VT82C686/A/B/VT823x/A/C disk controllers; this is most likely the issue you're hitting.

I'm not aware of what workarounds exist. To confirm the issue is with this disk controller, mount the internal 2.5" harddisk in a USB enclosure, boot off it and see if you can reproduce the issue via USB, and you'll know.

Daniel, it seems quite strange because I've been running Linux on this laptop since I bought it, in 2003, and never got this kind of problem. It only showed up since a few weeks and always following the same pattern. However, I'm going to confirm the issue following your suggestions as soon as possible. Thank you for your notice.

chastell (chastell) wrote :

Marco: In my case (64-bit Jaunty on a ThinkPad X301 + a 128 GB Samsung MMCQE28G SSD) the issue went away as soon as I switched to a vanilla (mainline) kernel: https://wiki.ubuntu.com/KernelTeam/MainlineBuilds

Can you try to reproduce your issue with one of these kernels? (I’ve been happily using 2.6.30.5 for quite some time now.)

Shot, finally I had some time for testing with other kernels and these are my results.
Versions are expressed as shown in the "Installed version" column of Synaptic.

Ubuntu-specific kernels

2.6.31-20.57 - Doesn't work
2.6.31-19.56 - Doesn't work
2.6.31-18.55 - Doesn't work
2.6.31-17.54 - Doesn't work
2.6.28-16.55 - Works

Ubuntu mainline kernels

2.6.32-02063208 - Doesn't work
2.6.31-02063112 - Doesn't work
2.6.30-02063010 - Works

There are chances the bug was introduced in 2.6.31 vanilla kernel. Now I'm staying with 2.6.30, which works and allows me to do the (simple) 3D tasks I need.

I'm here for any other test or update. Sadly I still can't check the disk controller path (#182).

Bye,
Marco

chastell (chastell) wrote :

Marco: By „doesn’t work” do you mean that the bug manifests itself, or that the given kernel doesn’t work at all?

I’m asking because on my 64-bit Jaunty mainline 2.6.32 don’t even boot properly (haven’t tried 2.6.31, went with mainline 2.6.30.10 which works very well).

The custom 2.6.28 kernel(s) that Ubuntu ships with Jaunty manifested this bug in my case, so I’m very reluctant to try any non-mainline kernel (the data loss is non-obvious and can happen to a backup when its drive is connected, so I don’t see a way to safely test a non-mainline kernel).

The first you said. On my machine, every kernel, both mainline and custom, just works. Apart from this very annoying problem, I'm not having kernel panics since... 6 years?

However, I also never got a data loss, maybe because I'm quite used to recognise the symptoms and hard stop the computer before any loss can happen.

If the bug is not disk controller related, and if you are able to reproduce it like me, you could prepare an Ubuntu live usb pen, install the bug firing program on it, update with the want-to-test kernel, boot from it and check. Never did it before, so I can't figure out any practical problem with this procedure.

Bye,
Marco

Csimbi (turbotalicska) wrote :

Hi there,
I am afraid I have the same problem - the EXT4 file system getting corrupted over time.
I've built a NAS from Ubuntu 9.10 Server amd64. The system+temp is on an SSD drive, while the data is on a RAID6 array using an Adaptec 51645 card and 8 identical 1.5TB disks.
I use use SSH/PUTTY to manage the box and Samba to access the data (fill using my Windows machine, play using XBMC on Linux mini), and I never reboot the machine unless there was an update installed.

uname -r: 2.6.31-19-server
fstab: /dev/sdb1 /mnt/raid6 ext4 suid,dev,exec,nodelalloc 0 0

I never noticed anything wrong, but today I was looking for a file that was supposed to be there. The file was not there, but there were files from other directories(!). So, obviously there is something wrong.
I removed the mount command from fstab, reboot, then I run:
sudo fsck.ext4 -fyv /dev/sdb1
I got massive amounts of inode issues - I could not take a copy because the PUTTY buffer seems to be too small (text has been pushed off very quickly).
Right now it says "Clone multiply-claimed blocks? yes" and it's hanging - I understand it takes quite a while.
I wonder if I am a victim of the same corruption reported in this thread and whether I can fix it using fsck.ext4.

I would not like to loose any data, because I just can't recover it from anywhere (these are HD family movies, pictures and such nowhere else to be found). I never planned on making backups because RAID6 offers good protection and using nodelalloc in the mount options should protect from power loss.

Please advise. Thank you.

Csimbi (turbotalicska) wrote :

I managed to grab a part of the long long output (this is just a fraction of the whole dump).
See attachment.

Tim McCormack (phyzome) wrote :

I wiped my Intrepid box and installed Karmic... and hit the bug. 2.6.31-14 still causes superblock corruption on my amd64 machine. Here are my specs:

Clevo M762T <http://www.clevo.com.tw/en/products/prodinfo_2.asp?productid=88> with 250 GB SATA Fujitsu MJA2250BH G2 drive. Intel Corporation ICH9M/M-E 2 port SATA IDE Controller.

I will attempt to set the drive to AHCI mode and try another installation.

Tim McCormack (phyzome) wrote :

I was able to set the drive to AHCI mode by setting OS compatibility in the (Phoenix?) BIOS to "Vista" (instead of "Other"), which unlocked an IDE vs. AHCI switch.

I was unable to reliably reproduce the bug while running in IDE mode (across several wipe-and-installs), but did not encounter it at all in AHCI mode. I kept it in that mode and restored my files, and have not seen corruption. (I did have to nuke my WinXP partition to do this. Win7 seems to be OK with AHCI, but probably needs a fresh install or a repair to accommodate the switchover.)

For testing I tried to use the iozone filesystem benchmarking tool from repository in an effort to generate lots of file writes in different ways, but it did not do as I hoped.

While my system now functions, the bug still lurks, waiting.

beej (beej) wrote :

manoj: are you still working on this?

Chelmite (steve-kelem) wrote :
Download full text (9.4 KiB)

I upgraded from Karmic to Lucid on my x86_64 box. I tried upgrading. When that didn't work, I resorted to formatting the drive and installing from scratch. The initial system works, but (a) doesn't have enough of the packages installed that I need for work, and (b) normal apt-get upgrade or synaptic updates put the system in a nearly unusable state.

Right now, when I boot, I get a purplish "starry" screen, the audible tom-toms, and nothing more...no login greeter, no panels, the mouse doesn't reveal anything on the periphery of the screen, right-clicking on the desktop doesn't bring up anything. I end up having to use a console to log in as root to do xhost +, then use another console to log in as me, then start xfce4-panel. Then I can use emacs. But, firefox, synaptic, thunderbird all get a segmentation fault.

I looked in /var/log/gdm.
What's interesting is that the crash happens in libc. When I run synaptic, it also crashes in libc, as reported in bug 577159. The following is from :0-greeter.log:
Window manager warning: Failed to read saved session file /var/lib/gdm/.config/metacity/sessions/10c5860066ae4f5bf1127424360828629600000013940005.ms: Failed to open file '/var/lib/gdm/.config/metacity/sessions/10c5860066ae4f5bf1127424360828629600000013940005.ms': No such file or directory
** (process:1406): DEBUG: Greeter session pid=1406 display=:0.0 xauthority=/var/run/gdm/auth-for-gdm-rEAtyV/database
gdm[1422]: ******************* START **********************************
gdm[1422]: [Thread debugging using libthread_db enabled]
gdm[1422]: 0x00007f667240744e in waitpid () from /lib/libpthread.so.0
gdm[1422]: #0 0x00007f667240744e in waitpid () from /lib/libpthread.so.0
gdm[1422]: #1 0x000000000042d02b in ?? ()
gdm[1422]: #2 0x000000000042d0d7 in ?? ()
gdm[1422]: #3 <signal handler called>
gdm[1422]: #4 0x00007f666e9827f0 in ?? () from /lib/libc.so.6
gdm[1422]: #5 0x00007f666f774a6a in __xmlParserInputBufferCreateFilename ()
gdm[1422]: from /usr/lib/libxml2.so.2
gdm[1422]: #6 0x00007f666f749d9d in xmlNewInputFromFile () from /usr/lib/libxml2.so.2
gdm[1422]: #7 0x00007f666f7647bb in xmlCtxtReadFile () from /usr/lib/libxml2.so.2
gdm[1422]: #8 0x00007f666fa6f786 in xkl_config_registry_load_from_file ()
gdm[1422]: from /usr/lib/libxklavier.so.16
gdm[1422]: #9 0x00007f666fa6fbe5 in xkl_config_registry_load_helper ()
gdm[1422]: from /usr/lib/libxklavier.so.16
gdm[1422]: #10 0x0000000000427a2c in ?? ()
gdm[1422]: #11 0x0000000000428018 in ?? ()
gdm[1422]: #12 0x00000000004279a4 in ?? ()
gdm[1422]: #13 0x0000000000424013 in ?? ()
gdm[1422]: #14 0x00000000004242a8 in ?? ()
gdm[1422]: #15 0x00000000004278f2 in ?? ()
gdm[1422]: #16 0x00007f666f2ee935 in g_type_create_instance ()
gdm[1422]: from /usr/lib/libgobject-2.0.so.0
gdm[1422]: #17 0x00007f666f2d283c in ?? () from /usr/lib/libgobject-2.0.so.0
gdm[1422]: #18 0x0000000000422886 in ?? ()
gdm[1422]: #19 0x00007f666f2d3841 in g_object_newv () from /usr/lib/libgobject-2.0.so.0
gdm[1422]: #20 0x00007f666f2d42ad in g_object_new_valist ()
gdm[1422]: from /usr/lib/libgobject-2.0.so.0
gdm[1422]: #21 0x00007f666f2d44f1 in g_object_new () from /usr/lib/libgobject-2.0.so.0
gdm[1422]: #22 0...

Read more...

Emily Wind (emilywind) wrote :

It seems this bug is related to https://bugs.launchpad.net/ubuntu/+source/linux/+bug/346691, which seems to randomly affect different 64bit kernel releases and not others. This would explain why the error report on the Ubuntu forums about this dated back to 2008 and such. If the developers looked for a patch pattern within the affected kernels, that would likely be a good start.

I think the reason this recently started affecting me a lot might be due to the latest kernel (2.6.32-22). I did not have the issues as all with kernel 2.6.32-21 as GUmeR reports, so reverting to that is the best bet for avoiding this issue for now. Cheers.

Emily Wind (emilywind) wrote :

Disregard the above post, except for the points about looking at patch patterns in the affected kernels and that 2.6.32-21 did not have the issue for me and GUmeR who posted in this bug report which seems to cover the same issue: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/515937

Cheers.

Emily Wind (emilywind) wrote :

These are the bugs fixed in 2.6.32-22 according to the update-manager along with https://lists.ubuntu.com/archives/lucid-changes/2010-April/011181.html

[ Andy Whitcroft ]
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/526354

[ Tim Gardner ]
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/567016

It is likely that those patches accidentally broke something causing this error, such as if it was written without 64bit in mind at some point in the coding. I am going to contact Manoj about this to get his attention. Cheers.

Chelmite (steve-kelem) wrote :

I have the problem in 2.6.32.21.

Emily Wind (emilywind) wrote :

It seems that could possibly be unrelated at this point, but https://bugzilla.kernel.org/show_bug.cgi?id=16006 seems to have an answer. It is my error, but reading some of the comments here makes me think this bug report might not be the same as 515937, but could be causing some of the issues reported here. Hopefully we can get the ball rolling on a kernel update soon. :)

Chelmite (steve-kelem) wrote :

I installed the new kernel, 2.6.32, and still have the same problem with the greeter, synaptic, firefox, and thunderbird getting segmentation faults. The gdb traceback for synaptic hints that the problem is in libc. The top of the traceback follows. It looks to my (partially-trained eyes) that there's a problem with strncmp/strcmp for x86_64. This may affect the kernel, but the effect I'm seeing is in programs outside the kernel.

Program received signal SIGSEGV, Segmentation fault.
__strncmp_ssse3 () at ../sysdeps/x86_64/multiarch/../strcmp.S:100
100 ../sysdeps/x86_64/multiarch/../strcmp.S: No such file or directory.
 in ../sysdeps/x86_64/multiarch/../strcmp.S
(gdb) where
#0 __strncmp_ssse3 () at ../sysdeps/x86_64/multiarch/../strcmp.S:100
#1 0x00007ffff6dd6a6a in __xmlParserInputBufferCreateFilename ()
   from /usr/lib/libxml2.so.2

tags: added: cherry-pick
®om (rom1v) wrote :

The bug was fixed in later versions of kernel, but it seems it appears again in 2.6.35-19 (in maverick beta) : https://bugs.launchpad.net/ubuntu/+source/linux/+bug/636430

Pete Graner (pgraner) on 2011-01-10
Changed in linux (Ubuntu):
status: Confirmed → Fix Released
Tim McCormack (phyzome) wrote :

Pete, what was the actual bug, and where is the fix released?

RobM (robert-meerman) wrote :

I second that - what was/is the bug, and where can I obtain the fix?

Other words - in which "stock" Ubuntu kernel was it fixed?

--
Dmitry

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.