ext4 filesystem errors on SSD disk

Bug #992424 reported by Peter Meier
96
This bug affects 19 people
Affects Status Importance Assigned to Milestone
Linux
Fix Released
Medium
linux (Ubuntu)
Fix Released
High
denace

Bug Description

After Upgrading from 11.10 to 12.04 I quickly got EXT4 Filesystem errors on my root-fs, which will result in / being remounted as readonly.

How to reproduce:

1. Boot 12.04 with latest 12.04 kernel (3.2.0.24.26)
2. Start some io-heavy (probably write-intensive) task, like syncing your mailboxes with offlineimap
3. Bumm -> / is mounted as readonly (there are 2 partitions /boot and / )

Dmesg then Usually shows these 4 lines, but nothing more:

[11742.577091] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:739: group 908, 32254 clusters in bitmap, 32258 in gd
[11742.577100] Aborting journal on device dm-1-8.
[11742.577337] EXT4-fs (dm-1): Remounting filesystem read-only
[11742.577357] EXT4-fs (dm-1): ext4_da_writepages: jbd2_start: 9223372036854775807 pages, ino 14876673; err -30

You can then reboot your system, let fsck find a few errors it can fix, reboot again (as /-fs changed) and repeat the steps above.

I can boot my 12.04 with the kernel of 11.10 (2.6.38-12.51) and everything works fine (except the wireless card, but that's probably an unrelated bug). So I assume it must be a regression in the EXT4-code of the latest 12.04 Kernel.
Also I had no problems with 11.10 and all its previous versions.

This happens only on my laptop that uses an INTEL SSDSA2CW160G3. On another machine, which I upgraded at the same time and I use as frequently as my laptop, but with a normal SATA disk it didn't happen so far. Both machines, have 2 partitions, while the LVM for the root filesystem and swap is on the second - an encrypted partition:

/dev/sda1 /boot
/dev/sda2 -> cryptsetup
  -> lvm
    -> root
    -> swap

As both systems are setup the same way, but only the desktop behaves badly on the latest kernel, I assume it could have todo something with the SSD disk, therefor SSD in the title of that bug report.

My fstab looks like this:

$ cat /etc/fstab
# /etc/fstab: static file system information.
#
# <file system> <mount point> <type> <options> <dump> <pass>
proc /proc proc defaults 0 0
# /dev/mapper/foo-root
UUID=5eb462f7-485f-48f0-a50b-f07de47c8d01 / ext4 defaults,errors=remount-ro,relatime 0 1
# /dev/sda1
UUID=c69c5d4d-179f-43c7-a793-bc10254f2b1c /boot ext3 defaults,relatime 0 2
# /dev/mapper/foo-swap_1
UUID=e8c2dc03-1b7a-4bb9-a983-cdf40d77d50f none swap sw 0 0

Attached is also a dmesg-output with the 12.04 Kernel and a lspci-vnn output. After submitting that bug I will boot into the newer kernel and also attach uname and version_signature output.

If you need any additional information, please let me know.

$ lsb_release -rd
Description: Ubuntu 12.04 LTS
Release: 12.04

$ apt-cache policy linux-image
linux-image:
  Installed: 3.2.0.24.26
  Candidate: 3.2.0.24.26
  Version table:
 *** 3.2.0.24.26 0
        500 http://archive.ubuntu.com/ubuntu/ precise-updates/main amd64 Packages
        100 /var/lib/dpkg/status
     3.2.0.23.25 0
        500 http://archive.ubuntu.com/ubuntu/ precise/main amd64 Packages

Revision history for this message
Peter Meier (peter-meier) wrote :
Revision history for this message
Peter Meier (peter-meier) wrote :
Revision history for this message
Peter Meier (peter-meier) wrote :

$ cat /proc/version_signature
Ubuntu 3.2.0-24.37-generic 3.2.14
$ uname -a
Linux foo 3.2.0-24-generic #37-Ubuntu SMP Wed Apr 25 08:43:22 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Brad Figg (brad-figg)
affects: linux-meta (Ubuntu) → linux (Ubuntu)
Brad Figg (brad-figg)
Changed in linux (Ubuntu):
status: New → Confirmed
tags: added: precise
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.4kernel[1] (Not a kernel in the daily directory). Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-rc4-precise/

Changed in linux (Ubuntu):
importance: Undecided → Medium
importance: Medium → High
tags: added: needs-upstream-testing
Changed in linux (Ubuntu):
status: Confirmed → Incomplete
tags: added: kernel-da-key
Revision history for this message
Peter Meier (peter-meier) wrote :

I'm now running a 3.4 kernel. Will report back if there are problems.

Revision history for this message
Peter Meier (peter-meier) wrote :

Unfortunately it also happened with the upstream kernel:

[14081.450885] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:741: group 904, 32254 clusters in bitmap, 32258 in gd
[14081.450895] Aborting journal on device dm-1-8.
[14081.451151] EXT4-fs (dm-1): Remounting filesystem read-only
[14081.452467] EXT4-fs (dm-1): ext4_da_writepages: jbd2_start: 9223372036854775807 pages, ino 2408796; err -30

$ uname -a
Linux foo 3.4.0-030400rc4-generic #201204230908 SMP Mon Apr 23 13:10:03 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

tags: added: kernel-bug-exists-upstream
removed: needs-upstream-testing
Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report at bugzilla.kernel.org [1]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

If you are comfortable with opening a bug upstream, It would be great if you can report back the upstream bug number in this bug report. That will allow us to link this bug to the upstream report.

[1] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu):
status: Incomplete → Triaged
Revision history for this message
Peter Meier (peter-meier) wrote :

Imho this is upstream Bug# 42723 -> https://bugzilla.kernel.org/show_bug.cgi?id=42723

I went back to 11.04 with 3.0.0-19-generic which works fine and without any problems for nearly a week now.

Revision history for this message
Steff (s-teff) wrote :

Same problem with P-ATA-Harddisk (ATA WDC WD2500BEVE-00WZT0) in Thinkpad T43
Works with Xubuntu 12.04 and kernel 3.0.0-17-generic

Revision history for this message
Jim Bander (jim-bander) wrote :

Same problem on SSD boot disk with a recently-updated copy of Linux Mint 10 Julia:

 # lsb_release -rd
Description: Linux Mint 10 Julia
Release: 10
# apt-cache policy linux-image
linux-image:
  Installed: (none)
  Candidate: 2.6.35.32.42
  Version table:
     2.6.35.32.42 0
        500 http://archive.ubuntu.com/ubuntu/ maverick-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu/ maverick-security/main amd64 Packages
     2.6.35.22.23 0
        500 http://archive.ubuntu.com/ubuntu/ maverick/main amd64 Packages

Changed in linux:
importance: Unknown → Medium
status: Unknown → Confirmed
Revision history for this message
Jeremy Sanders (jeremysanders) wrote :

Same problem for me with a Seagate Barracuda 7200.9 ST3500641AS using ata_piix, connected to SATA (IDE interface: Intel Corporation N10/ICH7 Family SATA Controller [IDE mode] (rev 01))/

Revision history for this message
Heiko Sieger (h-sieger) wrote :

I'm experiencing similar file system corruption issues with my Sandisk Extreme 120GB SSD. All Ubuntu or Debian based Linux OSes I tried so far, including Ubuntu 12.04, as well as Fedora 16 and 17, report file system errors on bootup irregularly. After fsck and fixing the errors (or in case it boots normally), I get various segmentation faults and/or CRC errors for files. Trying 5 or 6 different distributions with the same results, I must conclude this may be a kernel issue.

I will update my SSD firmware in the hope that this solves the issue, but I doubt since the updated firmware supposedly only addresses TRIM issues (discard option in fstab).

All distributions worked perfectly fine when booting them from a live USB stick. I checked my RAM using memtest86+ and it reports no errors. Also smartctl does not reveal any problems with the SSD.

I managed to install and boot several 3.2 and later kernels from SSD, but when using synaptic for updates and installs I eventually get segmentation faults for synaptic, or the system doesn't boot anymore. It looks like something is corrupting the file system.

Again, this happened with all new linux kernels 3.2 and above (see above list of distributions I tried). I do not have any issues with the SSD while running from a live USB stick. I can format the disk, install on it, chroot into it and install or modify things, but I can't get a stable system booting from the SSD. One of the common errors when the SSD doesn't boot is efidisk read error. But it also doesn't work with MBR formated SSD.

I wonder if SSDs can be used with kernel 3.2 and above? I didn't try older kernel versions, though. Sorry I can't post more specific details, but if anybody is interested to get output of debugging commands with Ubuntu 12.04, I will try.

Revision history for this message
Thomas Hood (jdthood) wrote :

With ThinkPad X220 and 240 GB Intel 520 SSD I also get serious filesystem corruption errors running Ubuntu 12.04 desktop. "Serious" means: just now I couldn't boot from the Ubuntu partition on the SSD; I had to boot from another disk and fsck the Ubuntu partition on the SSD, and even then the booted Ubuntu 12.04 system behaves erratically, indicating file corruption has occurred.

Until I figure out what is going on I have gone back to running Ubuntu 11.10 from my trusty old Hitachi hard disk.

Revision history for this message
Noisome (jeffry-r-walsh) wrote :

I am having problems as well. I thought my SSD was dying. During updates it would receive an input/output error and halt updates. But an older version of Ubuntu functioned without errors. I switched to ReiserFS and have not had any errors so far.

Revision history for this message
André Desgualdo Pereira (desgua) wrote :

I tested Reiserfs as suggested, but in my case, despite of fsck not showing disk corruption, I got some others problems like after a few minutes the read/write goes super-slow (from ~100MB/s to ~1MB/s as showed by hdparm -t /dev/sda), then I got some "Bus error" when trying to launch some apps (i.e. gnome-terminal, gedit, rhythmbox, eog) which I could not fix (I had even compiled gnome-terminal but get the same error) and finally the system would not boot anymore with DRDY ERR. Not to mention that I couldn't find a way to TRIM (neither fstrim, nor hdparm worked for this).
So I am thinking if the problem would involve others components besides the filesystem.

Revision history for this message
Phattanon Duangdara (sfalpha) wrote :

It seems to related to Chipset and IDE mode in BIOS.

I found this problems only in old server (DELL 860) using IDE Mode on ICH7/ICH7R.
And not affected only SSD, also HDD.

Another server with ICH10 with AHCI works flawlessly.

Revision history for this message
André Desgualdo Pereira (desgua) wrote : Re: [Bug 992424] Re: ext4 filesystem errors on SSD disk

I am using AHCI mode. Now I am testing Btrfs for two days. Although it is
much more slow for boot (11.6 seconds vs 4.8 with ext4) and for
installation, so far I have not experienced any file corruption. I will
report again in a couple of weeks.
Regards.

Revision history for this message
André Desgualdo Pereira (desgua) wrote :

More info, trying to elucidate this issue.
Using btrfs: unintentionally I pressed the evil suspend button and the computer goes for a loop sleeping and awakening until reboot with a corrupted filesystem. Btrfs seems to somehow recover itself and after ~300 seconds get to boot after a few DRDY err and others errors. Tried to boot from a live cd and repair without success.

Revision history for this message
NIkolaos Papadakis (nkpapas) wrote :

The same applies for me. It is a nightmare! I have a 90GB SSD and Kubuntu 12.04 64bit.
Since I installed 12.04 in random times the SSD is mounted in read only mode.
When I reboot I have to disconnect the SSD from the power and then the PC is booted indicating no errors.
I will try to format the disk and install the OS on a clean ext3 file system, on the same SSD, and see if the problem persists.
I will keep you updated.

N.-

Revision history for this message
Joseph Salisbury (jsalisbury) wrote :

Can folks affected by this bug test the latest mainline kernel, which is v3.6-rc7[0] ?

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.6-rc7-quantal/

Changed in linux (Ubuntu):
status: Triaged → Fix Released
Revision history for this message
Thomas Hood (jdthood) wrote :

@Charles: Can you please explain why this report has been reclassified as "Fix Released"? Which kernel version contains the fix? How has this fix been tested?

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :

Sorry. I accidentally changed the status from Triaged, thinking I was clicking through to more details, and it won't let me change it back. Perhaps someone can revert that change.

I was intending just to add a note that this bug has been affecting several different Ubuntu systems I maintain for myself, friends and relatives, at least one is not SSD. On my own laptop the journal ends up being corrupted about once every 2 days, sometimes once a day. I switched off the "discard" mount option last week.

The following is a typical diagnostic, but the details vary.

250162.298456] EXT4-fs error (device sda1): ext4_ext_remove_space:2574: inode #1077692: comm Chrome_CacheThr: bad header/extent: invalid magic - magic 87cf, entries 10138, max 3135(0), depth 51679(0)
[250162.298471] Aborting journal on device sda1-8.
[250162.298587] EXT4-fs (sda1): Remounting filesystem read-only
[250162.298594] EXT4-fs error (device sda1) in ext4_ext_remove_space:2637: IO failure
[250162.298706] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4494: Journal has aborted

In my case, the system has been reliable for months before. This trouble seemed to start on 4 October, when an Ubuntu update-manager invocation remade initrd (but the kernel didn't change from /boot/vmlinuz-3.2.0-32-generic).

I'm trying Linux 3.6.2-030602-generic #201210121823 SMP Fri Oct 12 22:31:22 UTC 2012 i686 i686 i386 GNU/Linux

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :

@Thomas See my last message. I used the web page interface incorrectly and then could not undo my mistake.

Revision history for this message
Thomas Hood (jdthood) wrote :

@Charles: I've had the same problem in the past. :)

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :

It might be worth noting that the bug is fairly nasty. I had to recover my system yesterday afternoon. Today, after the overnight locatedb build (which might be disk intensive), but relatively little use of the disk during a few hours of normal work, the bug was triggered again by my downloading and unpacking (dpkg -i) the 3.6.2-030602 kernel components. It failed during the unpack, and I had to boot a rescue CD, fsck which reset the file system state many hours (by discarding part of the journal), then download and unpack the kernel packages again since all that had been lost.
That worked, so I'm running with that now.

Revision history for this message
Thomas Hood (jdthood) wrote :

Yep, nasty enough for me to stop using my new 240 GB Intel "520 series" SSD (model number SSDSC2CW240A3) and return to my old Hitachi 320GB HD which works perfectly.

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :

And this just in:
Linux ... 3.6.2-030602-generic #201210121823 SMP Fri Oct 12 22:31:22 UTC 2012 i686 i686 i386 GNU/Linux

[88861.206938] EXT4-fs error (device sda1): __ext4_ext_check_block:472: inode #1080426: comm Chrome_CacheThr: bad header/extent: invalid magic - magic 81a4, entries 1000, max 30863(0), depth 0(0)
[88861.206944] Aborting journal on device sda1-8.
[88861.207068] EXT4-fs (sda1): Remounting filesystem read-only
[88861.207076] EXT4-fs error (device sda1) in ext4_ext_remove_space:2790: IO failure
[88861.207152] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[88861.207246] EXT4-fs error (device sda1) in ext4_ext_truncate:4308: Journal has aborted
[88861.207352] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[88861.207467] EXT4-fs error (device sda1) in ext4_orphan_del:2491: Journal has aborted
[88861.207572] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted

Zut alors!

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :

I turned off the "discard" option last week after the first bout of trouble, based on a suggestion in a google'd bug tracker that there was an off-by-one in scsi_debug in the part that implemented trim. (Not that I thought I was running scsi_debug, but since I was having similar problems I suspected there might be similar trouble elsewhere.)

What would there be that's SSD-related apart from trim?

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :

Since I get problems anyway, I've enabled "discard" again.

I might be willing to believe my SSD has hardware problems or is wearing out, but I've seen trouble on two different systems so far.
Also, int February, when I had corruption with ext2/3, I thought it might be memory or SSD, but memory tests went fine, and then it happened on a conventional SATA drive as well on a different machine.
I changed those machines to ext4 (since that seemed more actively maintained, and the diagnostic I was getting related to a known race in ext2/3), and ... the problem went away until last week.

There have never been relevant errors in the drive's error log (SMART).

I could change over to a replacement SSD I bought in February, but it's a bit of a slog that won't work if there really is a driver problem somewhere.

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :

Since I seem to be able to reproduce it, not at will, but at least with the passage of a day or two,
is there something constructive I could do to help track this down? It looks useful to run with a USB
stick attached and mounted, so I can copy logs etc onto it when things go wrong.

Revision history for this message
André Desgualdo Pereira (desgua) wrote :

@Charles
Can you explain (step by step if possible) how to reproduce the error?
(I am willing to do some tests, but the corruption seems random to me)

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :

@André
By being able to reproduce it, I meant only that it happens so regularly from day to day that I can try to capture more information: it isn't "once in a blue moon". It does seem to be a function of the amount of file IO (which makes sense), so I thought I might risk generating a load synthetically.

Revision history for this message
André Desgualdo Pereira (desgua) wrote :

@Charles
Ok, I thought you could cause it whenever you want. I agree with you that
it seems to be a function of the amount of file IO. Unfortunately I could
not reproduce when I tried to generate a lot of read and write. But my
files get corrupted after 15-20 days.
I am now testing the same disk (Intel SSD) in another notebook (also with
up to date Ubuntu 12.04) just to make sure it is not related to a hardware
problem with another component like motherboard for example.

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :
Download full text (3.9 KiB)

Today's contribution, after hardly any work (allowing for overnight locatedb updates and anything else the cron might do.

[70391.556798] EXT4-fs error (device sda1): __ext4_ext_check_block:472: inode #560844: comm Chrome_CacheThr: bad header/extent: invalid magic - magic 3262, entries 13113, max 14435(0), depth 12385(0)
[70391.556806] Aborting journal on device sda1-8.
[70391.556876] EXT4-fs (sda1): Remounting filesystem read-only
[70391.556881] EXT4-fs error (device sda1) in ext4_ext_remove_space:2790: IO failure
[70391.556956] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[70391.557068] EXT4-fs error (device sda1) in ext4_ext_truncate:4308: Journal has aborted
[70391.557159] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[70391.557260] EXT4-fs error (device sda1) in ext4_orphan_del:2491: Journal has aborted
[70391.557360] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted

Similar trace back.

I find it interesting that the locatedb is often supposedly corrupt at the same time:

% locate cron
locate: `/var/lib/mlocate/mlocate.db' does not seem to be a mlocate database

After a reboot and a file system check, however, it's fine.

Some output from debugfs:
dumpe2fs 1.42 (29-Nov-2011)
Filesystem volume name: <none>
Last mounted on: /
Filesystem UUID: 4bfde6cd-c859-40f6-8848-9ecaa5d93265
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent sparse_super large_file uninit_bg
Filesystem flags: signed_directory_hash
Default mount options: discard
Filesystem state: clean with errors
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 2215168
Block count: 4442364
Reserved block count: 222117
Free blocks: 381539
Free inodes: 1611147
First block: 0
Block size: 4096
Fragment size: 4096
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 16288
Inode blocks per group: 509
Last mount time: Wed Oct 17 14:44:21 2012
Last write time: Thu Oct 18 10:20:17 2012
Mount count: 1
Maximum mount count: 30
Last checked: Wed Oct 17 15:40:55 2012
Check interval: 0 (<none>)
Lifetime writes: 437 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 128
Journal inode: 8
First orphan inode: 560844
Default directory hash: half_md4
Directory Hash Seed: 51028ff9-6909-473c-bbb9-01690bdd9a66
Journal backup: inode blocks
FS Error count: 7
First error time: Thu Oct 18 10:20:17 2012
First error function: __ext4_ext_check_block
First error line #: 472
First error inode #: 560844
First error block #: 0
Last error time: Thu Oct 18 10:20:17 2012
Last error function: ext4_reserve_inode_write
Last error line #: 4550
Last error inode #: 560844...

Read more...

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :

I think I know what it is, in my case. If my revised system gets through the next few days, let alone a week, without trouble, I'll think it is reasonably certain.

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :

I was wrong about that. fsck -c -c -k ... had found 3 bad blocks, so I thought "ah! it was device error after all".
Having moved the bad blocks out of the way, I expected all to return to normal, and would have changed to moaning
about the complete lack of visible diagnostics (including in dmesg) about the occurrence of any IO error when writing
to the bad blocks. (It's possible that the IO ends up in the device cache and it's not until that's flushed to the drive that
any error is detected, and that's not communicated back to the host, so there's little the software can do.)

In fact, the system has continued on in the same old way. Just now:
[73851.280405] EXT4-fs error (device sda1): ext4_mb_generate_buddy:741: group 67, 6802 clusters in bitmap, 6777 in gd
[73851.280416] Aborting journal on device sda1-8.
[73851.280527] EXT4-fs (sda1): Remounting filesystem read-only
[73851.280541] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.280639] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.280836] EXT4-fs error (device sda1) in ext4_ext_remove_space:2790: Journal has aborted
[73851.280922] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.281005] EXT4-fs error (device sda1) in ext4_ext_truncate:4308: Journal has aborted
[73851.281093] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.281165] EXT4-fs error (device sda1) in ext4_orphan_del:2491: Journal has aborted
[73851.281313] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.331505] EXT4-fs error (device sda1): ext4_mb_generate_buddy:741: group 128, 8453 clusters in bitmap, 8437 in gd
[73851.331513] EXT4-fs (sda1): pa f61a05e8: logic 2637, phys. 4209875, len 3
[73851.331516] EXT4-fs error (device sda1): ext4_mb_release_inode_pa:3607: group 128, free 3, pa_free 2

and after an fsck, it has reverted chunks of the file system because (presumably, not that it tells you anywhere) it has discarded the tail of the journal.

This has become unusable.

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :

I wondered whether ext4_mb_generate_buddy might be related to http://help.lockergnome.com/linux/ext4fs-error-ext4_mb_generate_buddy-741-group-16-8160-cluste--ftopict559576.html
so for a final experiment, I've switched off "discard" once more. That bug seemed confined to scsi_debug, but who knows?

On the other hand, another round of fsck -c -c -k found 3 more bad blocks, once again in the journal.
Note that fsck -c -c -k has got a problem: it adds the blocks to the bad block list, and promptly complains
that there are duplicate blocks in the bad block inode and in the journal. It then asks whether to clone
the multiply-allocated blocks. Unfortunately, it gives no sign how it will clone them: will the bad blocks
remain in the bad block list, with the copies going into the journal, or will the blocks remain in the journal,
with copies uselessly being placed in the bad block list. I decided it was safer to delete the journal, re-run
the check, leaving the blocks only in the bad block list, then recreate the journal, then switch discard off.

The bad blocks were as follows:

(0-2):2277409-2277411
(3-5):2277665-2277667

The pattern is easier to see in hex, I think:
22C021
22C022
22C023
22C121
22C122
22C123

Hmm. 3 in a row each time. 256 might be important in the internal geometry. *Might* be a failing drive.

Revision history for this message
André Desgualdo Pereira (desgua) wrote :

It would be very nice if Linus give his opinion on this. He has told that
he doesn't like spinning disks, so besides coordinating the kernel he also
uses solid state disks. Is this a hardware failure? Or a bug?

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :

If I get through the next day or two without the problems that plagued me last week, that might suggest trying again with "discard" to see if problems reappear.

On the other hand, if I have further trouble, I've got a new replacement SSD to try.

Since my usage doesn't seem to be that unusual, and if anything is fairly light, and many people these days would be using SSDs with Ubuntu, even Ubuntu 12.04, I'd assumed that if there were a bug that was causing frequent corruption of [file systems on] SSDs at the rate I've seen, there would be widespread reports of dismay, but that doesn't seem to be so.

Revision history for this message
André Desgualdo Pereira (desgua) wrote :

@Charles
I think you are right. Keep us updated with yours tests. :-)

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :
Download full text (4.0 KiB)

At 2am, after working away happily for 1.5 days (I copied the work out regularly), some time after a 45Mbyte Software Update, it
began to go wrong. A reboot prompted the following repair, and I left it:

[ 6.314129] EXT4-fs (sda1): orphan cleanup on readonly fs
[ 6.314137] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 309788
[ 6.315820] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 766985
[ 6.317334] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 769166
[ 6.317369] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 769170
[ 6.317391] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 768775
[ 6.320236] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 814440
[ 6.320273] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 768541
[ 6.322899] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 311047
[ 6.322945] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 977877
[ 6.322973] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 310939

I left it until this morning, and at 9am after a restore of my work by copying it in, and a little use of Chrome:

[ 665.062016] rm[14080]: segfault at 4 ip b77ab60d sp bfbd20a0 error 4 in ld-2.15.so[b77a9000+20000]
[ 665.160702] mv[14084]: segfault at 4 ip b7790481 sp bfa935b0 error 4 in ld-2.15.so[b7785000+20000]
[ 665.264448] touch[14088]: segfault at 4 ip b77c9481 sp bf82cac0 error 4 in ld-2.15.so[b77be000+20000]
[ 1348.241462] dell_wmi: Received unknown WMI event (0x11)
[ 7654.917714] readlink[14150]: segfault at 4 ip b776a481 sp bfeedef0 error 4 in ld-2.15.so[b775f000+20000]
[ 7655.051813] dirname[14156]: segfault at 4 ip b76fc481 sp bff23930 error 4 in ld-2.15.so[b76f1000+20000]
[ 7655.182041] mkdir[14163]: segfault at 4 ip b77b2481 sp bfe16680 error 4 in ld-2.15.so[b77a7000+20000]
[14836.922957] readlink[14226]: segfault at 4 ip b7784481 sp bfd0e0c0 error 4 in ld-2.15.so[b7779000+20000]
[14837.011050] dirname[14230]: segfault at 4 ip b77c5481 sp bfc2a6a0 error 4 in ld-2.15.so[b77ba000+20000]
[14837.107422] mkdir[14235]: segfault at 4 ip b775e481 sp bf91e510 error 4 in ld-2.15.so[b7753000+20000]
[15019.548968] EXT4-fs error (device sda1): __ext4_ext_check_block:472: inode #310825: comm rs:main Q:Reg: bad header/extent: invalid magic - magic 8b1f, entries 8, max 20596(0), depth 20597(0)
[15019.548974] Aborting journal on device sda1-8.
[15019.549058] EXT4-fs (sda1): Remounting filesystem read-only
[15019.549069] EXT4-fs error (device sda1) in ext4_da_write_begin:2533: IO failure
[22018.927076] readlink[14332]: segfault at 4 ip b77c2481 sp bfbe0f00 error 4 in ld-2.15.so[b77b7000+20000]
[22019.009226] dirname[14334]: segfault at 4 ip b776e481 sp bfbcb140 error 4 in ld-2.15.so[b7763000+20000]
[22019.086704] mkdir[14337]: segfault at 4 ip b7707481 sp bfc313a0 error 4 in ld-2.15.so[b76fc000+20000]
[25526.009125] dell_wmi: Received unknown WMI event (0x11)
[25550.643627] uname[14420]: segfault at 4 ip b772c481 sp bfef8260 error 4 in ld-2.15.so[b7721000+20000]

 By the way: one of my complaints was that there wasn't any warni...

Read more...

Revision history for this message
breek (breek) wrote :

i can't even install xubuntu 12.10 (minimal iso + xubuntu-desktop).
sometimes is the mini iso installer that stops due to this bug (when installing basic system), other times the errors occur when installing xubuntu-desktop packages.

(ssd: samsung 830; motherboard asus p7p55d-e set in ahci mode)

Revision history for this message
Charles Forsyth (charles-forsyth) wrote :

Since I've been running days longer than previously (since the trouble started at the star of October) with no errors at all, it seems to have been a hardware problem. (One can imagine software problems that depended on an odd structure in the original file system, and I didn't copy the file system bytes: I copied its files into a new, empty file system. Even so, that possibility seems unlikely.) I don't think it was a fundamental mismatch between device and software, since I've had the SSD in the machine for about two years without fuss.

I've got two remaining questions: what was the actual hardware error, and why did it not show up as a hardware error to the software (or if the hardware did try to signal it, why didn't the software diagnose that right away)?
I've kept the old SSD, and when I've got some spare time I hope to experiment with it in another system, to see whether it misbehaves there, undetectably.

It's tricky to do experiments with a device and a system that are being used for production (ie, paying) work.

Now I can try the upgrade to 12.10 ...

Revision history for this message
André Desgualdo Pereira (desgua) wrote :

@Charles Thank you for the news.

Changed in linux:
status: Confirmed → Fix Released
Revision history for this message
André Desgualdo Pereira (desgua) wrote :

The weirdest thing: I moved the ssd disk to another notebook over a month ago and no errors. I put an hdd in my primary notebook and no errors too.
Next move: I will bought another ssd to this notebook and see what happens.

Revision history for this message
André Desgualdo Pereira (desgua) wrote :

The new ssd (Coursair Force GT) is working great for one month. The old (Intel 510) is also working great in another computer. So I am wondering if there is a compatibility problem with my Malibal Lotus P151HM1 and the Intel ssd.

Revision history for this message
99Sono (nunogdem) wrote :
Download full text (7.6 KiB)

Same problem here, Ubuntu 3.5.0-25-generic #39-Ubuntu SMP Mon Feb 25 18:26:58 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
running on top of a 120 GB SSD (OCZ Agility 3 ) is systematically getting remounted as read only.

I have already tried to minimize the IO on the / partition, by having some of the folders being mounted as tmpfs.

99sono@99sono-Satellite-A665:/tmp$ mount -l
/dev/sda1 on / type ext4 (rw,noatime,nodiratime,errors=remount-ro)
proc on /proc type proc (rw,noexec,nosuid,nodev)
sysfs on /sys type sysfs (rw,noexec,nosuid,nodev)
none on /sys/fs/fuse/connections type fusectl (rw)
none on /sys/kernel/debug type debugfs (rw)
none on /sys/kernel/security type securityfs (rw)
udev on /dev type devtmpfs (rw,mode=0755)
devpts on /dev/pts type devpts (rw,noexec,nosuid,gid=5,mode=0620)
tmpfs on /tmp type tmpfs (rw,noatime,mode=1777)
tmpfs on /run type tmpfs (rw,noexec,nosuid,size=10%,mode=0755)
none on /run/lock type tmpfs (rw,noexec,nosuid,nodev,size=5242880)
none on /run/shm type tmpfs (rw,nosuid,nodev)
none on /run/user type tmpfs (rw,noexec,nosuid,nodev,size=104857600,mode=0755)
tmpfs on /var/tmp type tmpfs (rw,noatime,mode=1777)
tmpfs on /var/log type tmpfs (rw,noatime,mode=0755)
tmpfs on /var/log/apt type tmpfs (rw,noatime)
/dev/sdb9 on /home/99sono/Documents/workspaceEclipse type ext4 (rw,nosuid,nodev,errors=remount-ro) [EclipseWorkspace]
/dev/sdb5 on /home/99sono/Dropbox type ext4 (rw,nosuid,nodev,errors=remount-ro) [Dropbox]
/dev/sdb6 on /home/99sono/Downloads type ext4 (rw,nosuid,nodev,errors=remount-ro) [Download]
/dev/sdb7 on /home/99sono/SpiderOak type ext4 (rw,noexec,nosuid,nodev,errors=remount-ro) [SpiderOak]
/dev/sdb8 on /var/cache type ext4 (rw,noatime,nodiratime,errors=remount-ro) [VarCache]

(sda would be SSD sdb, is a plain usual SATA disk)

According to all indications, It would appear that I have hardware problems... but in reality, i highly doubt it, the SSD has viratully no use and OCZ SSDs are reputed to be of top quality. So, unless OCZ has an underserved reputation, the ssd should still be in mint condition.

dmesg lists log messages such as (sda1 would be the root partition):
1689 [ 6717.553653] Buffer I/O error on device sda1, logical block 1218984
  1690 [ 6717.553655] Buffer I/O error on device sda1, logical block 1218985
  1691 [ 6717.553657] Buffer I/O error on device sda1, logical block 1218986
  1692 [ 6717.553659] Buffer I/O error on device sda1, logical block 1218987
  1693 [ 6717.553661] Buffer I/O error on device sda1, logical block 1218988
  1694 [ 6717.553663] Buffer I/O error on device sda1, logical block 1218989
  1695 [ 6717.553665] Buffer I/O error on device sda1, logical block 1218990
  1696 [ 6717.553667] Buffer I/O error on device sda1, logical block 1218991
  1697 [ 6717.553669] Buffer I/O error on device sda1, logical block 1218992
  1698 [ 6717.553671] Buffer I/O error on device sda1, logical block 1218993
  1699 [ 6717.553673] Buffer I/O error on device sda1, logical block 1218994
  1700 [ 6717.553675] Buffer I/O error on device sda1, logical block 1218995
  1701 [ 6717.553677] Buffer I/O error on device sda1, logical block 1218996
  1702 [ 6717.553679] Buffer I/O error on devi...

Read more...

Revision history for this message
Heitzso (heitzso) wrote :

Wife has a U300s running Debian Mint. Got bit recently w/ custom 3.5.2 (standard .config updated with defaults and compiled). I'm compiling 3.8.4 on it now (again debian mint standard .config updated w/ defaults for 3.8.4) to see if that fixes it. Filesystem is ext4. I know I'm on ubuntu launchpad web site versus mint/debian/upstream but read through the problems here.

Wife says system never ran on battery to battery death (I asked). However she likely suspends the system every evening. Don't know if that triggered the file system corruption.

I'm frustrated (calmly so) and wondering whether JFS or bleeding edge kernel or ? what ? will fix. This is the only SSD in my house (of 5 house computers) and the only computer whose file system trashes this way. As a U300s (will never buy again) you cannot pop in a standard replacement SSD (non standard SSD).

Again, hope this adds some insight. Not trying to thrash the wrong bug reporting system.

Revision history for this message
THCTLO (thctlo) wrote :

1, always check if powermanagement is of for the sata /SSD's
2. always use AHCI

Im having multiple systems.
different SSDS ( ADATA 510, VERTEX 3, CRUCIAL V4 ).
Running Debian Wheezy ( kernel 3.2) of ubuntu kernel 3.2 and 3.8
only 1 problem found.
My OCZ Vertex 3 is running more than 1 year now. 0 problems. latest firmware )
Adata ssd, for 2 weeks now, 0 problems. ( firmware 5.06)
Crucial V4, 32Gb, 2 SSD in 2 different machines. 1 disk has errors, upgraded to latest firmware,
few days ok, and now again errors, , disk wil go back for repair.

Bad SSD's. Faulty firmwares are most of the problems.
always check if TRIM is working.

Simple test.

wget -O /tmp/test_trim.sh "https://sites.google.com/site/lightrush/random-1/checkiftrimonext4isenabledandworking/test_trim.sh?attredirects=0&d=1"

chmod +x /tmp/test_trim.sh
sudo /tmp/test_trim.sh tempfile 50 /dev/sdX

Revision history for this message
THCTLO (thctlo) wrote :

and for the users :

99Sono (nuno-godinhomatos) wrote on 2013-03-03:
missing the option discard in fstab
this is what you should have.

/dev/sda1 / noatime,discard,errors=remount-ro 0 1

and yes, you can optimze a lot more but it's not needed, try this like above, reboot and test TRIM.

Revision history for this message
Peter Karasev (karasevpa) wrote :

@THCTLO and all, is there a type of SSD that works better for you??

   For me, kernels 3.13, 3.19, and 4.2 (all variants of linux mint 17.x) create this issue with intel SSD drive 460GB. Extremely annoying because windoze 7 worked like a beast on this drive for over a year and I just migrated the disk to run linux in another pc...

   I guess I should avoid having an SSD mounting / for the forseeable future ... ??

Revision history for this message
Peter Karasev (karasevpa) wrote :

Additional info: note that the drive has been absolutely beaten down mercilessly in that year with building 20GB+ MSVC solution projects in numerous directories, recursive changes to permissions on many files, weekly unpacking and writing many TB's of zip files.

      Is the io error symptoms possibly more of an issue once the drive has gotten a lot of heavy use?

Revision history for this message
denace (denace03) wrote :

Good

Changed in linux (Ubuntu):
assignee: nobody → denace (denace03)
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.