Bug #992424 “ext4 filesystem errors on SSD disk” : Bugs : linux package : Ubuntu

Revision history for this message

Peter Meier (peter-meier) wrote on 2012-05-01:

#1

dmesg.log Edit (61.7 KiB, text/plain)

Revision history for this message

Peter Meier (peter-meier) wrote on 2012-05-01:

#2

lspci-vnn Edit (22.5 KiB, text/plain)

Revision history for this message

Peter Meier (peter-meier) wrote on 2012-05-01:

#3

$ cat /proc/version_signature
Ubuntu 3.2.0-24.37-generic 3.2.14
$ uname -a
Linux foo 3.2.0-24-generic #37-Ubuntu SMP Wed Apr 25 08:43:22 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

Brad Figg (brad-figg) on 2012-05-01

affects:

linux-meta (Ubuntu) → linux (Ubuntu)

Brad Figg (brad-figg) on 2012-05-01

Changed in linux (Ubuntu):
status:	New → Confirmed
tags:	added: precise

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2012-05-01:

#4

Would it be possible for you to test the latest upstream kernel? Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please test the latest v3.4kernel[1] (Not a kernel in the daily directory). Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag(Only that one tag, please leave the other tags). This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream'.
Once testing of the upstream kernel is complete, please mark this bug as "Confirmed".

Thanks in advance.

http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.4-rc4-precise/

Changed in linux (Ubuntu):
importance:	Undecided → Medium
importance:	Medium → High
tags:	added: needs-upstream-testing
Changed in linux (Ubuntu):
status:	Confirmed → Incomplete
tags:	added: kernel-da-key

Revision history for this message

Peter Meier (peter-meier) wrote on 2012-05-01:

#5

I'm now running a 3.4 kernel. Will report back if there are problems.

Revision history for this message

Peter Meier (peter-meier) wrote on 2012-05-02:

#6

Unfortunately it also happened with the upstream kernel:

[14081.450885] EXT4-fs error (device dm-1): ext4_mb_generate_buddy:741: group 904, 32254 clusters in bitmap, 32258 in gd
[14081.450895] Aborting journal on device dm-1-8.
[14081.451151] EXT4-fs (dm-1): Remounting filesystem read-only
[14081.452467] EXT4-fs (dm-1): ext4_da_writepages: jbd2_start: 9223372036854775807 pages, ino 2408796; err -30

$ uname -a
Linux foo 3.4.0-030400rc4-generic #201204230908 SMP Mon Apr 23 13:10:03 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

tags:

added: kernel-bug-exists-upstream
removed: needs-upstream-testing

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2012-05-02:

#7

This issue appears to be an upstream bug, since you tested the latest upstream kernel. Would it be possible for you to open an upstream bug report at bugzilla.kernel.org [1]? That will allow the upstream Developers to examine the issue, and may provide a quicker resolution to the bug.

If you are comfortable with opening a bug upstream, It would be great if you can report back the upstream bug number in this bug report. That will allow us to link this bug to the upstream report.

[1] https://wiki.ubuntu.com/Bugs/Upstream/kernel

Changed in linux (Ubuntu):
status:	Incomplete → Triaged

Revision history for this message

Peter Meier (peter-meier) wrote on 2012-05-09:

#8

Imho this is upstream Bug# 42723 -> https://bugzilla.kernel.org/show_bug.cgi?id=42723

I went back to 11.04 with 3.0.0-19-generic which works fine and without any problems for nearly a week now.

Revision history for this message

Steff (s-teff) wrote on 2012-05-12:

#9

Same problem with P-ATA-Harddisk (ATA WDC WD2500BEVE-00WZT0) in Thinkpad T43
Works with Xubuntu 12.04 and kernel 3.0.0-17-generic

Revision history for this message

Jim Bander (jim-bander) wrote on 2012-05-14:

#10

Same problem on SSD boot disk with a recently-updated copy of Linux Mint 10 Julia:

# lsb_release -rd
Description: Linux Mint 10 Julia
Release: 10
# apt-cache policy linux-image
linux-image:
  Installed: (none)
  Candidate: 2.6.35.32.42
  Version table:
     2.6.35.32.42 0
        500 http://archive.ubuntu.com/ubuntu/ maverick-updates/main amd64 Packages
        500 http://security.ubuntu.com/ubuntu/ maverick-security/main amd64 Packages
     2.6.35.22.23 0
        500 http://archive.ubuntu.com/ubuntu/ maverick/main amd64 Packages

Bug Watch Updater (bug-watch-updater) on 2012-05-19

Changed in linux:
importance:	Unknown → Medium
status:	Unknown → Confirmed

Revision history for this message

Jeremy Sanders (jeremysanders) wrote on 2012-05-25:

#11

Same problem for me with a Seagate Barracuda 7200.9 ST3500641AS using ata_piix, connected to SATA (IDE interface: Intel Corporation N10/ICH7 Family SATA Controller [IDE mode] (rev 01))/

Revision history for this message

Heiko Sieger (h-sieger) wrote on 2012-05-31:

#12

I'm experiencing similar file system corruption issues with my Sandisk Extreme 120GB SSD. All Ubuntu or Debian based Linux OSes I tried so far, including Ubuntu 12.04, as well as Fedora 16 and 17, report file system errors on bootup irregularly. After fsck and fixing the errors (or in case it boots normally), I get various segmentation faults and/or CRC errors for files. Trying 5 or 6 different distributions with the same results, I must conclude this may be a kernel issue.

I will update my SSD firmware in the hope that this solves the issue, but I doubt since the updated firmware supposedly only addresses TRIM issues (discard option in fstab).

All distributions worked perfectly fine when booting them from a live USB stick. I checked my RAM using memtest86+ and it reports no errors. Also smartctl does not reveal any problems with the SSD.

I managed to install and boot several 3.2 and later kernels from SSD, but when using synaptic for updates and installs I eventually get segmentation faults for synaptic, or the system doesn't boot anymore. It looks like something is corrupting the file system.

Again, this happened with all new linux kernels 3.2 and above (see above list of distributions I tried). I do not have any issues with the SSD while running from a live USB stick. I can format the disk, install on it, chroot into it and install or modify things, but I can't get a stable system booting from the SSD. One of the common errors when the SSD doesn't boot is efidisk read error. But it also doesn't work with MBR formated SSD.

I wonder if SSDs can be used with kernel 3.2 and above? I didn't try older kernel versions, though. Sorry I can't post more specific details, but if anybody is interested to get output of debugging commands with Ubuntu 12.04, I will try.

Revision history for this message

Thomas Hood (jdthood) wrote on 2012-06-06:

#13

With ThinkPad X220 and 240 GB Intel 520 SSD I also get serious filesystem corruption errors running Ubuntu 12.04 desktop. "Serious" means: just now I couldn't boot from the Ubuntu partition on the SSD; I had to boot from another disk and fsck the Ubuntu partition on the SSD, and even then the booted Ubuntu 12.04 system behaves erratically, indicating file corruption has occurred.

Until I figure out what is going on I have gone back to running Ubuntu 11.10 from my trusty old Hitachi hard disk.

Revision history for this message

Noisome (jeffry-r-walsh) wrote on 2012-08-18:

#14

I am having problems as well. I thought my SSD was dying. During updates it would receive an input/output error and halt updates. But an older version of Ubuntu functioned without errors. I switched to ReiserFS and have not had any errors so far.

Revision history for this message

André Desgualdo Pereira (desgua) wrote on 2012-09-01:

#15

I tested Reiserfs as suggested, but in my case, despite of fsck not showing disk corruption, I got some others problems like after a few minutes the read/write goes super-slow (from ~100MB/s to ~1MB/s as showed by hdparm -t /dev/sda), then I got some "Bus error" when trying to launch some apps (i.e. gnome-terminal, gedit, rhythmbox, eog) which I could not fix (I had even compiled gnome-terminal but get the same error) and finally the system would not boot anymore with DRDY ERR. Not to mention that I couldn't find a way to TRIM (neither fstrim, nor hdparm worked for this).
So I am thinking if the problem would involve others components besides the filesystem.

Revision history for this message

Phattanon Duangdara (sfalpha) wrote on 2012-09-18:

#16

It seems to related to Chipset and IDE mode in BIOS.

I found this problems only in old server (DELL 860) using IDE Mode on ICH7/ICH7R.
And not affected only SSD, also HDD.

Another server with ICH10 with AHCI works flawlessly.

Revision history for this message

André Desgualdo Pereira (desgua) wrote on 2012-09-18: Re: [Bug 992424] Re: ext4 filesystem errors on SSD disk

#17

I am using AHCI mode. Now I am testing Btrfs for two days. Although it is
much more slow for boot (11.6 seconds vs 4.8 with ext4) and for
installation, so far I have not experienced any file corruption. I will
report again in a couple of weeks.
Regards.

Revision history for this message

André Desgualdo Pereira (desgua) wrote on 2012-09-22:

#18

More info, trying to elucidate this issue.
Using btrfs: unintentionally I pressed the evil suspend button and the computer goes for a loop sleeping and awakening until reboot with a corrupted filesystem. Btrfs seems to somehow recover itself and after ~300 seconds get to boot after a few DRDY err and others errors. Tried to boot from a live cd and repair without success.

Revision history for this message

NIkolaos Papadakis (nkpapas) wrote on 2012-09-27:

#19

The same applies for me. It is a nightmare! I have a 90GB SSD and Kubuntu 12.04 64bit.
Since I installed 12.04 in random times the SSD is mounted in read only mode.
When I reboot I have to disconnect the SSD from the power and then the PC is booted indicating no errors.
I will try to format the disk and install the OS on a clean ext3 file system, on the same SSD, and see if the problem persists.
I will keep you updated.

N.-

Revision history for this message

Joseph Salisbury (jsalisbury) wrote on 2012-09-27:

#20

Can folks affected by this bug test the latest mainline kernel, which is v3.6-rc7[0] ?

Thanks in advance.

[0] http://kernel.ubuntu.com/~kernel-ppa/mainline/v3.6-rc7-quantal/

Charles Forsyth (charles-forsyth) on 2012-10-16

Changed in linux (Ubuntu):
status:	Triaged → Fix Released

Revision history for this message

Thomas Hood (jdthood) wrote on 2012-10-16:

#21

@Charles: Can you please explain why this report has been reclassified as "Fix Released"? Which kernel version contains the fix? How has this fix been tested?

Revision history for this message

Charles Forsyth (charles-forsyth) wrote on 2012-10-16:

#22

Sorry. I accidentally changed the status from Triaged, thinking I was clicking through to more details, and it won't let me change it back. Perhaps someone can revert that change.

I was intending just to add a note that this bug has been affecting several different Ubuntu systems I maintain for myself, friends and relatives, at least one is not SSD. On my own laptop the journal ends up being corrupted about once every 2 days, sometimes once a day. I switched off the "discard" mount option last week.

The following is a typical diagnostic, but the details vary.

250162.298456] EXT4-fs error (device sda1): ext4_ext_remove_space:2574: inode #1077692: comm Chrome_CacheThr: bad header/extent: invalid magic - magic 87cf, entries 10138, max 3135(0), depth 51679(0)
[250162.298471] Aborting journal on device sda1-8.
[250162.298587] EXT4-fs (sda1): Remounting filesystem read-only
[250162.298594] EXT4-fs error (device sda1) in ext4_ext_remove_space:2637: IO failure
[250162.298706] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4494: Journal has aborted

In my case, the system has been reliable for months before. This trouble seemed to start on 4 October, when an Ubuntu update-manager invocation remade initrd (but the kernel didn't change from /boot/vmlinuz-3.2.0-32-generic).

I'm trying Linux 3.6.2-030602-generic #201210121823 SMP Fri Oct 12 22:31:22 UTC 2012 i686 i686 i386 GNU/Linux

Revision history for this message

Charles Forsyth (charles-forsyth) wrote on 2012-10-16:

#23

@Thomas See my last message. I used the web page interface incorrectly and then could not undo my mistake.

Revision history for this message

Thomas Hood (jdthood) wrote on 2012-10-16:

#24

@Charles: I've had the same problem in the past. :)

Revision history for this message

Charles Forsyth (charles-forsyth) wrote on 2012-10-16:

#25

It might be worth noting that the bug is fairly nasty. I had to recover my system yesterday afternoon. Today, after the overnight locatedb build (which might be disk intensive), but relatively little use of the disk during a few hours of normal work, the bug was triggered again by my downloading and unpacking (dpkg -i) the 3.6.2-030602 kernel components. It failed during the unpack, and I had to boot a rescue CD, fsck which reset the file system state many hours (by discarding part of the journal), then download and unpack the kernel packages again since all that had been lost.
That worked, so I'm running with that now.

Revision history for this message

Thomas Hood (jdthood) wrote on 2012-10-16:

#26

Yep, nasty enough for me to stop using my new 240 GB Intel "520 series" SSD (model number SSDSC2CW240A3) and return to my old Hitachi 320GB HD which works perfectly.

Revision history for this message

Charles Forsyth (charles-forsyth) wrote on 2012-10-17:

#27

And this just in:
Linux ... 3.6.2-030602-generic #201210121823 SMP Fri Oct 12 22:31:22 UTC 2012 i686 i686 i386 GNU/Linux

[88861.206938] EXT4-fs error (device sda1): __ext4_ext_check_block:472: inode #1080426: comm Chrome_CacheThr: bad header/extent: invalid magic - magic 81a4, entries 1000, max 30863(0), depth 0(0)
[88861.206944] Aborting journal on device sda1-8.
[88861.207068] EXT4-fs (sda1): Remounting filesystem read-only
[88861.207076] EXT4-fs error (device sda1) in ext4_ext_remove_space:2790: IO failure
[88861.207152] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[88861.207246] EXT4-fs error (device sda1) in ext4_ext_truncate:4308: Journal has aborted
[88861.207352] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[88861.207467] EXT4-fs error (device sda1) in ext4_orphan_del:2491: Journal has aborted
[88861.207572] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted

Zut alors!

Revision history for this message

Charles Forsyth (charles-forsyth) wrote on 2012-10-17:

#28

I turned off the "discard" option last week after the first bout of trouble, based on a suggestion in a google'd bug tracker that there was an off-by-one in scsi_debug in the part that implemented trim. (Not that I thought I was running scsi_debug, but since I was having similar problems I suspected there might be similar trouble elsewhere.)

What would there be that's SSD-related apart from trim?

Revision history for this message

Charles Forsyth (charles-forsyth) wrote on 2012-10-17:

#29

Since I get problems anyway, I've enabled "discard" again.

I might be willing to believe my SSD has hardware problems or is wearing out, but I've seen trouble on two different systems so far.
Also, int February, when I had corruption with ext2/3, I thought it might be memory or SSD, but memory tests went fine, and then it happened on a conventional SATA drive as well on a different machine.
I changed those machines to ext4 (since that seemed more actively maintained, and the diagnostic I was getting related to a known race in ext2/3), and ... the problem went away until last week.

There have never been relevant errors in the drive's error log (SMART).

I could change over to a replacement SSD I bought in February, but it's a bit of a slog that won't work if there really is a driver problem somewhere.

Revision history for this message

Charles Forsyth (charles-forsyth) wrote on 2012-10-17:

#30

Since I seem to be able to reproduce it, not at will, but at least with the passage of a day or two,
is there something constructive I could do to help track this down? It looks useful to run with a USB
stick attached and mounted, so I can copy logs etc onto it when things go wrong.

Revision history for this message

André Desgualdo Pereira (desgua) wrote on 2012-10-17:

#31

@Charles
Can you explain (step by step if possible) how to reproduce the error?
(I am willing to do some tests, but the corruption seems random to me)

Revision history for this message

Charles Forsyth (charles-forsyth) wrote on 2012-10-17:

#32

@André
By being able to reproduce it, I meant only that it happens so regularly from day to day that I can try to capture more information: it isn't "once in a blue moon". It does seem to be a function of the amount of file IO (which makes sense), so I thought I might risk generating a load synthetically.

Revision history for this message

André Desgualdo Pereira (desgua) wrote on 2012-10-17:

#33

@Charles
Ok, I thought you could cause it whenever you want. I agree with you that
it seems to be a function of the amount of file IO. Unfortunately I could
not reproduce when I tried to generate a lot of read and write. But my
files get corrupted after 15-20 days.
I am now testing the same disk (Intel SSD) in another notebook (also with
up to date Ubuntu 12.04) just to make sure it is not related to a hardware
problem with another component like motherboard for example.

Revision history for this message

Charles Forsyth (charles-forsyth) wrote on 2012-10-18:

#34

Download full text (3.9 KiB)

Today's contribution, after hardly any work (allowing for overnight locatedb updates and anything else the cron might do.

[70391.556798] EXT4-fs error (device sda1): __ext4_ext_check_block:472: inode #560844: comm Chrome_CacheThr: bad header/extent: invalid magic - magic 3262, entries 13113, max 14435(0), depth 12385(0)
[70391.556806] Aborting journal on device sda1-8.
[70391.556876] EXT4-fs (sda1): Remounting filesystem read-only
[70391.556881] EXT4-fs error (device sda1) in ext4_ext_remove_space:2790: IO failure
[70391.556956] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[70391.557068] EXT4-fs error (device sda1) in ext4_ext_truncate:4308: Journal has aborted
[70391.557159] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[70391.557260] EXT4-fs error (device sda1) in ext4_orphan_del:2491: Journal has aborted
[70391.557360] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted

Similar trace back.

I find it interesting that the locatedb is often supposedly corrupt at the same time:

% locate cron
locate: `/var/lib/mlocate/mlocate.db' does not seem to be a mlocate database

After a reboot and a file system check, however, it's fine.

Some output from debugfs:
dumpe2fs 1.42 (29-Nov-2011)
Filesystem volume name: <none>
Last mounted on: /
Filesystem UUID: 4bfde6cd-c859-40f6-8848-9ecaa5d93265
Filesystem magic number: 0xEF53
Filesystem revision #: 1 (dynamic)
Filesystem features: has_journal ext_attr dir_index filetype needs_recovery extent sparse_super large_file uninit_bg
Filesystem flags: signed_directory_hash
Default mount options: discard
Filesystem state: clean with errors
Errors behavior: Continue
Filesystem OS type: Linux
Inode count: 2215168
Block count: 4442364
Reserved block count: 222117
Free blocks: 381539
Free inodes: 1611147
First block: 0
Block size: 4096
Fragment size: 4096
Blocks per group: 32768
Fragments per group: 32768
Inodes per group: 16288
Inode blocks per group: 509
Last mount time: Wed Oct 17 14:44:21 2012
Last write time: Thu Oct 18 10:20:17 2012
Mount count: 1
Maximum mount count: 30
Last checked: Wed Oct 17 15:40:55 2012
Check interval: 0 (<none>)
Lifetime writes: 437 GB
Reserved blocks uid: 0 (user root)
Reserved blocks gid: 0 (group root)
First inode: 11
Inode size: 128
Journal inode: 8
First orphan inode: 560844
Default directory hash: half_md4
Directory Hash Seed: 51028ff9-6909-473c-bbb9-01690bdd9a66
Journal backup: inode blocks
FS Error count: 7
First error time: Thu Oct 18 10:20:17 2012
First error function: __ext4_ext_check_block
First error line #: 472
First error inode #: 560844
First error block #: 0
Last error time: Thu Oct 18 10:20:17 2012
Last error function: ext4_reserve_inode_write
Last error line #: 4550
Last error inode #: 560844...

Today's contribution, after hardly any work (allowing for overnight locatedb updates and anything else the cron might do.

[70391.556798] EXT4-fs error (device sda1): __ext4_ext_check_block:472: inode #560844: comm Chrome_CacheThr: bad header/extent: invalid magic - magic 3262, entries 13113, max 14435(0), depth 12385(0)
[70391.556806] Aborting journal on device sda1-8.
[70391.556876] EXT4-fs (sda1): Remounting filesystem read-only
[70391.556881] EXT4-fs error (device sda1) in ext4_ext_remove_space:2790: IO failure
[70391.556956] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[70391.557068] EXT4-fs error (device sda1) in ext4_ext_truncate:4308: Journal has aborted
[70391.557159] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[70391.557260] EXT4-fs error (device sda1) in ext4_orphan_del:2491: Journal has aborted
[70391.557360] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted

Similar trace back.

I find it interesting that the locatedb is often supposedly corrupt at the same time:

% locate cron
locate: `/var/lib/mlocate/mlocate.db' does not seem to be a mlocate database

After a reboot and a file system check, however, it's fine.

Some output from debugfs:
dumpe2fs 1.42 (29-Nov-2011)
Filesystem volume name:   <none>
Last mounted on:          /
Filesystem UUID:          4bfde6cd-c859-40f6-8848-9ecaa5d93265
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr dir_index filetype needs_recovery extent sparse_super large_file uninit_bg
Filesystem flags:         signed_directory_hash 
Default mount options:    discard
Filesystem state:         clean with errors
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              2215168
Block count:              4442364
Reserved block count:     222117
Free blocks:              381539
Free inodes:              1611147
First block:              0
Block size:               4096
Fragment size:            4096
Blocks per group:         32768
Fragments per group:      32768
Inodes per group:         16288
Inode blocks per group:   509
Last mount time:          Wed Oct 17 14:44:21 2012
Last write time:          Thu Oct 18 10:20:17 2012
Mount count:              1
Maximum mount count:      30
Last checked:             Wed Oct 17 15:40:55 2012
Check interval:           0 (<none>)
Lifetime writes:          437 GB
Reserved blocks uid:      0 (user root)
Reserved blocks gid:      0 (group root)
First inode:              11
Inode size:	          128
Journal inode:            8
First orphan inode:       560844
Default directory hash:   half_md4
Directory Hash Seed:      51028ff9-6909-473c-bbb9-01690bdd9a66
Journal backup:           inode blocks
FS Error count:           7
First error time:         Thu Oct 18 10:20:17 2012
First error function:     __ext4_ext_check_block
First error line #:       472
First error inode #:      560844
First error block #:      0
Last error time:          Thu Oct 18 10:20:17 2012
Last error function:      ext4_reserve_inode_write
Last error line #:        4550
Last error inode #:       560844
Last error block #:       0
Journal features:         journal_incompat_revoke
Journal size:             128M
Journal length:           32768
Journal sequence:         0x00004065
Journal start:            24962

debugfs:   stat <560844>
Inode: 560844   Type: regular    Mode:  0600   Flags: 0x80000
Generation: 2592733553    Version: 0x00000001
User:  1000   Group:  1000   Size: 0
File ACL: 0    Directory ACL: 0
Links: 0   Blockcount: 88
Fragment:  Address: 0    Number: 0    Size: 0
ctime: 0x507fc9d1 -- Thu Oct 18 10:20:17 2012
atime: 0x507eca54 -- Wed Oct 17 16:10:12 2012
mtime: 0x507eca54 -- Wed Oct 17 16:10:12 2012
dtime: 0x001069cf -- Tue Jan 13 11:47:43 1970
EXTENTS:
(ETB0):1121648

debugfs:  blocks <560844>
1121648

debugfs:  bmap <560844> 0
argv[0]: Corrupt extent header while mapping logical block 0

Revision history for this message

Charles Forsyth (charles-forsyth) wrote on 2012-10-19:

#35

I think I know what it is, in my case. If my revised system gets through the next few days, let alone a week, without trouble, I'll think it is reasonably certain.

Revision history for this message

Charles Forsyth (charles-forsyth) wrote on 2012-10-20:

#36

I was wrong about that. fsck -c -c -k ... had found 3 bad blocks, so I thought "ah! it was device error after all".
Having moved the bad blocks out of the way, I expected all to return to normal, and would have changed to moaning
about the complete lack of visible diagnostics (including in dmesg) about the occurrence of any IO error when writing
to the bad blocks. (It's possible that the IO ends up in the device cache and it's not until that's flushed to the drive that
any error is detected, and that's not communicated back to the host, so there's little the software can do.)

In fact, the system has continued on in the same old way. Just now:
[73851.280405] EXT4-fs error (device sda1): ext4_mb_generate_buddy:741: group 67, 6802 clusters in bitmap, 6777 in gd
[73851.280416] Aborting journal on device sda1-8.
[73851.280527] EXT4-fs (sda1): Remounting filesystem read-only
[73851.280541] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.280639] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.280836] EXT4-fs error (device sda1) in ext4_ext_remove_space:2790: Journal has aborted
[73851.280922] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.281005] EXT4-fs error (device sda1) in ext4_ext_truncate:4308: Journal has aborted
[73851.281093] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.281165] EXT4-fs error (device sda1) in ext4_orphan_del:2491: Journal has aborted
[73851.281313] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.331505] EXT4-fs error (device sda1): ext4_mb_generate_buddy:741: group 128, 8453 clusters in bitmap, 8437 in gd
[73851.331513] EXT4-fs (sda1): pa f61a05e8: logic 2637, phys. 4209875, len 3
[73851.331516] EXT4-fs error (device sda1): ext4_mb_release_inode_pa:3607: group 128, free 3, pa_free 2

and after an fsck, it has reverted chunks of the file system because (presumably, not that it tells you anywhere) it has discarded the tail of the journal.

This has become unusable.

I was wrong about that. fsck -c -c -k ... had found 3 bad blocks, so I thought "ah! it was device error after all".
Having moved the bad blocks out of the way, I expected all to return to normal, and would have changed to moaning
about the complete lack of visible diagnostics (including in dmesg) about the occurrence of any IO error when writing
to the bad blocks. (It's possible that the IO ends up in the device cache and it's not until that's flushed to the drive that
any error is detected, and that's not communicated back to the host, so there's little the software can do.)

In fact, the system has continued on in the same old way. Just now:
[73851.280405] EXT4-fs error (device sda1): ext4_mb_generate_buddy:741: group 67, 6802 clusters in bitmap, 6777 in gd
[73851.280416] Aborting journal on device sda1-8.
[73851.280527] EXT4-fs (sda1): Remounting filesystem read-only
[73851.280541] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.280639] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.280836] EXT4-fs error (device sda1) in ext4_ext_remove_space:2790: Journal has aborted
[73851.280922] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.281005] EXT4-fs error (device sda1) in ext4_ext_truncate:4308: Journal has aborted
[73851.281093] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.281165] EXT4-fs error (device sda1) in ext4_orphan_del:2491: Journal has aborted
[73851.281313] EXT4-fs error (device sda1) in ext4_reserve_inode_write:4550: Journal has aborted
[73851.331505] EXT4-fs error (device sda1): ext4_mb_generate_buddy:741: group 128, 8453 clusters in bitmap, 8437 in gd
[73851.331513] EXT4-fs (sda1): pa f61a05e8: logic 2637, phys. 4209875, len 3
[73851.331516] EXT4-fs error (device sda1): ext4_mb_release_inode_pa:3607: group 128, free 3, pa_free 2

and after an fsck, it has reverted chunks of the file system because (presumably, not that it tells you anywhere) it has discarded the tail of the journal.

This has become unusable.

Revision history for this message

Charles Forsyth (charles-forsyth) wrote on 2012-10-21:

#37

I wondered whether ext4_mb_generate_buddy might be related to http://help.lockergnome.com/linux/ext4fs-error-ext4_mb_generate_buddy-741-group-16-8160-cluste--ftopict559576.html
so for a final experiment, I've switched off "discard" once more. That bug seemed confined to scsi_debug, but who knows?

On the other hand, another round of fsck -c -c -k found 3 more bad blocks, once again in the journal.
Note that fsck -c -c -k has got a problem: it adds the blocks to the bad block list, and promptly complains
that there are duplicate blocks in the bad block inode and in the journal. It then asks whether to clone
the multiply-allocated blocks. Unfortunately, it gives no sign how it will clone them: will the bad blocks
remain in the bad block list, with the copies going into the journal, or will the blocks remain in the journal,
with copies uselessly being placed in the bad block list. I decided it was safer to delete the journal, re-run
the check, leaving the blocks only in the bad block list, then recreate the journal, then switch discard off.

The bad blocks were as follows:

(0-2):2277409-2277411
(3-5):2277665-2277667

The pattern is easier to see in hex, I think:
22C021
22C022
22C023
22C121
22C122
22C123

Hmm. 3 in a row each time. 256 might be important in the internal geometry. *Might* be a failing drive.

Revision history for this message

André Desgualdo Pereira (desgua) wrote on 2012-10-21:

#38

It would be very nice if Linus give his opinion on this. He has told that
he doesn't like spinning disks, so besides coordinating the kernel he also
uses solid state disks. Is this a hardware failure? Or a bug?

Revision history for this message

Charles Forsyth (charles-forsyth) wrote on 2012-10-21:

#39

If I get through the next day or two without the problems that plagued me last week, that might suggest trying again with "discard" to see if problems reappear.

On the other hand, if I have further trouble, I've got a new replacement SSD to try.

Since my usage doesn't seem to be that unusual, and if anything is fairly light, and many people these days would be using SSDs with Ubuntu, even Ubuntu 12.04, I'd assumed that if there were a bug that was causing frequent corruption of [file systems on] SSDs at the rate I've seen, there would be widespread reports of dismay, but that doesn't seem to be so.

Revision history for this message

André Desgualdo Pereira (desgua) wrote on 2012-10-21:

#40

@Charles
I think you are right. Keep us updated with yours tests. :-)

Revision history for this message

Charles Forsyth (charles-forsyth) wrote on 2012-10-23:

#41

Download full text (4.0 KiB)

At 2am, after working away happily for 1.5 days (I copied the work out regularly), some time after a 45Mbyte Software Update, it
began to go wrong. A reboot prompted the following repair, and I left it:

[ 6.314129] EXT4-fs (sda1): orphan cleanup on readonly fs
[ 6.314137] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 309788
[ 6.315820] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 766985
[ 6.317334] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 769166
[ 6.317369] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 769170
[ 6.317391] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 768775
[ 6.320236] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 814440
[ 6.320273] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 768541
[ 6.322899] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 311047
[ 6.322945] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 977877
[ 6.322973] EXT4-fs (sda1): ext4_orphan_cleanup: deleting unreferenced inode 310939

I left it until this morning, and at 9am after a restore of my work by copying it in, and a little use of Chrome:

[ 665.062016] rm[14080]: segfault at 4 ip b77ab60d sp bfbd20a0 error 4 in ld-2.15.so[b77a9000+20000]
[ 665.160702] mv[14084]: segfault at 4 ip b7790481 sp bfa935b0 error 4 in ld-2.15.so[b7785000+20000]
[ 665.264448] touch[14088]: segfault at 4 ip b77c9481 sp bf82cac0 error 4 in ld-2.15.so[b77be000+20000]
[ 1348.241462] dell_wmi: Received unknown WMI event (0x11)
[ 7654.917714] readlink[14150]: segfault at 4 ip b776a481 sp bfeedef0 error 4 in ld-2.15.so[b775f000+20000]
[ 7655.051813] dirname[14156]: segfault at 4 ip b76fc481 sp bff23930 error 4 in ld-2.15.so[b76f1000+20000]
[ 7655.182041] mkdir[14163]: segfault at 4 ip b77b2481 sp bfe16680 error 4 in ld-2.15.so[b77a7000+20000]
[14836.922957] readlink[14226]: segfault at 4 ip b7784481 sp bfd0e0c0 error 4 in ld-2.15.so[b7779000+20000]
[14837.011050] dirname[14230]: segfault at 4 ip b77c5481 sp bfc2a6a0 error 4 in ld-2.15.so[b77ba000+20000]
[14837.107422] mkdir[14235]: segfault at 4 ip b775e481 sp bf91e510 error 4 in ld-2.15.so[b7753000+20000]
[15019.548968] EXT4-fs error (device sda1): __ext4_ext_check_block:472: inode #310825: comm rs:main Q:Reg: bad header/extent: invalid magic - magic 8b1f, entries 8, max 20596(0), depth 20597(0)
[15019.548974] Aborting journal on device sda1-8.
[15019.549058] EXT4-fs (sda1): Remounting filesystem read-only
[15019.549069] EXT4-fs error (device sda1) in ext4_da_write_begin:2533: IO failure
[22018.927076] readlink[14332]: segfault at 4 ip b77c2481 sp bfbe0f00 error 4 in ld-2.15.so[b77b7000+20000]
[22019.009226] dirname[14334]: segfault at 4 ip b776e481 sp bfbcb140 error 4 in ld-2.15.so[b7763000+20000]
[22019.086704] mkdir[14337]: segfault at 4 ip b7707481 sp bfc313a0 error 4 in ld-2.15.so[b76fc000+20000]
[25526.009125] dell_wmi: Received unknown WMI event (0x11)
[25550.643627] uname[14420]: segfault at 4 ip b772c481 sp bfef8260 error 4 in ld-2.15.so[b7721000+20000]

By the way: one of my complaints was that there wasn't any warni...

Affects		Status	Importance	Assigned to	Milestone
	Linux	Fix Released	Medium	linux-kernel-bugs #42723
	linux (Ubuntu)	Fix Released	High	denace

Ubuntu
linux package

ext4 filesystem errors on SSD disk

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Changed in linux (Ubuntu):
assignee:	nobody → denace (denace03)

Ubuntulinux package

ext4 filesystem errors on SSD disk

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches

Ubuntu
linux package