suspend and hibernate may cause data corruption because it doesn't syncs nor umounts external drives previously

Bug #198125 reported by sibidiba on 2008-03-03
34
This bug affects 3 people
Affects Status Importance Assigned to Milestone
pm-utils (Ubuntu)
Low
Unassigned
Nominated for Hardy by Alain Baeckeroot

Bug Description

Binary package hint: acpid

I'm running the current Hardy on a laptop, and I noticed upon a data corruption issue that suspend/hibernate does not syncs nor umounts external disks.
This may cause file-system corruption.

I attach a small patch that syncs as the last step of preparation for suspend/hibernate.

sibidiba (sibidiba) wrote :
TerryG (tgalati4) wrote :

Thanks for your bug submission. Sorry for your loss. This sounds serious. Marking as Confirmed.

What version of Hardy and what make/model of laptop?

What does the following say when this happened?

dmesg | tail -100

or perhaps

tail -100 /var/log/syslog

Changed in acpid:
status: New → Confirmed
sibidiba (sibidiba) wrote :
Download full text (6.4 KiB)

HW: ThinkPad R61i
SW: Hardy, daily update, since now I had kernel 2.6.24-8-generic

I have to apologize, because further examination of the logs revealed that there were I/O errors probably before I first suspended the box:

Mar 3 07:39:58 Kamorka kernel: [38410.059414] usb 2-2: reset high speed USB device using ehci_hcd and address 3
Mar 3 07:40:08 Kamorka kernel: [38412.832572] usb 2-2: reset high speed USB device using ehci_hcd and address 3
Mar 3 07:40:14 Kamorka kernel: [38413.130585] EXT3-fs error (device sdb2): htree_dirblock_to_tree: bad entry in directory #884738: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Mar 3 07:40:14 Kamorka kernel: [38413.137189] EXT3-fs error (device sdb2): htree_dirblock_to_tree: bad entry in directory #14123009: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Mar 3 07:40:14 Kamorka kernel: [38413.149699] EXT3-fs error (device sdb2): ext3_readdir: bad entry in directory #11: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Mar 3 07:40:14 Kamorka kernel: [38413.155155] EXT3-fs error (device sdb2): htree_dirblock_to_tree: bad entry in directory #15040513: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
Mar 3 07:40:14 Kamorka kernel: [38413.163752] EXT3-fs error (device sdb2): htree_dirblock_to_tree: bad entry in directory #28327937: rec_len % 4 != 0 - offset=0, inode=1919240992, rec_len=25966, name_len=108
Mar 3 07:40:15 Kamorka kernel: [38413.174590] EXT3-fs error (device sdb2): htree_dirblock_to_tree: bad entry in directory #14483457: rec_len % 4 != 0 - offset=0, inode=1684628289, rec_len=28535, name_len=108
Mar 3 07:40:15 Kamorka kernel: [38413.183440] EXT3-fs error (device sdb2): htree_dirblock_to_tree: bad entry in directory #20447233: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
(...)
Mar 3 07:40:16 Kamorka kernel: [38413.779429] EXT3-fs error (device sdb2): ext3_readdir: bad entry in directory #11: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0

Also my flatmate said there was a short power-outage in the morning, just before I left, but I haven't noticed it at all.
It is a possible scenario, that the file-system corruption occurred when the laptop kept going on batteries, while the external disk shut down.

When first resumed without the disk attached:
Mar 3 09:46:43 Kamorka hald[6155]: forcibly attempting to lazy unmount /dev/sdb2 as enclosing drive was disconnected
ar 3 09:46:43 Kamorka kernel: [40226.138935] Buffer I/O error on device sdb2, logical block 1545
Mar 3 09:46:43 Kamorka kernel: [40226.138941] lost page write due to I/O error on sdb2
Mar 3 09:46:43 Kamorka kernel: [40226.138968] WARNING: at /build/buildd/linux-2.6.24/fs/buffer.c:1169 mark_buffer_dirty()
Mar 3 09:46:43 Kamorka kernel: [40226.138972] Pid: 22078, comm: umount Not tainted 2.6.24-8-generic #1
Mar 3 09:46:43 Kamorka kernel: [40226.139006] [ext3:mark_buffer_dirty+0x7a/0x150] mark_buffer_dirty+0x7a/0x90
Mar 3 09:46:43 Kamorka kernel: [40226.139029] [<f89ab8e0>] journal_update_superblock+0x70/0xd0 [j...

Read more...

description: updated
Changed in acpid:
status: Confirmed → New
TerryG (tgalati4) wrote :

I assume that a 500 GB drive has a wall-wart for power. Loss of power to the drive with the laptop still running could be problematic. I'm going to plug mine into a spare UPS that I have lying around. You can track the problems by following the timestamp in the syslog file from when you booted to when the problems occurred and see if that corresponds to the time of the power outage. Any VCR's or microwave clocks flashing?

Theodore Ts'o (tytso) wrote :

Given the I/O errors reported by the user, the filesystem was probably very badly damanged before the power loss event. Normally ext3 recovers from power failures without a hitch. Enabling laptop mode may increase the amount of files whose data might be lost, but power failures will not result in this kind of damage as reported by the user here and in bug #198131.

Jim Braux-Zin (j-brauxzin) wrote :

This bug may be related to bug #108854.

I said there :

Hardy amd64 on a Lenovo 3000 N200 laptop (Core 2 Duo)

My external hard disks aren't switched off either.

What is more problematic to me is that they seem not to be unmounted, so when I unplugged a drive before resume, there still was its icon on the desktop. More problematic, when I plugged it again it would be mounted to a different location ("WD Passport_" instead of "WD Passport" the system thought was already in use), making all my bookmarks nonfunctional until reboot.

Jim Braux-Zin (j-brauxzin) wrote :

Please, it's getting worse and worse ! Last day, I noticed my externel hard drive was mounted to /media/WD Passport____" and there was three empty folders starting with "WD Passport". All my bookmarks are disabled and rhythmbox can't find my music.

Also, I don't understand why there aren't more people complaining about this issue since it requires a manual removal of the empty folders.

DaveAbrahams (boostpro) wrote :

This is a serious problem, and it applies to the internal disks as well. I am using JFS on LVM and have been testing suspend-to-RAM lately. Every time it failed, I ended up with really bad disk corruption (often couldn't boot or couldn't "touch /forcefsck").

Puhleeeze fix it. The cost is so low and the benfits so high!

Hendy Irawan (ceefour) wrote :

Probably related to bug:
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/203537

Does this bug still exist on release Hardy?

If so this is *VERY* dangerous people!!!!!!!!!!!!

sibidiba (sibidiba) wrote :

It is still not fixed.

Removable media is mounted asynchronous and there is no sign of any attempt to umount/remount it upon suspend/resume.

alex941021 (alex941021) wrote :

Confirmed on an ACER Ferrari 5000 with Hardy. Suspend does suspend the machine properly; however, upon resume it fails--the screen stays blank and a hard power off is required to reboot. Upon normal reboot I receive a GRUB error 17, indicative of a file system corruption. Even after an fsck, the disk is unrecoverable.

Same problem occurs if non-proprietary video drivers are not installed.

Confirmed on an ACER Ferrari 1100 with Hardy 64. Suspend does suspend the machine properly; however, upon resume it fails--the screen stays blank.
After a hard power off the system prompts GRUB error 17, indicating that the file system is corrupt.
As my data are on a separate home partition, not all was lost.

alex941021 (alex941021) wrote :

I've found a workaround for this problem by passing the "iommu=soft" to the kernel!!! Amazing, everything works!

it is something that has to do with AMD iommu module and kernel incompatibilities.

Daniel T Chen (crimsun) on 2008-11-30
Changed in acpid:
importance: Undecided → Low
status: New → Confirmed
Loïc Minier (lool) wrote :

Most comments here allude to the FS not being "sync"-ed; however /usr/lib/pm-utils/bin/pm-action (pm-suspend) in pm-utils does sync. I'm reassigning to pm-utils for now, but I rather suspect that this is a driver/hardware issue as we got a relatively low number of such reports.

These logs also point at drivers/hardware issues rather than userspace issues:
Mar 3 09:46:43 Kamorka kernel: [40226.138935] Buffer I/O error on device sdb2, logical block 1545
Mar 3 09:46:43 Kamorka kernel: [40226.138941] lost page write due to I/O error on sdb2

Steve Lemke (steve-lemkeville) wrote :

Any chance this might be related to disk image corruption issues when running Hardy in VmWare Fusion?

I have lost numerous VmWare Fusion images with what seems to be a similar problem. Typically in the middle of a large project build, everything will come to a grinding halt with (something like) the following error:

[ 3380.587304] EXT3-fs error (device sda1): htree_dirblock_to_tree: bad entry in directory #1777878: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
[ 3380.587411] Aborting journal on device sda1.
[ 3380.588457] Remounting filesystem read-only
[ 3380.664339] __journal_remove_journal_head: freeing b_committed_data

After rebooting the virtual Hardy machine, the disk is generally unbootable. I use suspend/resume all the time in VmWare, but thought this was some strange VmWare I/O problem happening during my build. Now I'm wondering if it's a Hardy bug?

Steve (stevenm86) wrote :

Seeing this exact problem in Ocelot. The filesystem on the external storage device is extremely upset that the device was ripped out from under it, and wasn't there yet (due to delay in the drive spinning up / enumerating) when the system was resumed. The system needs to delay suspending until all external storage devices are unmounted, and needs to NOT suspend if unmounting fails (due to a program/terminal/etc still being open somewhere pointing to these devices).

The current behavior of suspending regardless of external device state is WRONG and DOES result in data corruption. I don't understand why this is marked as 'Low priority'.

heckheck (jinfo) wrote :

I too have experienced disk corruption following suspend to ram, which I use on my NAS storage server in conjunction with PowerNap. I have observed this working in both Lucid and Natty over the last year. In my experience, this problem is not limited to external drives. I have observed corruption of my boot drive on three different internal SATA controller cards in a Nehalem class X86 server. These include a

* Highpoint Rocket 620 PCIe add in card running Bios Version 1.1
* LSI MPT SAS controller running in IT mode on a Supermicro S8DA3 motherboard running MPTSAS BIOS v 6.30.00.00 2009_11_12
* Old Promise SATA300 PCI add in card

Note that I am using all 6 of the ICH10 SATA ports for a RAID5 array, so I do not know if this corruption ever occurs using the standard Intel ICH10 SATA ports. Perhaps that is why it is not reported more widely.

The corruption occurred most often when using the LSI SAS controller (about 1 in 5 boots). It occurs much less frequently on the Highpoint Rocket 620 card, but it just happened for the first time yesterday after about 2 months of testing.

I'm sorry I don't have fresh logs to post, but I had to get my system back on-line ASAP. I'll add logs the next time it happens if I can scrape them out of the corrupted filesystem.

This is a very serious problem, and I am baffled that it is marked Low priority. It doesn't get much more grave than when your boot drive gets corrupted every few months due to something not being right in the syncing of disks going into and out of suspend to ram.

If Ubuntu is serious about power management in the upcoming Precise release, this MUST be addressed.

Best Regards,

-Jim Heck

Tomasz 'Zen' Napierala (tzn) wrote :
Download full text (5.9 KiB)

I cannot find how it might be related, but we are seeing massive filesystem corruptions in virtual guests on kvm in Lucid.
Host was running several kernels, from stock Lucid up to 3.0.0-14-server. Guests were booted with several different kernels as well. We also changed storag backen form qcow, to raw and eventually to lvm base with no avail.
Usuall message just after going to RO:
[2012-03-01 04:39:06] EXT4-fs (vda): error count: 10
[2012-03-01 04:39:06] EXT4-fs (vda): initial error at 1323754623: htree_dirblock_to_tree:586: inode 371080: block 1229922
[2012-03-01 04:39:06] EXT4-fs (vda): last error at 1329327878: ext4_remount:3754: inode 170313: block 543763
[2012-03-02 04:40:50] EXT4-fs (vda): error count: 10
[2012-03-02 04:40:50] EXT4-fs (vda): initial error at 1323754623: htree_dirblock_to_tree:586: inode 371080: block 1229922
[2012-03-02 04:40:50] EXT4-fs (vda): last error at 1329327878: ext4_remount:3754: inode 170313: block 543763
[2012-03-03 04:42:38] EXT4-fs (vda): error count: 10
[2012-03-03 04:42:38] EXT4-fs (vda): initial error at 1323754623: htree_dirblock_to_tree:586: inode 371080: block 1229922
[2012-03-03 04:42:38] EXT4-fs (vda): last error at 1329327878: ext4_remount:3754: inode 170313: block 543763
[2012-03-04 04:44:25] EXT4-fs (vda): error count: 10
[2012-03-04 04:44:25] EXT4-fs (vda): initial error at 1323754623: htree_dirblock_to_tree:586: inode 371080: block 1229922
[2012-03-04 04:44:25] EXT4-fs (vda): last error at 1329327878: ext4_remount:3754: inode 170313: block 543763
[2012-03-04 20:34:20] EXT4-fs error (device vda): htree_dirblock_to_tree:587: inode #171186: block 546842: comm chown: bad entry in directory: rec_len is smaller than minimal - offset=0(0), inode=4210740, rec_len=0, name_len=0
[2012-03-04 20:34:20] Aborting journal on device vda-8.
[2012-03-04 20:34:20] EXT4-fs (vda): Remounting filesystem read-only
[2012-03-05 04:46:13] EXT4-fs (vda): error count: 11
[2012-03-05 04:46:13] EXT4-fs (vda): initial error at 1323754623: htree_dirblock_to_tree:586: inode 371080: block 1229922
[2012-03-05 04:46:13] EXT4-fs (vda): last error at 1330893259: htree_dirblock_to_tree:587: inode 171186: block 546842

Or

[20768.343508] EXT3-fs error (device vda): htree_dirblock_to_tree: bad entry in directory #837494: rec_len is smaller than minimal - offset=0, inode=4210740, rec_len=0, name_len=0
[20768.348149] Aborting journal on device vda.
[20768.352064] EXT3-fs (vda): error: remounting filesystem read-only
[20768.396397] __journal_remove_journal_head: freeing b_committed_data
[20768.396405] __journal_remove_journal_head: freeing b_committed_data
[20768.396407] __journal_remove_journal_head: freeing b_committed_data
[20769.700102] ------------[ cut here ]------------
[20769.700125] WARNING: at /build/buildd/linux-lts-backport-oneiric-3.0.0/fs/ext3/inode.c:1571 ext3_ordered_writepage+0x223/0x250()
[20769.700127] Hardware name: Bochs
[20769.700128] Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc psmouse serio_raw virtio_balloon i2c_piix4 raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov floppy raid6_pq async_tx raid1 raid0 multipath linear
[20769.700146] Pid: 2496, comm: fl...

Read more...

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers