[All releases] Suspend/Resume with rootfs on USB, causes filesystem corruptions and kernel panic on mount attempt, leaving system unbootable with data lost.

Bug #706795 reported by Eugene San on 2011-01-24
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
linux (Fedora)
New
Undecided
Unassigned
linux (Ubuntu)
Medium
Unassigned

Bug Description

Suspend of root filesystem on USB, causes FS corruptions which crashes kernel
on mount attempt and making systen unbootable.

Affects (at least) RedHat6.0+, Fedora13+ and Ubuntu10+.

Reconstruction (Full):
---------------
a) Install any RedHat/Fedora/Ubuntu release on USB drive.
b) Boot into user session
c) Suspend
d) Resume
e) Repeat c,d (up to 4 times) until kernel informs about filesystem corruption.
f) Reset the system -> Observe system inability to boot and kernel panic due to
corrupted FS.

Reconstruction (Short):
---------------
a) Install any RedHat/Fedora/Ubuntu release on USB drive.
b) Boot into user session
c) Suspend
d) Hard reset the system (Before performing resume) -> Observe system inability
to boot and kernel panic due to corrupted FS.

Actually report describes two severe issues:
--------------------------------------------
a) FS corruption during suspend/resume
b) Kernel panic when trying to mount affected FS.

Facts:
------
*) Happens on systems with bootdevice on USB.
*) When system returns from suspend, there is a chance of ~25% for rootfs on
USB to be remounted as readonly. Due to short inability to read from USB device
(device settle delay required)
*) Attempt to start affected FS will result an useless initramfs prompt without
ability to solve the issue locally.
*) To fix the issue additional Linux system required (with auto-mount
de-activated!).
*) An attempt to mount affected EXT4 will results a crashed provided below.
   Corruptions are silent, FS marked as clean, that's what causes Kernel panic
when it tries to mount corrupted FS marked as clean.
*) Issue caused probably by partially synced disk cache.
*) All tested Kernels affected (Lucid, Maverick, Natty, Natty-Mainline-Daily,
Fedora14, Redhat6).
*) Validated with: Mainline, Lucid, Maverick, Natty, Fedora14 and RedHat6.
*) Kernels above 2.6.38 manages to mark FS as dirty during failed mount
attempt, so virtually corruption got fixed after 2 reboots but loss is there.
*) Bug affects all filesystems, but severe for EXT4(data loss while fixable
with another system without automount) and BTRFS (unfixable due to failed
mounttime fixes and there is no way to fix manually).

Eugene San (eugenesan) on 2011-01-24
description: updated
Eugene San (eugenesan) on 2011-01-24
description: updated
Jeremy Foshee (jeremyfoshee) wrote :

Hi Eugene,

Please be sure to confirm this issue exists with the latest development release of Ubuntu. ISO CD images are available from http://cdimage.ubuntu.com/daily/current/ . However, note you can only test Suspend, not Hibernate, when using a LiveCD. If the issue remains, run the following command from a Terminal (Applications->Accessories->Terminal) it will automatically gather and attach updated debug information to this report.

apport-collect -p linux 706795

Also, please be sure to take a look at https://wiki.ubuntu.com/DebuggingKernelSuspendHibernateResume . If you can provide any additional information outlined there it would be much appreciated.

Additionally, if you could try to reproduce this with the upstream mainline kernel that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. Thanks in advance.

    [This is an automated message. Apologies if it has reached you inappropriately; please just reply to this message indicating so.]

tags: added: kernel-suspend
tags: added: needs-kernel-logs
tags: added: needs-upstream-testing
tags: added: kj-triage
Changed in linux (Ubuntu):
status: New → Incomplete
Eugene San (eugenesan) wrote :

Hi,

What information exactly expected to be archived by mentioned experiment?
1) Logging of failed suspend/resume is not possible in this situation.
2) Filesystem corruption and kernel crash during mount attempt are already documented.
3) Filesystem recovery procedure incompetence is obvious here.

If its worth something, suspend/resume issues are the same on Fedora14 and RedHat6, not sure regarding kernel crash since their filesystem recovery procedure is different.

My observetions shows that, occasionally, VFS awakes before block device on USB device becomes ready (probably due to delayed/forced USB host reset).
After failure, XFS retries and survives, while EXT fails immediately and remounts root as R/O.
The big questions here, is why EXT leaves filesystem in broken state in firsts place?
Isn't it supposed to commit both data and journal before suspend?

I am pretty sure mainline Linux also affected, but I am ready to perform validation.
Since suspend/resume part of an issue is reconstructable only if root filesystem is EXT4 and on USB, LiveCD is not an option. Is it ok to install clean system on USB?

Eugene San (eugenesan) on 2011-01-30
summary: - Terminated suspend causes ext4 corruptions which crashes kernel on mount
- attempt
+ Suspending with rootfs (mainly ext4) on USB, cause corruption which
+ crashes kernel on mount attempt
summary: - Suspending with rootfs (mainly ext4) on USB, cause corruption which
- crashes kernel on mount attempt
+ [All releases] Suspending with rootfs (mainly ext4) on USB, cause silent
+ corruptions which panics kernel on mount attempt. Making system
+ unbootbale
summary: [All releases] Suspending with rootfs (mainly ext4) on USB, cause silent
corruptions which panics kernel on mount attempt. Making system
- unbootbale
+ unbootable
summary: [All releases] Suspending with rootfs (mainly ext4) on USB, cause silent
corruptions which panics kernel on mount attempt. Making system
- unbootable
+ unbootable.
Changed in linux (Ubuntu):
status: Incomplete → Confirmed

Hi all,

Update:
1) Corruptions are always there when suspending rootfs on USB.
2) Corruptions are silent, FS marked as clean, that's what causes Kernel panic when it tries to mount corrupted FS marked as clean.
3) Issue caused probably by partially synced disk cache.
4) All tested Kernels affected (Lucid, Maverick, Natty, Natty-Mainline-Daily, Fedora14, Redhat6).
5) Kernels above 2.6.38 manages to mark FS as dirty during failed mount attempt, so virtually corruption got fixed after 2 reboots but loss is there.
6) Bug affects all filesystems, but severe for EXT4(data loss but fixable with another system without automount) and BTRFS (unfixable due to failed mounttime fixes and there is no way to fix manually).

summary: - [All releases] Suspending with rootfs (mainly ext4) on USB, cause silent
- corruptions which panics kernel on mount attempt. Making system
+ [All releases] Suspending with rootfs on USB, cause silent corruptions
+ cause kernel panic on mount attempt, data loss and making system
unbootable.
summary: - [All releases] Suspending with rootfs on USB, cause silent corruptions
- cause kernel panic on mount attempt, data loss and making system
- unbootable.
+ [All releases] Suspending with rootfs on USB, causes silent corruptions,
+ kernel panic on mount attempt, data loss and leaving system unbootable.
Eugene San (eugenesan) wrote :

Attaching failed suspend-resume log.

Comments:
1) Unrecoverable read error reported, but device is fully readable when tested manually few seconds later. Device probably needs more time to settle, maybe analog of rootdelay= is required for resume?
2) Corruption exists also when no resume procedure executed (by manually ejecting device or system hard reset).
3) Kernel suspected to mess with VFS/USB suspending orders or by not providing enough time for USB device to commit data physically before suspend/reseting/turning-off USB host, after sync.
4) Several USB storage devices were tested, they all fail.

Eugene San (eugenesan) on 2011-01-30
description: updated
Eugene San (eugenesan) on 2011-01-30
tags: removed: needs-kernel-logs needs-upstream-testing
Eugene San (eugenesan) wrote :

Providing suspend resume cycle with pm-trace and USB debug

Eugene San (eugenesan) wrote :

Reconstruction script, please retry up to 4 times

Eugene San (eugenesan) on 2011-02-25
summary: - [All releases] Suspending with rootfs on USB, causes silent corruptions,
- kernel panic on mount attempt, data loss and leaving system unbootable.
+ [All releases] Suspend/Resume with rootfs on USB, causes filesystem
+ corruptions and kernel panic on mount attempt, leaving system unbootable
+ with data lost.
description: updated
Hendrik van Wyk (tonberry) wrote :

This seems very similar to what I am seeing on my system if I use 2.6.37 instead of maveric's usual 2.6.35. I have been running ubuntu off of a flash drive on this system since 9.04 and only recently stared to encounter this. Unlike the original report I have never had the accompanying kernel panics at boot. Fsck runs at boot and manages to repair the filesystem and then boot properly.

It should probably be noted that my setup is a little different in that I run two USB flash drives in RAID 0 instead of a single drive and that my /boot partition is on the SSD that came with the laptop and has so far not been affected by this. I did also run a single USB drive for a while a year or so back and I do not recall encountering this issue.

Brad Figg (brad-figg) on 2011-09-12
tags: added: maverick
somejan (somejan) wrote :

I have a similar problem which I suspect is the same bug. In my case it's a non-root ext3 fs on an external usb disk. I regularly leave it plugged in while suspending/resuming, and sometimes it becomes corrupted. The fs stays mounted across suspend/resume, but after the corruption programs will start getting strange IO errors, I once got the fs reported as being over 200 TB in size, (it's about 240 GB). Until now an fsck has been able to fix it (in some cases only after several runs) and I haven't lost any data. I'm running an ubuntu 2.6.32 LTS kernel, but I could test under mainline later when I have time.

Ubfan (ubfan1) wrote :

I started seeing disk checks at boot time 25% of the time on a new 8G USB Maverick installation, currently running kernel 2.6.35.30. I noticed that at shutdown, the usb light blinks right up to the fraction of a second that power is cut to the laptop, so I assumed that some fileysystem buffer had not been fully flushed. On an older 4G USB Maverick, there is a noticeable time between when the usb light stops blinking and when the power to the laptop is turned off.

The second half of the bug looks like https://bugs.launchpad.net/ubuntu/+source/linux/+bug/766970
(My experience is filed under a duplicate as https://bugs.launchpad.net/ubuntu/+source/linux/+bug/910041)

I also saw fsck fixing some issues at boot before I saw the oops on mount. But I had attributed it to hdd errors or something.

Eugene San, this bug was reported a while ago and there hasn't been any activity in it recently. We were wondering if this is still an issue? Can you try with the latest development release of Ubuntu? ISO CD images are available from http://cdimage.ubuntu.com/releases/ .

If it remains an issue, could you run the following command in the development release from a Terminal (Applications->Accessories->Terminal). It will automatically gather and attach updated debug information to this report.

apport-collect -p linux <replace-with-bug-number>

Also, if you could test the latest upstream kernel available that would be great. It will allow additional upstream developers to examine the issue. Refer to https://wiki.ubuntu.com/KernelMainlineBuilds . Please do not test the kernel in the daily folder, but the one all the way at the bottom. Once you've tested the upstream kernel, please remove the 'needs-upstream-testing' tag. This can be done by clicking on the yellow pencil icon next to the tag located at the bottom of the bug description and deleting the 'needs-upstream-testing' text. As well, please comment on which kernel version specifically you tested.

If this bug is fixed in the mainline kernel, please add the following tag 'kernel-fixed-upstream'.

If the mainline kernel does not fix this bug, please add the tag: 'kernel-bug-exists-upstream'.

If you are unable to test the mainline kernel, for example it will not boot, please add the tag: 'kernel-unable-to-test-upstream', and comment as to why specifically you were unable to test it.

Please let us know your results. Thanks in advance.

Helpful Bug Reporting Links:
https://help.ubuntu.com/community/ReportingBugs#Bug_Reporting_Etiquette
https://help.ubuntu.com/community/ReportingBugs#A3._Make_sure_the_bug_hasn.27t_already_been_reported
https://help.ubuntu.com/community/ReportingBugs#Adding_Apport_Debug_Information_to_an_Existing_Launchpad_Bug
https://help.ubuntu.com/community/ReportingBugs#Adding_Additional_Attachments_to_an_Existing_Launchpad_Bug

tags: added: needs-upstream-testing
tags: added: lucid natty
Changed in linux (Ubuntu):
importance: Undecided → Medium
status: Confirmed → Incomplete
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers