[TOPBLOCKER] file corruption on touch images in rw portions of the filesystem

Bug #1387214 reported by Jamie Strandboge on 2014-10-29
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Canonical System Image
Undecided
Unassigned
android (Ubuntu)
Critical
Sergio Schvezov
android (Ubuntu RTM)
Critical
Ricardo Salveti
android-tools (Ubuntu)
Low
Oliver Grawert
android-tools (Ubuntu RTM)
Low
Oliver Grawert
initramfs-tools-ubuntu-touch (Ubuntu)
Critical
Ricardo Salveti
initramfs-tools-ubuntu-touch (Ubuntu RTM)
Critical
Ricardo Salveti
linux-mako (Ubuntu)
Critical
Paolo Pisati
linux-mako (Ubuntu RTM)
Critical
Paolo Pisati

Bug Description

Symptoms are that cache files in /var/cache/apparmor and profiles in /var/lib/apparmor/profiles are sometimes corrupted after a reboot. We've already fixed several bugs in the apparmor and click-apparmor and made both more robust in the face of corruption and we've reduced the impact when there is a corrupted profile, but we've still not found the cause of the corruption. This corruption can still affect real-world devices: if a profile in /var/lib/apparmor/profiles is corrupted and the cache file is out of date, then the profile won't compile and that app/scope won't start.

Workaround: remove the affected profile and then run 'sudo aa-clickhook'. This obviously is not viable on an end-user device.

The investigation is ongoing and this may not be a problem with the kernel at all, so this bug may be retargeted to another project.

The security team and the kernel team have discussed this a lot and Colin King is currently looking at this. This bug is just so it can be tracked. Here is an excerpt from my latest email to Colin:

"I believe I have conclusively ruled out apparmor_parser and aa-clickhook by creating a new 'home/bug/test-with-true.sh'. Here is the test output:

http://paste.ubuntu.com/8648109/

Specifically, home/bug/test-with-true.sh changes the interesting parts of the algorithm to:

1. wait for unity8 to start (this ensures the apparmor upstart job is finished)
2. restore apparmor_parser and aa-clickhook, if needed
3. if /home/bug/profiles... exists, perform a diff -Naur /home/bug/profiles...
   /var/lib/apparmor/profiles and fail if differences (note, apparmor_parser
   and aa-clickhook were /bin/true during boot so they could not have changed
   /var/lib/apparmor/profiles)
4. verify the profiles, exit with error if they do not
5. alternately upgrade/downgrade the packages
6. verify the profiles, exit with error if they do not
7. copy the known good profiles in the previous step to /home/bug/profiles...
8. have apparmor_parser and aa-clickhook point to /bin/true
9. reboot
10. go to step 1

In the paste you'll notice that in step 6 the profiles were successfully created by the installation of the packages, then verified, then copied aside, then apparmor_parser and aa-clickhook diverted, then rebooted, only to have the profiles in /var/lib/apparmor/profiles be different than what was copied aside. It would be nice to verify on your device as well (I reproduced several times here) and verify the reproducer algorithm. I think this suggests this is a kernel issue and not userspace.

IMPORTANT: you will want to update the reproducer and refollow all of these steps (ie, I updated the scripts, the debs, the sudoers file, etc):
$ wget http://people.canonical.com/~jamie/cking/aa-corruption.tar.gz
$ tar -zxvf ./aa-corruption.tar.gz
...

$ adb push ./aa-corruption.tar.gz /tmp
$ adb shell
phablet@ubuntu-phablet:~$ cd /tmp
phablet@ubuntu-phablet:~$ tar -zxvf ./aa-corruption.tar.gz
phablet@ubuntu-phablet:~$ sudo mount -o remount,rw /
phablet@ubuntu-phablet:~$ sudo cp ./aa-corruption/etc/sudoers.d/phablet
/etc/sudoers.d/
phablet@ubuntu-phablet:~$ sudo mount -o remount,ro /
phablet@ubuntu-phablet:~$ sudo cp -a ./aa-corruption/home/bug /home
phablet@ubuntu-phablet:~$ exit
$ cd ./aa-corruption
$ ./test-from-host.sh
...

The old script is still in place. Simply adjust ./test-from-host.sh to have:
testscript=/home/bug/test.sh
#testscript=/home/bug/test-with-true.sh"

The kernel team has verified the above reproducer and symptoms.

Related bugs:
* bug 1371771
* bug 1371765
* bug 1377338

description: updated
description: updated
description: updated
description: updated
description: updated
Jamie Strandboge (jdstrand) wrote :

Added application-confinement and apparmor tags since this bug affects both and it will be easier to find.

description: updated
tags: added: application-confinement
tags: added: apparmor
tags: added: kernel-key
Colin Ian King (colin-king) wrote :

Made some progress today.

On the phone, I am seeing:

/var/lib/apparmor/profiles/click_com.ubuntu.filemanager_filemanager_0.3.275 containing a pathname and all zeros. The start is always on a page boundary and the end is always on a page boundary.

I copied the entire partition /dev/mmcblk0p23 over adb back to my laptop, mounted it and then mounted /mnt/ubuntu.img and the same file is sane and not corrupted. So the underlying data is OK.

corrupted data contains /usr/share/click/preinstalled/com.ubuntu.music/1.3.625/apparmor.json

Cannot find any symlinks that would relate to this.

Next step, I'm adding debug into the symlink name to see if this appears in the corrupt data to verify it is a symlink.

Colin Ian King (colin-king) wrote :

The corruption to /var/lib/apparmor/profiles/click_com.ubuntu.filemanager_filemanager_0.3.275 survives multiple reboots. I'll take another 6GB snapshot of the underlying partition and see if that's now corrupted.

Colin Ian King (colin-king) wrote :

After several reboots, the data still appears corrupted on the phone, but copying the underlying raw device /dev/mmcblk0p23 to my laptop and loop mounting it and then loop mounting ubuntu.img shows an uncorrupted var/lib/apparmor/profiles/click_com.ubuntu.filemanager_filemanager_0.3.275. I'm now going to test this on another device with a 3.4 kernel to see what I get.

Colin Ian King (colin-king) wrote :

Ruled out the bind mount of /var/lib/apparmor/profiles on /userdata/system-data/var/lib/apparmor/profiles, still see corruption there on the device

Colin Ian King (colin-king) wrote :

Sanity checked the raw data from /dev/mmcblk0p23:

1. copied raw data off the phone to may laptop
2. using sshfs, mounted the directory containing the raw data snapshot back on the phone
3. loop mounted it
4. loop mounted ubuntu.img from this
5. /ubuntu/var/lib/apparmor/profiles is sane, no corruption

Colin Ian King (colin-king) wrote :

On the phone:

debugfs /userdata/ubuntu.img
cat /var/lib/apparmor/profiles/click_com.ubuntu.filemanager_filemanage
# vim:syntax=apparmor

#include <tunables/global>

# Define vars with unconfined since autopilot rules may reference them
# Specified profile variables
@{APP_APPNAME}="filemanager"
@{APP_ID_DBUS}="com_2eubuntu_2efilemanager_5ffilemanager_5f0_2e3_2e275"
@{APP_PKGNAME_DBUS}="com_2eubuntu_2efilemanager"
@{APP_PKGNAME}="com.ubuntu.filemanager"
@{APP_VERSION}="0.3.275"
@{CLICK_DIR}="/usr/share/click/preinstalled"
..
etc

so the underlying file system is verified as sane

Colin Ian King (colin-king) wrote :

I've searched the entire block device for the string /usr/share/click/preinstalled/com.ubuntu.music/1.3.625/apparmor.json and tagged it in such a way as it is obvious it that it has been modified on the flash drive. I rebooted and double checked - the modified data is still modified on disk however the corrupted file contains the original data.

So the underlying file system is sane. The in-memory view of the file seems borked.

Colin Ian King (colin-king) wrote :

After a lot of deep digging into the bind mount, loop driver, and buffer cache and tracking the corrupt pages back down the layers of the stack we've sanity checked this down to the image. The smoking gun was the kernel message:

Nov 6 12:15:16 ubuntu-phablet kernel: [ 3.940485] do_mount: /dev/loop0 -> /root [<null>]
Nov 6 12:15:16 ubuntu-phablet kernel: [ 3.941095] EXT2-fs (loop0): warning: mounting unchecked fs, running e2fsck is recommended
Nov 6 12:15:16 ubuntu-phablet kernel: [ 3.941431] do_mount return -> 0

(apologies for my extra debug).

So it appears that /dev/loop0 is being mounted and it is corrupted. I ran fsck on /userdata/system.img and /userdata/ubuntu.img only to find that the system.img needed some fixing:

fsck /userdata/system.img
fsck from util-linux 2.25
e2fsck 1.42.10 (18-May-2014)
/userdata/system.img was not cleanly unmounted, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 3A: Optimizing directories
Pass 4: Checking reference counts
Unattached inode 3225
Connect to /lost+found<y>? yes
Inode 3225 ref count is 2, should be 1. Fix<y>? yes
Unattached inode 3709
Connect to /lost+found<y>? yes
Inode 3709 ref count is 2, should be 1. Fix<y>? yes
Unattached inode 3808
Connect to /lost+found<y>? yes
Inode 3808 ref count is 2, should be 1. Fix<y>? yes
Unattached inode 4427
Connect to /lost+found<y>? yes
Inode 4427 ref count is 2, should be 1. Fix<y>? yes
Unattached inode 4485
Connect to /lost+found<y>? yes
Inode 4485 ref count is 2, should be 1. Fix<y>? yes
Unattached inode 5889
Connect to /lost+found<y>? yes
Inode 5889 ref count is 2, should be 1. Fix<y>? yes
Unattached inode 5943
Connect to /lost+found<y>? yes
Inode 5943 ref count is 2, should be 1. Fix<y>? yes
Unattached inode 7853
Connect to /lost+found<y>? yes
Inode 7853 ref count is 2, should be 1. Fix<y>? yes
yyyPass 5: Checking group summary information
Block bitmap differences: -70903 -71144 -71201 -(71674--71675) -71727 -71852 -72689 -72757 -(74519--74520) -74869 -74961 +(92082--92087) +(92089--92092) -92102 +92104 +92114 +y92119 +(92121--92131)
Fix<y>? yes
Free blocks count wrong for group #13 (8813, counted=8820).
Fix<y>? yes
Free blocks count wrong (133222, counted=133229).
Fix<y>? yes
Inode bitmap differences: +(19989--20010) +(20013--20014) -(20545--20549)y -(20551--20569)
Fix<y>? yes
Free inodes count wrong for group #13 (3225, counted=3232).
Fix<y>? yes
Directories count wrong for group #13 (761, counted=760).
Fix<y>? yes
Free inodes count wrong (81946, counted=81953).
Fix<y>? yes

/userdata/system.img: ***** FILE SYSTEM WAS MODIFIED *****
/userdata/system.img: ***** REBOOT LINUX *****

So, there are two big issues outstanding, most probably in the user space shutdown and initrd stages:

1. The file system is not being flushed and unmounted properly.
2. The file system is not being fsck'd before mounting - this is a cardinal sin IMHO

The end result is mounting a corrupt file system that is causing the garbage in the apparmor files.

tags: added: rtm14
affects: linux (Ubuntu) → system-image (Ubuntu)
Stéphane Graber (stgraber) wrote :

A suggested fix for this would be:

initrd:
 - At every boot, do a minimal fsck check for problems, if problem, write "fsck <partition>" to /cache/recovery/ubuntu_command and reboot to recovery

recovery:
 - Add support for the fsck stanza, when getting it, run fsck in fix mode, if possible, supporting the usual yes/no questions using pixelflinger and button input.

rootfs:
 - If /cache/recovery/broken_root exists, immediately pull the latest full image and reboot for flashing

system-image-client:
 - Add a fsck run to the standard ubuntu_command prior to flashing

Stéphane Graber (stgraber) wrote :

Adding an initramfs-tools-ubuntu-touch and android task since that's where the bulk of the work ought to go.

Jamie Strandboge (jdstrand) wrote :

Huge thanks to Colin (and apw) for this find.

I want to state that this is a very real problem and not theoretical. I've seen it many times, it is triggerable with enough reboots, and it has been seen on krillin by non-developers (ie, just through normal reboots and system-image updates). stgraber outlined the start of a fix here: http://paste.ubuntu.com/8851794/

IMO, we must solve #2 from Colin's comment #9 (perform an fsck using something like stgraber's approach) before the golden image. If we have that, we can solve #1 in OTA (though in an ideal world that would be fixed prior to golden image since we don't want people dropping into recovery mode if we can at all help it).

Changed in system-image (Ubuntu):
assignee: Colin Ian King (colin-king) → nobody
Changed in initramfs-tools-ubuntu-touch (Ubuntu):
importance: Undecided → Critical
Changed in android (Ubuntu):
importance: Undecided → Critical
Changed in system-image (Ubuntu):
status: Confirmed → Triaged
Changed in initramfs-tools-ubuntu-touch (Ubuntu):
status: New → Triaged
Changed in android (Ubuntu):
status: New → Triaged
Michael Frey (mfrey) on 2014-11-06
Changed in android (Ubuntu):
assignee: nobody → Ricardo Salveti (rsalveti)
Olli Ries (ories) on 2014-11-06
tags: added: touch-2014-11-06
Barry Warsaw (barry) wrote :

If you want a separate fsck command in ubuntu_command, please file a separate bug on upstream system-image project. Other than that, it doesn't look like this particular bug affects system-image client.

no longer affects: system-image (Ubuntu)
Olli Ries (ories) on 2014-11-06
summary: - file corruption on touch images in rw portions of the filesystem
+ [TOPBLOCKER] file corruption on touch images in rw portions of the
+ filesystem
tags: added: lt-blocker lt-category-visible lt-prio-high
Jamie Strandboge (jdstrand) wrote :

FYI, unconfirmed report via ubuntu-phone@: "Glad it isn't only me having this issue -- I have come across this issue twice and it's been files in /home/ getting corrupted."

description: updated
James Hunt (jamesodhunt) wrote :

See bug 1387214 (logrotate failing, resulting in high logfile growth) for another example of FS corruption that is seemingly caused by this issue.

I posted the comment linked in #15.

The partition I have had files corrupted on does have a e2fsck run on it before mounting, it's a LUKS encrypted partition and I added the e2fsck check after it became corrupted and I have also had files corrupted since.

I hadn't done anything to ensure that this partition was unmounted when the device is shutdown but I have now amended /etc/init.d/umountfs so it should be next time.

Niklas Wenzel (nikwen) wrote :

I've had this issue multiple times yet. For me it has always been like that: The content of a file gets overriden and therefore the system fails to do something.

Last time my system settings broke completely. When doing a "cat ~/.config/dconf/user" I got:
"/opt/click.ubuntu.com/com.ubuntu.developer.webapps.webapp-gmail/1.0.25"
And after a reboot it now shows this:
"/opt/click.ubuntu.com/com.ubuntu.developer.filip-dobrocky.2048native/0.2"

Deleting "~/.config/dconf/user" did the trick for me.

Some other time system updates didn't work and doing a "cat /cache/recovery/image-master.tar.xz.asc" revealed it contained the following content:
"/opt/click.ubuntu.com/com.ubuntu.developer.majster-pl.utorch/1.6"

Oliver Grawert (ogra) wrote :

your dconf issues are caused by illegal use of dconf by an app (apps should not be able to use dconf according to the security policy as i understand it, to avoid exactly this), please file a fresh bug for this.

Niklas Wenzel (nikwen) wrote :

@ogra: I don't think it's a dconf issue. Why would the same happen to /cache/recovery/image-master.tar.xz.asc then?

Niklas Wenzel (nikwen) wrote :

Doing a short grep, I also found that line in a lot of files, e.g. in the following one: "~/.cache/upstart/logrotate.log"

The file started with "/opt/click.ubuntu.com/com.ubuntu.developer.filip-dobrocky.2048native/0.2", then continued with its normal content and in the middle of the file the following line could be found which clearly isn't there for logical reasons:
"error: b/opt/click.ubuntu.com/com.ubuntu.developer.dpniel.dekko/0.2.9error: bad top line in state file /home/phablet/.cache/logrotate/status" All other lines looked like this one: "error: bad top line in state file /home/phablet/.cache/logrotate/status"

Having seen that error message, I checked the file it mentioned. Guess what its content is:
"/opt/click.ubuntu.com/com.ubuntu.developer.dpniel.dekko/0.2.9"

There actually seem to be a lot of corrupt files like these ones and it's the second Ubuntu (Touch) installation which gives me those results.

However, if you still think, it's a different bug, I will be happy to hide all my comments here and file a new bug. ;)

Matthew Paul Thomas (mpt) wrote :

As far as I know, there is no design spec for the startup process, and no designer assigned to it. In the meantime...

An automatic filesystem check being interactive makes less and less sense over time. (As the sheer number of files on a device increases, and the proportion of apps that are document-based decreases, you're less and less likely to recognize a file by its pathname.) And it makes *much* less sense on a phone, where you barely see the filesystem at all, so the probability that you can make an informed decision about fixing an individual file is pretty much zero.

So, in the "minimal fsck reveals an error" case, I suggest just triggering a fsck -y. Don't make it interactive. Ideally, add text something like "Repairing…" to the startup screen, or (during string freeze) vary the visual design of the startup screen in *some* tasteful way, to minimize the proportion of people who will think that the phone has got stuck while starting up and try to fix it by holding down the power button.

For the "immediately pull the latest full image and reboot for flashing" case, you might find useful the design for a failed system update. You'd need to change the text a bit. <https://wiki.ubuntu.com/SoftwareUpdates#Download_or_installation_failure>

Changed in android (Ubuntu):
status: Triaged → In Progress
Changed in initramfs-tools-ubuntu-touch (Ubuntu):
status: Triaged → In Progress
assignee: nobody → Ricardo Salveti (rsalveti)
Changed in android (Ubuntu):
assignee: Ricardo Salveti (rsalveti) → Sergio Schvezov (sergiusens)
Changed in initramfs-tools-ubuntu-touch (Ubuntu):
status: In Progress → Fix Released
Oliver Grawert (ogra) wrote :

adding an android-tools task, since "adb reboot" forces a hard reset without unmounting the filesystems ... adding something along:

<ogra_> root@ubuntu-phablet:~# echo u > /proc/sysrq-trigger
<ogra_> root@ubuntu-phablet:~# touch /userdata/foo
<ogra_> touch: cannot touch ‘/userdata/foo’: Read-only file system

to core/adbd/services.c in the reboot_service function (instead of the vdc call) should help here

Changed in android-tools (Ubuntu):
status: New → In Progress
Changed in android-tools (Ubuntu RTM):
status: New → In Progress
Changed in android-tools (Ubuntu):
importance: Undecided → Critical
Changed in android-tools (Ubuntu RTM):
importance: Undecided → Critical
Changed in android-tools (Ubuntu):
assignee: nobody → Oliver Grawert (ogra)
Changed in android-tools (Ubuntu RTM):
assignee: nobody → Oliver Grawert (ogra)
Daniel Holbach (dholbach) wrote :

I think bug 1385464 was meant in the case of logrotate (comment 15).

Changed in android (Ubuntu RTM):
assignee: nobody → Ricardo Salveti (rsalveti)
status: New → In Progress
importance: Undecided → Critical
Changed in initramfs-tools-ubuntu-touch (Ubuntu RTM):
status: New → In Progress
importance: Undecided → Critical
assignee: nobody → Ricardo Salveti (rsalveti)
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package android - 20141117-0039-0ubuntu1

---------------
android (20141117-0039-0ubuntu1) vivid; urgency=medium

  * New upstream release:
    - Checking partitions and filesystems when updating (LP: #1387214)
  * Dropping changing_series_to_vivid.patch and
    fixing-adb-lock-inhibition.patch, both already upstream
 -- Ricardo Salveti de Araujo <email address hidden> Mon, 17 Nov 2014 11:21:45 -0200

Changed in android (Ubuntu):
status: In Progress → Fix Released
Ricardo Salveti (rsalveti) wrote :

So this is not a normal fs corruption issue. I was able to run the aa-corruption test quite a few times with the extra e2fsck logic we added, and was easily able to get the corrupted files on krillin, but which is even more interesting is that the files are not actually corrupted for e2fsck (not pointing out a single fs error, even though the files are incorrect).

There's clearly a pattern with click paths under /opt/click.ubuntu.com inside the corrupted files. /opt/click.ubuntu.com is also bind-mounted, and it seems to be where it's mostly heavily used by the system.

So in the end it looks like with a quite bizarre error under rw filesystems (ext4) that are bind-mounted. After migrating /opt/click.ubuntu.com, /var/lib/apparmor and /var/cache/apparmor under '/' (not bind-mounted), I was able to run the same aa-corruption test case for a few without any issues (by default I can always get the issue after the first test iteration).

Will let it running for a few more iterations and report back, but one bad thing is that running e2fsck is really not going to help much here.

Ricardo Salveti (rsalveti) wrote :

Just migrated /opt/click.ubuntu.com under '/' (not bind-mounted) and was able to run the same test case for more than 2 hours without any issue.

It fails again at the moment I move it back under /data (bind-mounted at /opt/click.ubuntu.com), on the first run.

Ricardo Salveti (rsalveti) wrote :

Used links instead of bind-mounts (for /userdata) and was still able to reproduce the issue. So not necessarily related with bind-mounts itself.

Ricardo Salveti (rsalveti) wrote :

It works fine if I move the writable paths under /, and it also works fine after creating another image file (ext4) that gets stored under /userdata, to be consumed by the writable paths (using bind-mounts and everything).

But if I use bind mounts from /userdata/system-data directly, I can easily get it to break. The same setup works fine when using / instead (same bind-mounts and etc, but on a different dir path).

The only peculiarity with /userdata is that it's shared between both containers. Ubuntu uses /userdata/system-data, /userdata/SWAP.img and /userdata/user-data. Android uses /userdata/android-data.

Ricardo Salveti (rsalveti) wrote :

I'm also unable to reproduce this issue when mounting /userdata (and also it's bind mounts) with data=journal instead of data=ordered. This might be a consequence of it being slower, not yet sure.

Decided to try that to see if I'd get any error when running e2fsck, but still nothing. Only side effect is that I can't really reproduce the problem.

Running for almost an hour now, will let it for a few more.

Oliver Grawert (ogra) wrote :

after researching the adbd part for two days it seems that adbd already tries to call "echo u > /proc/sysrq-trigger" when adb reboot is issued but at this point we already dropped privs to the phablet user ...

/proc/sysrq-trigger is owned by root:system and writable for both, the only solution i see (beyond making /proc/sysrq-trigger owned by phablet or its group which would rip a giant security hole) is to make adbd start with "setguid system" and have it drop this group membership right before any adb shell call (so that the logged in phablet user is not member of system by default)

i'm trying to implement this but am constantly running into smaller issues.

Ricardo Salveti (rsalveti) wrote :

Colin King is investigating the issue on the filesystem level. He got an easier way to reproduce the problem, and is currently investigating possible patch sets from upstream.

Changed in linux-mako (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
Changed in linux-mako (Ubuntu RTM):
assignee: nobody → Colin Ian King (colin-king)
Changed in linux-mako (Ubuntu):
status: New → Confirmed
Changed in linux-mako (Ubuntu RTM):
status: New → Confirmed
Changed in linux-mako (Ubuntu):
importance: Undecided → Critical
Changed in linux-mako (Ubuntu RTM):
importance: Undecided → Critical
Changed in android (Ubuntu RTM):
status: In Progress → Fix Committed
Changed in initramfs-tools-ubuntu-touch (Ubuntu RTM):
status: In Progress → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package android - 20141117-0039-0ubuntu2

---------------
android (20141117-0039-0ubuntu2) vivid; urgency=medium

  * No change rebuild to pick up initramfs changes.
 -- Ricardo Salveti de Araujo <email address hidden> Tue, 18 Nov 2014 22:21:39 -0200

Changed in android (Ubuntu RTM):
status: Fix Committed → Fix Released
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package initramfs-tools-ubuntu-touch - 0.80

---------------
initramfs-tools-ubuntu-touch (0.80) vivid; urgency=medium

  * scripts/touch: mounting userdata with data=journal as a workaround
    for bug LP: #1387214
 -- Ricardo Salveti de Araujo <email address hidden> Tue, 18 Nov 2014 15:33:16 -0200

Changed in initramfs-tools-ubuntu-touch (Ubuntu RTM):
status: Fix Committed → Fix Released
tags: removed: lt-blocker
Colin Ian King (colin-king) wrote :

We've been exercising the rtm images that include the latest workaround now for a considerable amount of time repeating the test that originally triggered the issue.

On mako, 2 devices running:
     device #1, 329 iterations - no corruption observed
     device #2, 251 iterations - no corruption observed

And also on another spare device using rtm image,
     device #3, 176 iterations, - no corruption observed

I can continue running these for another few days if need be, but I think it is fairly conclusive that the data=journal change prevents the apparmor metadata from ending up on the /var/lib/apparmor/profiles/* files.

Changed in linux-mako (Ubuntu):
assignee: Colin Ian King (colin-king) → Paolo Pisati (p-pisati)
Changed in linux-mako (Ubuntu RTM):
assignee: Colin Ian King (colin-king) → Paolo Pisati (p-pisati)
Colin Ian King (colin-king) wrote :

@Ricardo,

I'm added copious amounts of debug into the kernel and a shim on umount too and I can't see /dev/mmcblk023 being unmounted anywhere. Can you inform me where to expect this umount is actioned on shutdown. I just can't see it.

With full journaling on I'd expect this to cope with a non-umount as it can replay the journaled data and metadata and so we don't get errors, however, with the data=ordered option I could expect to see metadata being perhaps sane where as the file data being not written back and hence we see old data and possibly the reason for this corruption.

So, I'd really like to have some idea where the umount actually occurs.

The debug shows me:

boot --> initrd --> mount /dev/mmcpblk023 onto /tmpmnt

and then

[ 5.142499] mount: /root/userdata
[ 5.142591] do_mount: /tmpmnt -> /root/userdata [<null>]
[ 5.143811] do_mount return -> 0

But for the life of me, I can't see this being unmounted. So that's my concern, it may not be umounted and hence the root cause of the data corruption.

Colin Ian King (colin-king) wrote :

So just to re-iterate:

1. I believe the file system is not being cleanly umounted.
2. data=journal saved us from disaster because it works so well.

My recommendation is to keep data=journal because a user can power off the device (battery death, pulling out battery, random kernel reboot, etc) and data=journal has conclusively shown it the only RELIABLE way to ensure data and metadata are sane.

I really would like to warn against fixing this bug by doing the umount and changing back to the faster but definitely less rugged data=ordered option. I think that would really be too risky IMHO.

Oliver Grawert (ogra) wrote :

after running with the change for one week on my krillin i must say that i dont really see any performance impact or any other bad behavior from the change ...

it would be nice to have some upstart job that at least does an emergecy readonly mount on shutdown/reboot like we plan for adbd "adb reboot" by echoing "u" into /proc/sysrq-trigger ... that way we would at least be sure we do not unmount dirty and calling this command takes close to no time.

Colin Ian King (colin-king) wrote :

I don't thing we're seeing many writes to that partition, most of it's probably data pages being flushed out in the background so it's not going to bite too hard. We do have tools to figure out how much is being written if it needs some analysis.

On Fri, Nov 28, 2014 at 3:29 PM, Oliver Grawert <email address hidden> wrote:
> it would be nice to have some upstart job that at least does an emergecy
> readonly mount on shutdown/reboot like we plan for adbd "adb reboot" by
> echoing "u" into /proc/sysrq-trigger ... that way we would at least be
> sure we do not unmount dirty and calling this command takes close to no
> time.

We could cover most of the possible causes for not unmounting it
properly, just wonder if we should still take the risk of moving back
to data=ordered.

Some use cases:
* Hard reboot (power + volume up): the userspace can probably be
notified by that, forcing the ro remount
* Adb: doing the emergency readonly mount as suggested by oliver
* $ sudo reboot: just making sure the halt process indeed umount the
partition successfully
* Low battery: forcing shutdown if batter lower than 3% (?)
* Kernel panic: can we force kernel to flush things into the disk when
we get a crash? This is important as well as it seems we can easily
make the kernel to crash still (specially when using bluetooth).

Oliver Grawert (ogra) wrote :

i think the "sudo reboot" and "low battery" case are the same, both should use init's shutdown/reboot commands ...

i dont think a kernel panic is a "normal usecase", if we ship panicking devices i think we failed ... developers should simply be aware that panics can cause corruption, this is no different to desktop development

Ricardo Salveti (rsalveti) wrote :

On Tue, Dec 2, 2014 at 7:30 AM, Oliver Grawert <email address hidden> wrote:
> i think the "sudo reboot" and "low battery" case are the same, both
> should use init's shutdown/reboot commands ...

Right.

> i dont think a kernel panic is a "normal usecase", if we ship panicking
> devices i think we failed ... developers should simply be aware that
> panics can cause corruption, this is no different to desktop development

Well, it might not necessarily be a "normal" usecase, but we should
expect that to happen. It's better to better handle the crash than
expecting it to no crash at all, as we can't guarantee that with this
kernel.

And we should be thinking about normal users instead of developers :-)

Colin Ian King (colin-king) wrote :

I think also we need to consider security of data is the ultimate goal. Quite frankly, unpredictable things happen users can do all kinds of things like yank out the battery and trip kernel panics. Nobody has formally verified that the kernel is perfect so there is always the possibility of the device spontaneously rebooting or dieing before data is written back.

Jamie Strandboge (jdstrand) wrote :

There are multiple tasks still open for this bug. Can they be closed or are there still things to do?

Oliver Grawert (ogra) wrote :

adbd can still not remount the filesystem readonly on "adb reboot" due to the fact that we drop all privileges writing to /proc/sysrq-trigger isnt possible and i havent found a way around this yet ... (this is a corner case and potential fs corruption should be handled by the fsck on next boot anyway, but it would be nice if adbd could work as expected here, which is why the task is still open)

Oliver Grawert (ogra) wrote :

setting the android-tools task to a realistic value, the remaining bit is a corner case for normal users and is already shielded from doing any damage by the fact that we do an fsck on every boot now.

Changed in android-tools (Ubuntu):
importance: Critical → Low
Changed in android-tools (Ubuntu RTM):
importance: Critical → Low
Pat McGowan (pat-mcgowan) wrote :

marking resolved for purposes of this project

Changed in canonical-devices-system-image:
status: New → Fix Released

I might have hit this issue again, I have some files in a git repo on my phone (mako running devel channel) and when I tried to do a `git commit -a` just now I got:

error: corrupt loose object '4e1f9449b61f1b1eea1415de21c1c90db0a01f45'
fatal: loose object 4e1f9449b61f1b1eea1415de21c1c90db0a01f45 (stored in .git/objects/4e/1f9449b61f1b1eea1415de21c1c90db0a01f45) is corrupt

This is a 6.5K zlib compressed data file, when I moved it out of the way and tried to do a `git commit -a` I was prompted to add things to the repo that were already there, at this point I gave up and checked out a fresh working copy of the repo.

It is of course possible that this corruption was caused by something like a `git pull` failing half way though due to a network connection issue, but I don't remember this happening.

平凡 (sysy5435) on 2016-07-12
Changed in linux-mako (Ubuntu):
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers