Snapd causes corruption on upgrade

Bug #1769669 reported by Viktor Petersson
20
This bug affects 2 people
Affects Status Importance Assigned to Milestone
snapd
Fix Released
Critical
Unassigned

Bug Description

Cross post from https://forum.snapcraft.io/t/snapd-causes-corruption-on-upgrade/5253 as per KyleN's recommendation.

Revision history for this message
Viktor Petersson (vpetersson) wrote :

Renat outlined the steps to reproduce the issue here (https://gist.github.com/renat-galimov/8b940cc5cd552fb46301046a678fec3a). As you can see, lots of these steps could be eliminated on a vanilla image, and we are trying to reproduce this on a vanilla core image as we speak.

The problem appears to stem from the rollback of the Gadget snap.

Revision history for this message
Kyle Nitzsche (knitzsche) wrote :

@viktor, did your customized gadget (for the boot splash support) possibly introduce the incorrectly cased uboot.env filename?

Revision history for this message
Viktor Petersson (vpetersson) wrote :

@Kyle - here's a complete diff from the gadget snap https://paste.ubuntu.com/p/CtgTNwbH5n/

Zygmunt Krynicki (zyga)
Changed in snappy:
status: New → Confirmed
importance: Undecided → Critical
affects: snappy → snapd
Revision history for this message
Oliver Grawert (ogra) wrote :

that the psplash.img file has any influence on this is very unlikely (it just sits there on disk and gets loaded along during boot ... it also never changes)

what i notice though is that a custom u-boot branch is being used instead of the official upstream repo at https://denx.de ... but it seems the "source-branch:" parameter in snapcraft.yaml points to a non-existing branch in that custom repo ... not sure what snapcraft does here during build, but my suspicion would be it falls back to master if it can not find the defined tree ... if that is true it would mean you get the latest development source of u-boot from that tree (it seems to be synced with denx.de's master tree) and thus also the experimental changes and features from there.

Revision history for this message
Kyle Nitzsche (knitzsche) wrote :

Viktor, can you build your normal image but with a stock gadget (pi3 gadget) and try to reproduce in order to test whether ogra's idea above is the root cause?

Revision history for this message
Viktor Petersson (vpetersson) wrote :

Just some more information here. While we have had this corruption on multiple disk images, the reproduction steps outlined are using the following snaps:

core: 4409
pi2-kernel: 51

If you also look at the attached image from one of the many bricked screens, you can see that it was upgrading to 4489. Hence the issue is likely something in the upgrade process between Core 4409 and 4489. That said, I have no idea if has to do with those particular versions or if it is in the upgrade logic.

Revision history for this message
Michael Vogt (mvo) wrote :

The file that sets "snap_mode=trying" is written by uboot:
"""
snappy_boot=if test "${snap_mode}" = "try"; then setenv snap_mode "trying"; saveenv; if test "${snap_try_core}" != ""; then setenv snap_core "${snap_try_core}"; fi; if test "${snap_try_kernel}" != ""; then setenv snap_kernel "${snap_try_kernel}"; fi; elif test "${snap_mode}" = "trying"; then setenv snap_mode ""; saveenv; fi; run loadfiles; setenv mmcroot "/dev/disk/by-label/writable ${snappy_cmdline} snap_core=${snap_core} snap_kernel=${snap_kernel}"; run mmcargs; bootz ${loadaddr} ${initrd_addr}:${initrd_size} 0x02000000
"""
I.e. on refresh the snap_mode is "try" and the saveenv will write it back to disk (the only change being snap_mode=trying).

So I think its definitely worth looking into uboot and seeing if there is any difference from the tree we use and the tree in this gadget. In parallel we are looking at how to make this more robust but its very tricky because what is happening is an "impossible" state in our model. I *think* http://paste.ubuntu.com/p/bP9Z7xjDq9/ will and prevent the bricking at least, it will still prevent refreshes, i.e. the system will reboot and revert (but at least should no longer go into boot fail mode). I will look at this again in my morning and see if I can write a proper testcase.

Revision history for this message
Viktor Petersson (vpetersson) wrote :

@ogra - Interesting find regarding the upstream repo. Our repo is just a mirror of the upstream one to make sure we can control for changes.

In any case, regarding the branch that's an interesting observation, but it seems to be absent in the upstream one too:

$ git clone git://git.denx.de/u-boot.git uboot-upstream
Cloning into 'uboot-upstream'...
remote: Counting objects: 551362, done.
remote: Compressing objects: 100% (93183/93183), done.
remote: Total 551362 (delta 455649), reused 545865 (delta 450199)
Receiving objects: 100% (551362/551362), 112.18 MiB | 7.60 MiB/s, done.
Resolving deltas: 100% (455649/455649), done.
$cd uboot-upstream
$ git branch -a
* master
  remotes/origin/HEAD -> origin/master
  remotes/origin/master
  remotes/origin/next
  remotes/origin/origin
  remotes/origin/u-boot-2009.11.y
  remotes/origin/u-boot-2013.01.y
  remotes/origin/u-boot-2016.09.y

As you can see, there is no branch named "v2017.05" there either.

Revision history for this message
Viktor Petersson (vpetersson) wrote :

Also just to add a bit more about uboot. Master in our fork is currently on 6d7403bf72b5ea46497fe8222d0303cb79563379 (Feb 22nd), whereas the the upstream master branch is currently on 890e79f2b1c26c5ba1a86d179706348aec7feef7 (May 7th). If it is the case that it will revert to the master branch when a branch doesn't exist, that would explain a delta.

Revision history for this message
Renat (renat2017) wrote :

Ogra, hi!

The repository we use is an exact fork of the official one. We had to move to GitHub because the original repository is not always available what causes our CI builds to fail.

We use the same commit as the official pi3 gadget uses: https://github.com/Screenly/u-boot/releases/tag/v2017.05, (or used when we worked the official pi3 gadget repo)

Revision history for this message
Kyle Nitzsche (knitzsche) wrote :

@ogra, so the stock pi3-gadget snapcraft.yaml (and screenly's) does specify uboot "source-branch: v2017.05" [1]

And as Viktor points out, there is no branch of that name upstream uboot, but there *is* a git tag of that name. [2]

Is this an error in the snapcraft.yaml? (There is also a snapcraft "source-tag:" key [3]

Could this cause the the wrong version of uboot in these gadgets (and maybe be to root cause)?

[1] https://github.com/snapcore/pi3-gadget/blob/master/snapcraft.yaml#L15
[2] https://git.denx.de/?p=u-boot.git;a=tags
[3] https://docs.snapcraft.io/reference/plugins/source

Revision history for this message
Renat (renat2017) wrote :

Hi guys.

@Kyle, I tried to create an image with older versions of "core" and "pi2-kernel" snaps, but didn't succeed.

I noticed that core and kernel snap names are hardcoded to the uboot.env file, but whenever I modify the file, the SD card becomes unbootable.

Here is the content I am trying to modify:

```
# Just did `cat uboot.env`
donescriptaddr=0x00000000snap_core=core_4489.snapsnap_kernel=pi2-kernel_52.snapsnappy_boot=if
```

Also, I install a pi2 kernel version 51 into the seed directory, but in the system-boot partitions there is a directory named pi2-kernel_52.snap and renaming that directory doesn't help at all.

Revision history for this message
Kyle Nitzsche (knitzsche) wrote :

@Renat.

What happens if you build your standard image but with the stock pi3 gadget: reproducible?

Regarding seeding, you could probably accomplish that same thing using ubuntu-image --extra-snaps pointing to .snap files, rather than mounting and seeding those snaps into the image after creation (which is harder on core I see now).

Revision history for this message
Renat (renat2017) wrote :

The tricky part is that we need an update to brick the device, @Kyle.

Guys, do you plan any pi2-kernel gadget updates in the nearest future? I have a default image built against pi2-kernel version 53, so if you plan to release version 53 - I can check whether our error is reproducible in Ubuntu Core Pi3 image.

Revision history for this message
Renat (renat2017) wrote :

Sorry, it should be

> I have a default image built against pi2-kernel version 52

Revision history for this message
Renat (renat2017) wrote :

@Kyle,

> What happens if you build your standard image but with the stock pi3 gadget: reproducible?

Even if I build it against stock pi3, I will need to wait until pi2-kernel snap gets outdated.

> Regarding seeding, you could probably accomplish that same thing using ubuntu-image --extra-snaps pointing to .snap files, rather than mounting and seeding those snaps into the image after creation (which is harder on core I see now).

I didn't try that, but as far as I know - it will not attempt to update snaps installed that way.

Revision history for this message
Kyle Nitzsche (knitzsche) wrote :

Ogra confirms the snapcraft.yaml error (comment #11): It should be source-tag, not source branch. However, since this issue exists in both the pi3 gadget and screenly's gadget, it is unlikely to be the root cause.

Revision history for this message
Kyle Nitzsche (knitzsche) wrote :

@Renat: You can use a local .snap file with --extra-snaps for a snap named in the model (for example core, or pi2-kernel, or your gadget) and the snap at runtime is refreshable (that is it can be updated). I just tried it with an older core snap like this:

ubuntu-image -O img-2 --extra-snaps core_4489.snap model

At runtime I got this:
knitzsche@localhost:~$ snap info core
name: core
summary: snapd runtime environment
publisher: canonical
license: unknown
description: |
  The core runtime environment for snapd
type: core
snap-id: 99T7MUlRhtI3U0QFgl5mXXESAiSwt776
tracking: stable
refreshed: 2018-04-16T10:40:40Z
installed: 16-2.32.5 (4489) 76MB core
channels:
  stable: 16-2.32.6 (4573) 76MB -
  candidate: 16-2.32.6 (4573) 76MB -
  beta: 16-2.32.6 (4573) 76MB -
  edge: 16-2.32.6+git711.8d0e194 (4628) 76MB -

And I ran 'snap refresh core' and it is refreshing to 4573.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

I'm marking this as fix release based on discussion in the forum and with a confirmation with mvo.

Changed in snapd:
status: Confirmed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.