Wrong dtb loaded on rpi

Bug #1940553 reported by Paul Larson
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
snapd
Fix Committed
Critical
Unassigned

Bug Description

Still gathering more information about this, but here's what we've noticed so far. We saw several new test failures happening on rpi devices that seemed pretty random, wifi tests failing on rpi4, playback device detection failing on armhf, etc. We also noticed on most armhf devices and some arm64 devices, our testsuite didn't run the cpu scaling test because it failed to find /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors

This is on uc18 images, and the current set of snaps where we can reliably reproduce it is:
Snaps currently installed (rpi4, armhf):
Name Version Rev Tracking Publisher Notes
bluez 5.48-3 286 latest/stable canonical* -
checkbox-snappy 18.16 2102 18/stable ce-certification-qa devmode
checkbox18 1.22 1685 latest/stable ce-certification-qa -
core 16-2.51.3 11425 latest/stable canonical* core
core18 20210816 2140 latest/beta canonical* base,ignore-validation
docker 19.03.13 801 latest/stable canonical* -
pi 18-1 100 18-pi/stable canonical* gadget
pi-bluetooth 1.0 4 latest/stable cwayne18 -
pi-kernel 5.4.0-1042.46~18.04.3 338 18-pi/stable canonical* kernel
snapd 2.51.3 12705 latest/stable canonical* snapd

I was still able to reproduce it by rolling back core18 to earlier revisions. I'm unable to rollback the kernel because of the recent gadget change. However, these errors didn't happen when we tested the current pi-kernel in beta.

Juergh did some excellent investigation and found that it's loading the wrong dtbs.

A quick way to check one of the symptoms (the missing sysfs file I mentioned earlier) is:
ubuntu@localhost:~$ cat /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors
cat: /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors: No such file or directory

Revision history for this message
Paul Larson (pwlars) wrote :

I was also able to reproduce this on an rpi3 with arm64. Here's the current list of snaps there:
Name Version Rev Tracking Publisher Notes
core18 20210722 2127 latest/stable canonical✓ base
pi 18-1 98 18-pi/stable canonical✓ gadget
pi-kernel 5.4.0-1042.46~18.04.3 337 18-pi/stable canonical✓ kernel
snapd 2.51.3 12707 latest/stable canonical✓ snapd

description: updated
description: updated
description: updated
Revision history for this message
Paul Larson (pwlars) wrote :

Some information that was requested:
- Diff of files in /snap/pi-kernel/current/dtbs/ vs /boot/uboot/ - https://paste.ubuntu.com/p/KHBdjZSyMh/
- snap change --last=auto-refresh - https://paste.ubuntu.com/p/bQQQ5JFrJP/
- sudo journalctl -u snapd - https://paste.ubuntu.com/p/zyKwQ7kt24/
- sudo journalctl --no-pager - https://paste.ubuntu.com/p/Kb28wP87wt/

Revision history for this message
Michael Vogt (mvo) wrote :

 I think I have an idea about what the bug might be - the old snapd generates the tasks to update assets but it's old and only generates the tasks for the gadget update not the kernel update, then snapd gets restarted and when this happens it needs to inject a new task, I think that is missing. I will confirm with a bit more testing but that is the most likely theory right now.

Revision history for this message
Michael Vogt (mvo) wrote :

ok, so actually it looks like the assumes: [kernel-assets] is missing from the meta/snap.yaml for the kernel. See https://bugs.launchpad.net/snapd/+bug/1907056 the second paragrah. I think that is the issue, with that snapd will only refresh the kernel if it is at the right version that supports this.

Revision history for this message
Michael Vogt (mvo) wrote :

Sorry - ignore comment #4 - the "assumes" is there, just using a different syntax (which is fine of course). Let me dig further.

Revision history for this message
Michael Vogt (mvo) wrote :

I downloaded the pi stable image now and it auto-refreshed. Snapd is 2.45.3 the auto-refresh log indicates anything just got refreshed, i.e. that there the "assumes: [kernel-assets]" was not takne into account for unknown reasons. I will create a manager test for this next.

Revision history for this message
Michael Vogt (mvo) wrote :

I believe this is understood now:

1. Old snapd (2.45.3.1) creates the refresh tasks for the snapd/pi/pi-kernel update. Here no kernel related asset-update task is generated because the old snapd does not know about this yet
2. The refresh happens and the first thing that gets restarted is snapd, now we have a snapd running that knows about "asset-updates"
3. Then the pi-kernel with the "assumes: [kernel-assets]" is mounted and checked. And at this point the the snapd support kernel-assets so snapd does not hold this refresh back. But that is wrong because the "gadget-update" task for the kernel is still missing.

So the bug is that the "assumes" is not checked early enough. Unfortunately we can't fix the past so the only way forward I can see is that we need to have code in snapd that ensures that on a snapd refresh all changes that contain a kernel refresh are checked and "asset-update" tasks are injected. I would love to have Samueles input here as well but that is currently the only fix I can think of.

Changed in snapd:
importance: Undecided → Critical
status: New → In Progress
Revision history for this message
Michael Vogt (mvo) wrote :

I pushed a draft PR https://github.com/snapcore/snapd/pull/10656 but it will need some input from Samuele before it can land.

Revision history for this message
Michael Vogt (mvo) wrote :

There is also an alternative way to solve it in https://github.com/snapcore/snapd/pull/10666 but it also needs Samuele review (and maybe he comes up with an even better way :)

Revision history for this message
Paweł Stołowski (stolowski) wrote :

The fix is expected in 2.51.7

Changed in snapd:
status: In Progress → Fix Committed
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers