ubuntu core installation goes down regularly

Bug #1843417 reported by Roger Peppe
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
snapd
Confirmed
Medium
Unassigned

Bug Description

I have an Ubuntu core installation running on a Raspberry Pi 3 (model B). It goes down regularly, requiring manual reboot, at least once every few hours. When rebooting (by power cycling), the restart usually fails the first time, requiring another power cycle before successful startup.

On inspecting the system journal, I see messages like this:

    Sep 09 07:40:40 localhost snapd[715]: handlers.go:459: Reported install problem for "core18" as fc12d9b2-d2e7-11e9-b568-fa163e102db1 OOPSID

`snap version` reports:

    snap 2.40
    snapd 2.40
    series 16
    kernel 4.15.0-1041-raspi2

`snap changes` reports:

 ID Status Spawn Ready Summary
 11 Error yesterday at 19:25 UTC yesterday at 23:47 UTC Auto-refresh snaps "core", "pi-kernel", "core18"
 12 Done yesterday at 23:47 UTC yesterday at 23:48 UTC Update kernel and core snap revisions
 13 Error today at 08:18 UTC today at 08:20 UTC Auto-refresh snap "core18"

`snap tasks 11` reports http://paste.ubuntu.com/p/CnjtBWwrtr/
`snap tasks 13` reports http://paste.ubuntu.com/p/SvvXP7rP5R/

The contents of `/var/lib/snapd/state.json` (secrets redacted) is http://paste.ubuntu.com/p/6sMn6tzBmH/

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

I'm just triaging this bug to leave a trace.

I'm periodically working with Roger to debug this issue. It seems to be a combination of several different problems that are unfortunately occur together.

We already identified and fixed a single issue with certainty: core18 did not ship fsck for vfat so a corruption of the boot partition, caused by anything at all, would not be automatically fixed.

We are working on the following additional threads now:

1) Understanding what is causing the reboot on that particular machine and if that is snapd or another service on the system.

2) Understanding why the machine does not come up cleanly after reboot. That is, the reboot seems to fail and requires a power cycle or two to recover from.

3) Understanding the situation of the hardware watchdog during the reboot process and in the bootloader. It is possible that we could at least fix the need to manually power cycle a device that has failed to boot correctly.

We've received a copy of the 1st partition of the SD card that this machine is using and we will perform additional analysis on how uboot and linux agree or disagree about the contents of the essential boot variables.

Changed in snapd:
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Zygmunt Krynicki (zyga)
status: Confirmed → In Progress
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Some more comments that I forgot to include:

- The fsck issue is discussed more on https://forum.snapcraft.io/t/core18-shortcoming-missing-fsck-vfat-for-boot-fat-partition/13276

- The machine does not have an RTC and may also surface a bug in our RTC handling. Specifically on the Raspberry Pi it relies on an accurate timestamp of /var/lib/snapd/snaps and it may be so that nothing is touch'ing that file and sync'ing the filesystem reliably on shutdown.

Revision history for this message
Zygmunt Krynicki (zyga) wrote :

We've received an image of the SD card of the affected raspberry Pi and could not reproduce this issue after a week of "aging" where the device was left idle, connected to the network, and operating normally at all times.

I cannot explain the failure that was originally seen by Roger. I suspect it is something environmental, specific to the hardware used.

We can perform further debugging to try to pinpoint the cause but I don't believe it is a specific issue in the software in general but instead an intersection of the software and particular hardware that we were unable to reproduce elsewhere.

I've un-assigned myself from the issue to reflect the current status.

Changed in snapd:
assignee: Zygmunt Krynicki (zyga) → nobody
status: In Progress → Confirmed
importance: High → Medium
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.