console-conf when run with auto-refresh of core20 will crash and become non-responsive

Bug #1880156 reported by Ian Johnson
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
snapd
Fix Released
High
Ian Johnson
subiquity
New
Undecided
Unassigned

Bug Description

When booting a VM image on amd64, I tried to configure the device with console-conf, but it seems that an auto-refresh of the core20 snap was in progress and subsequently snapd attempted to reboot the device immediately, causing what I think are the following bugs:

1. console-conf hung after entering my email and hitting "Done", it didn't show any progress or anything and was just hung until the VM rebooted itself a while later, maybe a minute or slightly less
2. upon rebooting I could no longer use console-conf at all, the message "Press enter to configure" was displayed but upon hitting enter some sort of other message appeared on the screen immediately and then all text went away leaving me with an empty screen, and furthermore being unable to configure console-conf at all. I waited for a minute or two before giving up and rebooting the VM as it normally is much much quicker to run in this VM, but perhaps it would

Upon a manual reboot after 2 I was then able to use console-conf and login successfully.

To be clear, during 2 I still could not login to the device over SSH with my configured credentials.

I think the sequencing is basically this:

1. snapd starts to auto-refresh core20
2. I hit enter and console-conf starts to run
3. snapd finishes the auto-refresh
4. I finish entering in my email
5. console-conf is stuck trying to run `snap create-user` which for some reason just hangs
6. eventually snapd's scheduled reboot for the refresh of the core20 snap hard-reboots the system

Probably this is a snapd bug in that when snapd has requested a scheduled reboot for the base snap, etc. it should fail quickly with some message that it can't respond now.

What I think console-conf could do better is to have some sort of spinning animation while waiting for snap create-user to finish/return, because there could be other situations where snap create-user takes longer than expected, i.e. high system load from background service snaps or internet connectivity issues talking to the store or even some kind of store issue.

Tags: core20 uc20
Revision history for this message
Ian Johnson (anonymouse67) wrote :

I can reproduce issue 1 with the same image again, but I have not been able to reproduce issue 2.

Revision history for this message
Ian Johnson (anonymouse67) wrote :

Another thing that maybe console-conf could do would be something like what the subiquity snap does and have the UI for console-conf ask if it should check for updates before continuing, but in this case it would check for updates to the core20 snap and if available it should make clear to the user it needs to reboot for the new console-conf to run.

Changed in snapd:
importance: Undecided → Medium
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

I'm confused how a refresh of core20 can do anything. Given that core20 is the rootfs, and we don't change rootfs, unless we reboot.

Can snapd, refresh snapd.snap without reboot? Cause that may cause a hang too.

Do you have logs from that system? Ie. all the snapd changes?

On the subiquity live server we disable snap refreshes by default. Can we inhibit / pause refreshes and reboots, if somebody launched consoleconf? (with a timeout, i.e. kill console-conf if it is not completed in 10min)

Revision history for this message
Dimitri John Ledkov (xnox) wrote :

Which images / channels were you using?

Changed in subiquity:
status: New → Incomplete
Changed in snapd:
status: New → Incomplete
Michael Vogt (mvo)
Changed in snapd:
status: Incomplete → Triaged
importance: Medium → High
Revision history for this message
Ian Johnson (anonymouse67) wrote :

@xnox I was using the beta channels, probably with a custom built image.

> I'm confused how a refresh of core20 can do anything.

A refresh of core20 will cause snapd to stop responding to it's REST API until it has rebooted, causing console-conf to hang trying to run `snap create-user`

> Can snapd, refresh snapd.snap without reboot?

Yes the snapd snap can be refreshed without a reboot.

> Do you have logs from that system? Ie. all the snapd changes?

Not anymore but this is pretty easily reproduced if you just build an image with the beta core20 snap as a revision older than what's currently on the beta channel.

> Can we inhibit / pause refreshes and reboots, if somebody launched consoleconf?

Sort of, see https://snapcraft.io/docs/system-options#heading--refresh specifically refresh.hold key. Note that setting this key is slightly buggy though as per https://bugs.launchpad.net/snapd/+bug/1856063

Changed in subiquity:
status: Incomplete → New
Changed in snapd:
status: Triaged → New
Revision history for this message
Dimitri John Ledkov (xnox) wrote :

so snapd can be refusing api connections.

if it does, console-conf should say that one cannot take ownership right now because snapd is refreshing and/or rebooting. But we should not crash / try again later.

potentially there should be two way locking between consoleconf & snapd, such that neither reboots nor configures the system, if the other one is in progress.

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

Well for one thing, the way the UI is completely blocked while console-conf runs snap create-user is terrible and we should fix that. Probably console-conf should just talk to the snapd api directly rather than via the command line tool (console-conf was first written back in the mists of time before I wrote code support for talking to the snapd api).

Once that is fixed, hm. In general, a device should be updated asap but of course an update may not be possible until the network has been configured. So the flow for console-conf should probably be something like:

1. configure networking
2. check for snap updates
3. apply snap updates if any, reboot if core20 was updated
4. now claim ownership if applicable

As Ian sort of comments, the subiquity codebase has all the code required to do this in support of the self-refresh feature in the server installer.

Thoughts?

Without doing the above (which is a bit of effort), is there a way for console-conf to tell that the reason snapd isn't answering it's connections is because a reboot is pending? In that case, console-conf could at least explain what is going on a bit better.

Changed in subiquity:
status: New → Incomplete
Changed in snapd:
status: New → Incomplete
Revision history for this message
Ian Johnson (anonymouse67) wrote :
Download full text (3.2 KiB)

I guess personally as someone who has been bitten basically 10 million and a half times by the issue where a freshly configured Ubuntu Core device immediately reboots itself (many times twice!) for snap updates, my humble opinion is that we should offer the choice of at least completing console-conf without a reboot in the middle, and that we should also indicate somewhere in the UI that an update is going to happen if the user declines the choice of completing console-conf without a reboot and accepts their fate to wait a bit longer.

In my mind that vision looks like this kind of flow:

1. device turns on, console-conf starts

2. console-conf checks if there are currently running refreshes (in the case that snapd beat console-conf to the punch and started refreshing using i.e. DHCP and ethernet)

3. if there are currently running refreshes, console-conf displays some message like "wait for reboot" etc. and the flow restarts at 1 after the reboot

4. if there are not any currently running refreshes (either because we don't have network yet or there are some available but snapd hasn't started this yet because devices are slow and humans are fast at hitting enter), console-conf should either automatically delay refreshes for an hour (or some reasonable small amount of time, maybe 10 minutes is okay)

5. now console-conf delayed refreshes and there is a human typing things into a screen or a serial port, so console-conf lets people push buttons and network things get setup and users get created without any reboots interrupting the process

6. now that console-conf has setup networking and users, if there are refreshes available which were delayed, console-conf will offer the option of waiting for those to finish politely with a UI spinner page or just "exiting" console-conf early, with the latter option just undelaying snap refreshes and giving the "This device is registered to ..." page immediately

This obviously is a big change from the current flow and would probably require a fair amount of work in console-conf, but the primitives should exist for this to happen:

* a GET to /v2/changes will give you all currently happening refreshes
* a PUT to /v2/snaps/system/conf with refresh.hold=$(date --date="+10 minutes" +%Y-%m-%dT%H:%M:%S%:z) will delay refreshes for 10 minutes while console-conf runs.

One other important thing is that we need to be sure that we do not delay refreshes for too long, because we still want auto-updates to continue functioning after some reasonable amount of waiting.

That's my big vision for how this would work in the bestest of worlds.

Maybe there are small things we can do in the meantime such as just making console-conf not let a user do anything useful in the UI if there is an in-progress refresh to prevent interrupted actions.

To respond to your specific question Michael, no there is not a great way to tell that snapd is not responding to messages due to it being down waiting for a reboot, because you can't get snap changes via the API or snap commands because snapd is down, but we do have a snap debug command which could maybe be used for this purpose, `snap debug state /var/lib/snapd/state.json`, but that'...

Read more...

Changed in snapd:
status: Incomplete → New
Changed in subiquity:
status: Incomplete → New
Revision history for this message
Zygmunt Krynicki (zyga) wrote :

Marking as confirmed since Ian has already started working on this.

Changed in snapd:
status: New → Confirmed
Zygmunt Krynicki (zyga)
Changed in snapd:
assignee: nobody → Ian Johnson (anonymouse67)
status: Confirmed → In Progress
Revision history for this message
Ian Johnson (anonymouse67) wrote :
Download full text (4.7 KiB)

So we have some PR's up for addressing this story namely:

* https://github.com/snapcore/snapd/pull/9418 (which has been merged)
* https://github.com/snapcore/snapd/pull/9489 (which should be merged soon)
* https://github.com/CanonicalLtd/subiquity/pull/858 (which after merging will need to be pulled into the core20 snap somehow, I assume xnox will take care of that)

This gets us a bit closer, namely that this enables subiquity/console-conf to tell snapd to delay refreshes while console-conf runs by running `snap routine console-conf-start` (which it will after the subiquity PR is pulled in and backported to the core20 snap as needed). Additionally, the snapd side of things has been implemented in such a way that eventually console-conf can stop using the `snap routine` command and instead directly call the REST API, and do different things like display a prettier UI, etc. based on the result of the API call, but that will be left to the console-conf folks to work out.

For reference, the REST API is as follows:

A POST to /v2/internal/console-conf-start with no body will "invoke" the routine, and will ensure that auto-refreshes are delayed for 20 minutes from the present time, and will also return a JSON structure like the following:

{
 "type": "sync",
 "status-code": 200,
        "result": {
  "active-auto-refreshes": ["1"],
  "active-auto-refresh-snaps": ["pc-kernel"]
 }
}

where "active-auto-refreshes" is the list of change id's for active auto-refreshes, and "active-auto-refresh-snaps" is the list of snaps being refreshed. It will not wait for pending auto-refresh changes to complete, the waiting described above is done on the `snap` client side, so console-conf can poke this endpoint periodically while displaying some sort of progress bar until there are no more change ID's in "active-auto-refreshes", and at that point can proceed with a guarantee that the device will not automatically reboot for an auto-refresh or have snapd restart itself until 20 minutes into the future.

Note that the snapd side doesn't handle time changes, so if the time moves forward after when this endpoint is invoked and the time moves forward by more than 20 minutes, the device could still auto-refresh at that point. We will work on trying to make snapd more robust against that, but I will file another bug about that. In the meantime, console-conf could perhaps just keep calling this endpoint after network connection has been established (or more precisely after time synchronization is done) for good measure to ensure that the auto-refresh is delayed until console-conf is all done.

Finally, we still should work on the following things to have a really good user experience here (and I will file bugs about individual things here so that we can close out this bug after console-conf uses the routine command):

* The user should be given the option in the console-conf UI of letting
  refreshes happen before logging in and taking ownership or choosing specific
  snaps that should be updated before logging in.
* There is no UI displayed to the user about ongoing refreshes, the only messages
  that `snap routine console-conf-start` will show are a single message...

Read more...

Changed in snapd:
milestone: none → 2.48
Revision history for this message
Ian Johnson (anonymouse67) wrote :

Just for book-keeping here are other relevant bugs for the story:

snapd bugs:
* time changes breaking auto-refresh delays: https://bugs.launchpad.net/snapd/+bug/1901766

subiquity bugs:
* i18n and l10n in console-conf: https://bugs.launchpad.net/subiquity/+bug/1621378
* display info about broken seeds: https://bugs.launchpad.net/subiquity/+bug/1901767
* calling snap managed: https://bugs.launchpad.net/subiquity/+bug/1901769
* selecting which snaps to refresh in the UI: https://bugs.launchpad.net/subiquity/+bug/1901773
* nice progress bar UI for ongoing snap auto-refreshes: https://bugs.launchpad.net/subiquity/+bug/1901774

Revision history for this message
Ian Johnson (anonymouse67) wrote :

I think this task can be closed for subiquity on UC20/main, since now subiquity will not crash if a refresh happens mid-way through, as subiquity will be blocked from starting it's UI if there is an ongoing refresh.

Not sure if backporting makes sense, the fix in snapd will apply equally to uc16 and uc18, so I think it would make sense, and the changes in subiquity to enable this should be fairly simple to backport.

The other bug reports mentioned remain however.

Changed in snapd:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.