Can't cope with crashing snap in auto-refresh status

Bug #1956451 reported by Hadmut Danisch
260
This bug affects 1 person
Affects Status Importance Assigned to Milestone
snapd
New
Undecided
Unassigned

Bug Description

Hi,

snapd can get in a state where it's impossible to stop or remove a snap.

I was following

https://ubuntu.com/tutorials/getting-started-with-microk8s-on-ubuntu-core#1-introduction

trying to run microk8s as a snap on Ubuntu Core 20, running on a raspberry Pi 4 with 8GB RAM.

As recommended, I did

snap install microk8s --channel=latest/edge/strict

which, instead of the version v1.22.3 mentioned in the tutorial, brought in v1.23.0, which was crashing due to several bugs and flaws (microk8s inspect reveals).

Today, after booting the machine, it auto-refreshed itself into the latest core version, and after rebooting, updated the microk8s snap to v1.23.1.

The problem is, that the bugs in microk8s still exist, it goes into an endless loop, restaring kubelite, occupying between 100% and 250% cpu, due to several bugs.

But now, I can't get rid of it in a clean way:

# snap disable microk8s
error: snap "microk8s" has "auto-refresh" change in progress

# snap remove microk8s
error: snap "microk8s" has "auto-refresh" change in progress

# snap stop microk8s
error: snap "microk8s" has "auto-refresh" change in progress

since the buggy snap never finishes it's auto-refresh due to its bugs.

I do consider this as a security vulnerability, because someone could intentionally build a snap that cannot be stopped or removed anymore (by regular administrative steps).

Revision history for this message
Ian Johnson (anonymouse67) wrote :

Hi, can you provide the output of `snap changes` and also specify how long you waited for snapd before running these commands? Snapd should at some point time out on the change IIRC (I think it's like 10 minutes in this case if a service that is trying to be started is broken), and you can also abort changes manually.

Changed in snapd:
status: New → Incomplete
Revision history for this message
Hadmut Danisch (hadmut) wrote :

# snap changes
ID Status Spawn Ready Summary
31 Doing yesterday at 10:57 UTC - Auto-refresh snaps "microk8s", "pi-kernel"
32 Done yesterday at 11:03 UTC yesterday at 11:03 UTC Running service command for snap "microk8s"
33 Done yesterday at 11:05 UTC yesterday at 11:06 UTC Running service command for snap "microk8s"
34 Done yesterday at 11:08 UTC yesterday at 11:08 UTC Running service command for snap "microk8s"
35 Done yesterday at 11:08 UTC yesterday at 11:08 UTC Running service command for snap "microk8s"
36 Done yesterday at 11:10 UTC yesterday at 11:11 UTC Running service command for snap "microk8s"
37 Done yesterday at 11:13 UTC yesterday at 11:13 UTC Running service command for snap "microk8s"
38 Done yesterday at 11:15 UTC yesterday at 11:16 UTC Running service command for snap "microk8s"
39 Done yesterday at 11:16 UTC yesterday at 11:16 UTC Running service command for snap "microk8s"
40 Done yesterday at 11:18 UTC yesterday at 11:18 UTC Running service command for snap "microk8s"
41 Done yesterday at 11:20 UTC yesterday at 11:20 UTC Running service command for snap "microk8s"
42 Done yesterday at 11:23 UTC yesterday at 11:23 UTC Running service command for snap "microk8s"
43 Done yesterday at 11:25 UTC yesterday at 11:25 UTC Running service command for snap "microk8s"
44 Done yesterday at 11:25 UTC yesterday at 11:25 UTC Running service command for snap "microk8s"

I don't recall how long I have waited, I didn't pay attention on that. But enough, which means some tens of minutes, at least.

I, btw., had noticed that kubelite process had a new PID every now and then, so it was not a single process going berserk, but it was repeatedly restarted.

Another hint: I had turned off the system after opening this bug. Today I restarted to answer your question, and the microk8s snap is still not working, but not in that particular error condition anymore, no running processes.

Now I was able to disable the snap.

However, re-enabling the snap took a veeeery long time for almost each of these „run hook...“ messages (didn't measure, but minutes for each single step). top reveals that kubelite goes into that error loop again. Parent ID is 1.

This is for testing only, and I do have root access and can see what's going on. But if this was a black box IoT device in real use, that would really break it. Although that microk8s snap comes from an edge and not stable channel, snapd should be more robust against such problems.

Revision history for this message
Hadmut Danisch (hadmut) wrote :
Download full text (3.4 KiB)

Finally, after long time, the snap enable microk8s failed. Output if that helps:

...
Run hook connect-plug-k8s-kubeproxy of snap "microk8s" -
error: cannot perform the following tasks:
- Run hook connect-plug-k8s-kubeproxy of snap "microk8s" (run hook "connect-plug-k8s-kubeproxy":
-----
+ sleep 5
++ date +%s
+ now=1641471192
+ [[ 1641471192 > 1641471217 ]]
+ is_apiserver_ready
+ return 1
+ sleep 5
++ date +%s
+ now=1641471197
+ [[ 1641471197 > 1641471217 ]]
+ is_apiserver_ready
+ return 1
+ sleep 5
++ date +%s
+ now=1641471202
+ [[ 1641471202 > 1641471217 ]]
+ is_apiserver_ready
+ return 1
+ sleep 5
++ date +%s
+ now=1641471207
+ [[ 1641471207 > 1641471217 ]]
+ is_apiserver_ready
+ return 1
+ sleep 5
++ date +%s
+ now=1641471212
+ [[ 1641471212 > 1641471217 ]]
+ is_apiserver_ready
+ return 1
+ sleep 5
++ date +%s
+ now=1641471217
+ [[ 1641471217 > 1641471217 ]]
+ is_apiserver_ready
+ return 1
+ sleep 5
++ date +%s
+ now=1641471222
+ [[ 1641471222 > 1641471217 ]]
+ break
++ date +%s
+ now=1641471222
+ [[ 1641471222 < 1641471217 ]]
+ check_snap_interfaces 1
+ interfaces=("docker-privileged" "docker-support" "dot-kube" "dot-config-helm" "firewall-control" "hardware-observe" "home" "home-read-all" "k8s-journald" "k8s-kubelet" "k8s-kubeproxy" "kernel-module-observe" "kubernetes-support" "log-observe" "login-session-observe" "mount-observe" "network" "network-bind" "network-control" "network-observe" "opengl" "process-control" "system-observe")
+ declare -ra interfaces
+ missing=()
+ declare -a missing
+ for interface in ${interfaces[@]}
+ snapctl is-connected docker-privileged
+ for interface in ${interfaces[@]}
+ snapctl is-connected docker-support
+ for interface in ${interfaces[@]}
+ snapctl is-connected dot-kube
+ for interface in ${interfaces[@]}
+ snapctl is-connected dot-config-helm
+ for interface in ${interfaces[@]}
+ snapctl is-connected firewall-control
+ for interface in ${interfaces[@]}
+ snapctl is-connected hardware-observe
+ for interface in ${interfaces[@]}
+ snapctl is-connected home
+ for interface in ${interfaces[@]}
+ snapctl is-connected home-read-all
+ for interface in ${interfaces[@]}
+ snapctl is-connected k8s-journald
+ for interface in ${interfaces[@]}
+ snapctl is-connected k8s-kubelet
+ for interface in ${interfaces[@]}
+ snapctl is-connected k8s-kubeproxy
+ for interface in ${interfaces[@]}
+ snapctl is-connected kernel-module-observe
+ for interface in ${interfaces[@]}
+ snapctl is-connected kubernetes-support
+ for interface in ${interfaces[@]}
+ snapctl is-connected log-observe
+ for interface in ${interfaces[@]}
+ snapctl is-connected login-session-observe
+ for interface in ${interfaces[@]}
+ snapctl is-connected mount-observe
+ for interface in ${interfaces[@]}
+ snapctl is-connected network
+ for interface in ${interfaces[@]}
+ snapctl is-connected network-bind
+ for interface in ${interfaces[@]}
+ snapctl is-connected network-control
+ for interface in ${interfaces[@]}
+ snapctl is-connected network-observe
+ for interface in ${interfaces[@]}
+ snapctl is-connected opengl
+ for...

Read more...

Revision history for this message
Ian Johnson (anonymouse67) wrote :

Thanks for that output, personally I don't think this needs to be a private security bug and arguably is not a security bug at all because snapd eventually recovers, I think the fact that it is taking a long time is exacerbated by:

1. the Raspberry Pi 4 8GB, while the fastest Pi out there, is still a Pi and thus is inherently a bit slow to complete operations
2. Microk8s is incredibly resource intensive, running many services and containers and pods and such

So the fact that in error conditions things take a long time, i.e. more than an hour, to recover (but less than say 24 hours) is not a security DDOS vulnerability to me. We can and should definitely try to do better here, but I don't think the fact that a snap service is misbehaving here is a security vulnerability.

Note that we also don't prevent someone from shipping a snap that uses no interfaces but instead just uses 100% CPU doing something like `dd if=/dev/zero of=/dev/null` on all CPU's, so it's not a unique problem.

I will leave it up to the security team to make this bug non-private if they agree with my reasoning here.

Changed in snapd:
status: Incomplete → New
Revision history for this message
Hadmut Danisch (hadmut) wrote :

Actually, the Raspi isn't slow, but a capable 64-bit computer, good enough for desktop and video applications.

But, after all, wasn't ubuntu core especially made as an IoT OS for smallest computers, and supposed to work well on those tiny and power saving machines?

Revision history for this message
Alex Murray (alexmurray) wrote :

From a security perspective, this could be seen as a potential DoS issue since I can imagine a snap which combines Ian's suggestion of a simple CPU DoS coupled with this bug to essentially make a device unusable once a snap is installed (and hence can't easily be uninstalled at all, especially once the snap is trying to use all available CPU cycles, denying snapd the chance to actually uninstall it).

However, I don't see this as an issue which needs to be kept private either - so I am marking this non-private, but still 'security' for now.

information type: Private Security → Public Security
Revision history for this message
Paweł Stołowski (stolowski) wrote :

Any chance you still have system log (journalctl) for that period?

Revision history for this message
Hadmut Danisch (hadmut) wrote :

Unfortunately not. It's an ubuntu core, which has little loging, and systemctl shows only the logs of the current kernel run.

To post a comment you must log in.
This report contains Public Security information  
Everyone can see this security related information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.