snapd

snapfuse use a lot of CPU inside containers when "core" and "snapd" is installed

Bug #1817276 reported by Björn Tillenius on 2019-02-22

This bug affects 4 people

Affects		Status	Importance	Assigned to	Milestone
	snapd	Triaged	Medium	Michael Vogt	snapd 2.42

Bug Description

When using the MAAS snap inside a LXD container, I notices that it was a lot
slower than usual. I noticed that snapfuse was often using 100% CPU.

You can reproduce this by installing the MAAS snap:

snap install maas --edge --devmode

Then run 'maas init', accepting the default options. During the
"Performing database migrations" step, you'll see that snapfuse
is doing a lot of work, slowing things down.

As a data point, I measured how long it took to run 'maas init'
on my laptop:

  'snap install' inside a container: 60 seconds
  'snap install' outside a container: 45 seconds
  'snap try' inside a container: 45 seconds

Both the container and host are up-to-date bionic systems.

If it makes a difference, I use ZFS as the backing file system, both
on the system itself and for LXD.

See original description

Tags:

Björn Tillenius (bjornt) on 2019-02-22

tags:	added: maas
description:	updated

Revision history for this message

Alberto Donato (ack) wrote on 2019-02-22:

I see a similar behavior on a machine using btrfs for both host and containers

Andres Rodriguez (andreserl) on 2019-02-25

tags:	added: snap
tags:	removed: snap

Revision history for this message

Zygmunt Krynicki (zyga) wrote on 2019-09-20:

I believe this was debugged and fixed yesterday. Assigning to mvo, who did the work, for tracking.

The relevant pull request is https://github.com/snapcore/snapd/pull/7477

Changed in snapd:
assignee:	nobody → Michael Vogt (mvo)
status:	New → In Progress
milestone:	none → 2.42

Revision history for this message

Zygmunt Krynicki (zyga) wrote on 2019-10-29:

The referenced pull request was merged into snapd. I'm marking this as fix committed.

Changed in snapd:
status:	In Progress → Fix Committed

Maciej Borzecki (maciek-borzecki) on 2019-10-31

Changed in snapd:
status:	Fix Committed → Fix Released

Revision history for this message

Alberto Donato (ack) wrote on 2019-12-05:

I'm still seeing this issue on conatiners when "core" is also installed.

In a fresh up-to-date bionic container w/snapd package: 2.42.1+18.04

sudo snap install --channel 2.7/edge maas
sudo maas init --maas-url http://localhost:5240/MAAS --admin-username $USER --admin-password $USER --admin-email $USER@$USER.net --admin-ssh-import lp:$USER --mode all

maas is working fine after this. After manually installing the "core" snap and rebooting, squashfuse is using 100% cpu and maas processes keep respawning (also using a lot of cpu)

Michael Vogt (mvo) on 2020-02-20

Changed in snapd:
status:	Fix Released → Triaged
summary:	- snapfuse use a lot of CPU inside containers + snapfuse use a lot of CPU inside containers when "core" is installed
summary:	- snapfuse use a lot of CPU inside containers when "core" is installed + snapfuse use a lot of CPU inside containers when "core" and "snapd" is + installed

Revision history for this message

Michael Vogt (mvo) wrote on 2020-02-20:

I was trying to reproduce this with the following setup:
- 19.10 (eoan) VM with clean image with ext4 fs
- installed lxd/3.20 snap
- created bionic container
- installled snapd in the contains and the maas snap
- then installed manually core
- rebooted both host and container
- ran "maas init" and watched "top" output - I see 1-2 snapfuse processes taking up to 60% cpu

I could not reproduce the 100% cpu scenario yet, will try with zfs or btrfs next.

Revision history for this message

Michael Vogt (mvo) wrote on 2020-02-20:

I just re-did my test with btrfs as the lxd storage pool and see similar behavior as before, cpu of snapfuse hoovers around 20-30% with some spikes to ~65%. Maybe I'm missing something to reproduce?

Revision history for this message

Michael Vogt (mvo) wrote on 2020-02-20:

I was asked for the supervisor log: http://paste.ubuntu.com/p/MkgJwgfhnf/ - my test system got up to 1600 processes and was close to OOM, I wonder if maybe this is causing the high snapfuse cpu load. But could be just coincidence of course.

Revision history for this message

Paul Dhaliwal (subpaul) wrote on 2020-02-20:

I am having a similar issue. Brand new maas 2.7 running as a snap on ubuntu lts 18.04.
Here is brief supervisor log:
https://paste.ubuntu.com/p/tqgfT6P362/

This is a small atom based system, but all it has running is maas on snap. maas without snap ran without an issue.

Here is my paste for following command:
ps ax | grep supervisor
https://paste.ubuntu.com/p/bctbxqJHC5/

Revision history for this message

Michael Vogt (mvo) wrote on 2020-02-20:

@Paul thanks! this looks super similar to what I see. The number of processes keeps growing it seems.

Claudio Matsuoka (cmatsuoka) on 2020-09-21

Changed in snapd:
importance:	Undecided → Medium

Revision history for this message

Boris Lukashev (rageltman) wrote on 2021-05-24:

#10

Seeing this happen every single time on OpenStack controllers provisioned via maas for a month or so - the controllers are on KVM VMs, atop xeon v2's, in host-cpu-passthrough mode (16 cores a pop). Happens after ~1h - https://pasteboard.co/K3kcOVH.png, not fun, completely kills the VM after a while. Going to try unrolling the VMs into LXCs to run it on bare metal... but the overall design doesn't scale.
The reason filesystems are run in kernelspace on Linux is to avoid expensive context switches to/from userspace - FUSE is not performant, and therefore with multiple LXD units running snaps in them on a single host, eat more CPU than the actual workload being deployed. Juju's isolation/scale paradigm seems rather incompatible with the snap storage semantic.

Revision history for this message

Boris Lukashev (rageltman) wrote on 2021-05-25:

#11

There appear to be two factors involved in these degenerating conditions:
1. Memory pressure
2. Context switch

Far as i can tell, having 4GB of memory available to page cache on a system with 24 LXD units on 16 Xeon cores effectively reproduces the problem. It looks like squashfuse tries to grab significant chunks of memory, pushing evictions from the LRU to map the decompressed data to memory, then mount it as a filesystem, putting the data into page cache again after copying from userspace, resulting in pressure stalls.
With the controller stack and kernel using ~12G, i gave the VMs 32 and they've been stable since. Will gradually reduce that down to see where the breakpoint is, but it might be useful to document the runtime memory requirements for units leveraging snaps in order to avoid resource waste or instability (such that users could provision the correct amount of memory and cores to handle a snap-storm or whatever you want to call this effect).
I tried reproducing this effect using kernel-mode squashfs and was able to mount and iterate 32 4G squashfs filesystems on a 16G system without causing stalls and panics - so it seems FUSE-related, possibly squashfuse...

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.