snapd memory sizes grows huge over time

Bug #1642068 reported by Colin Ian King
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
snapd (Ubuntu)
Fix Released
High
Unassigned

Bug Description

I observed that installing and removing snaps seemed to make snapd grow in size, so to see if this assertion was true I repeatedly installed and removed a small package over the course of a few hours and measured the memory footprint with smemstat.

Attached is the smemstat log, with samples taken every 60 seconds. As one can see the USS, PSS and RSS sizes grow steadily to over 90MB over time. This suggests either data is being cached too aggressively or there is a memory leak somewhere, or perhaps garbage collection is not occurring as expected.

Although my test case is rather artificial, it does show there is an issue with memory size on snapd on a long running system when snaps are installed and removed. This is concerning for small IoT embedded devices.

Notes:

USS = Unique Set Size
PSS = Proportional Set Size
RSS = Resident Set Size

AttachedL smemstat log + data in a LibreOffice spreadsheet with graph showing memory size growth.

Revision history for this message
Colin Ian King (colin-king) wrote :
Revision history for this message
Colin Ian King (colin-king) wrote :
Changed in snapd (Ubuntu):
importance: Undecided → High
Revision history for this message
Oliver Grawert (ogra) wrote :

could this be related to Bug 1636847 ?

Revision history for this message
Colin Ian King (colin-king) wrote :

I don't think so, this is just snapd itself growing as it eats up heap.

Revision history for this message
Colin Ian King (colin-king) wrote :

valgrind is finding ~288 bytes per thread create being leaked:

==12683== Thread 1:
==12683== 288 bytes in 1 blocks are possibly lost in loss record 1 of 7
==12683== at 0x4C2EB45: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==12683== by 0x40139AA: allocate_dtv (dl-tls.c:322)
==12683== by 0x40139AA: _dl_allocate_tls (dl-tls.c:539)
==12683== by 0x4E412C9: allocate_stack (allocatestack.c:586)
==12683== by 0x4E412C9: pthread_create@@GLIBC_2.2.5 (pthread_create.c:539)
==12683== by 0xA89731: _cgo_sys_thread_start (in /usr/lib/snapd/snapd)
==12683== by 0x5EB33C: ??? (in /usr/lib/snapd/snapd)
==12683== by 0xC820000A7F: ???
==12683== by 0xC84CBF: ??? (in /usr/lib/snapd/snapd)

..but I suspect there are other memory consumers somewhere else too as this can't explain the megabytes of leakage I'm seeing.

Revision history for this message
Colin Ian King (colin-king) wrote :

gdb is showing that go is expanding the arena via anonymous mmap's, e.g: 128MB, 64MB etc

__mmap (addr=0x0, len=134217728, prot=0, flags=16418, fd=-1, offset=0)
__mmap (addr=0x7ff774000000, len=67108864, prot=0, flags=16418, fd=-1, offset=0)
etc..

so something is causing go to chomp up large amount of heap.

Revision history for this message
Colin Ian King (colin-king) wrote :

And running health-check on snapd shows the same king of arena expansions with mmap:

Memory Change via mmap() and munmap():
  PID mmaps munmaps Change (K) Rate (K/Sec)
 21627 snapd 36 1 87124 1452.06 (growing fast)
 21637 snapd 35 1 86992 1449.86 (growing fast)
 21236 snapd 1 1 65536 1092.26 (growing fast)
 21860 snapd 41 1 38971 649.53 (growing fast)
 21665 snapd 41 1 38971 649.53 (growing fast)
 21598 snapd 41 1 38971 649.53 (growing fast)
 21664 snapd 23 2 19995 333.26 (growing fast)
 21707 snapd 12 1 12138 202.31 (growing moderately fast)
 20838 snapd 3 0 9248 154.13 (growing moderately fast)
 20843 snapd 1 0 8196 136.60 (growing moderately fast)

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

If you install/remove on a loop, this will create changes which are kept in memory until they are pruned. This happens in the period of three days for successful ones, and a week for unsuccessful ones (which are aborted, and later pruned). Eventually we'll need to move to a model where this data is kept on disk with proper transactional control around it, but for now this is a cheap way for us to get going.

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

Btw, there's a "snap changes" command you can see which changes currently exist.

Revision history for this message
Colin Ian King (colin-king) wrote :

Pruning based on a time interval perhaps should be complimented with a memory threshold check too - on small devices we need to be sure we constrain memory footprint, especially on non-swap based devices otherwise processes will be OOM'd.

Revision history for this message
Colin Ian King (colin-king) wrote :

Well, until it's fixed, does that mean that unsuccessful transactions will cause snapd to eat memory - and if so, that seems like a DoS vector to me.

Revision history for this message
John Lenton (chipaca) wrote : Re: [Bug 1642068] Re: snapd memory sizes grows huge over time

Out of curiosity, how many changes per second were you doing?

On 16 Nov 2016 23:20, "Colin Ian King" <email address hidden> wrote:

> Well, until it's fixed, does that mean that unsuccessful transactions
> will cause snapd to eat memory - and if so, that seems like a DoS vector
> to me.
>
> --
> You received this bug notification because you are a member of Snappy
> Developers, which is subscribed to snapd in Ubuntu.
> https://bugs.launchpad.net/bugs/1642068
>
> Title:
> snapd memory sizes grows huge over time
>
> To manage notifications about this bug go to:
> https://bugs.launchpad.net/ubuntu/+source/snapd/+bug/
> 1642068/+subscriptions
>

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

As mentioned above, unsuccessful changes are aborted after a few days, and pruned thereafter.

Revision history for this message
Colin Ian King (colin-king) wrote :

I still think a daemon that can gobble up a load of memory through misuse needs fixing. Maybe I'm old school when I believe that small footprint daemons that function correctly when abused is the correct behaviour.

Revision history for this message
Colin Ian King (colin-king) wrote :

@John Lenton: about 3 per sec

Revision history for this message
John Lenton (chipaca) wrote :

Unless I'm misreading something, it's growing at about 2MB every 60 sec. 180 changes should use a *lot* less than 2MB. So there probably is something else going on...

Revision history for this message
Mark Shuttleworth (sabdfl) wrote :

Is pruning expensive? Could we set a bound on the number of unsuccessful
transactions in the system and prune when that is reached? So the number
is bounded in time (3 days) and in length?

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

@Colin, Sure it definitely needs to be improved. As I said above, this is just a cheap way to get us going, and the proper fix is a transactional storage on disk.

@Mark, Yes, we can prune based on number of changes as well. It's not expensive. That's a good trick to get us going farther.

@John, 2MB / (10 tasks * 180 changes) is in the ballpark of 1kb per task. That's not unrealistic I think. Also worth noting it's not just plain data.. there will be a small amount of additional garbage created per entry when processing it.

Revision history for this message
Colin Ian King (colin-king) wrote :

Sure. The other issue was the potential leak of ~288 bytes per thread creation at the C/Go interface layer, as mentioned in comment #5. I suspect that's an issue that needs fixing with the Go layering glue

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

Did you did into that problem? I don't see the potential leak looking at the function. All of the arguments other than *ThreadStart are allocated in the stack, and ts is freed on threadentry which is the function provided as the start_routine parameter of pthread_create.

What is the actual issue reported?

Revision history for this message
Colin Ian King (colin-king) wrote :

@gustavo, I was just reporting some incidental data that I observed with valgrind as I thought it may be pertinent. It is just an observation that made me dig a bit deeper as it didn't explain the larger heap allocations. I just wanted to be thorough and report my findings as it may be useful if it's a leak issue that could affect other threaded Go applications.

Revision history for this message
John Lenton (chipaca) wrote :

I've put a branch up that drops a few things so that building snapd without cgo is possible. Could you take it from https://github.com/chipaca/snappy/tree/no-c and either build it, and repeat the memory tests?

You can either build it entirely by hand setting CGO_ENABLED=0 to force it to build a static binary, or build it normally (using dpkg-buildpackage) and then force the resolver to use the pure go one with GODEBUG=netdns=go

Thanks!

Revision history for this message
John Lenton (chipaca) wrote :

(if you'd rather I put the binary somewhere for you, do let me know)

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

@Colin, That's what I understood previously, but thanks for confirming it. The question was whether you had digged further to see whether that warning made any sense, or whether you had any further data confirming the leak to be real. I'll take your response as a no, which is fine.

Revision history for this message
Colin Ian King (colin-king) wrote :

@Gustavo, sorry for the confusion, my response is no.

@John, building this on my Zesty box:

$ dpkg-buildpackage
...
debian/rules:67: recipe for target 'override_dh_auto_build' failed
make[1]: *** [override_dh_auto_build] Error 1
make[1]: Leaving directory '/home/king/tmp/snappy'
debian/rules:50: recipe for target 'build' failed
make: *** [build] Error 2
dpkg-buildpackage: error: debian/rules build gave error exit status 2

perhaps you could supply .deb for me

Revision history for this message
John Lenton (chipaca) wrote :
Revision history for this message
John Lenton (chipaca) wrote :

attached

Revision history for this message
John Lenton (chipaca) wrote :

(that one has a static binary; the only caveat is because it's master it'll bump your state.json to the latest thing, which might then not let you go back to whatever you were using before)

Revision history for this message
Colin Ian King (colin-king) wrote :

@John, I'm not seeing memory leaks on the thread create paths now. I guess that was expected.

Revision history for this message
John Lenton (chipaca) wrote :

Could you try again with this one? This one is dynamically linked but has an Environment line in /lib/systemd/system/snapd.service to force it to use the Go resolver.

And, lastly, if you can remove that line from the service file and try again, then it should use the cgo resolver.

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

Once more, unless we have evidence showing otherwise, there were no leaks before either. Let's please stop pretending otherwise.

Revision history for this message
Colin Ian King (colin-king) wrote :

@Gustavo, OK, I'll defer to your wishes. Just not 100% how valgrind could spot a leak which is not there.

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

@Colin, Sorry if it sounds different, but this is not a wish of mine. It's just about respecting the data we have. Not even Valgrind itself is reporting this as a leak. It's saying "block possibly lost". Unless we have either very basic empirical evidence of a leak (e.g. offending code alone in a loop grows the process larger), or someone that disagrees with the assessment I've made above reading the code, we shouldn't regard this as a leak.

Revision history for this message
John Lenton (chipaca) wrote :

For example, doing `snap find` in a loop grows snapd. I struggle to not call that a leak.

Revision history for this message
Gustavo Niemeyer (niemeyer) wrote :

If it never stops growing, it's obviously a leak.. somewhere! If it stabilizes after a while, then it's a side effect of Go being a non-deterministic garbage collected language. Again, data.

If the goal is testing this particular code path, it'd be best to have a tiny test case exercising it, rather than doing something like "snap find".

Revision history for this message
Michael Hudson-Doyle (mwhudson) wrote :

On 18 November 2016 at 02:26, Colin Ian King <email address hidden>
wrote:

> @gustavo, I was just reporting some incidental data that I observed with
> valgrind as I thought it may be pertinent.

FWIW, I wouldn't in general expect valgrind to be useful with a go program.
The Go runtime manages its heap in a totally different way to a C program.

Revision history for this message
babak (bk-ravandi) wrote :

I am having the same issue. The /usr/lib/snapd/snapd memory usage grows very large and never releases. Here is what my application does:

1) In runtime attach volumes to the ubuntu server using the OpenStack cinder
2) run FIO test on each volume every 30 seconds
3) randomly detach volumes

the snapd grows too large that I get "[Errno 12] Cannot allocate memory" trying to run any application (even pkill) so the only option is rebooting the system. Even when I stop my application, the memory won't release unless reboot.

I am using FIO version: fio-2.2.10

Below is my ubuntu version (its 32 bit):
Distributor ID: Ubuntu
Description: Ubuntu 16.04.1 LTS
Release: 16.04
Codename: xenial

ubuntu@vm-1:~$ uname -a
Linux vm-1 4.4.0-47-generic #68-Ubuntu SMP Wed Oct 26 19:39:59 UTC 2016 i686 i686 i686 GNU/Linux

Michael Vogt (mvo)
Changed in snapd (Ubuntu):
status: New → In Progress
status: In Progress → Triaged
Revision history for this message
Michael Vogt (mvo) wrote :

The latest snapd 2.20 does no longer use cgo, so the risk of memory leaks is greatly reduced this way.

There is still the problem that changes accumulate over time and that can eat quite a bit of memory. The following branch addresses this: https://github.com/snapcore/snapd/pull/2545

With the above branch I get something like this: http://paste.ubuntu.com/23728089/ (while running "while true; do sudo snap remove hello; sudo snap install hello; done").

Revision history for this message
Michael Vogt (mvo) wrote :

When running snapd with the following branches:
- https://github.com/snapcore/snapd/pull/2546
- https://github.com/snapcore/snapd/pull/2545
under:
 `while true; do sudo smemstat -p snapd ; sleep 60; done`
and with constantly installing/removing a snap:
 `while true; do sudo snap remove hello; sudo snap install hello; done`
it is going into a steady state at some point:
  http://paste.ubuntu.com/23732666/
where on my system the total RSS is around ~50MB but no longer growing.

The amount of changes goes up to ~350 and is then purged down to 100
again after the 10min purge interval.

So it looks like once the changes limiter lands this should be fixed.

Revision history for this message
Michael Vogt (mvo) wrote :

The fix for this has landed, we now have an upper bound for the amount of changes we keep.

Changed in snapd (Ubuntu):
status: Triaged → Fix Committed
Changed in snapd (Ubuntu):
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.