unexpectedly large memory usage of mounted snaps

Bug #1636847 reported by Zygmunt Krynicki on 2016-10-26
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Snappy
Critical
Unassigned
linux (Ubuntu)
Medium
Colin Ian King
Xenial
Critical
Andy Whitcroft
Yakkety
Critical
Andy Whitcroft
Zesty
Medium
Colin Ian King

Bug Description

This is a tracking bug for what might be kernel bugs or kernel configuration changes.

As described [1], memory used by simply mounting a squashfs file (even an empty one) is ranging from almost nothing (on certain distributions) to 131MB on Ubuntu 16.04 and 16.10 on a single-core machine or VM.

The amount is excessive and should be investigated by the kernel team. We may need to change the kernel or at least the configuration we ship in our packages and kernel snaps.

[1] https://github.com/zyga/mounted-fs-memory-checker

CVE References

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1636847

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete
Jamie Strandboge (jdstrand) wrote :

Marking as 'confirmed' so the bot doesn't auto-close it.

Changed in linux (Ubuntu):
status: Incomplete → Confirmed
Andy Whitcroft (apw) wrote :

Evidence suggests this is triggered by the parallel decompressor. Will spin a kernel with that switched out:

    CONFIG_SQUASHFS_DECOMP_SINGLE=y

Andy Whitcroft (apw) wrote :

Ok test kernels at the URL below, please report testing results back here:

    people.canonical.com/~apw/lp1636847-xenial/

Thanks.

Zygmunt Krynicki (zyga) wrote :

After using the CONFING_SQUASHFS_DECOMP_SINGLE=y option in an experimental kernel the memory usage dropped significantly. The traces below are for -45 and the experimental -46 kernel. The 1 and 4 are the number of CPUs on the system.

A single CPU system used to consumer 131MB per mounted snap, this is now reduced to just 4MB/snap. A four-CPU system has less dramatic improvement where the numbers are 7MB -> 4MB.

This suggests some kind of bug in the CONFIG_SQUASHFS_DECOMP_MULTI_PERCPU option perhaps

+ echo Ubuntu 16.04
Ubuntu 16.04
+ ./analyze.py ubuntu 16.04 4.4.0-45-generic 1 size-1m.squashfs.xz.heavy
# num-mounted extra-memory delta
0: 142.60MB
1: 274.71MB (delta: 132.11MB)
2: 406.55MB (delta: 131.84MB)
3: 538.36MB (delta: 131.81MB)
4: 670.19MB (delta: 131.82MB)
+ ./analyze.py ubuntu 16.04 4.4.0-46-generic 1 size-1m.squashfs.xz.heavy
# num-mounted extra-memory delta
0: 62.28MB
1: 66.99MB (delta: 4.71MB)
2: 71.05MB (delta: 4.07MB)
3: 75.12MB (delta: 4.06MB)
4: 79.19MB (delta: 4.07MB)
+ ./analyze.py ubuntu 16.04 4.4.0-45-generic 4 size-1m.squashfs.xz.heavy
# num-mounted extra-memory delta
0: 235.43MB
1: 242.38MB (delta: 6.96MB)
2: 249.38MB (delta: 7.00MB)
3: 256.45MB (delta: 7.06MB)
4: 263.42MB (delta: 6.97MB)
+ ./analyze.py ubuntu 16.04 4.4.0-46-generic 4 size-1m.squashfs.xz.heavy
# num-mounted extra-memory delta
0: 72.79MB
1: 75.90MB (delta: 3.11MB)
2: 79.96MB (delta: 4.06MB)
3: 83.93MB (delta: 3.97MB)
4: 88.00MB (delta: 4.07MB)

Andy Whitcroft (apw) on 2016-10-26
Changed in linux (Ubuntu Xenial):
status: New → Confirmed
Changed in linux (Ubuntu Yakkety):
status: New → Confirmed
Changed in linux (Ubuntu):
assignee: nobody → Colin Ian King (colin-king)
importance: Undecided → Medium
Changed in linux (Ubuntu Xenial):
importance: Undecided → Critical
Changed in linux (Ubuntu Yakkety):
importance: Undecided → Critical
Changed in linux (Ubuntu Xenial):
assignee: nobody → Andy Whitcroft (apw)
Changed in linux (Ubuntu Yakkety):
assignee: nobody → Andy Whitcroft (apw)
Tim Gardner (timg-tpi) on 2016-10-26
Changed in linux (Ubuntu Xenial):
status: Confirmed → Fix Committed
Changed in linux (Ubuntu Yakkety):
status: Confirmed → Fix Committed
Changed in linux (Ubuntu):
status: Confirmed → Fix Committed
Seth Forshee (sforshee) wrote :

@zyga: I'm honestly very surprised that the config change had that drastic an impact on the single-CPU system. Can you tell me what 'cat /sys/devices/system/cpu/possible' says on that system?

Can I assume that the "size-1M" implies a 1MB block size?

Let's start with the simplest options for reducing RAM usage. The most obvious place for gains is the buffers squashfs uses for decompression and caching. squashfs has three caches:

- The "metadata" cache caches 8 blocks of metadata. The metadata block size is fixed at 8KB and the cache is fixed at 8 blocks, so this consumes 64KB of RAM (plus overhead). So there isn't a lot to be gained here.

- The "data" cache is the one affected by CONFIG_SQUASHFS_DECOMP_SINGLE. squashfs allocates RAM up front for each possible decompression thread, at fs_block_size bytes per thread. The previously used config option allocated one cache per possible system in the CPU (which is why I was suprised at the numbers for the single CPU system; at a 1MB block size that implies the system supports over 100 CPUs). The only simple way to gain more here would be to reduce the block size of the filesystems.

- The "fragment" cache. This caches CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE blocks, which is currently set to 3 in our kernels. So it would seem that this accounts for the bulk of the remaining RAM usage. That means for a 1MB block size it's a fixed size of 3MB. We could reduce this to 1 to save some RAM, and reducing the block size of the squashfs images would again help here.

So if the images do have a 1MB block size there are two simple things that will yield the biggest gains - reducing the fragment cache to 1 block and reducing the block size of the squashfs images. Obviously any reduction in cache sizes may result in a loss of performance, depending on access patterns.

I'll go ahead and build a kernel with CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE=1 for you to test. I would suggest taking a look at performance in addition to RAM usage when you're testing.

If 1MB is the block size you're using, would you be open to making this smaller?

Seth Forshee (sforshee) wrote :

Test build with CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE=1 is done.

http://people.canonical.com/~sforshee/lp1636847/linux-4.4.0-46.67+lp1636847v201611071220/

Seth Forshee (sforshee) wrote :

Adding some notes and ideas based on looking more at the squashfs implementation.

One idea I had was to reduce the data cache (or maybe even eliminate it) by making squashfs decompress directly into the page cache. It turns out that squashfs already does this if CONFIG_SQUASHFS_FILE_DIRECT=y, which is the case in our kernel. Rather it tries to read into the page cache directly first, and if that fails it falls back to using the cache, so we can't eliminate the cache entirely. But it does mean that squashfs can probably get by with much less data cache than it currently allocates for SQUASHFS_DECOMP_MULTI and SQUASHFS_DECOMP_MULTI_PERCPU when that option is set. We can't reduce the data cache for CONFIG_SQUASHFS_DECOMP_SINGLE since it always needs enough cache for at least one uncompressed data block.

Also important to note is that although the squashfs data cache size is determined by which decompressor implementation is selected, the cache and decompressor states are managed independently, therefore there's no requirement that they are so tightly coupled. The coupling makes sense if SQUASHFS_FILE_DIRECT is disabled but seems less sensible if it's enabled.

The decompressor implementations come with their own tradeoffs:

- CONFIG_SQUASHFS_DECOMP_SINGLE: Uses less memory. However decompression is serialized, i.e. for a given superblock only one block can be decompressed at a time.

- SQUASHFS_DECOMP_MULTI: Allows for (num_online_cpus() * 2) parallel decompressions, which means the number of parallel decompressions could vary depending on how many CPUs were online when the filesystem was mounted. Also allocates the same number of blocks for the data cache. This implementation has more overhead associated with managing decompressor state memory than the others.

- SQUASHFS_DECOMP_MULTI_PERCPU: This maintains per-cpu decompressor state, therefore it is lockless and there is no overhead for managing the decompressor state memory. That means it uses more RAM though, both for decompressor state memory and for data cache. You also can have as many parallel decompressions going on as there are CPU cores.

Based on this, here are a couple of ideas we could try to strike a good balance between performance and RAM usage:

1. Add a config option for the maximum number of data cache blocks. This would allow us to use one of the implementations which allows for parallel decompression without the data cache size exploding. As long as most blocks get decompressed directly into the page cache (which we'll need to verify) having only one or two blocks in the data cache should not be detrimental to performance.

2. Make a new decompressor implementation which allows for full customization of the number of parallel decompressions and number of blocks in the data cache. In my opinion SQUASHFS_DECOMP_SINGLE is too little parallelization but the others are too much on systems with large numbers of CPUs. Instead we could specify both the maximum number of parallel decompressions in the kernel config (possibly capped at the number of possible CPU cores) and the number of data blocks to cache to suit our needs.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.4.0-47.68

---------------
linux (4.4.0-47.68) xenial; urgency=low

  [ Kamal Mostafa ]

  * Release Tracking Bug
    - LP: #1636941

  * Add a driver for Amazon Elastic Network Adapters (ENA) (LP: #1635721)
    - lib/bitmap.c: conversion routines to/from u32 array
    - net: ethtool: add new ETHTOOL_xLINKSETTINGS API
    - net: ena: Add a driver for Amazon Elastic Network Adapters (ENA)
    - [config] enable CONFIG_ENA_ETHERNET=m (Amazon ENA driver)

  * unexpectedly large memory usage of mounted snaps (LP: #1636847)
    - [Config] switch squashfs to single threaded decode

 -- Kamal Mostafa <email address hidden> Wed, 26 Oct 2016 10:47:55 -0700

Changed in linux (Ubuntu Xenial):
status: Fix Committed → Fix Released
Seth Forshee (sforshee) wrote :

Regarding #2 in comment #8 - I found that we can more or less do this with few simple modifications to SQUASHFS_DECOMP_MULTI. The config options are upper bounds on the number of decompressors and data cache blocks. I tested this with the mounted-fs-memory-checker for comparison, limiting squashfs to 1 data cache block and 4 decompressors per super block (and with CONFIG_SQUASHFS_FRAGMENT_CACHE_SIZE=1). Here's what I got for the "heavy" filesystems on a 2-core VM:

size-0m.squashfs.xz.heavy
# num-mounted extra-memory delta
0: 39.45MB
1: 39.85MB (delta: 0.40MB)
2: 41.91MB (delta: 2.06MB)
3: 43.99MB (delta: 2.07MB)
4: 46.06MB (delta: 2.08MB)
size-1m.squashfs.xz.heavy
# num-mounted extra-memory delta
0: 39.45MB
1: 39.85MB (delta: 0.40MB)
2: 41.91MB (delta: 2.06MB)
3: 43.97MB (delta: 2.06MB)
4: 46.04MB (delta: 2.06MB)

I expect this is identical to what we'd get with the kernel from comment #7, and is probably the minimum we can expect (2 * fs_block_size).

I want to do some performance comparison between these kernels and 4.4.0-47.68, and to get some idea as for how often squashfs has to fall back to using the data cache rather than decompressing into the page cache directly.

My most recent build (with one block in the data cache, one block in the fragment cache, and a maximum of 4 parallel decompressors) can be found at

http://people.canonical.com/~sforshee/lp1636847/linux-4.4.0-47.68+lp1636847v201611101005/

Seth Forshee (sforshee) wrote :

Observations from some very unscientific testing. Testing was done with fio using 8 parallel jobs doing random reads in an amd64 VM with 8 cores.

* The kernel in comment #10 and 4.4.0-45 (with SQUASHFS_DECOMP_MULTI_PERCPU) performed comparably for the most part. 4.4.0-47 (with CONFIG_SQUASHFS_DECOMP_SINGLE) was somewhat slower.

* With 4K and 128K block sizes, I did not see the kernel from comment #10 falling back to using the data block cache at all during my tests. With a 1M block size it was falling back to the data block cache sometimes.

Luis Henriques (henrix) wrote :

This bug is awaiting verification that the kernel in -proposed solves the problem. Please test the kernel and update this bug with the results. If the problem is solved, change the tag 'verification-needed-yakkety' to 'verification-done-yakkety'.

If verification is not done by 5 working days from today, this fix will be dropped from the source code, and this bug will be closed.

See https://wiki.ubuntu.com/Testing/EnableProposed for documentation how to enable and use -proposed. Thank you!

tags: added: verification-needed-yakkety
Zygmunt Krynicki (zyga) wrote :

Hey Seth.

To reply to your earlier question:

@zyga: I'm honestly very surprised that the config change had that drastic an impact on the single-CPU system. Can you tell me what 'cat /sys/devices/system/cpu/possible' says on that system?

This was in a virtual machine with one CPU and the file listed above says:

cat /sys/devices/system/cpu/possible
0-127

Interestingly, using more CPUs (4 virtual CPUs) the numbers change to:

$ cat /sys/devices/system/cpu/possible
0-7

So it looks like a bug in the kernel or the VM software (in this case vmware).

I will give the new kernels a try and report back.

Seth Forshee (sforshee) wrote :

Thanks for the responses, that explains the strange results on the single CPU system. Have you had a chance to try the new kernels yet?

Launchpad Janitor (janitor) wrote :
Download full text (26.6 KiB)

This bug was fixed in the package linux - 4.8.0-28.30

---------------
linux (4.8.0-28.30) yakkety; urgency=low

  [ Luis Henriques ]

  * Release Tracking Bug
    - LP: #1641083

  * lxc-attach to malicious container allows access to host (LP: #1639345)
    - Revert "UBUNTU: SAUCE: (noup) ptrace: being capable wrt a process requires
      mapped uids/gids"
    - (upstream) mm: Add a user_ns owner to mm_struct and fix ptrace permission
      checks

  * [Feature] AVX-512 new instruction sets (avx512_4vnniw, avx512_4fmaps)
    (LP: #1637526)
    - x86/cpufeature: Add AVX512_4VNNIW and AVX512_4FMAPS features

  * zfs: importing zpool with vdev on zvol hangs kernel (LP: #1636517)
    - SAUCE: (noup) Update zfs to 0.6.5.8-0ubuntu4.1

  * Move some device drivers build from kernel built-in to modules
    (LP: #1637303)
    - [Config] CONFIG_TIGON3=m for all arches
    - [Config] CONFIG_VIRTIO_BLK=m, CONFIG_VIRTIO_NET=m

  * I2C touchpad does not work on AMD platform (LP: #1612006)
    - pinctrl/amd: Configure GPIO register using BIOS settings

  * guest experiencing Transmit Timeouts on CX4 (LP: #1636330)
    - powerpc/64: Re-fix race condition between going idle and entering guest
    - powerpc/64: Fix race condition in setting lock bit in idle/wakeup code

  * QEMU throws failure msg while booting guest with SRIOV VF (LP: #1630554)
    - KVM: PPC: Always select KVM_VFIO, plus Makefile cleanup

  * [Feature] KBL - New device ID for Kabypoint(KbP) (LP: #1591618)
    - SAUCE: mfd: lpss: Fix Intel Kaby Lake PCH-H properties

  * hio: SSD data corruption under stress test (LP: #1638700)
    - SAUCE: hio: set bi_error field to signal an I/O error on a BIO
    - SAUCE: hio: splitting bio in the entry of .make_request_fn

  * cleanup primary tree for linux-hwe layering issues (LP: #1637473)
    - [Config] switch Vcs-Git: to yakkety repository
    - [Packaging] handle both linux-lts* and linux-hwe* as backports
    - [Config] linux-tools-common and linux-cloud-tools-common are one per series
    - [Config] linux-source-* is in the primary linux namespace
    - [Config] linux-tools -- always suggest the base package

  * SRU: sync zfsutils-linux and spl-linux changes to linux (LP: #1635656)
    - SAUCE: (noup) Update spl to 0.6.5.8-2, zfs to 0.6.5.8-0ubuntu4 (LP:
      #1635656)

  * [Feature] SKX: perf uncore PMU support (LP: #1591810)
    - perf/x86/intel/uncore: Add Skylake server uncore support
    - perf/x86/intel/uncore: Remove hard-coded implementation for Node ID mapping
      location
    - perf/x86/intel/uncore: Handle non-standard counter offset

  * [Feature] Purley: Memory Protection Keys (LP: #1591804)
    - x86/pkeys: Add fault handling for PF_PK page fault bit
    - mm: Implement new pkey_mprotect() system call
    - x86/pkeys: Make mprotect_key() mask off additional vm_flags
    - x86/pkeys: Allocation/free syscalls
    - x86: Wire up protection keys system calls
    - generic syscalls: Wire up memory protection keys syscalls
    - pkeys: Add details of system call use to Documentation/
    - x86/pkeys: Default to a restrictive init PKRU
    - x86/pkeys: Allow configuration of init_pkru
    - x86/pkeys: Add self-tests

  * kernel invalid ...

Changed in linux (Ubuntu Yakkety):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for linux has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

Michael Vogt (mvo) on 2016-11-30
Changed in snappy:
status: New → In Progress
importance: Undecided → Critical
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package linux - 4.8.0-30.32

---------------
linux (4.8.0-30.32) yakkety; urgency=low

  * CVE-2016-8655 (LP: #1646318)
    - packet: fix race condition in packet_set_ring

 -- Brad Figg <email address hidden> Thu, 01 Dec 2016 08:02:53 -0800

Changed in linux (Ubuntu Zesty):
status: Fix Committed → Fix Released
Michael Vogt (mvo) on 2017-01-17
Changed in snappy:
status: In Progress → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers