arale has very high memory usage compared to krillin

Bug #1468077 reported by Sturm Flut
20
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Canonical System Image
Won't Fix
High
Yuan-Chen Cheng

Bug Description

While using arale I noticed that the kernel reports a much higher memory usage than on krillin. I would expect some increase, e.g. all the image buffers need to be larger because the display has a much higher resolution. But the difference is so vast that I can't explain it easily, and there is a huge chunk of memory that seemingly can't be attributed to anything in userspace.

Let's make some definitions first:

- "Total PSS" is the sum of all proportional set size values of all running processes, as e.g. reported by Colin Ian King's "smemstat" (http://kernel.ubuntu.com/~cking/smemstat/) utility.

- "Used" is the actual amount of used memory, excluding buffers and caches, thus the result of MemTotal-MemFree-Buffers-Cached as reported in /proc/meminfo. This value is e.g. reported in the second output row of the "free" command under "used".

I would expect that Used is always a bit higher than Total PSS because the kernel needs some memory for itself. This assumption turns out to be true on all my devices, except arale.

Let's look at some results.

krillin: channel ubuntu-touch/stable/bq-aquaris.en, image r23 (OTA-4)

* Step 1: Turn off all radios and turn on Developer Mode (if necessary)

* Step 2: Reboot the device

* Step 3: Unlock the device, wait a minute until it has "settled"

* Step 4: Log in via phablet shell and measure

krillin Total PSS: 350.9 M
krillin Used: 394 M
krillin Swap: 0 M

arale Total PSS: 364.7 M
arale Used: 543 M
arale Swap: 0 M

* Step 5: Turn on WiFi, connect to an access point, open the browser and navigate to www.ubuntu.com. Make sure there were no tabs open! Measure when the page has loaded.

krillin Total PSS: 502.6 M
krillin Used: 498 M
krillin Swap: 6.0 M

arale Total PSS: 603.8 M
arale Used: 778 M
arale Swap: 0 M

* Step 6: Start the Telephone app, wait until it has loaded, then measure.

krillin Total PSS: 505.3 M
krillin Used: 533 M
krillin Swap: 14 M

arale Total PSS: 632.4 M
arale Used: 861 M
arale Swap: 0 M

These are just examples, but the general direction is always like this: Used is *much* higher than Total PSS on arale. I've seen edge cases where Used was nearly twice as high as Total PSS.

Changed in canonical-devices-system-image:
assignee: nobody → John McAleely (john.mcaleely)
Revision history for this message
John McAleely (john.mcaleely) wrote :

Discussing in IRC, Simon's hunch is that this may be the GPU. That certainly sounds plausible.

Revision history for this message
Sturm Flut (sturmflut) wrote :

After unsuccessfully wrangling with ftrace for a while, I decided to have a deeper look at the contents of /sys/kernel/debug and found the /sys/kernel/debug/ion/ directory, associated with the GPU driver. It contains a subdirectory called "clients", which contains files for a couple of PIDs. Each of these entries contains two lines like the following ones:

       heap_name: size_in_bytes
     ion_mm_heap: 44253184

From the file name and labels I would suspect that this value shows the size of some internal buffer the ION GPU driver susbsystem holds for a given PID. There are two additional files, "display-0" (maybe the display framebuffer) and "RGX-0" (RGX is the name of the GPU chip). After a reboot the following values are reported (I replaced PIDs with process names where applicable):

unity-system-compositor ion_mm_heap: 44253184
unity8 ion_mm_heap: 51843072
unity8-dash ion_mm_heap: 25300992
display-0 ion_mm_heap: 8847360
RGX-0 ion_mm_heap: 69554176

If display-0 really is the display framebuffer, then the value is spot on: 1920x1152 pixels at a depth of 32 bits per pixel gives exactly 8847360 bytes.

Now I repeated my measurements, and there is a correlation if I add the sum of all ion_mm_heap entries into the picture:

Step 1: Measure after a reboot

arale Total PSS: 382.6 M
arale Used: 572 M
Difference: ~189 M
ion_mm_heap sum: ~182 M

Step 2: Start the webbrowser

arale Total PSS: 432.5 M
arale Used: 668 M
Difference: ~235 M
ion_mm_heap sum: ~262 M

Step 3: Start a couple of random other apps (but keep note which you started)

arale Total PSS: 567.9 M
arale Used: 1135 M
Difference: ~567 M
ion_mm_heap sum: ~635 M

Step 4: Kill all running apps

arale Total PSS: 376.8 M
arale Used: 772 M
Difference: 395 M
ion_mm_heap sum: 182 M

Step 5: Restart the webrowser and all the apps you started in Step 3

arale Total PSS: 580.6 M
arale Used: 1135 M
Difference: 554 M
ion_mm_heap sum: 700 M

Notice that beginning with Step 4, Difference and ion_mm_heap sum start to drift apart quite a bit. I think this is because some PIDs show up multiple times in /sys/kernel/debug/ion/clients (stale entries?) and the sum calculation is fed with wrong data, for example I see the following files after the last step:

# ls /sys/kernel/debug/ion/clients/
1530-0 2441-0 4343-0 4481-0 857-12 857-15 857-5 857-8 RGX-0
1892-0 4232-0 4385-0 857-10 857-13 857-16 857-6 857-9
2287-0 4300-0 4414-0 857-11 857-14 857-4 857-7 display-0

PID 857 is the camera-app in my case, I somewhat doubt it really has 13 active entries?

Changed in canonical-devices-system-image:
assignee: John McAleely (john.mcaleely) → Yuan-Chen Cheng (ycheng-twn)
Changed in canonical-devices-system-image:
assignee: Yuan-Chen Cheng (ycheng-twn) → Alex Tu (alextu)
importance: Undecided → High
status: New → Confirmed
Revision history for this message
Alex Tu (alextu) wrote :

Attached the debug record which come from the kernel I added debug message in kernel/drivers/staging/android/ion/ion.c

It looks the ion_mm_heap value of each node under clients/ not only counts the buffers allocated by itself but also from others.

ex. 2 camera clients (PID 3130 & 825) and pvrsrvctl (PID 830) all counted the same buffer address = d9ce9b00, size = 786432.

This may explain why #2 Step 5 have the case : "ion_mm_heap sum" > Difference.

But I still have no idea to calculate how the memory used by kernel.

Revision history for this message
Alex Tu (alextu) wrote :

attached the changes I used to print debug message in #3.

Revision history for this message
Alex Tu (alextu) wrote :
Download full text (4.4 KiB)

Thanks to Colin's detail analysis, I past the message from mail:

Fortunately, I have an arale to hand and I was able to do some simple
analysis on a cleanly booted phone with today's latest updates.

My summary is as follows:

The phone has 1927.992M of mappable pages.

* "free" is reporting:
  Used: 653540K (638.2MB)

* Inspection of kernel boot information:
  Kernel (text, bss, etc): 336280K = 35.4MB

* Inspection of /proc/meminfo:

  Kernel stack: 5744K = 41MB
  Cached pages (transient file data): 96548K = 94MB
  Buffers (general kernel buffering): 116K = 0.1MB
  Slab: 30432K = 29MB
  Page Tables: 7844K = 7.6MB
  Vmalloc: 111424K = 108.8MB
  MMap'd and shared pages: 400484K = 391MB
  Total: 652592K = 637.29M

So this almost accounts for all the "used" memory. Here is how I see it:

Kernel stack can only be reduced by dropping the number of processes and
threads. This isn't so big, so no worries over this.

Cached pages will drop when memory pressure gets high, the kernel will
throw away cached pages as these are just a file system cache.

Buffers are small kernel buffering pages, no issue with that.

Slab allocations are relatively large. This kernel does not have
/proc/slabinfo enabled (see CONFIG_SLABINFO). It would be useful to
enable /proc/slabinfo to get some idea where the 29MB of slab is being used.

Finally, Vmalloc us using a large 108.8MB, the top vmalloc users are:

Size K Uses where
  38916 disp_hal_allocate_framebuffer
  28748 binder_mmap
  19212 iotable_init
  10500 vm_map_ram
   2064 stp_dbg_init
   2060 ccci_config_modem
   1776 n_tty_open
   1256 OSMMUPxMap
   1040 create_log.constprop.7
   1040 cmdqCoreInitialize
   1028 zram_meta_alloc
    852 osal_malloc
    464 OSAllocMem
    288 pcpu_get_vm_areas
    268 MTKPP_Init
    260 atomic_pool_init
    212 wlanAdapterCreate
    212 AudDrv_Allocate_mem_Buffer
    196 mtk_afe_hdmi_probe
    188 nicAllocateAdapterMemory
    132 SyS_swapon
    128 _ex_mu3d_hal_alloc_qmu_mem
    104 md_cd_init
   etc...

So it looks like that various kernel drivers are using some large chunks
of memory as expected (e.g. display, binder, etc). It may be worth
checking these to see if they are legitimate. The MediaTek drivers are a
bit sloppy, so it may be worth checking if the vm_map_ram, iotable_init,
binder_mmap, zram are required for our phone.

MMap'd and shared pages are application mappings, see below for more
details. Needless to say the biggest offenders are unity8-dash (105.2MB)
and unity8 (67.9MB) which accounts for 44% of the page mappings in terms
of PSS (proportional size). The top offenders are:

PSS Size Application
  98.7 M unity8-dash
  59.7 M unity8
  25.0 M /usr/lib/evolution/evolution-calendar-factory
  17.4 M maliit-server
5832.0 K media-hub-server
7028.0 K /usr/lib/ubuntu-push-client/ubuntu-push-client
5284.0 K /usr/lib/arm-linux-gnueabihf/unity-scopes/scoperegistry
5476.0 K pulseaudio
3988.0 K unity-system-compositor
5428.0 K /sbin/dhclient
3324.0 ...

Read more...

Revision history for this message
Alex Tu (alextu) wrote :

refer to #7 I think there does not have memory leak concern, the higher memory usage caused by MTK drivers.

Arale kernel can be found on https://github.com/meizuosc/m75, if someone interested.

Revision history for this message
Yuan-Chen Cheng (ycheng-twn) wrote :

since the memory is used by driver, and we didn't find memory leak, current we don't have plan to further work on it.

Mark it as incomplete, and I'll mark as won't fix if no objection in days.

Changed in canonical-devices-system-image:
status: Confirmed → Incomplete
assignee: Alex Tu (alextu) → nobody
assignee: nobody → Yuan-Chen Cheng (ycheng-twn)
Changed in canonical-devices-system-image:
status: Incomplete → Won't Fix
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.