[Regression]: Krillin gets to a state where OOM starts killing every app (also Dash)

Bug #1647982 reported by Andrea Bernabei
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Canonical System Image
Confirmed
High
Alejandro J. Cura
indicator-datetime (Ubuntu)
Confirmed
High
Charles Kerr

Bug Description

Krillin, rc-proposed/bq-aquaris.en, r486

Description:
In the last 3 weeks I managed to get myself into a situation where it was completely unusable.
Basically, every time I started an app (the *only* one currently running), it would get killed
by OOM after a very short amount of time.
It did not let me do anything, a few seconds after running the app, the OOM would kick in and kill it.
At one point it started killing the Dash as well as it had nothing else to kill as no apps were running.

A reboot fixes it.
I'm not sure about how to reproduce this yet. Once you get into that situation, the only thing you can do seems to be a reboot.

I experienced this problem about 3 times in the last couple of weeks. I already reported it to the Unity8 team when it happened, but it seems like the logs I provided were not enough to show the cause of the issue.

The logs show that some processes have really high Virtual Memory Size, probably because they link to plenty of shared libraries or load plenty of plugins, but it's maybe something we want to keep an eye on...
See polld and push-client: they have 900+Mb of VSZ, each.

Evidence of OOM killing Telegram and then the Dash:
"free -m" was showing 151Mb with no apps running, then when I tried to run one, it would get killed.

Nov 18 11:08:33 ubuntu-phablet kernel: [182604.420249] Free memory is -2912kB above reserved [gfp(0x200da)]
Nov 18 11:09:19 ubuntu-phablet kernel: [182650.251867]Killing 'telegram' (15175), adj 100,
Nov 18 11:09:19 ubuntu-phablet kernel: [182650.251871] to free 89460kB on behalf of 'QSGRenderThread' (15244) because
Nov 18 11:09:19 ubuntu-phablet kernel: [182650.251876] cache 65396kB is below limit 65536kB for oom_score_adj 12
Nov 18 11:09:19 ubuntu-phablet kernel: [182650.251881] Free memory is -2820kB above reserved [gfp(0x200da)]

Nov 18 11:30:37 ubuntu-phablet kernel: [183928.542584]select 'unity8-dash' (1385), adj 50, size 8384, to kill
Nov 18 11:30:37 ubuntu-phablet kernel: [183928.542617]Killing 'unity8-dash' (1385), adj 50,
Nov 18 11:30:37 ubuntu-phablet kernel: [183928.542623] to free 33536kB on behalf of 'lsb_release' (1428) because
Nov 18 11:30:37 ubuntu-phablet kernel: [183928.542629] cache 65248kB is below limit 65536kB for oom_score_adj 12
Nov 18 11:30:37 ubuntu-phablet kernel: [183928.542634] Free memory is -2900kB above reserved [gfp(0x200da)]

TOP RAM EATERS:
phablet@ubuntu-phablet:~$ ps aux --sort=-%mem | awk 'NR<=20{print $0}'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
phablet 2998 0.7 13.3 856732 131420 ? Ssl Nov14 44:56 unity8 --mode=full-greeter
phablet 5292 6.6 9.8 525300 96952 ? Ssl 11:32 0:15 unity8-dash --desktop_file_hint=unity8-dash.desktop
phablet 2163 0.1 1.8 26096 17996 ? Ss Nov14 11:36 dbus-daemon --fork --session --address=unix:abstract=/tmp/dbus-VwsPw6LOmi
phablet 3706 0.2 1.7 357620 17064 ? Ssl Nov14 15:48 maliit-server
phablet 3721 0.1 1.7 935372 16760 ? Ssl Nov14 6:38 /usr/bin/account-polld
phablet 3828 0.0 1.5 219696 15360 ? Sl Nov14 2:17 /usr/lib/evolution/evolution-calendar-factory
phablet 3696 0.0 1.5 137904 15140 ? Ssl Nov14 4:40 /usr/lib/arm-linux-gnueabihf/sync-monitor/sync-monitor
phablet 3623 0.0 0.9 193676 9592 ? Ssl Nov14 3:56 /usr/lib/arm-linux-gnueabihf/indicator-datetime/indicator-datetime-service
phablet 2941 0.0 0.8 146332 8028 ? Ssl Nov14 0:22 /usr/lib/arm-linux-gnueabihf/address-book-service/address-book-service
phablet 3694 0.0 0.8 976116 7904 ? Ssl Nov14 1:58 /usr/lib/ubuntu-push-client/ubuntu-push-client
phablet 2959 0.0 0.6 97412 6836 ? S Nov14 1:45 /usr/bin/history-daemon
root 1619 0.3 0.4 111084 4624 ? Ssl Nov14 20:10 NetworkManager
phablet 3768 0.0 0.4 394092 4360 ? Ssl Nov14 1:33 /usr/lib/arm-linux-gnueabihf/unity-scopes/smartscopesproxy upstart
phablet 20493 0.0 0.3 52036 3928 ? Sl 11:18 0:00 /usr/lib/arm-linux-gnueabihf/thumbnailer/thumbnailer-service
phablet 3814 0.0 0.3 63724 3912 ? Ssl Nov14 1:14 /usr/lib/arm-linux-gnueabihf/indicator-network/indicator-network-service
phablet 3861 0.0 0.3 352740 3848 ? Ssl Nov14 1:28 /usr/lib/arm-linux-gnueabihf/unity-scopes/scoperegistry
phablet 2794 0.0 0.3 131456 3712 ? Ssl Nov14 0:10 /usr/bin/telephony-service-indicator
root 1949 0.1 0.3 178760 3408 ? Sl Nov14 11:05 unity-system-compositor --disable-overlays=false --spinner=/usr/bin/unity-system-compositor-spinner --file /run/mir_socket --from-dm-fd 10 --to-dm-fd 13 --vt 1
phablet 7245 0.0 0.3 120452 3408 ? Sl Nov14 0:09 /usr/bin/telephony-service-approver

Andrea Bernabei (faenil)
summary: - [Regression]: OOM gets to the point of killing dash for no apparent
- reason
+ [Regression]: Krillin gets to a state where OOM starts killing every app
+ (also Dash)
Revision history for this message
Pat McGowan (pat-mcgowan) wrote :

@andrea fwiw there have not been any other reports so we will need a way to reproduce it.

Changed in canonical-devices-system-image:
assignee: nobody → Pat McGowan (pat-mcgowan)
status: New → Incomplete
Revision history for this message
Andrea Bernabei (faenil) wrote :

I have it again right now.
I think we have some clue, indicator-datetime-service is using 52% of the memory :)

The Dash is dead, i.e. I see indicators and launcher, but the rest is black.

I open apps, but they are killed by OOM.
phablet@ubuntu-phablet:~$ free -m
             total used free shared buffers cached
Mem: 960 937 22 0 0 34
-/+ buffers/cache: 902 57
Swap: 511 396 115

phablet@ubuntu-phablet:~$ ps aux --sort=-%mem | awk 'NR<=20{print $0}'
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
phablet 3038 22.0 52.3 803160 514864 ? Rsl 08:21 53:28 /usr/lib/arm-linux-gnueabihf/indicator-datetime/indicator-datetime-service
phablet 2735 1.3 5.4 601244 53516 ? Ssl 08:21 3:12 unity8 --mode=full-greeter
phablet 3149 26.7 2.3 415928 22768 ? Sl 08:21 64:53 /usr/lib/evolution/evolution-calendar-factory
phablet 3081 0.3 0.9 332396 9152 ? Ssl 08:21 0:47 maliit-server
phablet 22687 5.2 0.6 101664 6424 ? Sl 12:20 0:10 /usr/lib/arm-linux-gnueabihf/syncevolution/syncevo-dbus-helper --dbus-verbosity 3
phablet 3073 0.3 0.3 129248 3292 ? Ssl 08:21 0:54 /usr/lib/arm-linux-gnueabihf/sync-monitor/sync-monitor
phablet 2159 1.7 0.3 8320 3076 ? Ss 08:20 4:15 dbus-daemon --fork --session --address=unix:abstract=/tmp/dbus-gGhieosHGl
root 1588 0.9 0.2 110956 2920 ? Ssl 08:20 2:17 NetworkManager
phablet 3152 0.0 0.2 54428 2896 ? Ssl 08:21 0:08 /usr/lib/arm-linux-gnueabihf/indicator-network/indicator-network-service
phablet 21142 0.0 0.2 113208 2576 ? Sl 11:03 0:00 /usr/bin/telephony-service-approver
phablet 2642 0.0 0.2 132572 2572 ? Ssl 08:21 0:01 /usr/bin/telephony-service-indicator
message+ 859 0.6 0.2 7040 2404 ? Ss 08:20 1:28 dbus-daemon --system --fork
phablet 13067 0.2 0.2 78700 2332 ? Sl 11:59 0:03 /usr/lib/arm-linux-gnueabihf/syncevolution/syncevo-dbus-server
phablet 22697 1.1 0.2 84364 2160 ? Sl 12:20 0:02 /usr/lib/arm-linux-gnueabihf/syncevolution/syncevo-local-sync
root 1991 0.2 0.2 178760 2080 ? Sl 08:20 0:31 unity-system-compositor --disable-overlays=false --spinner=/usr/bin/unity-system-compositor-spinner --file /run/mir_socket --from-dm-fd 10 --to-dm-fd 13 --vt 1
whoopsie 1895 0.0 0.2 63264 2044 ? Ssl 08:20 0:04 whoopsie -f
phablet 2673 0.0 0.2 146168 2028 ? Sl 08:21 0:07 /usr/lib/arm-linux-gnueabihf/address-book-service/address-book-service
phablet 3592 0.0 0.2 73240 2016 ? Ssl 08:21 0:00 /usr/bin/msyncd
phablet 2653 0.0 0.1 60480 1964 ? S 08:21 0:00 /usr/bin/telephony-service-handler

Changed in canonical-devices-system-image:
assignee: Pat McGowan (pat-mcgowan) → Alejandro J. Cura (alecu)
status: Incomplete → Confirmed
Revision history for this message
Andrea Bernabei (faenil) wrote :

I managed to reproduce the problem by triggering
https://bugs.launchpad.net/ubuntu/+source/network-manager/+bug/1580146

i.e. I just connect to the WiFi in London's office, after a few minutes the device stays connected to the AP but cannot actually access the internet (ipv6 kernel problems, afaik), so calendar sync starts failing.

At that point, I see dbus-monitor being spammed with events, and the memory taken by indicator-datetime-service increases until settling at 40-50% of the total memory in "ps", once the spamming ends.

Stacktrace without dbg symbols shows a malloc (as expected), need to try with more dbg syms
Thread 1 (Thread 0xb3ed9000 (LWP 2978)):
#0 0xb6897dea in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
#1 0xb689995e in malloc () from /lib/arm-linux-gnueabihf/libc.so.6
#2 0xb6930000 in ?? () from /lib/arm-linux-gnueabihf/libc.so.6
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

Revision history for this message
Andrea Bernabei (faenil) wrote :

I confirm I can reliably reproduce the problem. I just have to connect to WiFi and wait for the ipv6 problem to kick, leaving me connected to the WiFi AP but without an actual internet connection.

I believe you should be able to reproduce it by creating a fake AP, or connecting the device to your working AP and then unplugging the modem from the internet

Revision history for this message
Andrea Bernabei (faenil) wrote :

I installed:
libglib2.0-0-dbg
evolution-data-server-dbg
libc6-dbg

Here are 2 more stacktraces I saved after a reboot.
I'm not sure the problem triggers when the connection starts failing...
internet was working now, but I still had the same problem.

indicator-datetime-service stays at about 4% memory, eds at 6%, for a few minutes, during which I see spamming of GetObject calls in dbus-monitor.

At one point then datetime-service starts allocating memory, and slowly grows until reaching a limit.

Here are 2 stacktraces I saved while datetime-service was allocating memory:
https://pastebin.canonical.com/173275/
https://pastebin.canonical.com/173276/

After a few minutes, gdb reported that 10 thread exited.
At that point datetime-service stopped allocating.

Hope that helps!

Changed in indicator-datetime (Ubuntu):
status: New → Confirmed
importance: Undecided → High
assignee: nobody → Charles Kerr (charlesk)
Changed in canonical-devices-system-image:
importance: Undecided → High
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.