unity8 leaks file descriptors cause unresponsive ui but power button works display

Bug #1495871 reported by Alexandros Frantzis
18
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Canonical System Image
Fix Released
High
kevin gunn
qtmir (Ubuntu)
Fix Released
High
Daniel d'Andrada
ubuntu-app-launch (Ubuntu)
Fix Released
Critical
Ted Gould

Bug Description

unity8 leaks file descriptors in various cases, causing the process to eventually fail (hang and/or restart). These leaks were first discovered during investigations of https://bugs.launchpad.net/canonical-devices-system-image/+bug/1491566 , but it's not clear yet if they are indeed responsible for that particular bug. In any case, this is a critical problem, and should be addressed independently.

The fds are leaked in various scenarios, including:

* opening/closing apps
* switching between apps
* locking the screen

Tags: hotfix

Related branches

Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

For completeness, here are Jean-Baptiste's findings:

"I reproduce the issue by following the "fd leak" lead with the following test case:

1. Open 2 apps
2. Switch from the app scope to app1
3. Switch from app1 to the app scope
4. Switch from the app scope to app2
5 lock the screen (optional but it leaks even more fd)

At the same time monitor the number of file descriptors used by unity8:
$ watch -n1 "ls /proc/$(pidof unity8)/fdinfo/| wc -l "

After a while unity8 runs out of fd and the system hangs. Eventually unity8 will restart a moment later."

Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

As mentioned in the description, one way to leak fds is just opening and closing an app. It leaks an fd of type 'anon_inode:[eventfd]' per app connection. I am attaching a simple script I used to reproduce this and track it over multiple app invocations.

Sample output from a run (omitting the lsof and /proc/*/fd information) is:

#### Starting at 2015-09-15T08:10:55+0000 ####
# Before fds: 119
# During fds: 131
# After fds: 120
#### Starting at 2015-09-15T08:11:02+0000 ####
# Before fds: 120
# During fds: 133
# After fds: 121
#### Starting at 2015-09-15T08:11:09+0000 ####
# Before fds: 121
# During fds: 134
# After fds: 122
#### Starting at 2015-09-15T08:11:16+0000 ####
# Before fds: 122
# During fds: 135
# After fds: 123
...

You can clearly see unity8 leaking one fd every time the app connects and disconnects.

Revision history for this message
Albert Astals Cid (aacid) wrote :

I don't see any eventfd() call in unity8, mir has at least two eventfd calls.

Revision history for this message
Albert Astals Cid (aacid) wrote :

According to alf_ it can't be mir since those eventfd calls are stored in a Mir::Fd that takes care of closing them

Changed in canonical-devices-system-image:
assignee: nobody → kevin gunn (kgunn72)
status: New → Confirmed
importance: Undecided → Critical
milestone: none → ww40-2015
tags: added: hotfix
Revision history for this message
Albert Astals Cid (aacid) wrote :

Investigating, seems pretty clear mir is not at blame here

Changed in mir (Ubuntu):
status: New → Invalid
Revision history for this message
Gerry Boland (gerboland) wrote :
Download full text (5.0 KiB)

Watching the eventfd2 & close calls to see which one leaks:

sudo strace -p `pidof unity8` 2>&1 | grep -P "(eventfd|close)"

I get on app open:

eventfd2(0, O_NONBLOCK|O_CLOEXEC) = 118
eventfd2(0, O_NONBLOCK|O_CLOEXEC) = 117
close(117) = 0
eventfd2(0, O_NONBLOCK|O_CLOEXEC) = 116
close(116) = 0

on app close:

eventfd2(0, O_NONBLOCK|O_CLOEXEC) = 109
close(109) = 0
eventfd2(0, O_NONBLOCK|O_CLOEXEC) = 109
close(109) = 0
eventfd2(0, O_NONBLOCK|O_CLOEXEC) = 109
close(109) = 0
eventfd2(0, O_NONBLOCK|O_CLOEXEC) = 109
close(109) = 0

I saw no close call for FD 118 - the first one.

In gdb the first eventfd call on app launch has this backtrace:

Breakpoint 1, 0xb614ae36 in eventfd () from /lib/arm-linux-gnueabihf/libc.so.6
(gdb) bt
#0 0xffffffff in eventfd () at /lib/arm-linux-gnueabihf/libc.so.6
#1 0xffffffff in g_wakeup_new () at /build/buildd/glib2.0-2.44.1/./glib/gwakeup.c:146
#2 0xffffffff in g_main_context_new () at /build/buildd/glib2.0-2.44.1/./glib/gmain.c:624
#3 0xffffffff in cgroup_manager_connection () at /build/ubuntu-app-launch-qvXALV/ubuntu-app-launch-0.5+15.04.20150827/helpers-shared.c:153
#4 0xffffffff in pids_for_appid (appid=appid@entry=0xa1f1e8 "dialer-app") at /build/ubuntu-app-launch-qvXALV/ubuntu-app-launch-0.5+15.04.20150827/libubuntu-app-launch/ubuntu-app-launch.c:1515
#5 0xffffffff in ubuntu_app_launch_pid_in_app_id (pid=17487, appid=0xa1f1e8 "dialer-app") at /build/ubuntu-app-launch-qvXALV/ubuntu-app-launch-0.5+15.04.20150827/libubuntu-app-launch/ubuntu-app-launch.c:1576
#6 0xffffffff in qtmir::upstart::ApplicationController::appIdHasProcessId(int, QString const&) (this=<optimized out>, pid=17487, appId=...) at /home/gerry/dev/projects/qtmir/multimonitor/src/modules/Unity/Application/upstart/applicationcontroller.cpp:180
#7 0xffffffff in qtmir::TaskController::appIdHasProcessId(QString const&, unsigned long long) const (this=this@entry=0x61b138, appId=..., pid=<optimized out>) at /home/gerry/dev/projects/qtmir/multimonitor/src/modules/Unity/Application/taskcontroller.cpp:94
#8 0xffffffff in qtmir::ApplicationManager::authorizeSession(unsigned long long, bool&) (this=0x618328, pid=<optimized out>, authorized=@0xadefeb2c: false) at /home/gerry/dev/projects/qtmir/multimonitor/src/modules/Unity/Application/application_manager.cpp:557
#9 0xffffffff in QMetaCallEvent::placeMetaCall(QObject*) (a=<optimized out>, r=0x618328, this=0x62c248) at ../../include/QtCore/../../src/corelib/kernel/qobject_impl.h:124
#10 0xffffffff in QMetaCallEvent::placeMetaCall(QObject*) (this=<optimized out>, object=0x618328) at kernel/qobject.cpp:483
#11 0xffffffff in QObject::event(QEvent*) (this=<optimized out>, e=<optimized out>) at kernel/qobject.cpp:1245
#12 0xffffffff in QCoreApplication::notify(QObject*, QEvent*) (this=<optimized out>, receiver=<optimized out>, event=<optimized out>) at kernel/qcoreapplication.cpp:997
#13 0xffffffff in QCoreApplication::notifyInternal(QObject*, QEvent*) (this=0x518f00, receiver=receiver@entr...

Read more...

kevin gunn (kgunn72)
Changed in ubuntu-app-launch (Ubuntu):
assignee: nobody → Ted Gould (ted)
importance: Undecided → Critical
status: New → Confirmed
no longer affects: mir (Ubuntu)
no longer affects: unity8 (Ubuntu)
Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

From Gerry's backtrace it is clear that the fd leak we get when opening/closing apps can be traced back to u-a-l.

However, Jean-Baptiste found a few other scenarios (see first comment) that lead to leaked fds. Have we confirmed that all the leaks in these scenarios can be traced back to this u-a-l leak? If not, we need to ensure this is reflected in the bug status (by keeping unity8 as affected), or opening a new bug report.

kevin gunn (kgunn72)
summary: - unity8 leaks file descriptors
+ unity8 leaks file descriptors cause unresponsive ui but power button
+ works display
Revision history for this message
kevin gunn (kgunn72) wrote :

had a question earlier today also if this was on stable (ota6) i just confirmed it is.

Changed in ubuntu-app-launch (Ubuntu):
status: Confirmed → In Progress
Revision history for this message
Albert Astals Cid (aacid) wrote :

> Have we confirmed that all the leaks in these scenarios can be traced back to this u-a-l leak?
No we have not, it'll be easier once this one has been fixed, but given we don't use eventfd nor glib extensibly, there's a high chnange this is not in unity8 itself

> If not, we need to ensure this is reflected in the bug status (by keeping unity8 as affected), or opening a new bug report.
I guess this is a matter of taste, feel free to re-add it if you prefer.

Revision history for this message
Alexandros Frantzis (afrantzis) wrote :

>> If not, we need to ensure this is reflected in the bug status (by keeping unity8 as affected), or opening a new bug report.
> I guess this is a matter of taste, feel free to re-add it if you prefer.

I just want to make sure we track this properly, not close it when the u-a-l part is fixed (unless, of course, u-a-l turns out to be the source of all our leak instances). I guess tracking it in "Canonical System Image" should be enough for this.

Revision history for this message
Albert Astals Cid (aacid) wrote :

I can confirm that Ted's branch seems to fix all the fd leaking for me. A second source would be cool.

It's a bit unfortunate we need ~8 fd per app though, may we need to optimize that at some point in the future.

Revision history for this message
Albert Astals Cid (aacid) wrote :

s/second source/second confirmation

Revision history for this message
Ted Gould (ted) wrote : Re: [Bug 1495871] Re: unity8 leaks file descriptors cause unresponsive ui but power button works display

On Wed, 2015-09-16 at 15:22 +0000, Albert Astals Cid wrote:

> It's a bit unfortunate we need ~8 fd per app though, may we need to
> optimize that at some point in the future.

It will go down a bunch with the switch to systemd as we can request on
the system bus instead of building a custom dbus peer-to-peer connection
with CGManager.

Revision history for this message
kevin gunn (kgunn72) wrote :

using ted's branch
https://code.launchpad.net/~ted/ubuntu-app-launch/lp1495871-unref-context-15.04/+merge/271373
which is conveniently available in silo 60
i tested using the script above to open/close apps...which kept the number of open fd's to a steady state

however, i then modified the script to not open/close apps and just poop out the unity8 process # of open fds
I then manually opened as many apps as possible/available on the basic image
I then watched the ubuntu-app-watch output & ubuntu-app-list to make sure apps were getting closed due to memory pressure, i then would use the spread and pull up the very "last" or front app in spread (e.g. all the way to the right in the spread) app to make sure i was cycling between all the apps available in the spread
when i started with no apps open it was posted 135 fds open, at some point after opening all the apps fresh(no toggling yet) it was at ~200+, upon toggling the number of fds grew to 480, i stopped toggle, allowed steady state/screen blanked...it remained at 480. Then i swiped away every available app in the spread, which reduced the number of open fds down to 255. So that's 120 fds orphaned. Not sure if these are associated with screen shots in the spread or ual somehow.

Revision history for this message
kevin gunn (kgunn72) wrote :

just out of curiosity, i set up the phone sim, sent text messages - since that's when this seems to happen the most
the fds for unity8 would jump up by 1 but eventually drop back down by one.
I also watched the fd count for message+ & indicator-message-service which stayed constant with sending text messages - altho maybe these aren't the right processes to watch?

Revision history for this message
kevin gunn (kgunn72) wrote :

ok, watched pulseaudio and mediahub-server using the phonesim
pulseaudio was fine
media-hub-server was leaking fds, leaking 2 per text message alert & phone call

Changed in media-hub (Ubuntu):
importance: Undecided → Critical
Revision history for this message
Jean-Baptiste Lallement (jibel) wrote :

If it's useful to anyone, here is a small script to track the number of fd used per process. By default t takes a sample every 5 minutes. Then you can diff the logs to find which process is suspicious.

Revision history for this message
Albert Astals Cid (aacid) wrote :

The leaks in unity8 when the memory pressure is high can be easily reproduced by:
 * Be on dash
 * Open an app
 * Switch to he dash
 * Use kill -9 to fill the app

We're leaking an anon_inode:dmabuf fd each time that happens.

Revision history for this message
kevin gunn (kgunn72) wrote :

removing media-hub, spawing bug 1496894

no longer affects: media-hub (Ubuntu)
Revision history for this message
Albert Astals Cid (aacid) wrote :

The anon_inode:dmabuf i mention in #18 is not leaked anymore if we make ApplicationScreenshotProvider::requestImage return an empty QImage and is still leaked if we return a QImage with junk data (without calling mir::screenshot) so it seems it's not mir.

I'm investigating now if it's either qtmir, unity8, qt or it's not a fd leak at all and it's just the image cache of Qt being used.

Changed in canonical-devices-system-image:
status: Confirmed → In Progress
Revision history for this message
Albert Astals Cid (aacid) wrote :

Adding qtmir since i can make the anon_inode:dmabuf fd not leak by making qtmir not free early the session when the app is killed.

Revision history for this message
Albert Astals Cid (aacid) wrote :

FWIW the ubuntu-app-launch fix has landed in the vivid image

kevin gunn (kgunn72)
Changed in ubuntu-app-launch (Ubuntu):
status: In Progress → Fix Released
Changed in qtmir (Ubuntu):
importance: Undecided → Critical
Revision history for this message
Albert Astals Cid (aacid) wrote :

I've found two ways (which are effectively the same) to workaround the fd leak caused by killing an app (comment #18)
 * removing surfaceItem.surface = null; in unity8 SurfaceContainer.qml
 * removing the 3 deleteLater in qtmir session.cpp

Both have the same effect. qtmir's session and mirsurface won't be freed when the application is killed, just when the user actually swipes down from the side spread, by doing that the fd leak is gone.

It has two problems:
 * We are delaying freeing memory resources and since the most common occurrence of apps being killed is by the OOM, this is not good
 * If you try to restart the app without swiping down from the side spread, unity8 will crash because the mir surface item has bad data, i think this could be fixed, but the first problem seems strong enough to disqualify these solutions.

The workaround seem to imply that something is wrong when the early free for killed apps is done.

I've read the code and to my non-domain-expert-eye it all seems good.

Valgrind could not find any memory or fd leak either.

Maybe someone from the mir/qtmir fields can have a look at the early free code for killed apps and see if they see anything wrong?

Revision history for this message
kevin gunn (kgunn72) wrote :

unfortunately gerry's probably the best expert but out, assigning to daniel in hopes he can make some progress

Changed in qtmir (Ubuntu):
assignee: nobody → Daniel van Vugt (vanvugt)
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Using comment #18 on arale, I find roughly three fds are leaked at a time:

> lrwx------ 1 phablet phablet 64 Sep 22 07:49 135 -> anon_inode:dmabuf
> lrwx------ 1 phablet phablet 64 Sep 22 07:49 138 -> /dev/pvrsrvkm
> lrwx------ 1 phablet phablet 64 Sep 22 07:49 139 -> /dev/pvrsrvkm

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Apparently we fixed a very similar bug on desktop two years ago --> bug 1185183

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Invalid for qtmir, but it does seem to be a Mir issue... See bug 1498361.

Changed in qtmir (Ubuntu):
status: New → Invalid
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Actually there might still be a qtmir-specific leak but we need to fix the leaks seen in plain Mir too --> bug 1498361

Changed in qtmir (Ubuntu):
status: Invalid → New
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Considering bug 1498816, and the new-ish bufferstream logic in Mir's SessionMediator, I'm suspicious that Mir might just be indefinitely bloated, never releasing some resources until shutdown. It's a bit scary that the bufferstream logic is passing around raw pointers without any clear ownership. We may well be holding some for too long. Which is exactly the kind of resource that would be seen as "anon_inode:dmabuf".

Revision history for this message
Alan Griffiths (alan-griffiths) wrote :

Mir does associate resources with a client socket that are not released until one of two things happen:

1. The client explicitly disconnects
2. An operation (like the outstanding read attempt) on the socket reports an error

AFAIK there is no *guarantee* that the latter happens in a timely manner when a client disconnects abruptly.

Changed in qtmir (Ubuntu):
status: New → In Progress
affects: qtmir (Ubuntu) → qtmir
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Note that tools like valgrind (memcheck) are unlikely to find the problem. Because C++ and RAII... the offending resources get released on shutdown and never actually leaked. So instead of leaks, you have to hunt for "bloat" and use a memory/resource profiler.

I got distracted for a long time finding lots of similar but unrelated erratic bloat in Mir demo servers on Android. But I think those are specific to Mir demos and not Unity8. Seems plausible Unity8 is possibly avoiding some Mir logic that's bloaty in its own right.

Next stop: Rebuild a qtmir development environment and investigate the snapshot stuff. In theory if the snapshot code is leaking a buffer then that would explain the anon_inode:dmabuf handle. However it's not just the snapshot code but also bug 1498816 and how that affects SessionMediator that's concerning.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Handy tip:
You can see the call stack of (many but not all) fds unity8 has open while it's running. Just start unity8 under valgrind with --track-fds=yes. Then you can get the live list of fds with their backtraces:

    vgdb v.info open_fds

Annoyingly this doesn't show the dmabuf ones we're looking for.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Dropped severity because the bug requires significant time to occur, if at all.

Changed in qtmir:
importance: Critical → High
Revision history for this message
kevin gunn (kgunn72) wrote :

Just noting, i agree with drop in priority, since we have a fix for u-a-l and that was the original culprit.
we do need to eventually get this nailed, so please keep this as a front burner activity

Changed in canonical-devices-system-image:
importance: Critical → High
Revision history for this message
Daniel van Vugt (vanvugt) wrote :

aacid:
Can you please test the attached proposed branches? I'm wondering if I'm imaging things when it seems fixed.

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

*imagining*

Revision history for this message
Albert Astals Cid (aacid) wrote :

I can confirm that the fd leaking on killing apps like explained in comment #18 also seems to stop for me when using https://code.launchpad.net/~dandrader/qtmir/textureCleanUp-lp1499388/+merge/272452

Revision history for this message
Daniel van Vugt (vanvugt) wrote :

Awesomtastic

Changed in qtmir:
assignee: Daniel van Vugt (vanvugt) → Daniel d'Andrada (dandrader)
assignee: Daniel d'Andrada (dandrader) → Daniel van Vugt (vanvugt)
assignee: Daniel van Vugt (vanvugt) → Daniel d'Andrada (dandrader)
Changed in qtmir (Ubuntu):
status: New → Triaged
importance: Undecided → High
Revision history for this message
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package qtmir - 0.4.6+15.10.20150930.1-0ubuntu1

---------------
qtmir (0.4.6+15.10.20150930.1-0ubuntu1) wily; urgency=medium

  [ CI Train Bot ]
  * New rebuild forced.

  [ Daniel d'Andrada ]
  * MirSurfaceItem: texture must be manipulated only from the scene
    graph thread (LP: #1499388, #1495871)

  [ Gerry Boland ]
  * Add "Closing" state to Application, use it to distinguish user-
    induced close from app-induced close. Don't clear QML cache if user-
    induced. (LP: #1500372)

 -- Michał Sawicz <email address hidden> Wed, 30 Sep 2015 10:08:37 +0000

Changed in qtmir (Ubuntu):
status: Triaged → Fix Released
kevin gunn (kgunn72)
Changed in qtmir:
status: In Progress → Fix Released
Changed in canonical-devices-system-image:
status: In Progress → Fix Committed
Changed in canonical-devices-system-image:
status: Fix Committed → Fix Released
Michał Sawicz (saviq)
Changed in qtmir (Ubuntu):
assignee: nobody → Daniel d'Andrada (dandrader)
no longer affects: qtmir
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.