Ubuntu
upstart package

uevent spam causes libdbus client code in session upstart to consume massive amounts of memory on Ubuntu Touch

Bug #1235649 reported by Oliver Grawert on 2013-10-05

This bug affects 8 people

	Status	Importance	Assigned to	Milestone
Touch Landing Plan	Fix Released	Undecided	Unassigned
Unity	Confirmed	Undecided	Unassigned
upstart	Fix Released	Undecided	James Hunt
linux (Ubuntu)	Invalid	Critical	Unassigned
Saucy	Invalid	Critical	Unassigned
systemd (Ubuntu)	Invalid	Undecided	Dimitri John Ledkov	Ubuntu ubuntu-13.10
Saucy	Invalid	Undecided	Dimitri John Ledkov	Ubuntu ubuntu-13.10
unity (Ubuntu)	Confirmed	Undecided	Unassigned
Saucy	Invalid	Undecided	Unassigned
upstart (Ubuntu)	Fix Released	Critical	James Hunt
Saucy	Fix Released	Critical	James Hunt

Bug Description

using ubuntu touch image 82 i see the session init consume about 10MB per minute as long as the screen is on with Mir.
running the same session with surfaceflinger only consumes 1MB per minute.

in both cases the system starts to swap heavily at some point, making the UI unresponsive.

http://paste.ubuntu.com/6196223/ has the top output of a Mir session after 30min, the UI just got completely unresponsive when this snapshot was taken.

http://paste.ubuntu.com/6196332/ is the top output of a surfaceflinger session where the screen was off for about 10min

apparently the leak only occurs while the screen is on, it seems to be permanently there but in the case of surfaceflinger it hits less hard.

Tags:

Related branches

lp:~xnox/upstart/fix-1235649

Rejected for merging into lp:upstart

James Hunt: Disapprove on 2013-10-08

lp:ubuntu/saucy/upstart

lp:~vorlon/unity8/lp.1235649

Merged into lp:unity8

PS Jenkins bot (community): Needs Fixing (continuous-integration) on 2013-10-10

Thomas Voß (community): Approve on 2013-10-10

lp:~xnox/ubuntu/saucy/systemd/udevd-ignore

lp:~jamesodhunt/upstart/bugs-1235649+1203595

Rejected for merging into lp:upstart

Steve Langasek: Disapprove on 2013-10-23

lp:~jamesodhunt/upstart/bug-1235649

Merged into lp:upstart at revision 1554

Dimitri John Ledkov: Approve on 2013-11-04

lp:~jamesodhunt/ubuntu/trusty/upstart/1.11

lp:~jamesodhunt/ubuntu/trusty/dbus/bug-1252317

Revision history for this message

Oliver Grawert (ogra) wrote on 2013-10-05:

both of the above was observed on maguro btw.

Changed in upstart (Ubuntu):
importance:	Undecided → Critical

Alan Pope 🍺🐧🐱 🦄 (popey) on 2013-10-05

Changed in upstart (Ubuntu):
status:	New → Confirmed

Revision history for this message

Oliver Grawert (ogra) wrote on 2013-10-05:

after capturing the above i tried to upgrade to upstart 1.10-0ubuntu6 from the archive which exposes the same behavior.

Revision history for this message

Oliver Grawert (ogra) wrote on 2013-10-05:

i tried downgrading upstart version by version until 1.9.1-0ubuntu5 (which was the first one with user sessions enabled) and it seems all versions expose this behavior.

Revision history for this message

Jean-Baptiste Lallement (jibel) wrote on 2013-10-06:

RSS-usage-Mir_vs_SF.png Edit (46.6 KiB, image/png)

I confirm this behaviour with build 82 on mako.
To illustrate it I made the following test:
Start and stop dialer-app every 6s (reasonable delay to give time to the application to start and stop)
The application is started with: start application APP_ID=dialer-app
and stop with a kill.

I captured RSS measured by ps after each stop and made 2 runs with Mir and 1 with SF. The result is attached.
I shows that RSS usage of init with Mir is much higher the SF and growth is exponential while it is linear with Mir.

As a side note, the number of samples with Mir is limited to less than 50 because the phone stops responding at this moment while it continues to run fine with SF. This is reproduceable.

tags:

added: qa-manual-testing qa-touch rls-s-incoming

Revision history for this message

Jean-Baptiste Lallement (jibel) wrote on 2013-10-06:

replace:
I shows that RSS usage of init with Mir is much higher the SF and growth is exponential while it is linear with Mir.

by
It shows that RSS usage of init with Mir is much higher than SF and growth is exponential while it is linear with SF.

Revision history for this message

Oliver Grawert (ogra) wrote on 2013-10-06:

the growth in itself is an issue though, even SF will hit swap at some point, just later ...

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-07:

I don't have a maguro device so debugging is going to have to be by proxy :)

A few questions:

1) Does this only occur when the system is read-only?
2) Are logs growning fast in ~/.cache/upstart/ ?
3) What is the screen doing when it is on? Is it enough to run "powerd-cli display on" to see the issue?
4) Is the issue seen on any other hardware?
5) How do you toggle between Mir and SF?

Please could you:

a) Run the Session Init in debug mode by adding '--debug' to 'init --user' in /usr/bin/ubuntu-touch-session.
b) attach ~/.xsession-errors
c) Get an strace of the Session Init.

Sounds like this issue may be related to kernel bug 1234743.

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2013-10-07:

https://wiki.ubuntu.com/Touch/Testing/Mir

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-07:

Answering own question (5): https://wiki.ubuntu.com/Touch/Testing/Mir

Revision history for this message

Jean-Baptiste Lallement (jibel) wrote on 2013-10-07: Re: [Bug 1235649] Re: session upstart leaks massive amounts of memory on Ubuntu Touch

#10

ls-lSr_cache_upstart.mir.ro.log Edit (7.0 KiB, text/x-log; name="ls-lSr_cache_upstart.mir.ro.log")

Le 07/10/2013 11:12, James Hunt a écrit :
> I don't have a maguro device so debugging is going to have to be by
> proxy :)
>
> A few questions:
>
> 1) Does this only occur when the system is read-only?
No. init leaks the same amount of memory when the file system is RO or RW.

> 2) Are logs growning fast in ~/.cache/upstart/ ?
No. The biggest log file is application-legacy-dialer-app-.log (the
application I used for the test) and it is around 700kB. The size is
approximately the same with Mir or SF and in RO or RW mode. (ls -lSr
/home/phablet/.cache/upstart attached)

> 3) What is the screen doing when it is on? Is it enough to run "powerd-cli display on" to see the issue?
In my test, I was launching/stopping dialer app, so the screen was
switching between the dialer and the application scope.

Running powerd-cli or not doesn't make a difference (excepted that I
have to tap the screen regularly to keep the display alive without
powerd-cli :))

> 4) Is the issue seen on any other hardware?
Reproduced on maguro and mako

> Please could you:
>
> a) Run the Session Init in debug mode by adding '--debug' to 'init --user' in /usr/bin/ubuntu-touch-session.
> b) attach ~/.xsession-errors
> c) Get an strace of the Session Init.
In progress

--
Jean-Baptiste
irc: jibel

Revision history for this message

Jean-Baptiste Lallement (jibel) wrote on 2013-10-07: Re: session upstart leaks massive amounts of memory on Ubuntu Touch

#11

_home_phablet_.xsession-errors Edit (2.7 KiB, text/plain)

/home/phablet/.xsession-errors with init --user --debug

Revision history for this message

Jean-Baptiste Lallement (jibel) wrote on 2013-10-07:

#12

init_--user_--debug.strace Edit (3.4 MiB, text/html)

strace of session init

Revision history for this message

Jean-Baptiste Lallement (jibel) wrote on 2013-10-07:

#13

btw, .xession-errors shows 5 run of dialer, 1 run of shorts and 1 run of calculator, and strace shows 1 run of dialer.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-07:

#14

jibel - thanks. It does indeed look like bug 1234743 as the trace shows a lot of the following events:

:sys:cpu-device-online
:sys:cpu-device-offline
:sys:power_supply-device-changed

This would explain the Session consuming a lot of CPU since those events are generated by the kernel, picked up by the upstart-udev-bridge which injects them into the system init event space. These events are then noticed by the upstart-event-bridge running as phablet and injected into the Session Init with the ":sys:" prefix.

Could you run that again as "strace -fFttv -s 1024 ..." so we can see some timestamps?

Also, please attach /var/log/udev and the output of 'ls -al /proc/$session_init_pid/fd/'.

Adding kernel since as this issue only affects particular hardward, it is highly unlikely to be an Upstart issue.

Changed in upstart (Ubuntu Saucy):
status:	Confirmed → Incomplete

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-07:

#15

Also, please could you run http://people.canonical.com/~jhunt/upstart/scripts/get_state.sh as user phablet and attach to this bug.

Revision history for this message

Brad Figg (brad-figg) wrote on 2013-10-07: Missing required logs.

#16

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1235649

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status:	New → Incomplete

Revision history for this message

Jean-Baptiste Lallement (jibel) wrote on 2013-10-07: Re: session upstart leaks massive amounts of memory on Ubuntu Touch

#17

strace_-fFvtt_-s.init_--user_--debug.strace.gz Edit (784.1 KiB, application/octet-stream)

more verbose strace during a start/stop of dialer-app

Revision history for this message

Jean-Baptiste Lallement (jibel) wrote on 2013-10-07:

#18

get_state.out Edit (284.1 KiB, text/plain)

output of get_state.sh after dialer-app has been stopped.

Revision history for this message

Jean-Baptiste Lallement (jibel) wrote on 2013-10-07:

#19

_var_log_udev Edit (343.3 KiB, text/plain)

/var/log/udev

Revision history for this message

Jean-Baptiste Lallement (jibel) wrote on 2013-10-07:

#20

ls_-al__proc_PID_fd_.out Edit (2.6 KiB, text/plain)

output of ls -al /proc/1467/fd/

Changed in upstart (Ubuntu Saucy):
status:	Incomplete → Confirmed
Changed in linux (Ubuntu Saucy):
status:	Incomplete → Confirmed
tags:	added: kernel-bot-stop-nagging
Changed in linux (Ubuntu Saucy):
importance:	Undecided → Critical

Revision history for this message

Oliver Grawert (ogra) wrote on 2013-10-07:

#21

i can definitely make it stop consuming more ram by stopping the upstart-event-bridge session job.
this works on both devices.
the ram does not get freed but the consumption does not keep rising either anymore then.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-07:

#22

Apparently, the high cpu load stop when the upstart-event-bridge is stopped. This lends weight to the high cpu issue being caused the a kernel udev event issue (like bug 1234743).

However, what is not explained is why the Session Init retains the memory it consumes once the upstart-event-bridge is stopped.

Please would someone check to see if the issue is recreatable on grouper?

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-07:

#23

@jibel - that get_state.out appears to have either been run as root, or atleast outside the session init environment. Please could you re-run as user phablet?

Also, please could you attach /var/log/udev ?

Revision history for this message

Jean-Baptiste Lallement (jibel) wrote on 2013-10-07:

#24

get_state.out Edit (89.9 KiB, text/plain)

output of get_state.sh as user phablet

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-07:

#25

Can someone try running "/sbin/initctl notify-disk-writeable" and see if the session inits memory consumption drops? Alternatively, run the session init as "init --user --debug --no-log".

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-07:

#26

From info provided by jibel...

Forcing a log flush (#25) doesn't resolve the high memory usage (as we'd expect, but worth ruling out).

The state file in #24 looks sane to me; forcing a re-exec of the Session Init does cause the memory to fall to reasonable levels. This almost feels like a compiler bug where __attribute__ ((cleanup())) is failing for some reason.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-07:

#27

To summarise: there are 2 problems here:

1) The kernel spamming userspace with udev messages on mako+maguro (grouper doesn't appear to be affected but not independently verified yet).

2) Upstart not releasing memory once the kernel has finished spamming userspace (artificially forced by running "stop upstart-event-bridge" as user 'phablet').

Revision history for this message

Jean-Baptiste Lallement (jibel) wrote on 2013-10-07:

#28

initctl_list.out Edit (1003 bytes, text/plain)

output of "initctl list" as user phablet.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-07:

#29

Could someone run the "make check" tests for the libnih package on a mako/maguro device?

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-07:

#30

I can no recreate the issue with memory not being freed by running: http://paste.ubuntu.com/6205552/.

This is recreatable on i386 desktop images so although problem (1) in #27 is h/w specific, problem (2) is not.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-07:

#31

Attempting to run the script in #30 against a session init as below *outside* of a the standard gui environment with no .conf files does not trigger the issue:

$ mkdir -p /tmp/conf /tmp/log
$ init --debug --no-startup-event --confdir /tmp/conf --logdir /tmp/log --user

Same outcome running like this:

$ init --debug --no-startup-event --user

So, Event objects themselves are not the issue. More likely a subtle ref-counting issue related to one of the other object types.

Joseph Salisbury (jsalisbury) on 2013-10-07

tags:

added: kernel-key

Brian Murray (brian-murray) on 2013-10-07

tags:

removed: rls-s-incoming

James Hunt (jamesodhunt) on 2013-10-08

Changed in upstart (Ubuntu Saucy):
assignee:	nobody → James Hunt (jamesodhunt)

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-08:

#32

valgrind session init consuming ~250MB (with suppressions) Edit (51.0 KiB, application/octet-stream)

Valgrinding the session init shows that most of the memory is correctly freed at quiesce time.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-08:

#33

valgrind session init consuming ~500MB (no suppressions) Edit (63.6 KiB, application/octet-stream)

Valgrind run as:

valgrind -v -v -v -v --show-reachable=yes --track-origins=yes --leak-check=full init --debug --user

Dimitri John Ledkov (xnox) on 2013-10-08

summary:

- session upstart leaks massive amounts of memory on Ubuntu Touch
+ uevent spam causes session upstart to consume massive amounts of memory
+ on Ubuntu Touch

Joseph Salisbury (jsalisbury) on 2013-10-08

Changed in linux (Ubuntu Saucy):
status:	Confirmed → Incomplete
tags:	added: kernel-da-key removed: kernel-key

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2013-10-08: Re: uevent spam causes session upstart to consume massive amounts of memory on Ubuntu Touch

#34

After discussion on irc between ogra, pitti and I:
workaround #1 - blacklist that uevent in upstart-udev-bridge, thus neither system init event will not be emitted, nor re-emitted to session upstart.

workaround #2 - blacklist that uevent in systemd-udev (via socket filter) if not needed by Mir.

The first one is proposed in a branch already, and that still leaves 5% cpu usage by systemd-udev constantly whilst the screen is on.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-08:

#35

The problem is triggered by init/control.c:control_emit_event_emitted(). It looks like Upstart itself is clearing up memory, but either dbus isn't, or upstart is missing a dbus unref call.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-08:

#36

The working theory at this stage is that init/control.c:control_disconnected() is not cleaning up the dbus connections created by the upstart-udev-bridge for each event it requests upstart emit.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-08:

#37

init/control.c:control_disconnected() could do with a call to dbus_connection_close(), but each connection is being unref'd correctly fwics.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-08:

#38

To be clear, adding a call to dbus_connection_close() in control_disconnected() did not resolve the problem.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-09:

#39

Persistent D-Bus connections are as expected (1 / bridge and 1 for unity-panel-service).
Ephemeral D-Bus connections are being handled correctly and removed from the control_conns list.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-09:

#40

Still unclear why memory is being cached in Upstart address space, but I'm starting to think this might be a libdbus issue.

What is triggering the memory rise though is unity-panel-service: stopping this from running results in no memory increase within the Session Init.

Revision history for this message

Oliver Grawert (ogra) wrote on 2013-10-09:

#41

note that touch images do not use unity-panel-service anywhere

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-09:

#42

Right - recall that I am testing on i386 as I don't have mako/maguro :)

However, note that on Touch, the upstart code in unity8 (plugins/Unity/Indicators/indicatorsmanager.cpp) is almost identical to the upstart code in unity-panel-service (unity/services/panel-service.c).

Is it possible to somehow disable the loading of /usr/lib/i386-linux-gnu/unity8/qml/Unity/Indicators/libIndicatorsQml.so in the unity8-private package on mako/maguro to see if that stops the memory bloat?

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-09:

#43

Re #41, interesting that unity-panel-service *is* installed on Touch image though :)

Revision history for this message

Michał Sawicz (saviq) wrote on 2013-10-09:

#44

Download full text (3.9 KiB)

Sure:

=== modified file 'Panel/Panel.qml'
--- Panel/Panel.qml 2013-09-24 09:00:55 +0000
+++ Panel/Panel.qml 2013-10-09 14:14:02 +0000
@@ -71,7 +71,7 @@
}
}

- Indicators {
+ Showable {
id: indicatorsMenu
objectName: "indicators"

@@ -79,10 +79,10 @@
         y: panelBackground.y
         width: root.indicatorsMenuWidth
         shown: false
- hintValue: __panelMinusSeparatorLineHeight * 3
- panelHeight: __panelMinusSeparatorLineHeight
- openedHeight: parent.height + (pinnedMode ? 0 : root.panelHeight)
- pinnedMode: !fullscreenMode
+// hintValue: __panelMinusSeparatorLineHeight * 3
+// panelHeight: __panelMinusSeparatorLineHeight
+// openedHeight: parent.height + (pinnedMode ? 0 : root.panelHeight)
+// pinnedMode: !fullscreenMode

property real unitProgress: (height - panelHeight) / (openedHeight - panelHeight)
}

=== modified file 'Shell.qml'
--- Shell.qml 2013-10-08 19:11:11 +0000
+++ Shell.qml 2013-10-09 14:13:59 +0000
@@ -568,7 +568,7 @@ ...

Sure:

=== modified file 'Panel/Panel.qml'                                                                
--- Panel/Panel.qml     2013-09-24 09:00:55 +0000                                                  
+++ Panel/Panel.qml     2013-10-09 14:14:02 +0000                                                  
@@ -71,7 +71,7 @@                                                                                  
         }                                                                                         
     }                                                                                             
                                                                                                   
-    Indicators {                                                                                  
+    Showable {                                                                                    
         id: indicatorsMenu                                                                        
         objectName: "indicators"                                                                  
                                                                                                   
@@ -79,10 +79,10 @@                                                                                
         y: panelBackground.y                                                                      
         width: root.indicatorsMenuWidth                                                           
         shown: false                                                                              
-        hintValue: __panelMinusSeparatorLineHeight * 3                                            
-        panelHeight: __panelMinusSeparatorLineHeight                                              
-        openedHeight: parent.height + (pinnedMode ? 0 : root.panelHeight)                         
-        pinnedMode: !fullscreenMode                                                               
+//        hintValue: __panelMinusSeparatorLineHeight * 3                                          
+//        panelHeight: __panelMinusSeparatorLineHeight                                            
+//        openedHeight: parent.height + (pinnedMode ? 0 : root.panelHeight)                       
+//        pinnedMode: !fullscreenMode                                                             
                                                                                                   
         property real unitProgress: (height - panelHeight) / (openedHeight - panelHeight)         
     }                                                                                             
                                                                                                   
=== modified file 'Shell.qml'                                                                      
--- Shell.qml   2013-10-08 19:11:11 +0000                                                          
+++ Shell.qml   2013-10-09 14:13:59 +0000                                                          
@@ -568,7 +568,7 @@                                                                                
             indicators {                                                                          
                 hides: [launcher]                                                                 
                 available: edgeDemo.panelEnabled                                                  
-                contentEnabled: edgeDemo.panelContentEnabled                                      
+//                contentEnabled: edgeDemo.panelContentEnabled                                    
             }                                                                                     
             fullscreenMode: shell.fullscreenMode                                                  
             searchVisible: !greeter.shown && !lockscreen.shown

Revision history for this message

Michał Sawicz (saviq) wrote on 2013-10-09:

#45

The above can be applied straight on device, in /usr/share/unity8/.

Revision history for this message

Steve Langasek (vorlon) wrote on 2013-10-09:

#46

Workaround is in place in upstart package to reduce the flood of events that are causing this rapid memory growth, and we believe a fix to the unity upstart client will fix the memory leak entirely; but upstart needs to be more robust against misbehaving clients. Opening a task against upstart upstream.

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-10-09:

#47

This bug was fixed in the package upstart - 1.10-0ubuntu7

---------------
upstart (1.10-0ubuntu7) saucy; urgency=low

* Ignore emitting specific udev event spam on maguro, as a workaround
for LP: #1235649.
-- Dmitrijs Ledkovs <email address hidden> Wed, 09 Oct 2013 13:52:10 +0100

Changed in upstart (Ubuntu Saucy):
status:	Confirmed → Fix Released

Joseph Salisbury (jsalisbury) on 2013-10-09

Changed in linux (Ubuntu Saucy):
status:	Incomplete → Invalid

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-10:

#48

This problem can now be demonstrated with a single connection to upstart that makes the following calls:

dbus_connection_open()
nih_dbus_proxy_new ()

The problem is not seen if the app instead calls:

           - nih_dbus_connect (upstart_session, disconnect_handler)
           - nih_dbus_proxy_new ()
           - nih_main_loop()

Or,

           - nih_dbus_connect (upstart_session, NULL)
           - nih_dbus_proxy_new ()
           - nih_main_loop()

Specifically, the problem appears to be that when an app calls dbus_connection_open(), one or more of the functions in nih_dbus_setup() should be being called to allow libdbus to interrupt the apps main loop and request callbacks are run to clean up resources.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-10:

#49

The problem is not observed when the only connections to upstart are the standard upstart bridges (since they use nih_dbus_connect()).

For reference, here is nih_dbus_setup() which includes one or more calls that unity seemingly should be calling itself:

http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/saucy/libnih/saucy/view/head:/nih-dbus/dbus_connection.c#L202

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-10:

#50

Note that disabling libIndicatorsQml.so with #44 did not resolve the problem. However, I am confident this is because there were other connections to the Session Init (probably resulting from libupstart-app-launch1).

This does need to be confirmed though.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-10:

#51

trigger-bug-1235649.c Edit (2.6 KiB, text/plain)

C program to trigger this bug. Run it in an environment with UPSTART_SESSION set (such as a desktop/Touch environment), then run the about-to-be attached spam.sh script.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-10:

#52

spam.sh Edit (697 bytes, text/x-sh)

Run as:

$ spam.sh <session_init_pid>

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-10:

#53

Program attached in #51 demonstrates the incorrect calling behaviour. To demonstrate the correct behaviour change the #if0/else/endif such that rather than calling dbus_connection_open(), the program calls nih_dbus_connect().

nih_dbus_connect(): No memory growth.
dbus_connection_open(): Memory growth.

summary:

- uevent spam causes session upstart to consume massive amounts of memory
- on Ubuntu Touch
+ uevent spam causes libdbus client code in session upstart to consume
+ massive amounts of memory on Ubuntu Touch

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-11:

#54

Note that the following call on the server side does not resolve the issue:

while (dbus_connection_dispatch (connection) == DBUS_DISPATCH_DATA_REMAINS)
;

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-11:

#55

Neither does calling the following help:

dbus_connection_flush (connection);

Revision history for this message

Steve Langasek (vorlon) wrote on 2013-10-12:

#56

I've tested on a mako device here, and the memory usage is not all due to a misbehaved client. I've tracked down all the clients of the session init; unity8 is the only long-lived client other than the bridges, and if I kill unity8 and let it respawn, upstart gains back some of the memory, but *not all*: if I feed events into the session init, bringing its memory usage up to 93.3MB, and then kill unity8 (and, in fact,if I go through one-by-one and kill *all* the processes related to the session), upstart frees *some* memory, but its total usage remains at 88MB - compared to 5MB when it starts out.

Another thing I've found out, though, is that if I force a re-exec of the session init with 'telinit u', the memory usage drops down... and never rises again, even if I spam it with more events. This suggests a possible workaround of just re-execing the session init immediately after startup. But that probably requires landing a fix for bug #1238078 first.

Revision history for this message

Steve Langasek (vorlon) wrote on 2013-10-12:

#57

I've also checked this memory usage with valgrind; valgrind doesn't think there's a leak, at the end of a valgrind session there are only 211,206 bytes "possibly lost" - nowhere near the many MB that we're seeing.

Revision history for this message

Steve Langasek (vorlon) wrote on 2013-10-12:

#58

So the reason the memory usage wasn't growing after re-exec was because unity8 didn't automatically reconnect when the dbus connection dropped. If I kill and restart unity8, the problem reappears.

So the memory usage is indeed associated with the dbus client, even though disconnecting the client doesn't free the memory, and valgrind doesn't consider the memory to have been lost.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-14:

#59

vg.log Edit (1.6 MiB, text/plain)

Just got a much more interesting valgrind run. Juicy excerpt:

==9366== 147,560 bytes in 1,190 blocks are still reachable in loss record 187 of 188
==9366== at 0x402B965: calloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==9366== by 0x408E730: dbus_malloc0 (dbus-memory.c:572)
==9366== by 0x407983B: dbus_message_new_empty_header (dbus-message.c:1153)
==9366== by 0x4079D0F: dbus_message_new_method_return (dbus-message.c:1319)
==9366== by 0x80835F5: control_emit_event_reply (com.ubuntu.Upstart.c:2098)
==9366== by 0x806BC6C: event_finished (event.c:501)
==9366== by 0x806B26F: event_poll (event.c:259)
==9366== by 0x4044B63: nih_main_loop (main.c:609)
==9366== by 0x804D507: main (main.c:688)
==9366==
==9366== 228,394,752 bytes in 2,386 blocks are still reachable in loss record 188 of 188
==9366== at 0x402BB88: realloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==9366== by 0x408E778: dbus_realloc (dbus-memory.c:678)
==9366== by 0x408F08C: reallocate_for_length (dbus-string.c:349)
==9366== by 0x408F12B: set_length (dbus-string.c:390)
==9366== by 0x408F1A9: open_gap (dbus-string.c:411)
==9366== by 0x408FA67: copy (dbus-string.c:1198)
==9366== by 0x408FCCE: _dbus_string_copy_len (dbus-string.c:1368)
==9366== by 0x408DCF8: marshal_len_followed_by_bytes (dbus-marshal-basic.c:726)
==9366== by 0x408DD95: marshal_string (dbus-marshal-basic.c:758)
==9366== by 0x408DF60: _dbus_marshal_write_basic (dbus-marshal-basic.c:840)
==9366== by 0x4075CD7: _dbus_type_writer_write_basic_no_typecode (dbus-marshal-recursive.c:1601)
==9366== by 0x40768BB: _dbus_type_writer_write_basic (dbus-marshal-recursive.c:2323)
==9366==
==9366== LEAK SUMMARY:
==9366== definitely lost: 28 bytes in 1 blocks
==9366== indirectly lost: 24 bytes in 1 blocks
==9366== possibly lost: 12,907 bytes in 290 blocks
==9366== still reachable: 228,607,834 bytes in 3,732 blocks
==9366== suppressed: 0 bytes in 0 blocks

Just got a much more interesting valgrind run. Juicy excerpt:

==9366== 147,560 bytes in 1,190 blocks are still reachable in loss record 187 of 188
==9366==    at 0x402B965: calloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==9366==    by 0x408E730: dbus_malloc0 (dbus-memory.c:572)
==9366==    by 0x407983B: dbus_message_new_empty_header (dbus-message.c:1153)
==9366==    by 0x4079D0F: dbus_message_new_method_return (dbus-message.c:1319)
==9366==    by 0x80835F5: control_emit_event_reply (com.ubuntu.Upstart.c:2098)
==9366==    by 0x806BC6C: event_finished (event.c:501)
==9366==    by 0x806B26F: event_poll (event.c:259)
==9366==    by 0x4044B63: nih_main_loop (main.c:609)
==9366==    by 0x804D507: main (main.c:688)
==9366==
==9366== 228,394,752 bytes in 2,386 blocks are still reachable in loss record 188 of 188
==9366==    at 0x402BB88: realloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==9366==    by 0x408E778: dbus_realloc (dbus-memory.c:678)
==9366==    by 0x408F08C: reallocate_for_length (dbus-string.c:349)
==9366==    by 0x408F12B: set_length (dbus-string.c:390)
==9366==    by 0x408F1A9: open_gap (dbus-string.c:411)
==9366==    by 0x408FA67: copy (dbus-string.c:1198)
==9366==    by 0x408FCCE: _dbus_string_copy_len (dbus-string.c:1368)
==9366==    by 0x408DCF8: marshal_len_followed_by_bytes (dbus-marshal-basic.c:726)
==9366==    by 0x408DD95: marshal_string (dbus-marshal-basic.c:758)
==9366==    by 0x408DF60: _dbus_marshal_write_basic (dbus-marshal-basic.c:840)
==9366==    by 0x4075CD7: _dbus_type_writer_write_basic_no_typecode (dbus-marshal-recursive.c:1601)
==9366==    by 0x40768BB: _dbus_type_writer_write_basic (dbus-marshal-recursive.c:2323)
==9366==
==9366== LEAK SUMMARY:
==9366==    definitely lost: 28 bytes in 1 blocks
==9366==    indirectly lost: 24 bytes in 1 blocks
==9366==      possibly lost: 12,907 bytes in 290 blocks
==9366==    still reachable: 228,607,834 bytes in 3,732 blocks
==9366==         suppressed: 0 bytes in 0 blocks

Revision history for this message

Steve Langasek (vorlon) wrote on 2013-10-14:

#60

James, how was this valgrind trace produced? The stacktrace doesn't show anything above the low-level calls in dbus, making it difficult to pinpoint the actual source of the problem.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-15:

#61

Hi Steve,

The trace in #59 was produced by:

- Running Upstart via valgrind which upstart linked against debug builds of nih and dbus.
- Running a single instance of the test program (#51).
- Running spam.sh (#52).
- Waiting until init was consuming ~200MB, then sending SIGTERM to the init/valgrind process.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-15:

#62

vg-20131015.log.gz Edit (138.8 KiB, application/octet-stream)

Another valgrind log taken using the same precedure as above. However, this time I waited until the init process was consuming over 2GB of RAM. Top showed:

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15329 james 20 0 2463m 2,0g 12m R 100,0 26,1 6:51.59 memcheck-x86-li

Once again, the biggest offender by a huge margin is dbus:

==15329== 1,153,679,208 bytes in 12,025 blocks are still reachable in loss record 208 of 208
==15329== at 0x402BB88: realloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==15329== by 0x408E778: dbus_realloc (dbus-memory.c:678)
==15329== by 0x408F08C: reallocate_for_length (dbus-string.c:349)
==15329== by 0x408F12B: set_length (dbus-string.c:390)
==15329== by 0x408F1A9: open_gap (dbus-string.c:411)
==15329== by 0x408FA67: copy (dbus-string.c:1198)
==15329== by 0x408FCCE: _dbus_string_copy_len (dbus-string.c:1368)
==15329== by 0x408DCF8: marshal_len_followed_by_bytes (dbus-marshal-basic.c:726)
==15329== by 0x408DD95: marshal_string (dbus-marshal-basic.c:758)
==15329== by 0x408DF60: _dbus_marshal_write_basic (dbus-marshal-basic.c:840)
==15329== by 0x4075CD7: _dbus_type_writer_write_basic_no_typecode (dbus-marshal-recursive.c:1601)
==15329== by 0x40768BB: _dbus_type_writer_write_basic (dbus-marshal-recursive.c:2323)
==15329==
==15329== LEAK SUMMARY:
==15329== definitely lost: 28 bytes in 1 blocks
==15329== indirectly lost: 24 bytes in 1 blocks
==15329== possibly lost: 133,265 bytes in 315 blocks
==15329== still reachable: 1,154,753,166 bytes in 18,194 blocks
==15329== suppressed: 0 bytes in 0 blocks

Revision history for this message

Dimitri John Ledkov (xnox) wrote on 2013-10-15:

#63

systemd task assigned to me:
- make it possible for systemd-udevd to ignore kernel events from devices tagged "ignore", using a kernel netlink filter to completely avoid wakeups of systemd-udevd.

Changed in systemd (Ubuntu Saucy):
status:	New → In Progress
assignee:	nobody → Dmitrijs Ledkovs (xnox)
milestone:	none → ubuntu-13.10

Dimitri John Ledkov (xnox) on 2013-10-15

Changed in systemd (Ubuntu Saucy):
status:	In Progress → Invalid

Revision history for this message

Steve Langasek (vorlon) wrote on 2013-10-15: Re: [Bug 1235649] Re: uevent spam causes libdbus client code in session upstart to consume massive amounts of memory on Ubuntu Touch

#65

On Tue, Oct 15, 2013 at 08:19:53AM -0000, James Hunt wrote:
> Once again, the biggest offender by a huge margin is dbus:

> ==15329== 1,153,679,208 bytes in 12,025 blocks are still reachable in loss record 208 of 208
> ==15329== at 0x402BB88: realloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
> ==15329== by 0x408E778: dbus_realloc (dbus-memory.c:678)
> ==15329== by 0x408F08C: reallocate_for_length (dbus-string.c:349)
> ==15329== by 0x408F12B: set_length (dbus-string.c:390)
> ==15329== by 0x408F1A9: open_gap (dbus-string.c:411)
> ==15329== by 0x408FA67: copy (dbus-string.c:1198)
> ==15329== by 0x408FCCE: _dbus_string_copy_len (dbus-string.c:1368)
> ==15329== by 0x408DCF8: marshal_len_followed_by_bytes (dbus-marshal-basic.c:726)
> ==15329== by 0x408DD95: marshal_string (dbus-marshal-basic.c:758)
> ==15329== by 0x408DF60: _dbus_marshal_write_basic (dbus-marshal-basic.c:840)
> ==15329== by 0x4075CD7: _dbus_type_writer_write_basic_no_typecode (dbus-marshal-recursive.c:1601)
> ==15329== by 0x40768BB: _dbus_type_writer_write_basic (dbus-marshal-recursive.c:2323)

Yet this backtrace is still deep in the dbus internals and shows us nothing
about who the actual caller is that's causing the leak. Can you please run
this with --num-callers=20 so we can see why these strings are actually
being allocated?

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-15:

#64

test_nih_dbus_server.c Edit (12.5 KiB, text/plain)

I can now recreate the memory bloat using a basic NIH D-Bus server program and associated client program.

1) Run the server as:

$ ./test_nih_dbus_server unix:abstract=/com/hunt/james/foo com.hunt.james.Foo /com/hunt/james/Foo

2) Run the client as:

$ ./test_nih_dbus_client unix:abstract=/com/hunt/james/foo com.hunt.james.Foo

3) Run the updated spammer as:

$ ./spam_dbus_server.sh

4) Run top and watch the memory used by test_nih_dbus_server creep up.

The problem seems to be caused by the call to notify_clients() - this sends a signal to every connected client every time the spam_dbus_server.sh sends its data to the server. Commenting out the call to notify_clients() stop the memory bloating issue.

The function echo_handled() that is called by notify_clients() is analogous to control_emit_event_emitted() in Upstart parlance.

Again, running valgrind on the server shows the same stack traces as above:

==30015== 91,750,175 bytes in 7,617 blocks are still reachable in loss record 164 of 164
==30015== at 0x402BB88: realloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==30015== by 0x408E778: dbus_realloc (dbus-memory.c:678)
==30015== by 0x408F08C: reallocate_for_length (dbus-string.c:349)
==30015== by 0x408F12B: set_length (dbus-string.c:390)
==30015== by 0x408F1A9: open_gap (dbus-string.c:411)
==30015== by 0x408FA67: copy (dbus-string.c:1198)
==30015== by 0x408FCCE: _dbus_string_copy_len (dbus-string.c:1368)
==30015== by 0x408DCF8: marshal_len_followed_by_bytes (dbus-marshal-basic.c:726)
==30015== by 0x408DD95: marshal_string (dbus-marshal-basic.c:758)
==30015== by 0x408DF60: _dbus_marshal_write_basic (dbus-marshal-basic.c:840)
==30015== by 0x4075CD7: _dbus_type_writer_write_basic_no_typecode (dbus-marshal-recursive.c:1601)
==30015== by 0x40768BB: _dbus_type_writer_write_basic (dbus-marshal-recursive.c:2323)

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-15:

#66

test_nih_dbus_client.c Edit (3.3 KiB, text/plain)

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-15:

#67

spam_dbus_server.sh Edit (807 bytes, text/x-sh)

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-15:

#68

vg-test_nih_dbus_server.log.gz Edit (60.1 KiB, application/octet-stream)

Valgrind log from test_nih_dbus_server.c run using 'valgrind --num-callers=100 ...'.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-16:

#69

bug-1235649.diff Edit (830 bytes, text/plain)

FWICS D-Bus is behaving as designed: since the client has not called dbus_connection_set_wakeup_main_function(), and since D-Bus messaging is reliable, all libdbus is able to do when a signal is sent to a connection is cache any messages the client has not yet processed in the hope that at some future point it might consume them.

The key part of the D-Bus code that explains this is dbus_connection_send_preallocated_unlocked_no_update():

------------------------------

  /* Now we need to run an iteration to hopefully just write the messages
   * out immediately, and otherwise get them queued up
   */
  _dbus_connection_do_iteration_unlocked (connection,
                                          NULL,
                                          DBUS_ITERATION_DO_WRITING,
                                          -1);

  /* If stuff is still queued up, be sure we wake up the main loop */
  if (connection->n_outgoing > 0)
    _dbus_connection_wakeup_mainloop (connection);

------------------------------

As shown, the messages (signals in the case of Upstart) are added to the client connection queue, but since the client called
dbus_connection_open() rather than nih_dbus_connect(), no mainloop handler is registered to allow the client to deal with the signals sent to it.

From the upstart perspective, a fix is to call dbus_connection_has_messages_to_send() and ignore that connection if the
client still has unprocessed messages (patch attached).

From the D-Bus perspective, it would be useful to introduce a new public API:

dbus_bool_t
dbus_connection_has_wakeup_main_function (DBusConnection *connection);

This would allow Upstart and other D-Bus server applications to detect erroneous client connections.

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-16:

#70

TBC, using the patch on #69 results in no memory bloat.

Ubuntu Foundations Team Bug Bot (crichton) on 2013-10-16

tags:

added: patch

Revision history for this message

Steve Langasek (vorlon) wrote on 2013-10-17:

#71

After discussing this with Ted, I think the right answer architecturally is to move these clients to talking to the session bus, where filtering of messages will be done for us automatically, and have the session init register on the session bus, rather than deploying an unreliable change like having upstart only keep one message at a time in the queue for a client. This will require a few changes in upstart, including working out the correct way to get the session init to register with the session bus after dbus has been spawned (presumably we want to reuse SIGUSR1 for this, and maybe get the session bus address out of the global environment?)

Revision history for this message

James Hunt (jamesodhunt) wrote on 2013-10-17:

#72

We already have bug 1203595 for having the Session Init connect to the D-Bus session bus.

James Hunt (jamesodhunt) on 2013-10-18

Changed in upstart (Ubuntu):
status:	Fix Released → Incomplete
status:	Incomplete → In Progress

Michael Zanetti (mzanetti) on 2013-10-24

tags:

added: needs-ap-test

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-11-04:

#73

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in unity (Ubuntu Saucy):
status:	New → Confirmed
Changed in unity (Ubuntu):
status:	New → Confirmed

James Hunt (jamesodhunt) on 2013-11-12

Changed in upstart:
assignee:	nobody → James Hunt (jamesodhunt)
status:	New → Fix Committed

Revision history for this message

Launchpad Janitor (janitor) wrote on 2013-11-17:

#75

This bug was fixed in the package upstart - 1.11-0ubuntu1

---------------
upstart (1.11-0ubuntu1) trusty; urgency=low

[ Dmitrijs Ledkovs ]
* Run as much test-suites as possible, before exiting with failure.

  [ James Hunt ]
  * New upstream release.
    - the global environment is now serialized on re-exec. LP: #1238078.
    - fixes support for upstart-file-bridge detecting directory creation.
      LP: #1221466.
    - Fixes upstart user session to not override the user's configured
      umask. LP: #1240686.
    - Makes the session init available on the session dbus.
      LP: #1203595, #1235649.

  [ Steve Langasek ]
  * Document the authoritative Vcs branch in debian/control.
  * Update Standards-Version to 3.9.5, no changes required.
-- James Hunt <email address hidden> Fri, 15 Nov 2013 13:16:56 +0000

Changed in upstart (Ubuntu):
status:	In Progress → Fix Released

Robert Bruce Park (robru) on 2013-12-17

Changed in landing-plan:
status:	New → Fix Released

Marco Trevisan (Treviño) (3v1n0) on 2014-07-24

Changed in unity:
status:	New → Confirmed

Dimitri John Ledkov (xnox) on 2014-07-24

Changed in upstart:
status:	Fix Committed → Fix Released
status:	Fix Released → Confirmed

James Hunt (jamesodhunt) on 2014-07-24

Changed in upstart:
status:	Confirmed → Fix Released

Marco Trevisan (Treviño) (3v1n0) on 2014-08-05

Changed in unity (Ubuntu Saucy):
status:	Confirmed → Invalid

Revision history for this message

Ower (ower86-g) wrote on 2014-10-08:

#76

same problem.
Linux 3.14.0-generic
ubuntu 14.04
kdm

init --user use 1.8GB memory in 9 days

Report a bug

This report contains Public information

Everyone can see this information.

Duplicates of this bug

You are

Subscribing...

Edit bug mail

Other bug subscribers

Patches

bug-1235649.diff Edit

Add patch

Bug attachments

Add attachment

Remote bug watches

Bug watches keep track of this bug in other bug trackers.

Ubuntuupstart package

uevent spam causes libdbus client code in session upstart to consume massive amounts of memory on Ubuntu Touch

Bug Description

Related branches

Duplicates of this bug

Other bug subscribers

Patches

Bug attachments

Remote bug watches

Ubuntu
upstart package