uevent spam causes libdbus client code in session upstart to consume massive amounts of memory on Ubuntu Touch

Bug #1235649 reported by Oliver Grawert on 2013-10-05
52
This bug affects 8 people
Affects Status Importance Assigned to Milestone
Touch Landing Plan
Undecided
Unassigned
Unity
Confirmed
Undecided
Unassigned
upstart
Undecided
James Hunt
linux (Ubuntu)
Critical
Unassigned
Saucy
Critical
Unassigned
systemd (Ubuntu)
Undecided
Dimitri John Ledkov
Saucy
Undecided
Dimitri John Ledkov
unity (Ubuntu)
Undecided
Unassigned
Saucy
Undecided
Unassigned
upstart (Ubuntu)
Critical
James Hunt
Saucy
Critical
James Hunt

Bug Description

using ubuntu touch image 82 i see the session init consume about 10MB per minute as long as the screen is on with Mir.
running the same session with surfaceflinger only consumes 1MB per minute.

in both cases the system starts to swap heavily at some point, making the UI unresponsive.

http://paste.ubuntu.com/6196223/ has the top output of a Mir session after 30min, the UI just got completely unresponsive when this snapshot was taken.

http://paste.ubuntu.com/6196332/ is the top output of a surfaceflinger session where the screen was off for about 10min

apparently the leak only occurs while the screen is on, it seems to be permanently there but in the case of surfaceflinger it hits less hard.

Related branches

Oliver Grawert (ogra) wrote :

both of the above was observed on maguro btw.

Changed in upstart (Ubuntu):
importance: Undecided → Critical
Changed in upstart (Ubuntu):
status: New → Confirmed
Oliver Grawert (ogra) wrote :

after capturing the above i tried to upgrade to upstart 1.10-0ubuntu6 from the archive which exposes the same behavior.

Oliver Grawert (ogra) wrote :

i tried downgrading upstart version by version until 1.9.1-0ubuntu5 (which was the first one with user sessions enabled) and it seems all versions expose this behavior.

Jean-Baptiste Lallement (jibel) wrote :

I confirm this behaviour with build 82 on mako.
To illustrate it I made the following test:
Start and stop dialer-app every 6s (reasonable delay to give time to the application to start and stop)
The application is started with: start application APP_ID=dialer-app
and stop with a kill.

I captured RSS measured by ps after each stop and made 2 runs with Mir and 1 with SF. The result is attached.
I shows that RSS usage of init with Mir is much higher the SF and growth is exponential while it is linear with Mir.

As a side note, the number of samples with Mir is limited to less than 50 because the phone stops responding at this moment while it continues to run fine with SF. This is reproduceable.

tags: added: qa-manual-testing qa-touch rls-s-incoming
Jean-Baptiste Lallement (jibel) wrote :

replace:
I shows that RSS usage of init with Mir is much higher the SF and growth is exponential while it is linear with Mir.

by
It shows that RSS usage of init with Mir is much higher than SF and growth is exponential while it is linear with SF.

Oliver Grawert (ogra) wrote :

the growth in itself is an issue though, even SF will hit swap at some point, just later ...

James Hunt (jamesodhunt) wrote :

I don't have a maguro device so debugging is going to have to be by proxy :)

A few questions:

1) Does this only occur when the system is read-only?
2) Are logs growning fast in ~/.cache/upstart/ ?
3) What is the screen doing when it is on? Is it enough to run "powerd-cli display on" to see the issue?
4) Is the issue seen on any other hardware?
5) How do you toggle between Mir and SF?

Please could you:

a) Run the Session Init in debug mode by adding '--debug' to 'init --user' in /usr/bin/ubuntu-touch-session.
b) attach ~/.xsession-errors
c) Get an strace of the Session Init.

Sounds like this issue may be related to kernel bug 1234743.

James Hunt (jamesodhunt) wrote :

Answering own question (5): https://wiki.ubuntu.com/Touch/Testing/Mir

Le 07/10/2013 11:12, James Hunt a écrit :
> I don't have a maguro device so debugging is going to have to be by
> proxy :)
>
> A few questions:
>
> 1) Does this only occur when the system is read-only?
No. init leaks the same amount of memory when the file system is RO or RW.

> 2) Are logs growning fast in ~/.cache/upstart/ ?
No. The biggest log file is application-legacy-dialer-app-.log (the
application I used for the test) and it is around 700kB. The size is
approximately the same with Mir or SF and in RO or RW mode. (ls -lSr
/home/phablet/.cache/upstart attached)

> 3) What is the screen doing when it is on? Is it enough to run "powerd-cli display on" to see the issue?
In my test, I was launching/stopping dialer app, so the screen was
switching between the dialer and the application scope.

Running powerd-cli or not doesn't make a difference (excepted that I
have to tap the screen regularly to keep the display alive without
powerd-cli :))

> 4) Is the issue seen on any other hardware?
Reproduced on maguro and mako

> Please could you:
>
> a) Run the Session Init in debug mode by adding '--debug' to 'init --user' in /usr/bin/ubuntu-touch-session.
> b) attach ~/.xsession-errors
> c) Get an strace of the Session Init.
In progress

--
Jean-Baptiste
irc: jibel

/home/phablet/.xsession-errors with init --user --debug

strace of session init

btw, .xession-errors shows 5 run of dialer, 1 run of shorts and 1 run of calculator, and strace shows 1 run of dialer.

James Hunt (jamesodhunt) wrote :

jibel - thanks. It does indeed look like bug 1234743 as the trace shows a lot of the following events:

:sys:cpu-device-online
:sys:cpu-device-offline
:sys:power_supply-device-changed

This would explain the Session consuming a lot of CPU since those events are generated by the kernel, picked up by the upstart-udev-bridge which injects them into the system init event space. These events are then noticed by the upstart-event-bridge running as phablet and injected into the Session Init with the ":sys:" prefix.

Could you run that again as "strace -fFttv -s 1024 ..." so we can see some timestamps?

Also, please attach /var/log/udev and the output of 'ls -al /proc/$session_init_pid/fd/'.

Adding kernel since as this issue only affects particular hardward, it is highly unlikely to be an Upstart issue.

Changed in upstart (Ubuntu Saucy):
status: Confirmed → Incomplete
James Hunt (jamesodhunt) wrote :

Also, please could you run http://people.canonical.com/~jhunt/upstart/scripts/get_state.sh as user phablet and attach to this bug.

This bug is missing log files that will aid in diagnosing the problem. From a terminal window please run:

apport-collect 1235649

and then change the status of the bug to 'Confirmed'.

If, due to the nature of the issue you have encountered, you are unable to run this command, please add a comment stating that fact and change the bug status to 'Confirmed'.

This change has been made by an automated script, maintained by the Ubuntu Kernel Team.

Changed in linux (Ubuntu):
status: New → Incomplete

more verbose strace during a start/stop of dialer-app

output of get_state.sh after dialer-app has been stopped.

/var/log/udev

output of ls -al /proc/1467/fd/

Changed in upstart (Ubuntu Saucy):
status: Incomplete → Confirmed
Changed in linux (Ubuntu Saucy):
status: Incomplete → Confirmed
tags: added: kernel-bot-stop-nagging
Changed in linux (Ubuntu Saucy):
importance: Undecided → Critical
Oliver Grawert (ogra) wrote :

i can definitely make it stop consuming more ram by stopping the upstart-event-bridge session job.
this works on both devices.
the ram does not get freed but the consumption does not keep rising either anymore then.

James Hunt (jamesodhunt) wrote :

Apparently, the high cpu load stop when the upstart-event-bridge is stopped. This lends weight to the high cpu issue being caused the a kernel udev event issue (like bug 1234743).

However, what is not explained is why the Session Init retains the memory it consumes once the upstart-event-bridge is stopped.

Please would someone check to see if the issue is recreatable on grouper?

James Hunt (jamesodhunt) wrote :

@jibel - that get_state.out appears to have either been run as root, or atleast outside the session init environment. Please could you re-run as user phablet?

Also, please could you attach /var/log/udev ?

output of get_state.sh as user phablet

James Hunt (jamesodhunt) wrote :

Can someone try running "/sbin/initctl notify-disk-writeable" and see if the session inits memory consumption drops? Alternatively, run the session init as "init --user --debug --no-log".

James Hunt (jamesodhunt) wrote :

From info provided by jibel...

Forcing a log flush (#25) doesn't resolve the high memory usage (as we'd expect, but worth ruling out).

The state file in #24 looks sane to me; forcing a re-exec of the Session Init does cause the memory to fall to reasonable levels. This almost feels like a compiler bug where __attribute__ ((cleanup())) is failing for some reason.

James Hunt (jamesodhunt) wrote :

To summarise: there are 2 problems here:

1) The kernel spamming userspace with udev messages on mako+maguro (grouper doesn't appear to be affected but not independently verified yet).

2) Upstart not releasing memory once the kernel has finished spamming userspace (artificially forced by running "stop upstart-event-bridge" as user 'phablet').

output of "initctl list" as user phablet.

James Hunt (jamesodhunt) wrote :

Could someone run the "make check" tests for the libnih package on a mako/maguro device?

James Hunt (jamesodhunt) wrote :

I can no recreate the issue with memory not being freed by running: http://paste.ubuntu.com/6205552/.

This is recreatable on i386 desktop images so although problem (1) in #27 is h/w specific, problem (2) is not.

James Hunt (jamesodhunt) wrote :

Attempting to run the script in #30 against a session init as below *outside* of a the standard gui environment with no .conf files does not trigger the issue:

$ mkdir -p /tmp/conf /tmp/log
$ init --debug --no-startup-event --confdir /tmp/conf --logdir /tmp/log --user

Same outcome running like this:

$ init --debug --no-startup-event --user

So, Event objects themselves are not the issue. More likely a subtle ref-counting issue related to one of the other object types.

tags: added: kernel-key
tags: removed: rls-s-incoming
James Hunt (jamesodhunt) on 2013-10-08
Changed in upstart (Ubuntu Saucy):
assignee: nobody → James Hunt (jamesodhunt)
James Hunt (jamesodhunt) wrote :

Valgrinding the session init shows that most of the memory is correctly freed at quiesce time.

James Hunt (jamesodhunt) wrote :

Valgrind run as:

valgrind -v -v -v -v --show-reachable=yes --track-origins=yes --leak-check=full init --debug --user

summary: - session upstart leaks massive amounts of memory on Ubuntu Touch
+ uevent spam causes session upstart to consume massive amounts of memory
+ on Ubuntu Touch
Changed in linux (Ubuntu Saucy):
status: Confirmed → Incomplete
tags: added: kernel-da-key
removed: kernel-key

After discussion on irc between ogra, pitti and I:
workaround #1 - blacklist that uevent in upstart-udev-bridge, thus neither system init event will not be emitted, nor re-emitted to session upstart.

workaround #2 - blacklist that uevent in systemd-udev (via socket filter) if not needed by Mir.

The first one is proposed in a branch already, and that still leaves 5% cpu usage by systemd-udev constantly whilst the screen is on.

James Hunt (jamesodhunt) wrote :

The problem is triggered by init/control.c:control_emit_event_emitted(). It looks like Upstart itself is clearing up memory, but either dbus isn't, or upstart is missing a dbus unref call.

James Hunt (jamesodhunt) wrote :

The working theory at this stage is that init/control.c:control_disconnected() is not cleaning up the dbus connections created by the upstart-udev-bridge for each event it requests upstart emit.

James Hunt (jamesodhunt) wrote :

 init/control.c:control_disconnected() could do with a call to dbus_connection_close(), but each connection is being unref'd correctly fwics.

James Hunt (jamesodhunt) wrote :

To be clear, adding a call to dbus_connection_close() in control_disconnected() did not resolve the problem.

James Hunt (jamesodhunt) wrote :

Persistent D-Bus connections are as expected (1 / bridge and 1 for unity-panel-service).
Ephemeral D-Bus connections are being handled correctly and removed from the control_conns list.

James Hunt (jamesodhunt) wrote :

Still unclear why memory is being cached in Upstart address space, but I'm starting to think this might be a libdbus issue.

What is triggering the memory rise though is unity-panel-service: stopping this from running results in no memory increase within the Session Init.

Oliver Grawert (ogra) wrote :

note that touch images do not use unity-panel-service anywhere

James Hunt (jamesodhunt) wrote :

Right - recall that I am testing on i386 as I don't have mako/maguro :)

However, note that on Touch, the upstart code in unity8 (plugins/Unity/Indicators/indicatorsmanager.cpp) is almost identical to the upstart code in unity-panel-service (unity/services/panel-service.c).

Is it possible to somehow disable the loading of /usr/lib/i386-linux-gnu/unity8/qml/Unity/Indicators/libIndicatorsQml.so in the unity8-private package on mako/maguro to see if that stops the memory bloat?

James Hunt (jamesodhunt) wrote :

Re #41, interesting that unity-panel-service *is* installed on Touch image though :)

Michał Sawicz (saviq) wrote :
Download full text (3.9 KiB)

Sure:

=== modified file 'Panel/Panel.qml'
--- Panel/Panel.qml 2013-09-24 09:00:55 +0000
+++ Panel/Panel.qml 2013-10-09 14:14:02 +0000
@@ -71,7 +71,7 @@
         }
     }

- Indicators {
+ Showable {
         id: indicatorsMenu
         objectName: "indicators"

@@ -79,10 +79,10 @@
         y: panelBackground.y
         width: root.indicatorsMenuWidth
         shown: false
- hintValue: __panelMinusSeparatorLineHeight * 3
- panelHeight: __panelMinusSeparatorLineHeight
- openedHeight: parent.height + (pinnedMode ? 0 : root.panelHeight)
- pinnedMode: !fullscreenMode
+// hintValue: __panelMinusSeparatorLineHeight * 3
+// panelHeight: __panelMinusSeparatorLineHeight
+// openedHeight: parent.height + (pinnedMode ? 0 : root.panelHeight)
+// pinnedMode: !fullscreenMode

         property real unitProgress: (height - panelHeight) / (openedHeight - panelHeight)
     }

=== modified file 'Shell.qml'
--- Shell.qml 2013-10-08 19:11:11 +0000
+++ Shell.qml 2013-10-09 14:13:59 +0000
@@ -568,7 +568,7 @@ ...

Read more...

Michał Sawicz (saviq) wrote :

The above can be applied straight on device, in /usr/share/unity8/.

Steve Langasek (vorlon) wrote :

Workaround is in place in upstart package to reduce the flood of events that are causing this rapid memory growth, and we believe a fix to the unity upstart client will fix the memory leak entirely; but upstart needs to be more robust against misbehaving clients. Opening a task against upstart upstream.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package upstart - 1.10-0ubuntu7

---------------
upstart (1.10-0ubuntu7) saucy; urgency=low

  * Ignore emitting specific udev event spam on maguro, as a workaround
    for LP: #1235649.
 -- Dmitrijs Ledkovs <email address hidden> Wed, 09 Oct 2013 13:52:10 +0100

Changed in upstart (Ubuntu Saucy):
status: Confirmed → Fix Released
Changed in linux (Ubuntu Saucy):
status: Incomplete → Invalid
James Hunt (jamesodhunt) wrote :

This problem can now be demonstrated with a single connection to upstart that makes the following calls:

dbus_connection_open()
nih_dbus_proxy_new ()

The problem is not seen if the app instead calls:

           - nih_dbus_connect (upstart_session, disconnect_handler)
           - nih_dbus_proxy_new ()
           - nih_main_loop()

Or,

           - nih_dbus_connect (upstart_session, NULL)
           - nih_dbus_proxy_new ()
           - nih_main_loop()

Specifically, the problem appears to be that when an app calls dbus_connection_open(), one or more of the functions in nih_dbus_setup() should be being called to allow libdbus to interrupt the apps main loop and request callbacks are run to clean up resources.

James Hunt (jamesodhunt) wrote :

The problem is not observed when the only connections to upstart are the standard upstart bridges (since they use nih_dbus_connect()).

For reference, here is nih_dbus_setup() which includes one or more calls that unity seemingly should be calling itself:

http://bazaar.launchpad.net/~ubuntu-branches/ubuntu/saucy/libnih/saucy/view/head:/nih-dbus/dbus_connection.c#L202

James Hunt (jamesodhunt) wrote :

Note that disabling libIndicatorsQml.so with #44 did not resolve the problem. However, I am confident this is because there were other connections to the Session Init (probably resulting from libupstart-app-launch1).

This does need to be confirmed though.

James Hunt (jamesodhunt) wrote :

C program to trigger this bug. Run it in an environment with UPSTART_SESSION set (such as a desktop/Touch environment), then run the about-to-be attached spam.sh script.

James Hunt (jamesodhunt) wrote :

Run as:

$ spam.sh <session_init_pid>

James Hunt (jamesodhunt) wrote :

Program attached in #51 demonstrates the incorrect calling behaviour. To demonstrate the correct behaviour change the #if0/else/endif such that rather than calling dbus_connection_open(), the program calls nih_dbus_connect().

nih_dbus_connect(): No memory growth.
dbus_connection_open(): Memory growth.

summary: - uevent spam causes session upstart to consume massive amounts of memory
- on Ubuntu Touch
+ uevent spam causes libdbus client code in session upstart to consume
+ massive amounts of memory on Ubuntu Touch
James Hunt (jamesodhunt) wrote :

Note that the following call on the server side does not resolve the issue:

    while (dbus_connection_dispatch (connection) == DBUS_DISPATCH_DATA_REMAINS)
                       ;

James Hunt (jamesodhunt) wrote :

Neither does calling the following help:

    dbus_connection_flush (connection);

Steve Langasek (vorlon) wrote :

I've tested on a mako device here, and the memory usage is not all due to a misbehaved client. I've tracked down all the clients of the session init; unity8 is the only long-lived client other than the bridges, and if I kill unity8 and let it respawn, upstart gains back some of the memory, but *not all*: if I feed events into the session init, bringing its memory usage up to 93.3MB, and then kill unity8 (and, in fact,if I go through one-by-one and kill *all* the processes related to the session), upstart frees *some* memory, but its total usage remains at 88MB - compared to 5MB when it starts out.

Another thing I've found out, though, is that if I force a re-exec of the session init with 'telinit u', the memory usage drops down... and never rises again, even if I spam it with more events. This suggests a possible workaround of just re-execing the session init immediately after startup. But that probably requires landing a fix for bug #1238078 first.

Steve Langasek (vorlon) wrote :

I've also checked this memory usage with valgrind; valgrind doesn't think there's a leak, at the end of a valgrind session there are only 211,206 bytes "possibly lost" - nowhere near the many MB that we're seeing.

Steve Langasek (vorlon) wrote :

So the reason the memory usage wasn't growing after re-exec was because unity8 didn't automatically reconnect when the dbus connection dropped. If I kill and restart unity8, the problem reappears.

So the memory usage is indeed associated with the dbus client, even though disconnecting the client doesn't free the memory, and valgrind doesn't consider the memory to have been lost.

James Hunt (jamesodhunt) wrote :

Just got a much more interesting valgrind run. Juicy excerpt:

==9366== 147,560 bytes in 1,190 blocks are still reachable in loss record 187 of 188
==9366== at 0x402B965: calloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==9366== by 0x408E730: dbus_malloc0 (dbus-memory.c:572)
==9366== by 0x407983B: dbus_message_new_empty_header (dbus-message.c:1153)
==9366== by 0x4079D0F: dbus_message_new_method_return (dbus-message.c:1319)
==9366== by 0x80835F5: control_emit_event_reply (com.ubuntu.Upstart.c:2098)
==9366== by 0x806BC6C: event_finished (event.c:501)
==9366== by 0x806B26F: event_poll (event.c:259)
==9366== by 0x4044B63: nih_main_loop (main.c:609)
==9366== by 0x804D507: main (main.c:688)
==9366==
==9366== 228,394,752 bytes in 2,386 blocks are still reachable in loss record 188 of 188
==9366== at 0x402BB88: realloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==9366== by 0x408E778: dbus_realloc (dbus-memory.c:678)
==9366== by 0x408F08C: reallocate_for_length (dbus-string.c:349)
==9366== by 0x408F12B: set_length (dbus-string.c:390)
==9366== by 0x408F1A9: open_gap (dbus-string.c:411)
==9366== by 0x408FA67: copy (dbus-string.c:1198)
==9366== by 0x408FCCE: _dbus_string_copy_len (dbus-string.c:1368)
==9366== by 0x408DCF8: marshal_len_followed_by_bytes (dbus-marshal-basic.c:726)
==9366== by 0x408DD95: marshal_string (dbus-marshal-basic.c:758)
==9366== by 0x408DF60: _dbus_marshal_write_basic (dbus-marshal-basic.c:840)
==9366== by 0x4075CD7: _dbus_type_writer_write_basic_no_typecode (dbus-marshal-recursive.c:1601)
==9366== by 0x40768BB: _dbus_type_writer_write_basic (dbus-marshal-recursive.c:2323)
==9366==
==9366== LEAK SUMMARY:
==9366== definitely lost: 28 bytes in 1 blocks
==9366== indirectly lost: 24 bytes in 1 blocks
==9366== possibly lost: 12,907 bytes in 290 blocks
==9366== still reachable: 228,607,834 bytes in 3,732 blocks
==9366== suppressed: 0 bytes in 0 blocks

Steve Langasek (vorlon) wrote :

James, how was this valgrind trace produced? The stacktrace doesn't show anything above the low-level calls in dbus, making it difficult to pinpoint the actual source of the problem.

James Hunt (jamesodhunt) wrote :

Hi Steve,

The trace in #59 was produced by:

- Running Upstart via valgrind which upstart linked against debug builds of nih and dbus.
- Running a single instance of the test program (#51).
- Running spam.sh (#52).
- Waiting until init was consuming ~200MB, then sending SIGTERM to the init/valgrind process.

James Hunt (jamesodhunt) wrote :

Another valgrind log taken using the same precedure as above. However, this time I waited until the init process was consuming over 2GB of RAM. Top showed:

  PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
15329 james 20 0 2463m 2,0g 12m R 100,0 26,1 6:51.59 memcheck-x86-li

Once again, the biggest offender by a huge margin is dbus:

==15329== 1,153,679,208 bytes in 12,025 blocks are still reachable in loss record 208 of 208
==15329== at 0x402BB88: realloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==15329== by 0x408E778: dbus_realloc (dbus-memory.c:678)
==15329== by 0x408F08C: reallocate_for_length (dbus-string.c:349)
==15329== by 0x408F12B: set_length (dbus-string.c:390)
==15329== by 0x408F1A9: open_gap (dbus-string.c:411)
==15329== by 0x408FA67: copy (dbus-string.c:1198)
==15329== by 0x408FCCE: _dbus_string_copy_len (dbus-string.c:1368)
==15329== by 0x408DCF8: marshal_len_followed_by_bytes (dbus-marshal-basic.c:726)
==15329== by 0x408DD95: marshal_string (dbus-marshal-basic.c:758)
==15329== by 0x408DF60: _dbus_marshal_write_basic (dbus-marshal-basic.c:840)
==15329== by 0x4075CD7: _dbus_type_writer_write_basic_no_typecode (dbus-marshal-recursive.c:1601)
==15329== by 0x40768BB: _dbus_type_writer_write_basic (dbus-marshal-recursive.c:2323)
==15329==
==15329== LEAK SUMMARY:
==15329== definitely lost: 28 bytes in 1 blocks
==15329== indirectly lost: 24 bytes in 1 blocks
==15329== possibly lost: 133,265 bytes in 315 blocks
==15329== still reachable: 1,154,753,166 bytes in 18,194 blocks
==15329== suppressed: 0 bytes in 0 blocks

Dimitri John Ledkov (xnox) wrote :

systemd task assigned to me:
 - make it possible for systemd-udevd to ignore kernel events from devices tagged "ignore", using a kernel netlink filter to completely avoid wakeups of systemd-udevd.

Changed in systemd (Ubuntu Saucy):
status: New → In Progress
assignee: nobody → Dmitrijs Ledkovs (xnox)
milestone: none → ubuntu-13.10
Changed in systemd (Ubuntu Saucy):
status: In Progress → Invalid

On Tue, Oct 15, 2013 at 08:19:53AM -0000, James Hunt wrote:
> Once again, the biggest offender by a huge margin is dbus:

> ==15329== 1,153,679,208 bytes in 12,025 blocks are still reachable in loss record 208 of 208
> ==15329== at 0x402BB88: realloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
> ==15329== by 0x408E778: dbus_realloc (dbus-memory.c:678)
> ==15329== by 0x408F08C: reallocate_for_length (dbus-string.c:349)
> ==15329== by 0x408F12B: set_length (dbus-string.c:390)
> ==15329== by 0x408F1A9: open_gap (dbus-string.c:411)
> ==15329== by 0x408FA67: copy (dbus-string.c:1198)
> ==15329== by 0x408FCCE: _dbus_string_copy_len (dbus-string.c:1368)
> ==15329== by 0x408DCF8: marshal_len_followed_by_bytes (dbus-marshal-basic.c:726)
> ==15329== by 0x408DD95: marshal_string (dbus-marshal-basic.c:758)
> ==15329== by 0x408DF60: _dbus_marshal_write_basic (dbus-marshal-basic.c:840)
> ==15329== by 0x4075CD7: _dbus_type_writer_write_basic_no_typecode (dbus-marshal-recursive.c:1601)
> ==15329== by 0x40768BB: _dbus_type_writer_write_basic (dbus-marshal-recursive.c:2323)

Yet this backtrace is still deep in the dbus internals and shows us nothing
about who the actual caller is that's causing the leak. Can you please run
this with --num-callers=20 so we can see why these strings are actually
being allocated?

James Hunt (jamesodhunt) wrote :

I can now recreate the memory bloat using a basic NIH D-Bus server program and associated client program.

1) Run the server as:

        $ ./test_nih_dbus_server unix:abstract=/com/hunt/james/foo com.hunt.james.Foo /com/hunt/james/Foo

2) Run the client as:

        $ ./test_nih_dbus_client unix:abstract=/com/hunt/james/foo com.hunt.james.Foo

3) Run the updated spammer as:

        $ ./spam_dbus_server.sh

4) Run top and watch the memory used by test_nih_dbus_server creep up.

The problem seems to be caused by the call to notify_clients() - this sends a signal to every connected client every time the spam_dbus_server.sh sends its data to the server. Commenting out the call to notify_clients() stop the memory bloating issue.

The function echo_handled() that is called by notify_clients() is analogous to control_emit_event_emitted() in Upstart parlance.

Again, running valgrind on the server shows the same stack traces as above:

==30015== 91,750,175 bytes in 7,617 blocks are still reachable in loss record 164 of 164
==30015== at 0x402BB88: realloc (in /usr/lib/valgrind/vgpreload_memcheck-x86-linux.so)
==30015== by 0x408E778: dbus_realloc (dbus-memory.c:678)
==30015== by 0x408F08C: reallocate_for_length (dbus-string.c:349)
==30015== by 0x408F12B: set_length (dbus-string.c:390)
==30015== by 0x408F1A9: open_gap (dbus-string.c:411)
==30015== by 0x408FA67: copy (dbus-string.c:1198)
==30015== by 0x408FCCE: _dbus_string_copy_len (dbus-string.c:1368)
==30015== by 0x408DCF8: marshal_len_followed_by_bytes (dbus-marshal-basic.c:726)
==30015== by 0x408DD95: marshal_string (dbus-marshal-basic.c:758)
==30015== by 0x408DF60: _dbus_marshal_write_basic (dbus-marshal-basic.c:840)
==30015== by 0x4075CD7: _dbus_type_writer_write_basic_no_typecode (dbus-marshal-recursive.c:1601)
==30015== by 0x40768BB: _dbus_type_writer_write_basic (dbus-marshal-recursive.c:2323)

James Hunt (jamesodhunt) wrote :
James Hunt (jamesodhunt) wrote :
James Hunt (jamesodhunt) wrote :

Valgrind log from test_nih_dbus_server.c run using 'valgrind --num-callers=100 ...'.

James Hunt (jamesodhunt) wrote :

FWICS D-Bus is behaving as designed: since the client has not called dbus_connection_set_wakeup_main_function(), and since D-Bus messaging is reliable, all libdbus is able to do when a signal is sent to a connection is cache any messages the client has not yet processed in the hope that at some future point it might consume them.

The key part of the D-Bus code that explains this is dbus_connection_send_preallocated_unlocked_no_update():

------------------------------

  /* Now we need to run an iteration to hopefully just write the messages
   * out immediately, and otherwise get them queued up
   */
  _dbus_connection_do_iteration_unlocked (connection,
                                          NULL,
                                          DBUS_ITERATION_DO_WRITING,
                                          -1);

  /* If stuff is still queued up, be sure we wake up the main loop */
  if (connection->n_outgoing > 0)
    _dbus_connection_wakeup_mainloop (connection);

------------------------------

As shown, the messages (signals in the case of Upstart) are added to the client connection queue, but since the client called
dbus_connection_open() rather than nih_dbus_connect(), no mainloop handler is registered to allow the client to deal with the signals sent to it.

From the upstart perspective, a fix is to call dbus_connection_has_messages_to_send() and ignore that connection if the
client still has unprocessed messages (patch attached).

From the D-Bus perspective, it would be useful to introduce a new public API:

dbus_bool_t
dbus_connection_has_wakeup_main_function (DBusConnection *connection);

This would allow Upstart and other D-Bus server applications to detect erroneous client connections.

James Hunt (jamesodhunt) wrote :

TBC, using the patch on #69 results in no memory bloat.

tags: added: patch
Steve Langasek (vorlon) wrote :

After discussing this with Ted, I think the right answer architecturally is to move these clients to talking to the session bus, where filtering of messages will be done for us automatically, and have the session init register on the session bus, rather than deploying an unreliable change like having upstart only keep one message at a time in the queue for a client. This will require a few changes in upstart, including working out the correct way to get the session init to register with the session bus after dbus has been spawned (presumably we want to reuse SIGUSR1 for this, and maybe get the session bus address out of the global environment?)

James Hunt (jamesodhunt) wrote :

We already have bug 1203595 for having the Session Init connect to the D-Bus session bus.

James Hunt (jamesodhunt) on 2013-10-18
Changed in upstart (Ubuntu):
status: Fix Released → Incomplete
status: Incomplete → In Progress
tags: added: needs-ap-test
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in unity (Ubuntu Saucy):
status: New → Confirmed
Changed in unity (Ubuntu):
status: New → Confirmed
James Hunt (jamesodhunt) on 2013-11-12
Changed in upstart:
assignee: nobody → James Hunt (jamesodhunt)
status: New → Fix Committed
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package upstart - 1.11-0ubuntu1

---------------
upstart (1.11-0ubuntu1) trusty; urgency=low

  [ Dmitrijs Ledkovs ]
  * Run as much test-suites as possible, before exiting with failure.

  [ James Hunt ]
  * New upstream release.
    - the global environment is now serialized on re-exec. LP: #1238078.
    - fixes support for upstart-file-bridge detecting directory creation.
      LP: #1221466.
    - Fixes upstart user session to not override the user's configured
      umask. LP: #1240686.
    - Makes the session init available on the session dbus.
      LP: #1203595, #1235649.

  [ Steve Langasek ]
  * Document the authoritative Vcs branch in debian/control.
  * Update Standards-Version to 3.9.5, no changes required.
 -- James Hunt <email address hidden> Fri, 15 Nov 2013 13:16:56 +0000

Changed in upstart (Ubuntu):
status: In Progress → Fix Released
Changed in landing-plan:
status: New → Fix Released
Changed in unity:
status: New → Confirmed
Changed in upstart:
status: Fix Committed → Fix Released
status: Fix Released → Confirmed
James Hunt (jamesodhunt) on 2014-07-24
Changed in upstart:
status: Confirmed → Fix Released
Changed in unity (Ubuntu Saucy):
status: Confirmed → Invalid
Ower (ower86-g) wrote :

same problem.
Linux 3.14.0-generic
ubuntu 14.04
kdm

init --user use 1.8GB memory in 9 days

To post a comment you must log in.