g_dbus memory leak in lrmd

Bug #1316970 reported by Greg Murphy on 2014-05-07
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
pacemaker (Ubuntu)
Undecided
Unassigned
Trusty
Medium
Guilherme G. Piccoli

Bug Description

[Impact]
lrmd daemon with upstart resource has memory leak in Trusty

affected to pacemaker 1.1.10.
affected to glib2.0 2.40.2-0ubuntu1 >> for glib2.0 created new lp [1]

Please note that patch for pacemaker is created myself.

[Test Case]

https://pastebin.ubuntu.com/p/fqK6Cx3SKK/
you can check memory leak with this script

[Regression]
Restarting daemon after upgrading this pkg will be needed. this patch adds free for non-freed dynamic allocated memory. so it solves memory leak.

[Others]

this patch is from my self with testing.

Please review carefully if it is ok.

[1] https://bugs.launchpad.net/ubuntu/+source/glib2.0/+bug/1750741

[Original Description]

I'm running Pacemaker 1.1.10+git20130802-1ubuntu1 on Ubuntu Saucy (13.10) and have encountered a memory leak in lrmd.

The details of the bug are covered here in this thread (http://oss.clusterlabs.org/pipermail/pacemaker/2014-May/021689.html) but to summarise, the Pacemaker developers believe the leak is caused by the g_dbus API, the use of which was removed in Pacemaker 1.11.

I've also attached the Valgrind output from the run that exposed the issue.

Given that this issue affects production stability (a periodic restart of Pacemaker is required), will a version of 1.11 be released for Trusty? (I'm happy to upgrade the OS to Trusty to get it).

If not, can you advise which version of the OS will be the first to take 1.11 please?

Greg Murphy (greg-murphy) wrote :
Launchpad Janitor (janitor) wrote :

Status changed to 'Confirmed' because the bug affects multiple users.

Changed in pacemaker (Ubuntu):
status: New → Confirmed
Seyeong Kim (xtrusia) on 2018-01-05
description: updated
Changed in pacemaker (Ubuntu):
assignee: nobody → Seyeong Kim (xtrusia)
assignee: Seyeong Kim (xtrusia) → nobody
tags: added: sts
Seyeong Kim (xtrusia) on 2018-01-05
tags: added: sts-sru-needed
Seyeong Kim (xtrusia) on 2018-01-05
description: updated
Eric Desrochers (slashd) wrote :

Hi Seyeong,

As we speak the LP build farm and autopkgtest request.cgi are disabled for maintenance; no ETA yet.
No new uploads are allowed during this temporary freeze cause by the maintenance.

We will gladly review your patch when everything will back to normal.

- Eric

description: updated
Eric Desrochers (slashd) on 2018-01-05
Changed in pacemaker (Ubuntu):
status: Confirmed → Fix Released
Changed in pacemaker (Ubuntu Trusty):
assignee: nobody → Seyeong Kim (xtrusia)
importance: Undecided → Medium
status: New → In Progress
Dan Streetman (ddstreet) on 2018-01-09
tags: added: sts-sponsor-ddstreet
Dan Streetman (ddstreet) wrote :

Hi Seyeong,

the LP build farm is still unavailable, but once it's back up I'll review and sponsor this.

Dan Streetman (ddstreet) wrote :

Seyeong,

I prepared a package with the debdiff for uploading, but it is failing to build during the autopkgtests; can you check that and update the debdiff please?

configure: creating ./config.status
config.status: creating Makefile
config.status: creating Doxyfile
config.status: creating coverage.sh
config.status: WARNING: 'coverage.sh.in' seems to ignore the --datarootdir setting
config.status: error: cannot find input file: `cts/Makefile.in'
make: *** [build-stamp] Error 1
dpkg-buildpackage: error: debian/rules build gave error exit status 2
blame: pacemaker_1.1.10+git20130802-1ubuntu2.5.dsc
badpkg: rules build failed with exit code 2

You can check this by creating the patched source files and then run autopkgtest on it, e.g.:

$ autopkgtest -U pacemaker_1.1.10+git20130802-1ubuntu2.5.dsc -- lxd ubuntu-daily:trusty

Seyeong Kim (xtrusia) wrote :

Hello Dan & Eric

I confirmed that the issue dan mentioned is happening with refactoring commit.

and also the other person reported me this pkg doesn't fix this issue.

it seemed fix this but actually it is not.

so I think I need more time to analyze this issue.

Could you please give me more time for this?

Thanks.

Dan Streetman (ddstreet) wrote :

Sure, I'll unsubscribe sts-sponsors while you update the patch, just re-subscribe us when it's ready for review. Thanks!

Doug Parrish (dparrish) wrote :

Tested pacemaker 1.1.10+git20130802-1ubuntu2.4+sf160627v20180123.4 from xtrusia PPA. RSS was observed to increase slowly over time (less than 20KB/hr), and the other memory metrics[0] I gathered did not increase at all; whereas, prior to the patch, all memory metrics were increasing and more quickly. If not fixed, it certainly appears to have been significantly mitigated.

[0] specified RSS, SZ, SIZE, and VSZ in ps -o option

Seyeong Kim (xtrusia) wrote :
Seyeong Kim (xtrusia) wrote :
Seyeong Kim (xtrusia) wrote :
Seyeong Kim (xtrusia) wrote :
Seyeong Kim (xtrusia) on 2018-02-02
description: updated
Changed in glib2.0 (Ubuntu Trusty):
assignee: nobody → Seyeong Kim (xtrusia)
Dan Streetman (ddstreet) wrote :
Download full text (4.6 KiB)

Hi Seyeong,

first, I commend you for going to the trouble of fixing tests in lp1316970_trusty_glib2.0.debdiff, but I think those are unrelated to this actual bug and so the patch is not required for this. I'll focus only on the first patch.

In that patch, I did a quick review of the changes, and I'm not entirely sure they are correct? LP bug comments are not a great format for reviewing patches, so please excuse formatting/line-wrapping:

> --- a/lib/services/upstart.c
> +++ b/lib/services/upstart.c
> @@ -70,6 +70,7 @@
> if (error) {
> crm_err("Can't connect obtain proxy to %s interface: %s", interface, error->message);
> g_error_free(error);
> + g_object_unref(proxy);

this doesn't seem right - if error is non-NULL, then proxy should be NULL, according to the docs:
https://developer.gnome.org/gio/stable/GDBusProxy.html#g-dbus-proxy-new-for-bus-sync

> proxy = NULL;
> }
> return proxy;
> @@ -107,7 +108,9 @@
> /*
> com.ubuntu.Upstart0_6.GetJobByName (in String name, out ObjectPath job)
> */
> - GVariant *_ret = g_dbus_proxy_call_sync(proxy, "GetJobByName", g_variant_new("(s)", arg_name),
> + GVariant *_ret = NULL;
> +
> + _ret = g_dbus_proxy_call_sync(proxy, "GetJobByName", g_variant_new("(s)", arg_name),
> G_DBUS_CALL_FLAGS_NONE, -1, cancellable, error);

unless i'm missing something, there is no need for this change?

>
> if (_ret) {
> @@ -200,6 +203,7 @@
>
> g_variant_iter_free(iter);
> g_variant_unref(_ret);
> + free(path);

this does not seem correct - per the docs, 'path' will be freed by each call in the while loop to g_variant_iter_loop(), and only needs to be manually freed after the while loop if the code breaks out of the loop manually. That doesn't appear to be the case, so path should not need freeing here.
https://developer.gnome.org/glib/stable/glib-GVariant.html#g-variant-iter-loop

> return units;
> }
>
> @@ -224,7 +228,7 @@
> } else if (pass) {
> crm_trace("Got %s", path);
> }
> - /* free(path) */
> + free(path);

technically this should be freed only if !error - the freeing should go into the else if (pass) block above, although free(NULL) will just do nothing

however, i believe this should use g_free() instead of free()

> return pass;
> }
>
> @@ -272,6 +276,8 @@
>
> g_object_unref(proxy);
> g_variant_unref(_ret);
> + g_variant_unref(value);
> + g_variant_unref(asv);

asv is returned from g_variant_get_child_value, which does state in its docs that the returned value should be unref'ed, so this looks correct for 'asv'.
https://developer.gnome.org/glib/stable/glib-GVariant.html#g-variant-get-child-value

However value is returned from g_variant_lookup_value, which returns a value from a key-value pair in asv (as best i can tell from the docs for the function), and it does not indicate that the value is newly allocated or that it should be freed. So it does not appear that 'value' should be unref'ed here.
https://developer.gnome.org/glib/stable/glib-GVariant.html#g-variant-lookup-value

> return output;
> }
>
> @@ -299,11 +...

Read more...

Seyeong Kim (xtrusia) wrote :
Seyeong Kim (xtrusia) wrote :

Hello ddstreet

Thanks for the review.

I changed some code for pacemaker. ( char -> gchar for matching type, with g_free)

and glib2.0 patch is needed in my test.

without it, memory leak is there. 4kB leak every 10 second at least.

Thanks.

Seyeong Kim (xtrusia) wrote :

hmm, this has leak more than before..

testing..

Seyeong Kim (xtrusia) wrote :
Seyeong Kim (xtrusia) wrote :

For V3.

It seems that g_variant_unref(value); part is needed.

Added it again and no memory leaks.

in g_variant_lookup_value, it calls g_variant_get_child_value

so as you said. it need to be unref value as well i think.

Thanks.

Seyeong Kim (xtrusia) wrote :

@ddstreet

Could you please review V3 again?

Thanks in advance!

Eric Desrochers (slashd) wrote :

Seyeong/ddstreet,

As long as this critical bug (LP: #1740892) affecting corosync/pacemaker relation is not fix, I would be reluctant to sponsor this bug for the following reasons :

- To make sure it won't block potential pacemaker sru to fix this critical bug by start this one here.
- To make sure to not force another package upgrade issue and impact other users until (LP: #1740892) is figure out and fix.

- Eric

Seyeong Kim (xtrusia) wrote :

Hello Eric, ddstreet

#1740892 seems starting to fix

Could you guys can sponsor this issue as well?

Thanks a lot

Eric Desrochers (slashd) wrote :

Seyeong,

The SRU for LP: #1740892 is not started yet, but base on nacc's comment, It should start soon :
https://bugs.launchpad.net/charm-hacluster/+bug/1740892/comments/47

I would suggest to wait for LP: #1740892 SRU to be completed first since it affects package upgrade (corosync/pacemaker relation).

Additionally, IMHO, it may be "safer" to not introduce too much change in one SRU, especially considering LP: #1740892 mark as Critical.

1) Please keep monitoring LP: #1740892 and let's wait for the SRU to turn "Fix Released" for Trusty.
2) Then let's start LP: #1316970 SRU right after using either "STS-Sponsor" or "Ubuntu Sponsors Team"

- Eric

Seyeong Kim (xtrusia) wrote :

sure, no problem

Thanks eric

Dan Streetman (ddstreet) wrote :

Seyeong, can you also clarify - are the glib2.0 patches also needed to fix memory leaks?

Sebastien Bacher (seb128) wrote :

the glib change seems buggy, the upstream bug corresponding to the first patch https://bugzilla.gnome.org/show_bug.cgi?id=758641 states that the issue was in code commited in 2.45 or trusty has 2.40 so it seems wrong to try to use it on that codebase (unsure about the other ones)

Seyeong Kim (xtrusia) wrote :

Hello ddstreet

I confirmed that glib2.0 patch is needed.

I installed patched_pacemaker only on my system and running just 1 hour.

it started with 4700(RES) but now it is 8000(RES).

after installing glib2.0

it started with 4612, now is is 4688

Thanks

[1] is script to reproduce this issue. need to adjust some code but just uploading

[1] https://pastebin.ubuntu.com/p/CGrGhjdr4g/

Dan Streetman (ddstreet) wrote :

Ok, in that case, can you open a separate lp bug for the glib2.0 memory leak? I think it will be easier to track just fixing the pacemaker mem leaks in this bug, and track fixing the glib2.0 mem leak in a separate bug.

Dan Streetman (ddstreet) wrote :

Seyeong,

can you check this updated debdiff to verify it still fixes the mem leak?

I reverted the free->g_free changes that I previously suggested, as you were correct - the existing code uses chars and normal free() so we should stick with that, i was wrong to suggest using g_free().

Also I removed one of the free() that did not seem correct per the docs.

Seyeong Kim (xtrusia) on 2018-02-21
description: updated
description: updated
Dan Streetman (ddstreet) wrote :

glib2.0 memleak patch moved over to bug 1750741

no longer affects: glib2.0 (Ubuntu)
no longer affects: glib2.0 (Ubuntu Trusty)
Changed in pacemaker (Ubuntu Trusty):
assignee: Seyeong Kim (xtrusia) → Guilherme G. Piccoli (gpiccoli)

Dan, thanks for your v4! I just tested it and the leak
seems much less intense - the patch mitigated it very
fine. But still...we're leaking 16K/hour.

I'll investigate this further.

Download full text (3.1 KiB)

Hi Dan, made some progress on the investigation (not definitive, but still it helps us to continue with the SRU process).

By using Valgrind memcheck analyzer I couldn't observe any non-constant leaks after Seyeong's patch gets applied. Then, I started to use two more analyzers in order to obtain more information of lrmd's memory behavior: DHAT and massif, both contained in Valgrind's pack of analyzers.

DHAT (dynamic heap analyzer tool) allowed me to confirm some hypothesis. By running the experiment 3 times, the first for 10 minutes, then for 20 minutes and finally for 30 minutes, I observed that:

1) lrmd allocated a total of 20,671,285 bytes during 10 minutes, 38,486,917 in the 20 minute run and 56,338,077 when running for 30 minutes;

2) In all the 3 cases, the total leaked memory (from Glib library) was 104,146 bytes, a constant value;

3) Also, it measured that in all cases 20,673 bytes have lived for more than half of the run, meaning the application is allocating and de-allocating memory in reduced intervals of time, not keeping allocated memory until its end (when it would free all chunks);

Number #1 above indicates the total memory allocated - it doesn't mean this entire amount was living at same time. It basically sums all the calls to malloc-like functions during the program execution.

Number #2 indicates we have a constant amount of leaked memory, that is not increasing and so is not responsible for the slow memory increase we're observing.

Finally, number #3 shows us that this is not a case of a program allocating memory constantly and only de-allocating all chunks in the end of application at once. This was one of my hypothesis, now proved false.

That said, it's clear that heap-wise the application is not leaking an increasing amount of memory. From the stack point-of-view, by running application through massif analyzer it's possible to observe the stack behavior - the maximum size from stack was 5008 bytes, the minimum was 744 bytes. It floated between those 2 limits in a non-constant ratio, meaning it had increased and decreased over time, multiple times. This proves the stack has not much influence in the issue.

So, after that I started observing the /proc/smaps of the application, and it showed an important data point: the "area" that is growing is an anonymous non-heap map, so it was allocated through the mmap() syscall. Valgrind cannot capture mmap() syscalls, so it's likely to miss a possible leak if the memory in question was allocated through mmap(). By "stracing" the application, I saw many mmap() calls, more then munmap(). And by inspecting GLib code, I could see mmap() calls there (whereas lrmd code has none itself). So, it could be a GLib wrapper causing this slow increase of memory.

My last hypothesis is memory fragmentation, but I'd like to first exclude or confirm the mmap() idea before going with memory fragmentation hypothesis.

That all said, I believe we should continue the SRU process since Seyeong's patch was proved a valid fix for the heap leaks we had. I intend to continue the investigation to understand exactly what kind of memory behavior lrmd has to justify this slow but steady memory growth...

Read more...

Hello Greg, or anyone else affected,

Accepted pacemaker into trusty-proposed. The package will build now and be available at https://launchpad.net/ubuntu/+source/pacemaker/1.1.10+git20130802-1ubuntu2.5 in a few hours, and then in the -proposed repository.

Please help us by testing this new package. See https://wiki.ubuntu.com/Testing/EnableProposed for documentation on how to enable and use -proposed.Your feedback will aid us getting this update out to other Ubuntu users.

If this package fixes the bug for you, please add a comment to this bug, mentioning the version of the package you tested and change the tag from verification-needed-trusty to verification-done-trusty. If it does not fix the bug for you, please add a comment stating that, and change the tag to verification-failed-trusty. In either case, without details of your testing we will not be able to proceed.

Further information regarding the verification process can be found at https://wiki.ubuntu.com/QATeam/PerformingSRUVerification . Thank you in advance!

Changed in pacemaker (Ubuntu Trusty):
status: In Progress → Fix Committed
tags: added: verification-needed verification-needed-trusty

Thanks Łukasz, I've tested the pacemaker packages in -proposed and they fixed the issue for me.
The version I've tested is: 1.1.10+git20130802-1ubuntu2.5 .

The test was performed like this: I have a test-case that I ran against the version in trusty-updates (call it baseline); then I performed the packages' update to -proposed version and re-ran the test-case (call it proposed).

I've observed a memory leak of 4.5MB during 1 hour in the baseline.

After update it, I've observed a 64K memory increase in 1 hour, but 60K of it was "leaked" in the first 30 minutes. I don't believe this is a real memory leak, as explained in my previous comments. Valgrind didn't show anything, seems a normal application behavior.

Thanks,

Guilherme

tags: added: verification-done verification-done-trusty
removed: sts-sru-needed verification-needed verification-needed-trusty
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package pacemaker - 1.1.10+git20130802-1ubuntu2.5

---------------
pacemaker (1.1.10+git20130802-1ubuntu2.5) trusty; urgency=medium

  * Fixing memory leak (LP: #1316970)
    - adding free function where is missing

 -- Seyeong Kim <email address hidden> Tue, 20 Feb 2018 19:36:53 +0900

Changed in pacemaker (Ubuntu Trusty):
status: Fix Committed → Fix Released

The verification of the Stable Release Update for pacemaker has completed successfully and the package has now been released to -updates. Subsequently, the Ubuntu Stable Release Updates Team is being unsubscribed and will not receive messages about this bug report. In the event that you encounter a regression using the package from -updates please report a new bug using ubuntu-bug and tag the bug report regression-update so we can easily find any regressions.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.