Ubuntu

Drop raexecupstart.patch and fix_lrmd_leak.patch to not cause socket leak in lrmd.

Reported by Wolfgang Scherer on 2011-08-05
26
This bug affects 4 people
Affects Status Importance Assigned to Milestone
cluster-glue (Ubuntu)
High
Andres Rodriguez
Oneiric
High
Andres Rodriguez
Precise
High
Andres Rodriguez

Bug Description

ii cluster-glue 1.0.7-3ubuntu2 The reusable cluster components for Linux HA

The comamnds `crm ra classes` and `cr ra list` cause a socket leak in the lrmd daemon.

When approx. 1024 sockets are allocated, the lrmd becomes unresponsive and must be killed.
The syslog then shows repeated entries:

  Aug 3 10:25:08 server lrmd: [1941]: ERROR: socket_accept_connection: accept(sock=6): Too many open files

While I only use these commands during development, it is still a nuisance.

The leak does not appear for other commands, e.g. `crm resource
list`, but I have not tested exhaustively.

I originally reported this bug to http://developerbugs.linux-foundation.org/show_bug.cgi?id=2626.

There I was informed that the behavior most likely stems from an
unsupported patch (raexecupstart.patch) in the Ubuntu package.
When I remove that patch, the socket leaks does indeed go away.

Although I did not have any "deadlock" situations with the original
code, I replaced it with the attached patch which should prevent any
possible recursive calls of the `on_remove_client' function.

*******************************
After further investigation it was determined that the problem was in glib itself and the patch was not needed in the latest's releases of Ubuntu, but rather, this patches were creating the socket leak.

Ante Karamatić (ivoks) on 2011-11-01
Changed in cluster-glue (Ubuntu):
status: New → Confirmed
assignee: nobody → Ante Karamatić (ivoks)
Ante Karamatić (ivoks) wrote :

Hi Wolfgang

I've tested your patch and I didn't have luck with it (while loop of crm ra classes still brings lrmd to its knees; socket count hits the limit). Does that patch work for you?

The attachment "avoid recursive invocation of on_remove_client" of this bug report has been identified as being a patch. The ubuntu-reviewers team has been subscribed to the bug report so that they can review the patch. In the event that this is in fact not a patch you can resolve this situation by removing the tag 'patch' from the bug report and editing the attachment so that it is not flagged as a patch. Additionally, if you are member of the ubuntu-sponsors please also unsubscribe the team from this bug report.

[This is an automated message performed by a Launchpad user owned by Brian Murray. Please contact him regarding any issues with the action taken in this bug report.]

tags: added: patch

Hi Ante,

the patch does indeed work for me.

Here is exactly what I did:

dpkg-source -x cluster-glue_1.0.7-3ubuntu2.dsc

> dpkg-source: info: extracting cluster-glue in cluster-glue-1.0.7
> dpkg-source: info: unpacking cluster-glue_1.0.7.orig.tar.bz2
> dpkg-source: info: unpacking cluster-glue_1.0.7-3ubuntu2.debian.tar.gz
> dpkg-source: info: applying raexecupstart.patch

patch -R -p 0 <cluster-glue-1.0.7/debian/patches/raexecupstart.patch
patch -p 0 <bug-check-lrmd.dif
cd cluster-glue-1.0.7
dpkg-buildpackage
dpkg -i cluster-glue_1.0.7-3ubuntu2_amd64.deb

Ante Karamatić (ivoks) wrote :

Which Ubuntu version is that?

I've noticed that with same source built on Lucid and Oneiric I get different results. On Oneiric it works, on Lucid it doesn't.

Changed in cluster-glue (Ubuntu):
assignee: Ante Karamatić (ivoks) → Andres Rodriguez (andreserl)
Changed in cluster-glue (Ubuntu Oneiric):
assignee: nobody → Andres Rodriguez (andreserl)
Changed in cluster-glue (Ubuntu Precise):
importance: Undecided → High
Changed in cluster-glue (Ubuntu Oneiric):
importance: Undecided → High
status: New → Incomplete
status: Incomplete → Confirmed

Ubuntu natty.
I have not checked lucid.

Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cluster-glue - 1.0.8-2ubuntu1

---------------
cluster-glue (1.0.8-2ubuntu1) precise; urgency=low

  * debian/patches (LP: #821732):
    - raexecupstart.patch: Drop as this does not fix the leak issue.
    - fix_lrmd_leak.patch: Add new patch that correctly fixes the issue.
 -- Andres Rodriguez <email address hidden> Mon, 07 Nov 2011 14:49:50 -0500

Changed in cluster-glue (Ubuntu Precise):
status: Confirmed → Fix Released
Ante Karamatić (ivoks) wrote :

Maverick gives the same results as lucid. I believe a change in glib between maverick and natty solved this problem.

Ante Karamatić (ivoks) wrote :

For Lucid and Maverick Wolfgang's patch for cluster-glue isn't enough. Patch is good, but glib has an issue. There is an upstream fix for it:

https://mail.gnome.org/archives/commits-list/2010-November/msg01816.html

Patch attached is tested on Lucid. With Wolfgang's patch for cluster-glue, both deadlock and socket leaks are eliminated.

Ante Karamatić (ivoks) wrote :

Howto test bug and fix

Install lucid
add ubuntu-ha-maintainers ppa and update repo:
        apt-add-repository ppa:ubuntu-ha-maintainers/ppa ; apt-get update
Install pacemaker:
        apt-get -y install pacemaker
Enable corosync (/etc/default/corosync) and start it:
        sed -i -e 's/START=no/START=yes/' /etc/default/corosync ; \
        service corosync start
Open few client->server connections:
        lrmadmin -C ; lrmadmin -C ; lrmadmin -C ; lrmadmin -C
Check number of open sockets:
        lsof -f | grep lrm_callback_sock | wc -l
Correct value is 2, but it will be 6 or 8. There's a socket leak.

Stop corosync:
        service corosync stop
Add ppa:ivoks/ha:
        apt-add-repository ppa:ivoks/ha ; apt-get update ; apt-get -y upgrade
Start corosync:
        service corosync start
Repeate the test with client->server connections:
        lrmadmin -C ; lrmadmin -C
It deadlocks on second run

Kill lrmd and stop corosync:
        killall -9 lrmd ; service corosync stop
Add ppa:ivoks/glib:
        apt-add-repository ppa:ivoks/glib ; apt-get update ; apt-get -y upgrade
Start corosync:
        service corosync start
Run the test again:
        lrmadmin -C ; lrmadmin -C ; lrmadmin -C ; lrmadmin -C
It doesn't deadlock.
Check the socket count:
        lsof -f | grep lrm_callback_sock | wc -l
It's 2. Socket do not leak and glib doesn't deadlock.

Ante Karamatić (ivoks) wrote :

Actually, if both Wolfgang's and raexecupstart patch are dropped, lrmd/lrmclient will work as expected in 11.04, 11.10 and Precise.

Once glib is fixed in 10.04 and 10.10, we can drop raxecupstart patch in cluster-glue for those versions.

So, Andres, please drop all cluster-glue patches in Precise. Please remove raexecupstart patch in cluster-glue for 11.04 and 11.10 and ask for SRUs.

For 10.04 and 10.10, we need to get glib fix SRU first and then cluster-glue SRU, removing raexecupstart patch.

summary: - socket leak in lrmd
+ Drop raexecupstart.patch and fix_lrmd_leak.patch to not cause socket
+ leak in lrmd.
description: updated
Changed in cluster-glue (Ubuntu Precise):
status: Fix Released → In Progress
Launchpad Janitor (janitor) wrote :

This bug was fixed in the package cluster-glue - 1.0.8-2ubuntu2

---------------
cluster-glue (1.0.8-2ubuntu2) precise; urgency=low

  * debian/patches/fix_lrmd_leak.patch: Drop as the issue was in glib and it
    is now fixed. (LP: #821732)
 -- Andres Rodriguez <email address hidden> Thu, 10 Nov 2011 13:22:47 -0500

Changed in cluster-glue (Ubuntu Precise):
status: In Progress → Fix Released

Not sure these symptoms would be the same on the later versions but on Lucid I also encountered the following due to this bug:

service corosync stop - never completes. Hangs waiting for crmd to shutdown (over 6 hours without a change)
crm ra info ocf:xx:xx - hangs the crm shell
crm configure primitive p_test ocf: - hung when trying to use tab completion of ocf:<tab>

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Duplicates of this bug

Other bug subscribers