dlm_controld.pcmk segfault

Bug #571612 reported by Oliver Heinz
16
This bug affects 3 people
Affects Status Importance Assigned to Milestone
Red Hat Cluster
Fix Released
Medium
redhat-cluster (Ubuntu)
Invalid
High
Unassigned

Bug Description

Anyone who uses link aggregation (me), bridging, and vlans are affect due to the time required to bring up the network after reboot. Corosync comes up and dlm segfaults. This has been fixed upstream, and the fix is included in Maverick+.

Upstream bugreport and patch [1]. Patch commited upstream [2]. Discussion about the issue [3].

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=586752
[2]: http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=fa24b460c51aa0c47d0842703feea8bca0ed66b7
[3]: http://oss.clusterlabs.org/pipermail/pacemaker/2010-April/005954.html

Revision history for this message
In , Oliver (oliver-redhat-bugs) wrote :

Created attachment 409748
Andrew Beekhof's patch to fix this issue

Description of problem:
dlm_controld.pcmk segfaults on startup if network uses vlan, bonding or bridging and corosync/pacemaker is invoked too early

Version-Release number of selected component (if applicable):
bug and patch testet on 3.0.7 ubuntu lucid packages

How reproducible:
Configure any of the obove on top of the raw interface and start corosync before the network settles.

Additional info:
The issue is discussed here http://oss.clusterlabs.org/pipermail/pacemaker/2010-April/005954.html

Andrew Beekhof <email address hidden> posted the attached patch that fixes this issue.

gdb output is:
Core was generated by `dlm_controld.pcmk -q 0'.
Program terminated with signal 11, Segmentation fault.
#0 __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:31
        in ../sysdeps/x86_64/multiarch/../strlen.S
#0 __strlen_sse2 () at ../sysdeps/x86_64/multiarch/../strlen.S:31
#1 0x00007f499565cd46 in *__GI___strdup (s=0x0) at strdup.c:42
#2 0x0000000000403f0c in dlm_process_node (key=<value optimized out>, value=0x1864a30, user_data=0x62a4f8) at /usr/src/packages/redhat-cluster/3.0.7/redhat-cluster-3.0.7/group/dlm_controld/pacemaker.c:136
#3 0x00007f4995cdbd73 in IA__g_hash_table_foreach (hash_table=0x1866050, func=0x403e40 <dlm_process_node>, user_data=0x62a4f8) at /build/buildd/glib2.0-2.24.0/glib/ghash.c:1325
#4 0x0000000000403c9e in update_cluster () at /usr/src/packages/redhat-cluster/3.0.7/redhat-cluster-3.0.7/group/dlm_controld/pacemaker.c:82
#5 0x0000000000415a4a in loop () at /usr/src/packages/redhat-cluster/3.0.7/redhat-cluster-3.0.7/group/dlm_controld/main.c:986
#6 0x000000000041659c in main (argc=<value optimized out>, argv=<value optimized out>) at /usr/src/packages/redhat-cluster/3.0.7/redhat-cluster-3.0.7/group/dlm_controld/main.c:1295

hth,
Oliver

Revision history for this message
In , Andrew (andrew-redhat-bugs) wrote :

Patch fa24b46 resolving this issue has been committed in cluster.git
   http://git.fedorahosted.org/git/?p=cluster.git;a=commitdiff;h=fa24b460c51aa0c47d0842703feea8bca0ed66b7

Essentially, the dlm was trying to create a configfs entry for a node with no address.
This lead to a NULL pointer being dereferenced and the dlm crashing.

The above mentioned patch now checks for a valid address before continuing.

Revision history for this message
In , Andrew (andrew-redhat-bugs) wrote :

Sorry, set the wrong status.

Revision history for this message
In , Bug (bug-redhat-bugs) wrote :

This bug appears to have been reported against 'rawhide' during the Fedora 14 development cycle.
Changing version to '14'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Any update on this? It's been 8 months since found and reported. Any info would be great as it is a real pain!

Anyone who uses link aggregation (me), bridging, and vlans are affect due to the time required to bring up the network after reboot. Corosync comes up and dlm segfaults (something to do with network not actually being completely started?). I want a node to be able to completely recover after being fenced or the like and with this problem it won't start all it's resources again due to the segfault without restarting corosync.

I believe it is fixed in the 3.0.12 which is in Maverick - maybe this could be backported?
Or the above referenced patch...

I can figure out how to apply the patch myself but I would rather stay in sync with the HA packages...

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Hi Jacob,

I'll take a look at this in the next few days!

Thank you

Changed in redhat-cluster (Ubuntu):
importance: Undecided → Medium
status: New → Triaged
assignee: nobody → Andres Rodriguez (andreserl)
Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Great!

Changed in redhat-cluster (Ubuntu):
status: Triaged → In Progress
importance: Medium → High
description: updated
Revision history for this message
Andres Rodriguez (andreserl) wrote :

Jacob,

I have uploaded a test package to a PPA. However, I'm assuming that you are using redhat-cluster - 3.0.2-2 in Ubuntu Lucid. Is this correct? If so, I believe that redhat-cluster does not have pacemaker support enabled in Lucid, which will make this bug report invalid for such release. Or, are you using any other version of RHCS? If so, where was it obtained from.

Otherwise, you can test if the segfaulting issue is still present:

sudo apt-get install python-software-properties
sudo add-apt-repository ppa:andreserl/ha
sudo apt-get update
sudo apt-get install redhat-cluster-suite

Best regards,

Revision history for this message
Jacob Smith (jsmith-argotecinc) wrote :

Andres,

Good point - I actually don't even have redhat-cluster installed...
What led me to this bug report is following the Pacemaker, DRBD8, and OCFS2 test case in the wiki from Ubuntu-HA and other info online. I encountered the segfault problem and did a little digging to come across this bug.
Although I'm not sure fact that I don't have redhat-cluster installed makes this bug invalid. I believe I encountered the bug due to the libdlm3-packemaker package I installed (from ppa:ubuntu-ha/lucid-cluster) for DLM to work in pacemaker has this problem? Correct me if I'm wrong in that assumption? Here's what I think are the relevant packages I have installed:

libdlm3-pacemaker 3.0.7-0ubuntu0ppa2.2 RHCS compatibility package -- dlm_controld f
ocfs2-tools 1.4.3-1ubuntu0ppa4 tools for managing OCFS2 cluster
pacemaker 1.0.8+hg15494-2ubuntu2 HA cluster resource manager

I may be way off but is it possible to apply the patch against libdlm3-pacemaker package instead of redhat-cluster package?
I am just in the testing phases right now with pacemaker/drbd/ocsf2 on a couple servers so I can definitely test a fix for you.

Thanks!

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Jacob,

Well, if the affected package is in PPA that's totally different thing given that it does not directly affect the distribution, and of course, it is easier to patch (there's no hassle of a review to see if it affects the Ubuntu Archive).

Because of this, I'll review the above package and provide a fix for it in the PPA you listed above.

Also note, that this bug doesn't apply to Ubuntu Maverick either so I'm marking this bug as invalid.

Revision history for this message
Andres Rodriguez (andreserl) wrote :

Thank you for bugs and trying to make Ubuntu better.

I'm marking this bug report as invalid given that the above bug refers to a feature that is not enabled in lucid nor maverick, and has already been fixed upstream for Natty.

Regards

Changed in redhat-cluster (Ubuntu):
status: In Progress → Invalid
assignee: Andres Rodriguez (andreserl) → nobody
Changed in redhatcluster:
importance: Unknown → Medium
status: Unknown → Fix Released
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.