So the cherry-pick (for version trusty_pacemaker_1.1.10+git20130802-1ubuntu2.2, based on a upstream commit) seems ok since it makes lrmd (services, services_linux) to avoid repeating a timer when the source was already removed from glib main loop context:
example:
+ if (op->opaque->repeat_timer) {
+ g_source_remove(op->opaque->repeat_timer);
++ op->opaque->repeat_timer = 0;
etc...
This actually solved lrmd crashes I was getting with the testcase (explained inside this bug summary).
Analyzing your crash file (from stonith and not lrm), it looks like we have the following scenario:
==============
exited = child_waitpid(child, WNOHANG);
|_> child->callback(child, child->pid, core, signo, exitcode);
|_> stonith_action_async_done (stack shows: stonith_action_destroy()) <----> call g_resource_remove 2 times
|_> stonith_action_clear_tracking_data(action);
|_> g_source_remove(action->timer_sigterm);
|_> g_critical ("Source ID %u was not found when attempting to remove it", tag);
WHERE
==============
Child here is the "monitor" (0x7f1f63a08b70 "monitor"): /usr/sbin/fence_legacy
"Helper that presents a RHCS-style interface for Linux-HA stonith plugins"
This is the script responsible to monitor a stonith resource and it has returned (triggering monitor
callback) with the following data on it:
OBS: This means that fence_legacy returned, after checking that
st_kjpnode2 was ok, and its cleanup operation (callback) caused
the problem we faced.
As soon as it dies, the callback for this process is called:
if (child->callback) { child->callback(child, child->pid, core, signo, exitcode);
In our case, callback is:
0x7f1f6189cec0 <stonith_action_async_done> which calls
0x7f1f6189af10 <stonith_action_destroy> and then
0x7f1f6189ae60 <stonith_action_clear_tracking_data> generating the 2nd removal (g_source_remove)
with the 2nd call to g_source_remove, after glib2.0 change explained before this comment, we get a
g_critical ("Source ID %u was not found when attempting to remove it", tag);
and this generates the crash (since g_glob is called with a critical log_level causing crm_abort to be called).
POSSIBLE CAUSE:
==============
Under <stonith_action_async_done> we have:
stonith_action_t *action = 0x7f1f639f5b50.
if (action->timer_sigterm > 0) { g_source_remove(action->timer_sigterm);
}
if (action->timer_sigkill > 0) { g_source_remove(action->timer_sigkill);
}
Under <stonith_action_destroy> we have stonith_action_t *action = 0x7f1f639f5b50.
and a call to: stonith_action_clear_tracking_data(action);
Under stonith_action_clear_tracking_data(stonith_action_t * action) we have AGAIN:
stonith_action_t *action = 0x7f1f639f5b50.
if (action->timer_sigterm > 0) { g_source_remove(action->timer_sigterm); action->timer_sigterm = 0;
}
if (action->timer_sigkill > 0) { g_source_remove(action->timer_sigkill); action->timer_sigkill = 0;
}
This logic probably triggered the same problem the cherry pick addressed for lrmd,
but now for stonith (calling g_source_remove 2 times for the same source after
glib2.0 was changed).
I'll try to discuss this with upstream and confirm that this is happening.
Okay,
So the cherry-pick (for version trusty_ pacemaker_ 1.1.10+ git20130802- 1ubuntu2. 2, based on a upstream commit) seems ok since it makes lrmd (services, services_linux) to avoid repeating a timer when the source was already removed from glib main loop context:
example:
+ if (op->opaque- >repeat_ timer) { remove( op->opaque- >repeat_ timer); >repeat_ timer = 0;
+ g_source_
++ op->opaque-
etc...
This actually solved lrmd crashes I was getting with the testcase (explained inside this bug summary).
=== oss.clusterlabs .org/pipermail/ pacemaker/ 2014-October/ 022690. html oss.clusterlabs .org/pipermail/ pacemaker/ 2014-October/ 022699. html
Explanation:
g_source_remove -> http://
libglib2 changes -> http://
===
Analyzing your crash file (from stonith and not lrm), it looks like we have the following scenario:
==============
exited = child_waitpid( child, WNOHANG); callback( child, child->pid, core, signo, exitcode); action_ async_done (stack shows: stonith_ action_ destroy( )) <----> call g_resource_remove 2 times action_ clear_tracking_ data(action) ; remove( action- >timer_ sigterm) ;
|_> child->
|_> stonith_
|_> stonith_
|_> g_source_
|_> g_critical ("Source ID %u was not found when attempting to remove it", tag);
WHERE
==============
Child here is the "monitor" (0x7f1f63a08b70 "monitor"): /usr/sbin/ fence_legacy
"Helper that presents a RHCS-style interface for Linux-HA stonith plugins"
This is the script responsible to monitor a stonith resource and it has returned (triggering monitor
callback) with the following data on it:
------ data (begin) ------ timeout= 20
agent=fence_legacy
action=monitor
plugin=external/ssh
hostlist=kjpnode2
timeout=20
async=1
tries=1
remaining_
timer_sigterm=13
timer_sigkill=14
max_retries=2
pid=1464
rc=0 (RETURN CODE)
string buffer: "Performing: stonith -t external/ssh -S\nsuccess: 0\n"
------ data (end) ------
OBS: This means that fence_legacy returned, after checking that
st_kjpnode2 was ok, and its cleanup operation (callback) caused
the problem we faced.
As soon as it dies, the callback for this process is called:
if (child->callback) {
child- >callback( child, child->pid, core, signo, exitcode);
In our case, callback is:
0x7f1f6189cec0 <stonith_ action_ async_done> which calls action_ destroy> and then action_ clear_tracking_ data> generating the 2nd removal (g_source_remove)
0x7f1f6189af10 <stonith_
0x7f1f6189ae60 <stonith_
with the 2nd call to g_source_remove, after glib2.0 change explained before this comment, we get a
g_critical ("Source ID %u was not found when attempting to remove it", tag);
and this generates the crash (since g_glob is called with a critical log_level causing crm_abort to be called).
POSSIBLE CAUSE:
==============
Under <stonith_ action_ async_done> we have:
stonith_action_t *action = 0x7f1f639f5b50.
if (action- >timer_ sigterm > 0) {
g_source_ remove( action- >timer_ sigterm) ; >timer_ sigkill > 0) {
g_source_ remove( action- >timer_ sigkill) ;
}
if (action-
}
Under <stonith_ action_ destroy> we have stonith_action_t *action = 0x7f1f639f5b50. action_ clear_tracking_ data(action) ;
and a call to: stonith_
Under stonith_ action_ clear_tracking_ data(stonith_ action_ t * action) we have AGAIN:
stonith_action_t *action = 0x7f1f639f5b50.
if (action- >timer_ sigterm > 0) {
g_source_ remove( action- >timer_ sigterm) ;
action- >timer_ sigterm = 0; >timer_ sigkill > 0) {
g_source_ remove( action- >timer_ sigkill) ;
action- >timer_ sigkill = 0;
}
if (action-
}
This logic probably triggered the same problem the cherry pick addressed for lrmd,
but now for stonith (calling g_source_remove 2 times for the same source after
glib2.0 was changed).
I'll try to discuss this with upstream and confirm that this is happening.
Thank you
Rafael Tinoco