Pacemaker (lrmd) can seg fault in Trusty and Utopic after following message: Source ID XX was not found when attempting to remove it
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
pacemaker (Ubuntu) |
Fix Released
|
High
|
Rafael David Tinoco | ||
Trusty |
Fix Released
|
High
|
Unassigned | ||
Utopic |
Fix Released
|
High
|
Unassigned | ||
Vivid |
Fix Released
|
High
|
Rafael David Tinoco |
Bug Description
[IMPACT]
- Pacemaker seg fault on repeated crm node online/standy because:
- Newer glib versions uses hash_table to find GSources
- Glib can try to assert source being removed multiple times
[TEST CASE]
- Using same configuration as attached cib.xml :
#!/bin/bash
while true; do
crm node standby clustertrusty01
sleep 7
crm node online clustertrusty01
sleep 7
crm node standby clustertrusty02
sleep 7
crm node online clustertrusty02
sleep 7
crm node standby clustertrusty03
sleep 7
crm node online clustertrusty03
sleep 7
done
[REGRESSION POTENTIAL]
- Based on upstream commit 568e41d
- Test case ran for more than 7 hours with no problems
[OTHER INFO]
It was brought to my attention the following situation:
"""
[Issue]
lrmd process crashed when repeating "crm node standby" and "crm node online"
----------------
# grep pacemakerd ha-log.k1pm101 | grep core
Aug 27 17:47:06 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 49275 (lrmd) dumped core
Aug 27 17:47:06 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=49275, core=1)
Aug 27 18:27:14 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 1471 (lrmd) dumped core
Aug 27 18:27:14 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=1471, core=1)
Aug 27 18:56:41 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 35771 (lrmd) dumped core
Aug 27 18:56:41 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35771, core=1)
Aug 27 19:44:09 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 60709 (lrmd) dumped core
Aug 27 19:44:09 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=60709, core=1)
Aug 27 20:00:53 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 35838 (lrmd) dumped core
Aug 27 20:00:53 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35838, core=1)
Aug 27 21:33:52 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 49249 (lrmd) dumped core
Aug 27 21:33:52 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=49249, core=1)
Aug 27 22:01:16 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 65358 (lrmd) dumped core
Aug 27 22:01:16 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=65358, core=1)
Aug 27 22:28:02 k1pm101 pacemakerd[49271]: error: child_waitpid: Managed process 22693 (lrmd) dumped core
Aug 27 22:28:02 k1pm101 pacemakerd[49271]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=22693, core=1)
----------------
----------------
# grep pacemakerd ha-log.k1pm102 | grep core
Aug 27 15:32:48 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 5812 (lrmd) dumped core
Aug 27 15:32:48 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=5812, core=1)
Aug 27 15:52:52 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 35781 (lrmd) dumped core
Aug 27 15:52:52 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=35781, core=1)
Aug 27 16:02:54 k1pm102 pacemakerd[5808]: error: child_waitpid: Managed process 51984 (lrmd) dumped core
Aug 27 16:02:54 k1pm102 pacemakerd[5808]: notice: pcmk_child_exit: Child process lrmd terminated with signal 11 (pid=51984, core=1)
"""
Analyzing core file with dbgsyms I could see that:
#0 0x00007f7184a45983 in services_
434 crm_trace(" > stdout: %s", op->stdout_data);
Is responsible for the core.
I've checked upstream code and there might be 2 important commits that could be cherry-picked to fix this behavior:
commit f2a637cc553cb7a
Author: Andrew Beekhof <email address hidden>
Date: Fri Sep 20 12:20:36 2013 +1000
Fix: services: Prevent use-of-NULL when executing service actions
commit 11473a5a8c88eb1
Author: Gao,Yan <email address hidden>
Date: Sun Sep 29 12:40:18 2013 +0800
Fix: services: Fix the executing of synchronous actions
The core can be caused by things such as this missing code:
if (op == NULL) {
crm_trace("No operation to execute");
return FALSE;
on the beginning of "services_
And improved by commit #11473a5.
tags: | added: cts |
Changed in pacemaker (Ubuntu): | |
status: | Confirmed → In Progress |
description: | updated |
summary: |
- Pacemaker can seg fault on crm node online/standy + Pacemaker can seg fault on crm node online/standby |
Changed in pacemaker (Ubuntu Vivid): | |
importance: | Undecided → High |
Changed in pacemaker (Ubuntu Utopic): | |
importance: | Undecided → High |
Changed in pacemaker (Ubuntu Trusty): | |
importance: | Undecided → High |
There is already a Fix Released for Utopic:
https:/ /bugs.launchpad .net/ubuntu/ +source/ pacemaker/ +bug/1353473
And Trusty's fix is waiting to get released.
This way I'm working on the patch on the topic of another suggested SRU:
pacemaker (1.1.10+ git20130802- 1ubuntu3) trusty; urgency=medium
* Fix: services: Do not allow duplicate recurring op entries - 1/3 (LP: #1353473)
* High: lrmd: Merge duplicate recurring monitor operations - 2/3 (LP: #1353473)
* Fix: lrmd: Cancel recurring operations before stop action is executed - 3/3 (LP: #1353473)
-- Rafael David Tinoco <email address hidden> Wed, 06 Aug 2014 09:24:13 -0300