Okay, so here we had had the following:
node-3: behaved OK, recovered at: 2015-12-29T13:41:22.853829
node-9: seg-faulted all the time, failing to join the node-3 in endless loop. Even though its mnesia got reset every time after the join attempt failed. Status in pacemaker was correct all the time according to the pengine logs. Note, that join attempts was also reported exit code 2, so there were not only segfaults. But it is not clear why it was not able to join the node-3 and form a cluster.
node-4: the most strange behavior. The last join attempt was failed and followed by the stop with code 139 (a seg fault) at 2015-12-29T13:41:05.873048
After that it was kept down but pacemaker reported it as a running Slave (and this is the subject of the main bug 1472230 as well)
No details why, crmd log only contains this:
2015-12-29T13:38:13.842800+00:00 notice: notice: crm_update_peer_state: pcmk_quorum_notification: Node node-3.domain.tld[3] - state is now member (was lost)
2015-12-29T13:38:23.379517+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: ok (node=node-4.domain.tld, call=160, rc=0, cib-update=285, confirmed=false)
2015-12-29T13:38:37.348010+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=289, rc=0, cib-update=0, confirmed=true)
2015-12-29T13:38:54.857193+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: not running (node=node-4.domain.tld, call=160, rc=7, cib-update=287, confirmed=false)
2015-12-29T13:39:15.833534+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_103000: not running (node=node-4.domain.tld, call=159, rc=7, cib-update=288, confirmed=false)
2015-12-29T13:39:23.846079+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=293, rc=0, cib-update=0, confirmed=true)
2015-12-29T13:39:47.070403+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=294, rc=0, cib-update=0, confirmed=true)
(repeats)
Okay, so here we had had the following: 29T13:41: 22.853829
node-3: behaved OK, recovered at: 2015-12-
node-9: seg-faulted all the time, failing to join the node-3 in endless loop. Even though its mnesia got reset every time after the join attempt failed. Status in pacemaker was correct all the time according to the pengine logs. Note, that join attempts was also reported exit code 2, so there were not only segfaults. But it is not clear why it was not able to join the node-3 and form a cluster.
node-4: the most strange behavior. The last join attempt was failed and followed by the stop with code 139 (a seg fault) at 2015-12- 29T13:41: 05.873048 29T13:38: 13.842800+ 00:00 notice: notice: crm_update_ peer_state: pcmk_quorum_ notification: Node node-3. domain. tld[3] - state is now member (was lost) 29T13:38: 23.379517+ 00:00 notice: notice: process_lrm_event: Operation p_rabbitmq- server_ monitor_ 30000: ok (node=node- 4.domain. tld, call=160, rc=0, cib-update=285, confirmed=false) 29T13:38: 37.348010+ 00:00 notice: notice: process_lrm_event: Operation p_rabbitmq- server_ notify_ 0: ok (node=node- 4.domain. tld, call=289, rc=0, cib-update=0, confirmed=true) 29T13:38: 54.857193+ 00:00 notice: notice: process_lrm_event: Operation p_rabbitmq- server_ monitor_ 30000: not running (node=node- 4.domain. tld, call=160, rc=7, cib-update=287, confirmed=false) 29T13:39: 15.833534+ 00:00 notice: notice: process_lrm_event: Operation p_rabbitmq- server_ monitor_ 103000: not running (node=node- 4.domain. tld, call=159, rc=7, cib-update=288, confirmed=false) 29T13:39: 23.846079+ 00:00 notice: notice: process_lrm_event: Operation p_rabbitmq- server_ notify_ 0: ok (node=node- 4.domain. tld, call=293, rc=0, cib-update=0, confirmed=true) 29T13:39: 47.070403+ 00:00 notice: notice: process_lrm_event: Operation p_rabbitmq- server_ notify_ 0: ok (node=node- 4.domain. tld, call=294, rc=0, cib-update=0, confirmed=true)
After that it was kept down but pacemaker reported it as a running Slave (and this is the subject of the main bug 1472230 as well)
No details why, crmd log only contains this:
2015-12-
2015-12-
2015-12-
2015-12-
2015-12-
2015-12-
2015-12-
(repeats)