Fuel for OpenStack

Bug #1529875
Comment #6

Comment 6 for bug 1529875

Revision history for this message

Bogdan Dobrelya (bogdando) wrote on 2015-12-31:

Okay, so here we had had the following:
node-3: behaved OK, recovered at: 2015-12-29T13:41:22.853829

node-9: seg-faulted all the time, failing to join the node-3 in endless loop. Even though its mnesia got reset every time after the join attempt failed. Status in pacemaker was correct all the time according to the pengine logs. Note, that join attempts was also reported exit code 2, so there were not only segfaults. But it is not clear why it was not able to join the node-3 and form a cluster.

node-4: the most strange behavior. The last join attempt was failed and followed by the stop with code 139 (a seg fault) at 2015-12-29T13:41:05.873048
After that it was kept down but pacemaker reported it as a running Slave (and this is the subject of the main bug 1472230 as well)
No details why, crmd log only contains this:
2015-12-29T13:38:13.842800+00:00 notice: notice: crm_update_peer_state: pcmk_quorum_notification: Node node-3.domain.tld[3] - state is now member (was lost)
2015-12-29T13:38:23.379517+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: ok (node=node-4.domain.tld, call=160, rc=0, cib-update=285, confirmed=false)
2015-12-29T13:38:37.348010+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=289, rc=0, cib-update=0, confirmed=true)
2015-12-29T13:38:54.857193+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: not running (node=node-4.domain.tld, call=160, rc=7, cib-update=287, confirmed=false)
2015-12-29T13:39:15.833534+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_monitor_103000: not running (node=node-4.domain.tld, call=159, rc=7, cib-update=288, confirmed=false)
2015-12-29T13:39:23.846079+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=293, rc=0, cib-update=0, confirmed=true)
2015-12-29T13:39:47.070403+00:00 notice: notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=294, rc=0, cib-update=0, confirmed=true)
(repeats)

Okay, so here we had had the following:
node-3: behaved OK, recovered at: 2015-12-29T13:41:22.853829

node-4: the most strange behavior. The last join attempt was failed and followed by the stop with code 139 (a seg fault) at 2015-12-29T13:41:05.873048
After that it was kept down but pacemaker reported it as a running Slave (and this is the subject of the main bug 1472230 as well)
No details why, crmd log only contains this:
2015-12-29T13:38:13.842800+00:00 notice:    notice: crm_update_peer_state: pcmk_quorum_notification: Node node-3.domain.tld[3] - state is now member (was lost)
2015-12-29T13:38:23.379517+00:00 notice:    notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: ok (node=node-4.domain.tld, call=160, rc=0, cib-update=285, confirmed=false)
2015-12-29T13:38:37.348010+00:00 notice:    notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=289, rc=0, cib-update=0, confirmed=true)
2015-12-29T13:38:54.857193+00:00 notice:    notice: process_lrm_event: Operation p_rabbitmq-server_monitor_30000: not running (node=node-4.domain.tld, call=160, rc=7, cib-update=287, confirmed=false)
2015-12-29T13:39:15.833534+00:00 notice:    notice: process_lrm_event: Operation p_rabbitmq-server_monitor_103000: not running (node=node-4.domain.tld, call=159, rc=7, cib-update=288, confirmed=false)
2015-12-29T13:39:23.846079+00:00 notice:    notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=293, rc=0, cib-update=0, confirmed=true)
2015-12-29T13:39:47.070403+00:00 notice:    notice: process_lrm_event: Operation p_rabbitmq-server_notify_0: ok (node=node-4.domain.tld, call=294, rc=0, cib-update=0, confirmed=true)
(repeats)