Mirantis OpenStack

Comment 4 for bug 1626933

Revision history for this message

Dmitry Mescheryakov (dmitrymex) wrote on 2016-09-28:

It seems that the root cause of the issue is that RabbitMQ restart took too much time on node-3: it went down at 00:17 and started back only at 00:26, as it can be seen in lrmd.log from node-3. The restart itself was triggered by updating host_ip OCF parameter.

The cause of long restart seem to lie in that stop action failed:
2016-09-23T00:17:33.532795+00:00 err: ERROR: RMQ-runtime (beam) couldn't be stopped and will likely became unmanaged. Take care of it manually!
2016-09-23T00:17:33.538996+00:00 info: INFO: p_rabbitmq-server[10049]: stop: action end.

It led Pacemaker to consider it failed:
Sep 23 00:17:33 [9134] node-1.test.domain.local attrd: info: attrd_cib_callback: Update 151 for fail-count-p_rabbitmq-server[node-3.test.domain.local]=INFINITY: OK (0)

To sum up: we need to fix OCF script stop action so that it does not fail sporadically. The fix will benefit Mitaka code as well.