Fix ordering when pacemaker with bundles is being used
If you gracefully restart a node with pacemaker on it, the following can happen:
1) docker service gets stopped first
2) pacemaker gets shutdown
3) pacemaker tries to shutdown the bundles but fails due to 1)
This can make it so that after the reboot, because shutting down the
services failed, two scenarios can take place:
A) The node gets fenced (when stonith is configured) because it failed to stop a resource
B) The state of the resource might be saved as Stopped and when the node
comes back up (if multiple nodes were rebooted at the same time) the CIB
might have Stopped as the target state for the resource.
In the case of B) we will see something like the following:
We conditionalize this change when docker is enabled and we also
make sure that we make the change only after the docker package
is installed, in order to cover split stack deployments as well.
With this change we were able to restart nodes without
observing any timeouts during stop or stopped resources
at startup.
Reviewed: https:/ /review. openstack. org/521852 /git.openstack. org/cgit/ openstack/ puppet- tripleo/ commit/ ?id=4909df1567a ed44eb4e46af2fa 2e53aa8fa7847d
Committed: https:/
Submitter: Zuul
Branch: stable/pike
commit 4909df1567aed44 eb4e46af2fa2e53 aa8fa7847d
Author: Michele Baldessari <email address hidden>
Date: Mon Nov 20 17:06:09 2017 +0100
Fix ordering when pacemaker with bundles is being used
If you gracefully restart a node with pacemaker on it, the following can happen:
1) docker service gets stopped first
2) pacemaker gets shutdown
3) pacemaker tries to shutdown the bundles but fails due to 1)
This can make it so that after the reboot, because shutting down the
services failed, two scenarios can take place:
A) The node gets fenced (when stonith is configured) because it failed to stop a resource
B) The state of the resource might be saved as Stopped and when the node
comes back up (if multiple nodes were rebooted at the same time) the CIB
might have Stopped as the target state for the resource.
In the case of B) we will see something like the following:
Online: [ overcloud- controller- 0 overcloud- controller- 1 overcloud- controller- 2 ]
Full list of resources:
Docker container set: rabbitmq-bundle [192.168. 0.1:8787/ rhosp12/ openstack- rabbitmq- docker: pcmklatest]
rabbitmq- bundle- 0 (ocf::heartbeat :rabbitmq- cluster) : Stopped overcloud- controller- 0
rabbitmq- bundle- 1 (ocf::heartbeat :rabbitmq- cluster) : Stopped overcloud- controller- 0
rabbitmq- bundle- 2 (ocf::heartbeat :rabbitmq- cluster) : Stopped overcloud- controller- 0 0.1:8787/ rhosp12/ openstack- mariadb- docker: pcmklatest]
galera- bundle- 0 (ocf::heartbeat :galera) : Stopped overcloud- controller- 0
galera- bundle- 1 (ocf::heartbeat :galera) : Stopped overcloud- controller- 0
galera- bundle- 2 (ocf::heartbeat :galera) : Stopped overcloud- controller- 0 0.1:8787/ rhosp12/ openstack- redis-docker: pcmklatest]
redis-bundle- 0 (ocf::heartbeat :redis) : Stopped overcloud- controller- 0
redis-bundle- 1 (ocf::heartbeat :redis) : Stopped overcloud- controller- 0
redis-bundle- 2 (ocf::heartbeat :redis) : Stopped overcloud- controller- 0 192.168. 0.12 (ocf::heartbeat :IPaddr2) : Stopped 10.19.184. 160 (ocf::heartbeat :IPaddr2) : Stopped 10.19.104. 14 (ocf::heartbeat :IPaddr2) : Stopped 10.19.104. 19 (ocf::heartbeat :IPaddr2) : Stopped 10.19.105. 11 (ocf::heartbeat :IPaddr2) : Stopped 192.168. 200.15 (ocf::heartbeat :IPaddr2) : Stopped 0.1:8787/ rhosp12/ openstack- haproxy- docker: pcmklatest]
haproxy- bundle- docker- 0 (ocf::heartbeat :docker) : FAILED (blocked)[ overcloud- controller- 0 overcloud- controller- 2 overcloud- controller- 1 ]
haproxy- bundle- docker- 1 (ocf::heartbeat :docker) : FAILED (blocked)[ overcloud- controller- 0 overcloud- controller- 2 overcloud- controller- 1 ]
haproxy- bundle- docker- 2 (ocf::heartbeat :docker) : FAILED (blocked)[ overcloud- controller- 0 overcloud- controller- 2 overcloud- controller- 1 ] cinder- volume (systemd: openstack- cinder- volume) : Started overcloud- controller- 0
Docker container set: galera-bundle [192.168.
Docker container set: redis-bundle [192.168.
ip-
ip-
ip-
ip-
ip-
ip-
Docker container set: haproxy-bundle [192.168.
openstack-
Failed Actions: bundle- docker- 0_stop_ 0 on overcloud- controller- 0 'unknown error' (1): call=93, status=Timed Out, exitreason='none',
last-rc- change= 'Fri Nov 17 13:55:35 2017', queued=0ms, exec=20023ms bundle- docker- 1_stop_ 0 on overcloud- controller- 0 'unknown error' (1): call=94, status=Timed Out, exitreason='none',
last-rc- change= 'Fri Nov 17 13:55:35 2017', queued=0ms, exec=20037ms bundle- docker- 0_stop_ 0 on overcloud- controller- 0 'unknown error' (1): call=96, status=Timed Out, exitreason='none',
last-rc- change= 'Fri Nov 17 13:55:35 2017', queued=0ms, exec=20035ms
* rabbitmq-
* rabbitmq-
* galera-
We fix this by adding the docker service to agents- deps.target. wants which is the recommended method to /access. redhat. com/documentati on/en-us/ red_hat_ enterprise_ linux/7/ html/high_ availability_ add-on_ reference/ s1-nonpacemaker startup- haar
resource-
make sure non pacemaker managed resources come up before pacemaker
during a start and get stopped after pacemaker's service stop:
https:/
We conditionalize this change when docker is enabled and we also
make sure that we make the change only after the docker package
is installed, in order to cover split stack deployments as well.
With this change we were able to restart nodes without
observing any timeouts during stop or stopped resources
at startup.
Co-Authored-By: Damien Ciabrini <email address hidden>
Change-Id: I6a4dc3d4d4818f 15e9b7e68da3eb0 7e54b0289fa 414c1c4ae93d56f 18e84adbe6)
Closes-Bug: #1733348
(cherry picked from commit feca86b73028a81