Restarting an HA node is potentially racy since pike
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Critical
|
Michele Baldessari |
Bug Description
If you gracefully restart a node with pacemaker on it, the following can happen:
1) docker service gets stopped first
2) pacemaker gets shutdown
3) pacemaker tries to shutdown the bundles but fails due to 1)
This can make it so that after the reboot, because shutting down the services failed, two scenarios can take place:
A) The node gets fenced (when stonith is configured) because it failed to stop a resource
B) The state of the resource might be saved as Stopped and when the node comes back up (if multiple nodes were rebooted at the same time) the CIB might have Stopped as the target state for the resource.
In the case of B) we will see something like the following:
Online: [ overcloud-
Full list of resources:
Docker container set: rabbitmq-bundle [192.168.
rabbitmq-
rabbitmq-
rabbitmq-
Docker container set: galera-bundle [192.168.
galera-bundle-0 (ocf::heartbeat
galera-bundle-1 (ocf::heartbeat
galera-bundle-2 (ocf::heartbeat
Docker container set: redis-bundle [192.168.
redis-bundle-0 (ocf::heartbeat
redis-bundle-1 (ocf::heartbeat
redis-bundle-2 (ocf::heartbeat
ip-192.168.0.12 (ocf::heartbeat
ip-10.19.184.160 (ocf::heartbeat
ip-10.19.104.14 (ocf::heartbeat
ip-10.19.104.19 (ocf::heartbeat
ip-10.19.105.11 (ocf::heartbeat
ip-192.168.200.15 (ocf::heartbeat
Docker container set: haproxy-bundle [192.168.
haproxy-
haproxy-
haproxy-
openstack-
Failed Actions:
* rabbitmq-
last-
* rabbitmq-
last-
* galera-
last-
tags: | added: pike-backport-potential |
Fix proposed to branch: master /review. openstack. org/521570
Review: https:/