Currently when we call the major-upgrade step we do the following:
"""
...
if [[ -n $(is_bootstrap_node) ]]; then check_clean_cluster
fi
...
if [[ -n $(is_bootstrap_node) ]]; then migrate_full_to_ng_ha
fi
...
for service in $(services_to_migrate); do manage_systemd_service stop "${service%%-clone}"
...
done
"""
The problem with the above code is that it is open to the following race
condition:
1. Code gets run first on a non-bootstrap controller node so we start
stopping a bunch of services
2. Pacemaker notices will notice that services are down and will mark
the service as stopped
3. Code gets run on the bootstrap node (controller-0) and the
check_clean_cluster function will fail and exit
4. Eventually also the script on the non-bootstrap controller node will
timeout and exit because the cluster never shut down (it never actually
started the shutdown because we failed at 3)
Let's make sure we first only call the HA NG migration step as a
separate heat step. Only afterwards we start shutting down the systemd
services on all nodes.
We also need to move the STONITH_STATE variable into a file because it
is being used across two different scripts (1 and 2) and we need to
store that state.
Reviewed: https:/ /review. openstack. org/395454 /git.openstack. org/cgit/ openstack/ tripleo- heat-templates/ commit/ ?id=dde12b075ff 51d4def4f49e635 dd390a7f1f2cac
Committed: https:/
Submitter: Jenkins
Branch: master
commit dde12b075ff51d4 def4f49e635dd39 0a7f1f2cac
Author: Michele Baldessari <email address hidden>
Date: Wed Nov 9 09:05:08 2016 +0100
Fix race during major-upgrade- pacemaker step
Currently when we call the major-upgrade step we do the following: node) ]]; then
check_ clean_cluster node) ]]; then
migrate_ full_to_ ng_ha to_migrate) ; do
manage_ systemd_ service stop "${service% %-clone} "
"""
...
if [[ -n $(is_bootstrap_
fi
...
if [[ -n $(is_bootstrap_
fi
...
for service in $(services_
...
done
"""
The problem with the above code is that it is open to the following race clean_cluster function will fail and exit
condition:
1. Code gets run first on a non-bootstrap controller node so we start
stopping a bunch of services
2. Pacemaker notices will notice that services are down and will mark
the service as stopped
3. Code gets run on the bootstrap node (controller-0) and the
check_
4. Eventually also the script on the non-bootstrap controller node will
timeout and exit because the cluster never shut down (it never actually
started the shutdown because we failed at 3)
Let's make sure we first only call the HA NG migration step as a
separate heat step. Only afterwards we start shutting down the systemd
services on all nodes.
We also need to move the STONITH_STATE variable into a file because it
is being used across two different scripts (1 and 2) and we need to
store that state.
Co-Authored-By: Athlan-Guyot Sofer <email address hidden>
Closes-Bug: #1640407 604cca259007165 6f4b2275c60
Change-Id: Ifb9b9e633fcc77