Comment 4 for bug 1640407

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/395454
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=dde12b075ff51d4def4f49e635dd390a7f1f2cac
Submitter: Jenkins
Branch: master

commit dde12b075ff51d4def4f49e635dd390a7f1f2cac
Author: Michele Baldessari <email address hidden>
Date: Wed Nov 9 09:05:08 2016 +0100

    Fix race during major-upgrade-pacemaker step

    Currently when we call the major-upgrade step we do the following:
    """
    ...
    if [[ -n $(is_bootstrap_node) ]]; then
        check_clean_cluster
    fi
    ...
    if [[ -n $(is_bootstrap_node) ]]; then
        migrate_full_to_ng_ha
    fi
    ...
    for service in $(services_to_migrate); do
        manage_systemd_service stop "${service%%-clone}"
        ...
    done
    """

    The problem with the above code is that it is open to the following race
    condition:
    1. Code gets run first on a non-bootstrap controller node so we start
    stopping a bunch of services
    2. Pacemaker notices will notice that services are down and will mark
    the service as stopped
    3. Code gets run on the bootstrap node (controller-0) and the
    check_clean_cluster function will fail and exit
    4. Eventually also the script on the non-bootstrap controller node will
    timeout and exit because the cluster never shut down (it never actually
    started the shutdown because we failed at 3)

    Let's make sure we first only call the HA NG migration step as a
    separate heat step. Only afterwards we start shutting down the systemd
    services on all nodes.

    We also need to move the STONITH_STATE variable into a file because it
    is being used across two different scripts (1 and 2) and we need to
    store that state.

    Co-Authored-By: Athlan-Guyot Sofer <email address hidden>

    Closes-Bug: #1640407
    Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60