M/N upgrades - Race during the upgrade step

Bug #1640407 reported by Michele Baldessari on 2016-11-09
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Critical
Michele Baldessari

Bug Description

Currently when we call the major-upgrade step we do the following:
"""
...
if [[ -n $(is_bootstrap_node) ]]; then
    check_clean_cluster
fi
...
if [[ -n $(is_bootstrap_node) ]]; then
    migrate_full_to_ng_ha
fi
...
for service in $(services_to_migrate); do
    manage_systemd_service stop "${service%%-clone}"
    ...
done
"""

The problem with the above code is that it is open to the following race condition:
1. Code gets run first on a non-bootstrap controller node so we start stopping a bunch of services
2. Pacemaker notices will notice that services are down and will mark the service as stopped
3. Code gets run on the bootstrap node (controller-0) and the check_clean_cluster function will fail and exit
4. Eventually also the script on the non-bootstrap controller node will timeout and exit because the cluster never shut down (it never actually started the shutdown because we failed at 3)

Fix proposed to branch: master
Review: https://review.openstack.org/395454

Changed in tripleo:
assignee: nobody → Michele Baldessari (michele)
status: Triaged → In Progress
Marios Andreou (marios-b) wrote :

Reviewed: https://review.openstack.org/395454
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=dde12b075ff51d4def4f49e635dd390a7f1f2cac
Submitter: Jenkins
Branch: master

commit dde12b075ff51d4def4f49e635dd390a7f1f2cac
Author: Michele Baldessari <email address hidden>
Date: Wed Nov 9 09:05:08 2016 +0100

    Fix race during major-upgrade-pacemaker step

    Currently when we call the major-upgrade step we do the following:
    """
    ...
    if [[ -n $(is_bootstrap_node) ]]; then
        check_clean_cluster
    fi
    ...
    if [[ -n $(is_bootstrap_node) ]]; then
        migrate_full_to_ng_ha
    fi
    ...
    for service in $(services_to_migrate); do
        manage_systemd_service stop "${service%%-clone}"
        ...
    done
    """

    The problem with the above code is that it is open to the following race
    condition:
    1. Code gets run first on a non-bootstrap controller node so we start
    stopping a bunch of services
    2. Pacemaker notices will notice that services are down and will mark
    the service as stopped
    3. Code gets run on the bootstrap node (controller-0) and the
    check_clean_cluster function will fail and exit
    4. Eventually also the script on the non-bootstrap controller node will
    timeout and exit because the cluster never shut down (it never actually
    started the shutdown because we failed at 3)

    Let's make sure we first only call the HA NG migration step as a
    separate heat step. Only afterwards we start shutting down the systemd
    services on all nodes.

    We also need to move the STONITH_STATE variable into a file because it
    is being used across two different scripts (1 and 2) and we need to
    store that state.

    Co-Authored-By: Athlan-Guyot Sofer <email address hidden>

    Closes-Bug: #1640407
    Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60

Changed in tripleo:
status: In Progress → Fix Released

Reviewed: https://review.openstack.org/395460
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=cf465d87b16d2865a0826b3a8c60370b8820c0e7
Submitter: Jenkins
Branch: stable/newton

commit cf465d87b16d2865a0826b3a8c60370b8820c0e7
Author: Michele Baldessari <email address hidden>
Date: Wed Nov 9 09:05:08 2016 +0100

    Fix race during major-upgrade-pacemaker step

    Currently when we call the major-upgrade step we do the following:
    """
    ...
    if [[ -n $(is_bootstrap_node) ]]; then
        check_clean_cluster
    fi
    ...
    if [[ -n $(is_bootstrap_node) ]]; then
        migrate_full_to_ng_ha
    fi
    ...
    for service in $(services_to_migrate); do
        manage_systemd_service stop "${service%%-clone}"
        ...
    done
    """

    The problem with the above code is that it is open to the following race
    condition:
    1. Code gets run first on a non-bootstrap controller node so we start
    stopping a bunch of services
    2. Pacemaker notices will notice that services are down and will mark
    the service as stopped
    3. Code gets run on the bootstrap node (controller-0) and the
    check_clean_cluster function will fail and exit
    4. Eventually also the script on the non-bootstrap controller node will
    timeout and exit because the cluster never shut down (it never actually
    started the shutdown because we failed at 3)

    Let's make sure we first only call the HA NG migration step as a
    separate heat step. Only afterwards we start shutting down the systemd
    services on all nodes.

    We also need to move the STONITH_STATE variable into a file because it
    is being used across two different scripts (1 and 2) and we need to
    store that state.

    Co-Authored-By: Athlan-Guyot Sofer <email address hidden>

    Closes-Bug: #1640407
    Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60
    (cherry picked from commit dde12b075ff51d4def4f49e635dd390a7f1f2cac)

tags: added: in-stable-newton

This issue was fixed in the openstack/tripleo-heat-templates 6.0.0.0b1 development milestone.

This issue was fixed in the openstack/tripleo-heat-templates 5.2.0 release.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.