tripleo

M/N upgrades - Race during the upgrade step

Bug #1640407 reported by Michele Baldessari on 2016-11-09

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	tripleo	Fix Released	Critical	Michele Baldessari	tripleo ocata-1 "ocata-1"

Bug Description

Currently when we call the major-upgrade step we do the following:
"""
...
if [[ -n $(is_bootstrap_node) ]]; then
    check_clean_cluster
fi
...
if [[ -n $(is_bootstrap_node) ]]; then
    migrate_full_to_ng_ha
fi
...
for service in $(services_to_migrate); do
    manage_systemd_service stop "${service%%-clone}"
    ...
done
"""

The problem with the above code is that it is open to the following race condition:
1. Code gets run first on a non-bootstrap controller node so we start stopping a bunch of services
2. Pacemaker notices will notice that services are down and will mark the service as stopped
3. Code gets run on the bootstrap node (controller-0) and the check_clean_cluster function will fail and exit
4. Eventually also the script on the non-bootstrap controller node will timeout and exit because the cluster never shut down (it never actually started the shutdown because we failed at 3)

Tags:

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-09: Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.openstack.org/395454

Changed in tripleo:
assignee:	nobody → Michele Baldessari (michele)
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-09: Fix proposed to tripleo-heat-templates (stable/newton)

Fix proposed to branch: stable/newton
Review: https://review.openstack.org/395460

Revision history for this message

Marios Andreou (marios-b) wrote on 2016-11-09:

Note this is first discussed at https://bugzilla.redhat.com/show_bug.cgi?id=1389040#c22

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-10: Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.openstack.org/395454
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=dde12b075ff51d4def4f49e635dd390a7f1f2cac
Submitter: Jenkins
Branch: master

commit dde12b075ff51d4def4f49e635dd390a7f1f2cac
Author: Michele Baldessari <email address hidden>
Date: Wed Nov 9 09:05:08 2016 +0100

Fix race during major-upgrade-pacemaker step

    Currently when we call the major-upgrade step we do the following:
    """
    ...
    if [[ -n $(is_bootstrap_node) ]]; then
        check_clean_cluster
    fi
    ...
    if [[ -n $(is_bootstrap_node) ]]; then
        migrate_full_to_ng_ha
    fi
    ...
    for service in $(services_to_migrate); do
        manage_systemd_service stop "${service%%-clone}"
        ...
    done
    """

    The problem with the above code is that it is open to the following race
    condition:
    1. Code gets run first on a non-bootstrap controller node so we start
    stopping a bunch of services
    2. Pacemaker notices will notice that services are down and will mark
    the service as stopped
    3. Code gets run on the bootstrap node (controller-0) and the
    check_clean_cluster function will fail and exit
    4. Eventually also the script on the non-bootstrap controller node will
    timeout and exit because the cluster never shut down (it never actually
    started the shutdown because we failed at 3)

    Let's make sure we first only call the HA NG migration step as a
    separate heat step. Only afterwards we start shutting down the systemd
    services on all nodes.

    We also need to move the STONITH_STATE variable into a file because it
    is being used across two different scripts (1 and 2) and we need to
    store that state.

Co-Authored-By: Athlan-Guyot Sofer <email address hidden>

Closes-Bug: #1640407
Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-11: Fix merged to tripleo-heat-templates (stable/newton)

Reviewed: https://review.openstack.org/395460
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=cf465d87b16d2865a0826b3a8c60370b8820c0e7
Submitter: Jenkins
Branch: stable/newton

commit cf465d87b16d2865a0826b3a8c60370b8820c0e7
Author: Michele Baldessari <email address hidden>
Date: Wed Nov 9 09:05:08 2016 +0100

Fix race during major-upgrade-pacemaker step

    Currently when we call the major-upgrade step we do the following:
    """
    ...
    if [[ -n $(is_bootstrap_node) ]]; then
        check_clean_cluster
    fi
    ...
    if [[ -n $(is_bootstrap_node) ]]; then
        migrate_full_to_ng_ha
    fi
    ...
    for service in $(services_to_migrate); do
        manage_systemd_service stop "${service%%-clone}"
        ...
    done
    """

    The problem with the above code is that it is open to the following race
    condition:
    1. Code gets run first on a non-bootstrap controller node so we start
    stopping a bunch of services
    2. Pacemaker notices will notice that services are down and will mark
    the service as stopped
    3. Code gets run on the bootstrap node (controller-0) and the
    check_clean_cluster function will fail and exit
    4. Eventually also the script on the non-bootstrap controller node will
    timeout and exit because the cluster never shut down (it never actually
    started the shutdown because we failed at 3)

    Let's make sure we first only call the HA NG migration step as a
    separate heat step. Only afterwards we start shutting down the systemd
    services on all nodes.

    We also need to move the STONITH_STATE variable into a file because it
    is being used across two different scripts (1 and 2) and we need to
    store that state.

Co-Authored-By: Athlan-Guyot Sofer <email address hidden>

    Closes-Bug: #1640407
    Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60
    (cherry picked from commit dde12b075ff51d4def4f49e635dd390a7f1f2cac)

Reviewed:  https://review.openstack.org/395460
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=cf465d87b16d2865a0826b3a8c60370b8820c0e7
Submitter: Jenkins
Branch:    stable/newton

commit cf465d87b16d2865a0826b3a8c60370b8820c0e7
Author: Michele Baldessari <michele@acksyn.org>
Date:   Wed Nov 9 09:05:08 2016 +0100

Fix race during major-upgrade-pacemaker step
    
    Currently when we call the major-upgrade step we do the following:
    """
    ...
    if [[ -n $(is_bootstrap_node) ]]; then
        check_clean_cluster
    fi
    ...
    if [[ -n $(is_bootstrap_node) ]]; then
        migrate_full_to_ng_ha
    fi
    ...
    for service in $(services_to_migrate); do
        manage_systemd_service stop "${service%%-clone}"
        ...
    done
    """
    
    The problem with the above code is that it is open to the following race
    condition:
    1. Code gets run first on a non-bootstrap controller node so we start
    stopping a bunch of services
    2. Pacemaker notices will notice that services are down and will mark
    the service as stopped
    3. Code gets run on the bootstrap node (controller-0) and the
    check_clean_cluster function will fail and exit
    4. Eventually also the script on the non-bootstrap controller node will
    timeout and exit because the cluster never shut down (it never actually
    started the shutdown because we failed at 3)
    
    Let's make sure we first only call the HA NG migration step as a
    separate heat step. Only afterwards we start shutting down the systemd
    services on all nodes.
    
    We also need to move the STONITH_STATE variable into a file because it
    is being used across two different scripts (1 and 2) and we need to
    store that state.
    
    Co-Authored-By: Athlan-Guyot Sofer <sathlang@redhat.com>
    
    Closes-Bug: #1640407
    Change-Id: Ifb9b9e633fcc77604cca2590071656f4b2275c60
    (cherry picked from commit dde12b075ff51d4def4f49e635dd390a7f1f2cac)

tags:

added: in-stable-newton

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2016-11-17: Fix included in openstack/tripleo-heat-templates 6.0.0.0b1

This issue was fixed in the openstack/tripleo-heat-templates 6.0.0.0b1 development milestone.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2017-01-03: Fix included in openstack/tripleo-heat-templates 5.2.0

This issue was fixed in the openstack/tripleo-heat-templates 5.2.0 release.