Minor update of HA services doesn't restart containers on config change

Bug #1841629 reported by Damien Ciabrini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Damien Ciabrini

Bug Description

On HA overcloud, there are three different ways a pacemaker-managed container may need to be restarted:
  1. a container image update
  2. a tripleo service config change (in /var/lib/config-data/puppet-genenrated/<service>)
  3. a pacemaker resource config update (i.e. pcs resource update <...>)

Case 1. has to take place during a minor update workflow (i.e. not during a stack redeploy/update), because
it requires a coordinated action. In this workflow, various ansible tasks are executed, sequentially, one controller after the other. The sequentiality of the workflow ensure that the image update is coordinated across the entire cluster.

Case 2. and 3. are handled a bit differently depending on the action (stack update or minor update workflow):

Stack update:
docker-puppet regenerates the service configs on all the controller nodes. Then:
  . a special transient container <service>_init_bundle is run on a single controller, and restarts the pacemaker resource on all nodes if the bundle config has changed.
  . another special container <service>_restart_bundle is run on a single controller, and it can also restart the pacemaker resource on all nodes if the tripleo service config has changed since last run.

Minor update:
The pacemaker cluster is restarted on each controller node, sequentially. This guarantees that all pacemaker-managed containers are being restarted unconditionally, and without service disruption (services restart on one node at a time, so there are always two controller nodes available).

Thanks to the unconditional restart, we avoid running container <service>_restart_bundle during minor upgrade because 1) we know that the service will get restarted anyway, and because 2) running restart_bundle during the minor update would restart the service on _all_ nodes at once, which would break the service availability.

However, the approach of "skipping <service>_restart_bundle in minor update" makes the underlying assumption that when the pacemaker cluster restarts on a node, the configs have already been regenerated by docker-puppet. But this assumption is false [1], so a restarted container won't pick any config update that may happen later during the minor update.

[1] https://review.opendev.org/#/c/635725/5/common/deploy-steps.j2@494 : ansible run update task _before_ deploy tasks, and one controller at a time. Which means that when pacemaker is restarted during the update tasks, the config hasn't been regenerated yet.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/679102

tags: added: queens-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/679102
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=7f785e87579e0beaaa325e479d8387dc288c08ea
Submitter: Zuul
Branch: master

commit 7f785e87579e0beaaa325e479d8387dc288c08ea
Author: Damien Ciabrini <email address hidden>
Date: Wed Aug 28 18:25:43 2019 +0200

    HA: fix <service>_restart_bundle with minor update workflow

    For each HA service we have a paunch container <service>_restart_bundle
    which is started by paunch whenever config files changes during stack
    deploy/update. This container runs a pcs command on a single node to
    restart all the service's containers (e.g. all galera on all controllers).
    By design, when it is run, configs have already been regenerated by the
    deploy tasks on all nodes.

    For minor updates, the workflow runs differently: all the steps of the
    deploy tasks are run one node after the other, so when
    <service>_restart_bundle is called, there is no guarantee that the
    service's configs have been regenerated on all the nodes yet.

    To fix the wrong restart behaviour, only restart local containers when
    running during a minor update. And run once per node. When the minor
    update workflow calls <service>_restart_container, we still have the
    guarantee that the config files are already regenerated locally.

    Co-Authored-By: Michele Baldessari <email address hidden>
    Co-Authored-By: Luca Miccini <email address hidden>

    Change-Id: I92d4ddf2feeac06ce14468ae928c283f3fd04f45
    Closes-Bug: #1841629

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/stein)

Fix proposed to branch: stable/stein
Review: https://review.opendev.org/680673

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/stein)

Reviewed: https://review.opendev.org/680673
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=2d7c6823422dec2c6c8b238bc528ec8b9c8f0251
Submitter: Zuul
Branch: stable/stein

commit 2d7c6823422dec2c6c8b238bc528ec8b9c8f0251
Author: Damien Ciabrini <email address hidden>
Date: Wed Aug 28 18:25:43 2019 +0200

    HA: fix <service>_restart_bundle with minor update workflow

    For each HA service we have a paunch container <service>_restart_bundle
    which is started by paunch whenever config files changes during stack
    deploy/update. This container runs a pcs command on a single node to
    restart all the service's containers (e.g. all galera on all controllers).
    By design, when it is run, configs have already been regenerated by the
    deploy tasks on all nodes.

    For minor updates, the workflow runs differently: all the steps of the
    deploy tasks are run one node after the other, so when
    <service>_restart_bundle is called, there is no guarantee that the
    service's configs have been regenerated on all the nodes yet.

    To fix the wrong restart behaviour, only restart local containers when
    running during a minor update. And run once per node. When the minor
    update workflow calls <service>_restart_container, we still have the
    guarantee that the config files are already regenerated locally.

    Co-Authored-By: Michele Baldessari <email address hidden>
    Co-Authored-By: Luca Miccini <email address hidden>

    Change-Id: I92d4ddf2feeac06ce14468ae928c283f3fd04f45
    Closes-Bug: #1841629
    (manually cherry picked from commit 7f785e87579e0beaaa325e479d8387dc288c08ea)

tags: added: in-stable-stein
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 10.6.1

This issue was fixed in the openstack/tripleo-heat-templates 10.6.1 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/rocky)

Fix proposed to branch: stable/rocky
Review: https://review.opendev.org/681782

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/rocky)

Reviewed: https://review.opendev.org/681782
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=44283e9222793a8b8d019a98868d932dccc2b253
Submitter: Zuul
Branch: stable/rocky

commit 44283e9222793a8b8d019a98868d932dccc2b253
Author: Damien Ciabrini <email address hidden>
Date: Thu Sep 12 17:30:08 2019 +0200

    HA: fix <service>_restart_bundle with minor update workflow

    For each HA service we have a paunch container <service>_restart_bundle
    which is started by paunch whenever config files changes during stack
    deploy/update. This container runs a pcs command on a single node to
    restart all the service's containers (e.g. all galera on all controllers).
    By design, when it is run, configs have already been regenerated by the
    deploy tasks on all nodes.

    For minor updates, the workflow runs differently: all the steps of the
    deploy tasks are run one node after the other, so when
    <service>_restart_bundle is called, there is no guarantee that the
    service's configs have been regenerated on all the nodes yet.

    To fix the wrong restart behaviour, only restart local containers when
    running during a minor update. And run once per node. When the minor
    update workflow calls <service>_restart_container, we still have the
    guarantee that the config files are already regenerated locally.

    Co-Authored-By: Michele Baldessari <email address hidden>
    Co-Authored-By: Luca Miccini <email address hidden>

    Change-Id: I92d4ddf2feeac06ce14468ae928c283f3fd04f45
    Closes-Bug: #1841629
    (manually cherry picked from commit 7f785e87579e0beaaa325e479d8387dc288c08ea
    and adapted for Rocky, with the appropriate name changes)

tags: added: in-stable-rocky
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/queens)

Fix proposed to branch: stable/queens
Review: https://review.opendev.org/682315

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.2.0

This issue was fixed in the openstack/tripleo-heat-templates 11.2.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/queens)

Reviewed: https://review.opendev.org/682315
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=12ec4ac7096a0b4fb57db378b07b03b3b51a6d6c
Submitter: Zuul
Branch: stable/queens

commit 12ec4ac7096a0b4fb57db378b07b03b3b51a6d6c
Author: Damien Ciabrini <email address hidden>
Date: Thu Sep 12 17:30:08 2019 +0200

    HA: fix <service>_restart_bundle with minor update workflow

    For each HA service we have a paunch container <service>_restart_bundle
    which is started by paunch whenever config files changes during stack
    deploy/update. This container runs a pcs command on a single node to
    restart all the service's containers (e.g. all galera on all controllers).
    By design, when it is run, configs have already been regenerated by the
    deploy tasks on all nodes.

    For minor updates, the workflow runs differently: all the steps of the
    deploy tasks are run one node after the other, so when
    <service>_restart_bundle is called, there is no guarantee that the
    service's configs have been regenerated on all the nodes yet.

    To fix the wrong restart behaviour, only restart local containers when
    running during a minor update. And run once per node. When the minor
    update workflow calls <service>_restart_container, we still have the
    guarantee that the config files are already regenerated locally.

    Co-Authored-By: Michele Baldessari <email address hidden>
    Co-Authored-By: Luca Miccini <email address hidden>

    (cherry picked from commit 44283e9222793a8b8d019a98868d932dccc2b253
    and removed services undefined in Queens)

    Change-Id: I92d4ddf2feeac06ce14468ae928c283f3fd04f45
    Closes-Bug: #1841629

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates rocky-eol

This issue was fixed in the openstack/tripleo-heat-templates rocky-eol release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates queens-eol

This issue was fixed in the openstack/tripleo-heat-templates queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.