Adding TLS external endpoint during a stack update breaks with HA overcloud

Bug #1839858 reported by Damien Ciabrini
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Damien Ciabrini

Bug Description

When one deploys a HA overclouds without any TLS, and one runs a stack update afterwards to enable TLS for external endpoints, the HAProxy service may fails to be restarted.
In such case, the VIPs will also be stopped by pacemaker automatically due to collocation constraints, and the stack update will finish in error.

This is due to the way the HAProxy container is restarted on update:

  . The haproxy pacemaker service file is designed to restart HAProxy when either the haproxy configuration changes, or when the pacemaker bundle configuration changes.

  . when the TLS endpoints are added, the haproxy config is regenerated and the pacemaker bundle needs to bind-mount a new pem file in the haproxy container.

  . so this triggers two restarts of the haproxy containers.

  . however the first restart is triggered by the config change.

  . when the haproxy container is restarted, haproxy reads the new config, tries to load the certificates and keys for the TLS endpoints, but the pem file is not bind-monted yet because the bundle configuration hasn't been updated yet.

  . so all haproxy containers fail to start

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (master)

Fix proposed to branch: master
Review: https://review.opendev.org/675993

Changed in tripleo:
assignee: Damien Ciabrini (dciabrin) → Michele Baldessari (michele)
Changed in tripleo:
milestone: train-3 → ussuri-1
Changed in tripleo:
assignee: Michele Baldessari (michele) → Damien Ciabrini (dciabrin)
Changed in tripleo:
milestone: ussuri-1 → ussuri-2
Changed in tripleo:
importance: Medium → High
tags: added: idempotency queens-backport-potential train-backport-potential
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)

Reviewed: https://review.opendev.org/675993
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=3230f005c1d51863a2c2484fe4c05471f5dc25dc
Submitter: Zuul
Branch: master

commit 3230f005c1d51863a2c2484fe4c05471f5dc25dc
Author: Damien Ciabrini <email address hidden>
Date: Fri Nov 15 17:41:42 2019 +0100

    HA: reorder init_bundle and restart_bundle for improved updates

    A pacemaker bundle can be restarted either because:
      . a tripleo config has been updated (from /var/lib/config-data)
      . the bundle config has been updated (container image, bundle
        parameter,...)

    In HA services, special container "*_restart_bundle" is in charge
    of restarting the HA service on tripleo config change. Special
    container "*_init_bundle" handles restart on bundle config change.

    When both types of change occur at the same time, the bundle must
    be restarted first, so that the container has a chance to be
    recreated with all bind-mounts updated before it tries to reload
    the updated config.

    Implement the improvement with two changes:

    1. Make the "*_restart_bundle" start after the "*_init_bundle", and
    make sure "*_restart_bundle" is only enabled after the initial
    deployment.

    2. During minor update, make sure that the "*_restart_bundle" not
    only restarts the container, but also waits until the service
    is operational (e.g. galera fully promoted to Master). This forces
    the rolling restart to happen sequentially, and avoid service
    disruption in quorum-based clustered services like galera and
    rabbitmq.

    Tested the following update use cases:

    * minor update: ensure that *_restart_bundle restarts all types of
      resources (OCF, bundles, A/P, A/P Master/Slave).

    * minor update: ensure *_restart_bundle is not executed when no
      config or image update happened for a service.

    * restart_bundle: when resource (OCF or container) fails to
      restart, bail out early instead of waiting for nothing until
      timeout is reached.

    * restart_bundle: make sure a resource is restarted even when it
      is in failed stated when *_restart_bundle is called.

    * restart_bundle: A/P can be restarted on any node, so watch
      restart globally. When the resource restarts as Slave, continue
      watching for a Master elsewhere in the cluster.

    * restart_bundle: if an A/P is not running locally, make sure it
      doesn't get restarted anywhere else in the cluster.

    * restart_bundle: do not try to restart stopped (disabled) or
      unmanaged resource. Bail out early instead, to not wait until
      timeout is reached.

    * stack update: make sure that running a stack update with no
      change does not trigger any *_restart_bundle, and does not
      restart any HA container either.

    * stack update: when bundle and config will change, ensure bundle
      is updated before HA containers are restarted (e.g. HAProxy
      migration to TLS everywhere)

    Change-Id: Ic41d4597e9033f9d7847bb6c10c25f443fbd5b0e
    Closes-Bug: #1839858

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/train)

Fix proposed to branch: stable/train
Review: https://review.opendev.org/707907

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 12.1.0

This issue was fixed in the openstack/tripleo-heat-templates 12.1.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)
Download full text (3.2 KiB)

Reviewed: https://review.opendev.org/707907
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=2bd4cdeb2f7887208b863e9c19ab136e3fbf4958
Submitter: Zuul
Branch: stable/train

commit 2bd4cdeb2f7887208b863e9c19ab136e3fbf4958
Author: Damien Ciabrini <email address hidden>
Date: Fri Nov 15 17:41:42 2019 +0100

    HA: reorder init_bundle and restart_bundle for improved updates

    A pacemaker bundle can be restarted either because:
      . a tripleo config has been updated (from /var/lib/config-data)
      . the bundle config has been updated (container image, bundle
        parameter,...)

    In HA services, special container "*_restart_bundle" is in charge
    of restarting the HA service on tripleo config change. Special
    container "*_init_bundle" handles restart on bundle config change.

    When both types of change occur at the same time, the bundle must
    be restarted first, so that the container has a chance to be
    recreated with all bind-mounts updated before it tries to reload
    the updated config.

    Implement the improvement with two changes:

    1. Make the "*_restart_bundle" start after the "*_init_bundle", and
    make sure "*_restart_bundle" is only enabled after the initial
    deployment.

    2. During minor update, make sure that the "*_restart_bundle" not
    only restarts the container, but also waits until the service
    is operational (e.g. galera fully promoted to Master). This forces
    the rolling restart to happen sequentially, and avoid service
    disruption in quorum-based clustered services like galera and
    rabbitmq.

    Tested the following update use cases:

    * minor update: ensure that *_restart_bundle restarts all types of
      resources (OCF, bundles, A/P, A/P Master/Slave).

    * minor update: ensure *_restart_bundle is not executed when no
      config or image update happened for a service.

    * restart_bundle: when resource (OCF or container) fails to
      restart, bail out early instead of waiting for nothing until
      timeout is reached.

    * restart_bundle: make sure a resource is restarted even when it
      is in failed stated when *_restart_bundle is called.

    * restart_bundle: A/P can be restarted on any node, so watch
      restart globally. When the resource restarts as Slave, continue
      watching for a Master elsewhere in the cluster.

    * restart_bundle: if an A/P is not running locally, make sure it
      doesn't get restarted anywhere else in the cluster.

    * restart_bundle: do not try to restart stopped (disabled) or
      unmanaged resource. Bail out early instead, to not wait until
      timeout is reached.

    * stack update: make sure that running a stack update with no
      change does not trigger any *_restart_bundle, and does not
      restart any HA container either.

    * stack update: when bundle and config will change, ensure bundle
      is updated before HA containers are restarted (e.g. HAProxy
      migration to TLS everywhere)

    Change-Id: Ic41d4597e9033f9d7847bb6c10c25f443fbd5b0e
    Closes-Bug: #1839858
    (cherry picke...

Read more...

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.4.0

This issue was fixed in the openstack/tripleo-heat-templates 11.4.0 release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.