concurrent restart bundles fail in composable HA

Bug #1892206 reported by Michele Baldessari
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
tripleo
Fix Released
High
Michele Baldessari

Bug Description

When a redeploy command is being run in a composable HA environment, if there are any configuration changes, the <bundle>_restart containers will be kicked off. These restart containers will then try and restart the bundles globally in the cluster.

These restarts will be fired off in parallel from different nodes. So haproxy-bundle will be restarted from controller-0, mysql-bundle from database-0, rabbitmq-bundle from messaging-0.

This has proven to be problematic and very often (rhbz#1868113) it would fail the redeploy with:
2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining
2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired
2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped
2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop:
2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle
2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role
2020-08-11T13:40:25.996896822+00:00 stderr F

or

2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again:
2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle
2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining
2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
2020-08-11T13:39:49.197487180+00:00 stderr F

After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
"""
Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:

- Get the current configuration and state of the cluster, including a list of active resources (list #1)
- Set resource target-role to Stopped
- Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
- Compare lists #1 and #2, and the difference is the resources that should stop
- Periodically refresh the configuration and state until the list of active resources matches list #2
- Delete the target-role
- Periodically refresh the configuration and state until the list of active resources matches list #1
"""

So the suggestion is to replace the restarts with an enable/disable cycle of the resource.

Changed in tripleo:
status: Triaged → In Progress
Revision history for this message
Michele Baldessari (michele) wrote :
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to tripleo-heat-templates (stable/ussuri)

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/746936

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (master)
Download full text (3.4 KiB)

Reviewed: https://review.opendev.org/746660
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=dcfc98d23606ddec7b0d4f91bb1dc1b3c2a7409c
Submitter: Zuul
Branch: master

commit dcfc98d23606ddec7b0d4f91bb1dc1b3c2a7409c
Author: Michele Baldessari <email address hidden>
Date: Tue Aug 18 10:29:19 2020 +0200

    Fix pcs restart in composable HA

    When a redeploy command is being run in a composable HA environment, if there
    are any configuration changes, the <bundle>_restart containers will be kicked
    off. These restart containers will then try and restart the bundles globally in
    the cluster.

    These restarts will be fired off in parallel from different nodes. So
    haproxy-bundle will be restarted from controller-0, mysql-bundle from
    database-0, rabbitmq-bundle from messaging-0.

    This has proven to be problematic and very often (rhbz#1868113) it would fail
    the redeploy with:
    2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining
    2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired
    2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped
    2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop:
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role
    2020-08-11T13:40:25.996896822+00:00 stderr F

    or

    2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again:
    2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F

    After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
    """
    Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:

    - Get the current configuration and state of the cluster, including a list of active resources (list #1)
    - Set resource target-role to Stopped
    - Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
    - Compare lists #1 and #2, and the difference is the resources that should stop
    - Periodically refresh the configuration and state until the list of active resources matches list #2
    - Delete the target-role
    - Periodically refresh the configuration and state until the list of active resources matches list #1
    """

    So the su...

Read more...

Changed in tripleo:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/ussuri)
Download full text (3.4 KiB)

Reviewed: https://review.opendev.org/746936
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9ee9b945f85a2d6e0e13015f9f32b0db7fb3b883
Submitter: Zuul
Branch: stable/ussuri

commit 9ee9b945f85a2d6e0e13015f9f32b0db7fb3b883
Author: Michele Baldessari <email address hidden>
Date: Tue Aug 18 10:29:19 2020 +0200

    Fix pcs restart in composable HA

    When a redeploy command is being run in a composable HA environment, if there
    are any configuration changes, the <bundle>_restart containers will be kicked
    off. These restart containers will then try and restart the bundles globally in
    the cluster.

    These restarts will be fired off in parallel from different nodes. So
    haproxy-bundle will be restarted from controller-0, mysql-bundle from
    database-0, rabbitmq-bundle from messaging-0.

    This has proven to be problematic and very often (rhbz#1868113) it would fail
    the redeploy with:
    2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining
    2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired
    2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped
    2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop:
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role
    2020-08-11T13:40:25.996896822+00:00 stderr F

    or

    2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again:
    2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F

    After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
    """
    Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:

    - Get the current configuration and state of the cluster, including a list of active resources (list #1)
    - Set resource target-role to Stopped
    - Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
    - Compare lists #1 and #2, and the difference is the resources that should stop
    - Periodically refresh the configuration and state until the list of active resources matches list #2
    - Delete the target-role
    - Periodically refresh the configuration and state until the list of active resources matches list #1
    """

    So...

Read more...

tags: added: in-stable-ussuri
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/train)
Download full text (3.4 KiB)

Reviewed: https://review.opendev.org/746662
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=1fdfa3332321d9dd1246c7d13c77d93c10f7c3b3
Submitter: Zuul
Branch: stable/train

commit 1fdfa3332321d9dd1246c7d13c77d93c10f7c3b3
Author: Michele Baldessari <email address hidden>
Date: Tue Aug 18 10:29:19 2020 +0200

    Fix pcs restart in composable HA

    When a redeploy command is being run in a composable HA environment, if there
    are any configuration changes, the <bundle>_restart containers will be kicked
    off. These restart containers will then try and restart the bundles globally in
    the cluster.

    These restarts will be fired off in parallel from different nodes. So
    haproxy-bundle will be restarted from controller-0, mysql-bundle from
    database-0, rabbitmq-bundle from messaging-0.

    This has proven to be problematic and very often (rhbz#1868113) it would fail
    the redeploy with:
    2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining
    2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired
    2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped
    2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop:
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role
    2020-08-11T13:40:25.996896822+00:00 stderr F

    or

    2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again:
    2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F

    After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
    """
    Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:

    - Get the current configuration and state of the cluster, including a list of active resources (list #1)
    - Set resource target-role to Stopped
    - Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
    - Compare lists #1 and #2, and the difference is the resources that should stop
    - Periodically refresh the configuration and state until the list of active resources matches list #2
    - Delete the target-role
    - Periodically refresh the configuration and state until the list of active resources matches list #1
    """

    So ...

Read more...

tags: added: in-stable-train
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to tripleo-heat-templates (stable/queens)
Download full text (3.7 KiB)

Reviewed: https://review.opendev.org/746665
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=a7e9730a8566cf57713a00bfc706596847bb5fbf
Submitter: Zuul
Branch: stable/queens

commit a7e9730a8566cf57713a00bfc706596847bb5fbf
Author: Michele Baldessari <email address hidden>
Date: Tue Aug 18 10:55:29 2020 +0200

    Fix pcs restart in composable HA

    When a redeploy command is being run in a composable HA environment, if there
    are any configuration changes, the <bundle>_restart containers will be kicked
    off. These restart containers will then try and restart the bundles globally in
    the cluster.

    These restarts will be fired off in parallel from different nodes. So
    haproxy-bundle will be restarted from controller-0, mysql-bundle from
    database-0, rabbitmq-bundle from messaging-0.

    This has proven to be problematic and very often (rhbz#1868113) it would fail
    the redeploy with:
    2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining
    2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired
    2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped
    2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop:
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role
    2020-08-11T13:40:25.996896822+00:00 stderr F

    or

    2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again:
    2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F

    After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
    """
    Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:

    - Get the current configuration and state of the cluster, including a list of active resources (list #1)
    - Set resource target-role to Stopped
    - Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
    - Compare lists #1 and #2, and the difference is the resources that should stop
    - Periodically refresh the configuration and state until the list of active resources matches list #2
    - Delete the target-role
    - Periodically refresh the configuration and state until the list of active resources matches list #1
    """

    So...

Read more...

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates 11.4.0

This issue was fixed in the openstack/tripleo-heat-templates 11.4.0 release.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/tripleo-heat-templates queens-eol

This issue was fixed in the openstack/tripleo-heat-templates queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.