Bug #1892206 “concurrent restart bundles fail in composable HA” : Bugs : tripleo

OpenStack Infra (hudson-openstack) on 2020-08-19

Changed in tripleo:
status:	Triaged → In Progress

Revision history for this message

Michele Baldessari (michele) wrote on 2020-08-19:

#1

https://review.opendev.org/746660

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-19: Fix proposed to tripleo-heat-templates (stable/ussuri)

#2

Fix proposed to branch: stable/ussuri
Review: https://review.opendev.org/746936

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-22: Fix merged to tripleo-heat-templates (master)

#3

Download full text (3.4 KiB)

Reviewed: https://review.opendev.org/746660
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=dcfc98d23606ddec7b0d4f91bb1dc1b3c2a7409c
Submitter: Zuul
Branch: master

commit dcfc98d23606ddec7b0d4f91bb1dc1b3c2a7409c
Author: Michele Baldessari <email address hidden>
Date: Tue Aug 18 10:29:19 2020 +0200

Fix pcs restart in composable HA

    When a redeploy command is being run in a composable HA environment, if there
    are any configuration changes, the <bundle>_restart containers will be kicked
    off. These restart containers will then try and restart the bundles globally in
    the cluster.

    These restarts will be fired off in parallel from different nodes. So
    haproxy-bundle will be restarted from controller-0, mysql-bundle from
    database-0, rabbitmq-bundle from messaging-0.

    This has proven to be problematic and very often (rhbz#1868113) it would fail
    the redeploy with:
    2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining
    2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired
    2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped
    2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop:
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role
    2020-08-11T13:40:25.996896822+00:00 stderr F

or

    2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again:
    2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F

    After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
    """
    Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:

    - Get the current configuration and state of the cluster, including a list of active resources (list #1)
    - Set resource target-role to Stopped
    - Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
    - Compare lists #1 and #2, and the difference is the resources that should stop
    - Periodically refresh the configuration and state until the list of active resources matches list #2
    - Delete the target-role
    - Periodically refresh the configuration and state until the list of active resources matches list #1
    """

So the su...

Reviewed:  https://review.opendev.org/746660
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=dcfc98d23606ddec7b0d4f91bb1dc1b3c2a7409c
Submitter: Zuul
Branch:    master

commit dcfc98d23606ddec7b0d4f91bb1dc1b3c2a7409c
Author: Michele Baldessari <michele@acksyn.org>
Date:   Tue Aug 18 10:29:19 2020 +0200

Fix pcs restart in composable HA
    
    When a redeploy command is being run in a composable HA environment, if there
    are any configuration changes, the <bundle>_restart containers will be kicked
    off. These restart containers will then try and restart the bundles globally in
    the cluster.
    
    These restarts will be fired off in parallel from different nodes. So
    haproxy-bundle will be restarted from controller-0, mysql-bundle from
    database-0, rabbitmq-bundle from messaging-0.
    
    This has proven to be problematic and very often (rhbz#1868113) it would fail
    the redeploy with:
    2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining
    2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired
    2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped
    2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop:
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role
    2020-08-11T13:40:25.996896822+00:00 stderr F
    
    or
    
    2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again:
    2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F
    
    After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
    """
    Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:
    
    - Get the current configuration and state of the cluster, including a list of active resources (list #1)
    - Set resource target-role to Stopped
    - Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
    - Compare lists #1 and #2, and the difference is the resources that should stop
    - Periodically refresh the configuration and state until the list of active resources matches list #2
    - Delete the target-role
    - Periodically refresh the configuration and state until the list of active resources matches list #1
    """
    
    So the suggestion is to replace the restarts with an enable/disable cycle of the resource.
    
    Tested this on a dozen runs on a composable HA environment and did not observe the error
    any longer.
    
    Closes-Bug: #1892206
    
    Change-Id: I9cc27b1539a62a88fb0bccac64e6b1ae9295f22e

Changed in tripleo:
status:	In Progress → Fix Released

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-22: Fix merged to tripleo-heat-templates (stable/ussuri)

#4

Download full text (3.4 KiB)

Reviewed: https://review.opendev.org/746936
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9ee9b945f85a2d6e0e13015f9f32b0db7fb3b883
Submitter: Zuul
Branch: stable/ussuri

commit 9ee9b945f85a2d6e0e13015f9f32b0db7fb3b883
Author: Michele Baldessari <email address hidden>
Date: Tue Aug 18 10:29:19 2020 +0200

Fix pcs restart in composable HA

    When a redeploy command is being run in a composable HA environment, if there
    are any configuration changes, the <bundle>_restart containers will be kicked
    off. These restart containers will then try and restart the bundles globally in
    the cluster.

    These restarts will be fired off in parallel from different nodes. So
    haproxy-bundle will be restarted from controller-0, mysql-bundle from
    database-0, rabbitmq-bundle from messaging-0.

    This has proven to be problematic and very often (rhbz#1868113) it would fail
    the redeploy with:
    2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining
    2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired
    2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped
    2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop:
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role
    2020-08-11T13:40:25.996896822+00:00 stderr F

or

    2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again:
    2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F

    After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
    """
    Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:

    - Get the current configuration and state of the cluster, including a list of active resources (list #1)
    - Set resource target-role to Stopped
    - Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
    - Compare lists #1 and #2, and the difference is the resources that should stop
    - Periodically refresh the configuration and state until the list of active resources matches list #2
    - Delete the target-role
    - Periodically refresh the configuration and state until the list of active resources matches list #1
    """

So...

Reviewed:  https://review.opendev.org/746936
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=9ee9b945f85a2d6e0e13015f9f32b0db7fb3b883
Submitter: Zuul
Branch:    stable/ussuri

commit 9ee9b945f85a2d6e0e13015f9f32b0db7fb3b883
Author: Michele Baldessari <michele@acksyn.org>
Date:   Tue Aug 18 10:29:19 2020 +0200

Fix pcs restart in composable HA
    
    When a redeploy command is being run in a composable HA environment, if there
    are any configuration changes, the <bundle>_restart containers will be kicked
    off. These restart containers will then try and restart the bundles globally in
    the cluster.
    
    These restarts will be fired off in parallel from different nodes. So
    haproxy-bundle will be restarted from controller-0, mysql-bundle from
    database-0, rabbitmq-bundle from messaging-0.
    
    This has proven to be problematic and very often (rhbz#1868113) it would fail
    the redeploy with:
    2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining
    2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired
    2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped
    2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop:
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role
    2020-08-11T13:40:25.996896822+00:00 stderr F
    
    or
    
    2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again:
    2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F
    
    After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
    """
    Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:
    
    - Get the current configuration and state of the cluster, including a list of active resources (list #1)
    - Set resource target-role to Stopped
    - Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
    - Compare lists #1 and #2, and the difference is the resources that should stop
    - Periodically refresh the configuration and state until the list of active resources matches list #2
    - Delete the target-role
    - Periodically refresh the configuration and state until the list of active resources matches list #1
    """
    
    So the suggestion is to replace the restarts with an enable/disable cycle of the resource.
    
    Tested this on a dozen runs on a composable HA environment and did not observe the error
    any longer.
    
    Closes-Bug: #1892206
    
    Change-Id: I9cc27b1539a62a88fb0bccac64e6b1ae9295f22e

tags:

added: in-stable-ussuri

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-23: Fix merged to tripleo-heat-templates (stable/train)

#5

Download full text (3.4 KiB)

Reviewed: https://review.opendev.org/746662
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=1fdfa3332321d9dd1246c7d13c77d93c10f7c3b3
Submitter: Zuul
Branch: stable/train

commit 1fdfa3332321d9dd1246c7d13c77d93c10f7c3b3
Author: Michele Baldessari <email address hidden>
Date: Tue Aug 18 10:29:19 2020 +0200

Fix pcs restart in composable HA

    When a redeploy command is being run in a composable HA environment, if there
    are any configuration changes, the <bundle>_restart containers will be kicked
    off. These restart containers will then try and restart the bundles globally in
    the cluster.

    These restarts will be fired off in parallel from different nodes. So
    haproxy-bundle will be restarted from controller-0, mysql-bundle from
    database-0, rabbitmq-bundle from messaging-0.

    This has proven to be problematic and very often (rhbz#1868113) it would fail
    the redeploy with:
    2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining
    2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired
    2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped
    2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop:
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role
    2020-08-11T13:40:25.996896822+00:00 stderr F

or

    2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again:
    2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F

    After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
    """
    Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:

    - Get the current configuration and state of the cluster, including a list of active resources (list #1)
    - Set resource target-role to Stopped
    - Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
    - Compare lists #1 and #2, and the difference is the resources that should stop
    - Periodically refresh the configuration and state until the list of active resources matches list #2
    - Delete the target-role
    - Periodically refresh the configuration and state until the list of active resources matches list #1
    """

So ...

Reviewed:  https://review.opendev.org/746662
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=1fdfa3332321d9dd1246c7d13c77d93c10f7c3b3
Submitter: Zuul
Branch:    stable/train

commit 1fdfa3332321d9dd1246c7d13c77d93c10f7c3b3
Author: Michele Baldessari <michele@acksyn.org>
Date:   Tue Aug 18 10:29:19 2020 +0200

Fix pcs restart in composable HA
    
    When a redeploy command is being run in a composable HA environment, if there
    are any configuration changes, the <bundle>_restart containers will be kicked
    off. These restart containers will then try and restart the bundles globally in
    the cluster.
    
    These restarts will be fired off in parallel from different nodes. So
    haproxy-bundle will be restarted from controller-0, mysql-bundle from
    database-0, rabbitmq-bundle from messaging-0.
    
    This has proven to be problematic and very often (rhbz#1868113) it would fail
    the redeploy with:
    2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining
    2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired
    2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped
    2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop:
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role
    2020-08-11T13:40:25.996896822+00:00 stderr F
    
    or
    
    2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again:
    2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F
    
    After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
    """
    Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:
    
    - Get the current configuration and state of the cluster, including a list of active resources (list #1)
    - Set resource target-role to Stopped
    - Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
    - Compare lists #1 and #2, and the difference is the resources that should stop
    - Periodically refresh the configuration and state until the list of active resources matches list #2
    - Delete the target-role
    - Periodically refresh the configuration and state until the list of active resources matches list #1
    """
    
    So the suggestion is to replace the restarts with an enable/disable cycle of the resource.
    
    Tested this on a dozen runs on a composable HA environment and did not observe the error
    any longer.
    
    Closes-Bug: #1892206
    
    Change-Id: I9cc27b1539a62a88fb0bccac64e6b1ae9295f22e

tags:

added: in-stable-train

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2020-08-23: Fix merged to tripleo-heat-templates (stable/queens)

#6

Download full text (3.7 KiB)

Reviewed: https://review.opendev.org/746665
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=a7e9730a8566cf57713a00bfc706596847bb5fbf
Submitter: Zuul
Branch: stable/queens

commit a7e9730a8566cf57713a00bfc706596847bb5fbf
Author: Michele Baldessari <email address hidden>
Date: Tue Aug 18 10:55:29 2020 +0200

Fix pcs restart in composable HA

    When a redeploy command is being run in a composable HA environment, if there
    are any configuration changes, the <bundle>_restart containers will be kicked
    off. These restart containers will then try and restart the bundles globally in
    the cluster.

    These restarts will be fired off in parallel from different nodes. So
    haproxy-bundle will be restarted from controller-0, mysql-bundle from
    database-0, rabbitmq-bundle from messaging-0.

    This has proven to be problematic and very often (rhbz#1868113) it would fail
    the redeploy with:
    2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining
    2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired
    2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped
    2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop:
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role
    2020-08-11T13:40:25.996896822+00:00 stderr F

or

    2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again:
    2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F

    After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
    """
    Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:

    - Get the current configuration and state of the cluster, including a list of active resources (list #1)
    - Set resource target-role to Stopped
    - Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
    - Compare lists #1 and #2, and the difference is the resources that should stop
    - Periodically refresh the configuration and state until the list of active resources matches list #2
    - Delete the target-role
    - Periodically refresh the configuration and state until the list of active resources matches list #1
    """

So...

Reviewed:  https://review.opendev.org/746665
Committed: https://git.openstack.org/cgit/openstack/tripleo-heat-templates/commit/?id=a7e9730a8566cf57713a00bfc706596847bb5fbf
Submitter: Zuul
Branch:    stable/queens

commit a7e9730a8566cf57713a00bfc706596847bb5fbf
Author: Michele Baldessari <michele@acksyn.org>
Date:   Tue Aug 18 10:55:29 2020 +0200

Fix pcs restart in composable HA
    
    When a redeploy command is being run in a composable HA environment, if there
    are any configuration changes, the <bundle>_restart containers will be kicked
    off. These restart containers will then try and restart the bundles globally in
    the cluster.
    
    These restarts will be fired off in parallel from different nodes. So
    haproxy-bundle will be restarted from controller-0, mysql-bundle from
    database-0, rabbitmq-bundle from messaging-0.
    
    This has proven to be problematic and very often (rhbz#1868113) it would fail
    the redeploy with:
    2020-08-11T13:40:25.996896822+00:00 stderr F Error: Could not complete shutdown of rabbitmq-bundle, 1 resources remaining
    2020-08-11T13:40:25.996896822+00:00 stderr F Error performing operation: Timer expired
    2020-08-11T13:40:25.996896822+00:00 stderr F Set 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role set=rabbitmq-bundle-meta_attributes name=target-role value=stopped
    2020-08-11T13:40:25.996896822+00:00 stderr F Waiting for 2 resources to stop:
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F * galera-bundle
    2020-08-11T13:40:25.996896822+00:00 stderr F Deleted 'rabbitmq-bundle' option: id=rabbitmq-bundle-meta_attributes-target-role name=target-role
    2020-08-11T13:40:25.996896822+00:00 stderr F
    
    or
    
    2020-08-11T13:39:49.197487180+00:00 stderr F Waiting for 2 resources to start again:
    2020-08-11T13:39:49.197487180+00:00 stderr F * galera-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F Could not complete restart of galera-bundle, 1 resources remaining
    2020-08-11T13:39:49.197487180+00:00 stderr F * rabbitmq-bundle
    2020-08-11T13:39:49.197487180+00:00 stderr F
    
    After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
    """
    Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:
    
    - Get the current configuration and state of the cluster, including a list of active resources (list #1)
    - Set resource target-role to Stopped
    - Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
    - Compare lists #1 and #2, and the difference is the resources that should stop
    - Periodically refresh the configuration and state until the list of active resources matches list #2
    - Delete the target-role
    - Periodically refresh the configuration and state until the list of active resources matches list #1
    """
    
    So the suggestion is to replace the restarts with an enable/disable cycle of the resource.
    
    Tested this on a dozen runs on a composable HA environment and did not observe the error
    any longer.
    
    NB: This is not a clean cherry-pick of the related change, but a merge
        of master's I9cc27b1539a62a88fb0bccac64e6b1ae9295f22e and
        Ia850286682f09cd75651591a1158c2e467343c1d (Drop bootstrap_host_exec
        from pacemaker_restart_bundle)
    
    Closes-Bug: #1892206
    
    Change-Id: I9cc27b1539a62a88fb0bccac64e6b1ae9295f22e

tags:

added: in-stable-queens

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-02-08: Fix included in openstack/tripleo-heat-templates 11.4.0

#7

This issue was fixed in the openstack/tripleo-heat-templates 11.4.0 release.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2021-07-12: Fix included in openstack/tripleo-heat-templates queens-eol

#8

This issue was fixed in the openstack/tripleo-heat-templates queens-eol release.

tripleo

concurrent restart bundles fail in composable HA

Bug Description

Other bug subscribers

Remote bug watches