concurrent restart bundles fail in composable HA
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
High
|
Michele Baldessari |
Bug Description
When a redeploy command is being run in a composable HA environment, if there are any configuration changes, the <bundle>_restart containers will be kicked off. These restart containers will then try and restart the bundles globally in the cluster.
These restarts will be fired off in parallel from different nodes. So haproxy-bundle will be restarted from controller-0, mysql-bundle from database-0, rabbitmq-bundle from messaging-0.
This has proven to be problematic and very often (rhbz#1868113) it would fail the redeploy with:
2020-08-
2020-08-
2020-08-
2020-08-
2020-08-
2020-08-
2020-08-
2020-08-
2020-08-
or
2020-08-
2020-08-
2020-08-
2020-08-
2020-08-
2020-08-
After discussing it with kgaillot it seems that concurrent restarts in pcmk are just brittle:
"""
Sadly restarts are brittle, and they do in fact assume that nothing else is causing resources to start or stop. They work like this:
- Get the current configuration and state of the cluster, including a list of active resources (list #1)
- Set resource target-role to Stopped
- Get the current configuration and state of the cluster, including a list of which resources *should* be active (list #2)
- Compare lists #1 and #2, and the difference is the resources that should stop
- Periodically refresh the configuration and state until the list of active resources matches list #2
- Delete the target-role
- Periodically refresh the configuration and state until the list of active resources matches list #1
"""
So the suggestion is to replace the restarts with an enable/disable cycle of the resource.
Changed in tripleo: | |
status: | Triaged → In Progress |
https:/ /review. opendev. org/746660