rebooting an HA controller node can take around 20minutes
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
tripleo |
Fix Released
|
Critical
|
Michele Baldessari |
Bug Description
When issuing a normal reboot command on an overcloud node the following stop sequence can take place:
------------- -------
| Pacemaker | | paunch-
------------- -------
\ /
\ /
\ /
| docker |
We currently have two issues:
A) paunch-
We want this for two reasons:
A.1) We want the normal openstack services to be able to stop before the DB/rabbit so they can run any actions that need to store state on shutdown correctly.
A.2) paunch-
B) If there are docker plugins that are allowed to stop before docker and also before pacemaker, it might happen that stopping them down during the pacemaker stop will cause a bunch of timeouts and a failure to stop containers:
Sep 13 17:53:00.821030 controller-
Sep 13 17:54:15.798026 controller-
Sep 13 17:54:15.799004 controller-
One of these plugins is 'rhel-push-
So B) is the root cause for the timeout, but it is quite desirable to fix A) as well.
Changed in tripleo: | |
milestone: | rocky-rc2 → stein-1 |
A) is fixed via https:/ /review. rdoproject. org/r/16279