StarlingX

Bug #1855474
Comment #9

Comment 9 for bug 1855474

Revision history for this message

yong hu (yhu6) wrote on 2019-12-16:

The root-cause for this issue was dug out why mariadb-server (cluster) failed to recover after brutely rebooting the active controller.

Basically there are 2 mariadb-server instances in the mariadb cluster, and defined by "ReadinessProbe", they will cross-check the status with each other periodically (defined by ReadinessProbe). In the failed case, whenever one StarlingX controller, on which one mariadb-server is running, is rebooted, another mariadb (on another controller) will fail to sync with the destroyed (caused by reboot) one, and the "ReadinessProbe" failure will further lead the pod failure of itself. So, it is essentially a dead-lock, and eventually none of 2 mariadb-servers will come to live.

This issue was not from openstack-helm/mariadb upstream which actually hs 3 instances. With 3 mariadb server instances, one instance failures/death won't crash the cluster because there are still other 2 alive instances to "cross-check" with each other.
In StarlingX, we override the replica from 3 to 2 so that 2 mariadb servers can be placed on 2 controllers respectively. However this change (of only having 2 instances) brings the dead-lock as described above.

The solution is to disable the "ReadinessProbe", following the practice we took for Nova.