OpenStack HA Cluster Charm

Bug #1903745
Comment #19

Comment 19 for bug 1903745

Revision history for this message

Trent Lloyd (lathiat) wrote on 2020-11-17:

#19

With regards to Billy's Comment #18, my analysis for that bionic sosreport is in Comment #8 where I found that specific sosreport didn't experience this issue - but I found most likely that node was suffering from the issue occuring on the MySQL nodes it was connected to - and the service couldn't connect to MySQL as a result. We'd need the full logs (sosreport --all-logs) from all related keystone nodes and mysql nodes in the environment to be sure but I am 95% sure that is the case there.

I think there is some argument to be made to improve the package restart process for the pacemaker package itself, whoever I am finding based on the logs here and in a couple of environments I analysed that the primary problem is specifically related to the reduced StopTimeout set by charm-hacluster. So I think we should focus on that issue here and if we decide it makes sense to make improvements to the pacemaker package process itself that should be opened as a separate bug as I haven't seen any evidence of that issue in the logs here so far.

For anyone else experiencing this bug, please take a *full* copy of /var/log (or sosreport --all-logs) from -all- nodes in that specific pacemaker cluster and upload them and I am happy to analyse them - if you need a non-public location to share the files feel free to e-mail them to me. It would be great to receive that from any nodes already recovered so we can ensure we fully understand all the cases that happened.