With regards to Billy's Comment #18, my analysis for that bionic sosreport is in Comment #8 where I found that specific sosreport didn't experience this issue - but I found most likely that node was suffering from the issue occuring on the MySQL nodes it was connected to - and the service couldn't connect to MySQL as a result. We'd need the full logs (sosreport --all-logs) from all related keystone nodes and mysql nodes in the environment to be sure but I am 95% sure that is the case there.
I think there is some argument to be made to improve the package restart process for the pacemaker package itself, whoever I am finding based on the logs here and in a couple of environments I analysed that the primary problem is specifically related to the reduced StopTimeout set by charm-hacluster. So I think we should focus on that issue here and if we decide it makes sense to make improvements to the pacemaker package process itself that should be opened as a separate bug as I haven't seen any evidence of that issue in the logs here so far.
For anyone else experiencing this bug, please take a *full* copy of /var/log (or sosreport --all-logs) from -all- nodes in that specific pacemaker cluster and upload them and I am happy to analyse them - if you need a non-public location to share the files feel free to e-mail them to me. It would be great to receive that from any nodes already recovered so we can ensure we fully understand all the cases that happened.
With regards to Billy's Comment #18, my analysis for that bionic sosreport is in Comment #8 where I found that specific sosreport didn't experience this issue - but I found most likely that node was suffering from the issue occuring on the MySQL nodes it was connected to - and the service couldn't connect to MySQL as a result. We'd need the full logs (sosreport --all-logs) from all related keystone nodes and mysql nodes in the environment to be sure but I am 95% sure that is the case there.
I think there is some argument to be made to improve the package restart process for the pacemaker package itself, whoever I am finding based on the logs here and in a couple of environments I analysed that the primary problem is specifically related to the reduced StopTimeout set by charm-hacluster. So I think we should focus on that issue here and if we decide it makes sense to make improvements to the pacemaker package process itself that should be opened as a separate bug as I haven't seen any evidence of that issue in the logs here so far.
For anyone else experiencing this bug, please take a *full* copy of /var/log (or sosreport --all-logs) from -all- nodes in that specific pacemaker cluster and upload them and I am happy to analyse them - if you need a non-public location to share the files feel free to e-mail them to me. It would be great to receive that from any nodes already recovered so we can ensure we fully understand all the cases that happened.