I reviewed the sosreports and provide some general analysis below.


[sosreport-juju-machine-2-lxc-1-2020-11-10-tayyude]

I don't see any sign in this log of package upgrades or VIP stop/starts, I suspect this host may be unrelated.

[sosreport-juju-caae6f-19-lxd-6-20201110230352.tar.xz]

This is a charm-keystone node

Looking at this sosreport my general finding is that everything worked correctly on this specific host.

unattended-upgrades.log:
We can see the upgrade starts at 2020-11-10 06:17:03 and finishes at "2020-11-10 06:17:48"

syslog.1:
Nov 10 06:17:41 juju-caae6f-19-lxd-6 crmd[41203]:   notice: Result of probe operation for res_ks_680cfdf_vip on juju-caae6f-19-lxd-6: 7 (not running)
Nov 10 06:19:44 juju-caae6f-19-lxd-6 crmd[41203]:   notice: Result of start operation for res_ks_680cfdf_vip on juju-caae6f-19-lxd-6: 0 (ok)

We also see that the VIP moved around to different hosts a few times, likely as a result of each host successively upgrading. Which makes sense. I don't see any sign in this log of the mentioned lrmd issue.

[mysql issue]

What we do see however is issues with "Too many connections" from MySQL in the keystone logs. This generally happens because when the VIP moves from one host to another, all the old connections are left behind and just go stale (because the VIP was removed, the traffic for these connections just disappears and is sent to the new VIP owner which doens't have those TCP connections) and sit there until wait_timeout is reached (typically either 180s/3 min or 3600s/1 hour in our deployments) as the node will never get the TCP reset when the remote end sends it. The problem happens when it fails *back* to a host it already failed away from, now many of the connection slots are still used by the stale connections and you run our of connections if your max_connections limit is not at least double your normal connection count. This problem will eventually self resolve once the connections timeout but may take an hour.

Note that this sosreport is from a keystone node that *also* has charm-hacluster/corosync/pacemaker but the above discussed mysql issue would have occurred on the percona mysql nodes. To analyse the number of failovers we would need to get sosreports from the mysql node(s).

[summary]

I think we have likely 2 potential issues here from what I can see described so far.

Firstly the networkd issue is likely not related to this specific case, as that happens specifically when systemd is upgraded and thus networkd is restarted, that shouldn't have happened here.

(Issue 1) The first is that we hit max_connections due to the multiple successive MySQL VIP failovers where max_connections is not at least 2x the steady state connection count. It also seems possible in some cases the VIP may shift back to the same host a 3rd time by chance and you may end up needing 3x. I think we could potentially improve that by modifying the pacemaker resource scripts to kill active connections when the VIP departs, or, ensuring that you have 2-3x max_connections of the steady state active connection count. That should go into a new bug likely against charm-percona-cluster as it ships it's own resource agent. We could also potentially add a configurable nagios check for having active connections in excess of 50% of max_connections.

(Issue 2) It was described that pacemaker got into a bad state during the restart and the lrmd didn't exit, and didn't work correctly until it was manually killed and restarted. I think we need to get more logs/sosreports from the nodes that had that specific issue, it sounds like something that may be a bug specific to a certain scenario or perhaps the older xenial version [This USN-4623-1 update happened for all LTS releases, 16.04/18.04/20.04].