upgrade-charm can cause full cluster outage
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MySQL InnoDB Cluster Charm |
Triaged
|
High
|
Unassigned |
Bug Description
I've observed that in some cases when running a juju refresh on charm-mysql-
As it's possible to see in the logs the units are running the upgrade-charm practically at the same time:
unit/0(leader)
https:/
unit/1
https:/
unit/2
https:/
Look how the logs of journalctl matches with the time that units installed the packages and then restart.
juju run -a mysql-innodb-
- Stdout: |
-- Logs begin at Thu 2023-10-26 03:01:35 UTC, end at Tue 2023-10-31 23:06:21 UTC. --
Oct 31 11:08:56 juju-f293f7-0-lxd-6 systemd[1]: Stopping MySQL Community Server...
Oct 31 11:10:03 juju-f293f7-0-lxd-6 systemd[1]: mysql.service: Succeeded.
Oct 31 11:10:03 juju-f293f7-0-lxd-6 systemd[1]: Stopped MySQL Community Server.
Oct 31 11:10:03 juju-f293f7-0-lxd-6 systemd[1]: Starting MySQL Community Server...
Oct 31 11:10:05 juju-f293f7-0-lxd-6 systemd[1]: Started MySQL Community Server.
UnitId: mysql-innodb-
- Stdout: |
-- Logs begin at Thu 2023-10-26 03:02:59 UTC, end at Tue 2023-10-31 23:06:18 UTC. --
Oct 31 11:08:47 juju-f293f7-1-lxd-7 systemd[1]: Stopping MySQL Community Server...
Oct 31 11:08:56 juju-f293f7-1-lxd-7 systemd[1]: mysql.service: Succeeded.
Oct 31 11:08:56 juju-f293f7-1-lxd-7 systemd[1]: Stopped MySQL Community Server.
Oct 31 11:08:56 juju-f293f7-1-lxd-7 systemd[1]: Starting MySQL Community Server...
Oct 31 11:08:58 juju-f293f7-1-lxd-7 systemd[1]: Started MySQL Community Server.
UnitId: mysql-innodb-
- Stdout: |
-- Logs begin at Thu 2023-10-26 03:05:30 UTC, end at Tue 2023-10-31 23:06:24 UTC. --
Oct 31 11:08:45 juju-f293f7-2-lxd-7 systemd[1]: Stopping MySQL Community Server...
Oct 31 11:08:54 juju-f293f7-2-lxd-7 systemd[1]: mysql.service: Succeeded.
Oct 31 11:08:54 juju-f293f7-2-lxd-7 systemd[1]: Stopped MySQL Community Server.
Oct 31 11:08:54 juju-f293f7-2-lxd-7 systemd[1]: Starting MySQL Community Server...
Oct 31 11:08:55 juju-f293f7-2-lxd-7 systemd[1]: Started MySQL Community Server.
UnitId: mysql-innodb-
I don't think it' easy to reproduce this bug. I've tried couple of times and the leader unit runs the upgrade-charm approximately two minutes later than the non-leader units.
This one definitely fell into the 'unseen consequences' bucket. The way refresh/ upgrade- charm is implemented means that all units do get it essentially simultaneously; thus it would have to be up to charm code to actually stagger restarts.
There are a few approaches that could be used, but the charm *already* has coordinated restarts! The `coordinator` module in charm-helpers was cargo-culted into the charm, and is used to stagger restarts when doing tls certificate changes, etc.
What would be needed is to wire in the restart for the upgrade-charm into that system; e.g. in lines 2025 onwards, in src/lib/ charm/openstack /mysql_ innodb_ cluster. py we have:
This is the `@when`-ed to enable the restarts of the services to be staggered.