upgrade-charm can cause full cluster outage

Bug #2042368 reported by Gabriel Cocenza
28
This bug affects 5 people
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
Triaged
High
Unassigned

Bug Description

I've observed that in some cases when running a juju refresh on charm-mysql-innodb-cluster can cause a cluster outage. If I understood correctly, the function custom_upgrade_charm will restart mysql service [0] if it's necessary to remove obsolete packages. This step can't be done at the same time on all units, otherwise this will be the same as a full down event like losing power.

As it's possible to see in the logs the units are running the upgrade-charm practically at the same time:

unit/0(leader)
https://pastebin.canonical.com/p/mMV8WprS7q/

unit/1
https://pastebin.canonical.com/p/XVp5MrcbRv/

unit/2
https://pastebin.canonical.com/p/p4KkP3m5WP/

Look how the logs of journalctl matches with the time that units installed the packages and then restart.

juju run -a mysql-innodb-cluster -- journalctl -n20 -u mysql
- Stdout: |
    -- Logs begin at Thu 2023-10-26 03:01:35 UTC, end at Tue 2023-10-31 23:06:21 UTC. --
    Oct 31 11:08:56 juju-f293f7-0-lxd-6 systemd[1]: Stopping MySQL Community Server...
    Oct 31 11:10:03 juju-f293f7-0-lxd-6 systemd[1]: mysql.service: Succeeded.
    Oct 31 11:10:03 juju-f293f7-0-lxd-6 systemd[1]: Stopped MySQL Community Server.
    Oct 31 11:10:03 juju-f293f7-0-lxd-6 systemd[1]: Starting MySQL Community Server...
    Oct 31 11:10:05 juju-f293f7-0-lxd-6 systemd[1]: Started MySQL Community Server.
  UnitId: mysql-innodb-cluster/0
- Stdout: |
    -- Logs begin at Thu 2023-10-26 03:02:59 UTC, end at Tue 2023-10-31 23:06:18 UTC. --
    Oct 31 11:08:47 juju-f293f7-1-lxd-7 systemd[1]: Stopping MySQL Community Server...
    Oct 31 11:08:56 juju-f293f7-1-lxd-7 systemd[1]: mysql.service: Succeeded.
    Oct 31 11:08:56 juju-f293f7-1-lxd-7 systemd[1]: Stopped MySQL Community Server.
    Oct 31 11:08:56 juju-f293f7-1-lxd-7 systemd[1]: Starting MySQL Community Server...
    Oct 31 11:08:58 juju-f293f7-1-lxd-7 systemd[1]: Started MySQL Community Server.
  UnitId: mysql-innodb-cluster/1
- Stdout: |
    -- Logs begin at Thu 2023-10-26 03:05:30 UTC, end at Tue 2023-10-31 23:06:24 UTC. --
    Oct 31 11:08:45 juju-f293f7-2-lxd-7 systemd[1]: Stopping MySQL Community Server...
    Oct 31 11:08:54 juju-f293f7-2-lxd-7 systemd[1]: mysql.service: Succeeded.
    Oct 31 11:08:54 juju-f293f7-2-lxd-7 systemd[1]: Stopped MySQL Community Server.
    Oct 31 11:08:54 juju-f293f7-2-lxd-7 systemd[1]: Starting MySQL Community Server...
    Oct 31 11:08:55 juju-f293f7-2-lxd-7 systemd[1]: Started MySQL Community Server.
  UnitId: mysql-innodb-cluster/2

I don't think it' easy to reproduce this bug. I've tried couple of times and the leader unit runs the upgrade-charm approximately two minutes later than the non-leader units.

[0] https://github.com/openstack/charms.openstack/blob/0664de1344479610ab5739b7479081b504e58a32/charms_openstack/charm/core.py#L858-L872

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

This one definitely fell into the 'unseen consequences' bucket. The way refresh/upgrade-charm is implemented means that all units do get it essentially simultaneously; thus it would have to be up to charm code to actually stagger restarts.

There are a few approaches that could be used, but the charm *already* has coordinated restarts! The `coordinator` module in charm-helpers was cargo-culted into the charm, and is used to stagger restarts when doing tls certificate changes, etc.

What would be needed is to wire in the restart for the upgrade-charm into that system; e.g. in lines 2025 onwards, in src/lib/charm/openstack/mysql_innodb_cluster.py we have:

                    ch_core.hookenv.log(
                        "Acquiring config-changed-restart lock for TLS change",
                        "DEBUG")
                    coordinator.acquire('config-changed-restart')

This is the `@when`-ed to enable the restarts of the services to be staggered.

Changed in charm-mysql-innodb-cluster:
importance: Undecided → High
status: New → Triaged
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.