Scaling fails with MySQL InnoDB Cluster not healthy: None

Bug #1918322 reported by Jake Hill
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
charm-mysql-innodb-cluster
Undecided
Unassigned

Bug Description

I read https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1889792 but this seems different in that it is not directly following a botched removal, there is no current IP conflict.

Starting with an apparently healthy three-instance cluster;

Unit Workload Agent Machine Public address Ports Message
mysql-innodb-cluster/4* active idle 0/lxd/32 192.168.200.205 Unit is ready: Mode: R/O
mysql-innodb-cluster/5 active idle 38/lxd/34 192.168.200.189 Unit is ready: Mode: R/O
mysql-innodb-cluster/6 active idle 9/lxd/11 192.168.200.53 Unit is ready: Mode: R/W

$ juju run-action mysql-innodb-cluster/leader --wait cluster-status
unit-mysql-innodb-cluster-4:
  UnitId: mysql-innodb-cluster/4
  id: "44211"
  results:
    cluster-status: '{"clusterName": "jujuCluster", "defaultReplicaSet": {"name":
      "default", "primary": "192.168.200.53:3306", "ssl": "REQUIRED", "status": "OK",
      "statusText": "Cluster is ONLINE and can tolerate up to ONE failure.", "topology":
      {"192.168.200.189:3306": {"address": "192.168.200.189:3306", "mode": "R/O",
      "readReplicas": {}, "replicationLag": null, "role": "HA", "status": "ONLINE",
      "version": "8.0.22"}, "192.168.200.205:3306": {"address": "192.168.200.205:3306",
      "mode": "R/O", "readReplicas": {}, "replicationLag": null, "role": "HA", "status":
      "ONLINE", "version": "8.0.22"}, "192.168.200.53:3306": {"address": "192.168.200.53:3306",
      "mode": "R/W", "readReplicas": {}, "replicationLag": null, "role": "HA", "status":
      "ONLINE", "version": "8.0.22"}}, "topologyMode": "Single-Primary"}, "groupInformationSourceMember":
      "192.168.200.53:3306"}'

Attempt to scale;

$ juju add-unit mysql-innodb-cluster --to lxd:72

The new instance does not join correctly;

Unit Workload Agent Machine Public address Ports Message
mysql-innodb-cluster/4* active idle 0/lxd/32 192.168.200.205 Unit is ready: Mode: R/O
mysql-innodb-cluster/5 active idle 38/lxd/34 192.168.200.189 Unit is ready: Mode: R/O
mysql-innodb-cluster/6 active idle 9/lxd/11 192.168.200.53 Unit is ready: Mode: R/W
mysql-innodb-cluster/11 blocked idle 72/lxd/20 192.168.200.166 MySQL InnoDB Cluster not healthy: None

$ juju run-action mysql-innodb-cluster/leader --wait cluster-status
unit-mysql-innodb-cluster-4:
  UnitId: mysql-innodb-cluster/4
  id: "44213"
  results:
    cluster-status: '{"clusterName": "jujuCluster", "defaultReplicaSet": {"name":
      "default", "primary": "192.168.200.53:3306", "ssl": "REQUIRED", "status": "OK_PARTIAL",
      "statusText": "Cluster is ONLINE and can tolerate up to ONE failure. 1 member
      is not active", "topology": {"192.168.200.166:3306": {"address": "192.168.200.166:3306",
      "mode": "R/O", "readReplicas": {}, "role": "HA", "status": "(MISSING)"}, "192.168.200.189:3306":
      {"address": "192.168.200.189:3306", "mode": "R/O", "readReplicas": {}, "replicationLag":
      null, "role": "HA", "status": "ONLINE", "version": "8.0.22"}, "192.168.200.205:3306":
      {"address": "192.168.200.205:3306", "mode": "R/O", "readReplicas": {}, "replicationLag":
      null, "role": "HA", "status": "ONLINE", "version": "8.0.22"}, "192.168.200.53:3306":
      {"address": "192.168.200.53:3306", "mode": "R/W", "readReplicas": {}, "replicationLag":
      null, "role": "HA", "status": "ONLINE", "version": "8.0.22"}}, "topologyMode":
      "Single-Primary"}, "groupInformationSourceMember": "192.168.200.53:3306"}'

I am unable to reproduce this on a fresh deploy of openstack-base, so something with the initial state of this production model despite the apparent greenness. Please help to diagnose further.

Revision history for this message
Jake Hill (routergod) wrote :
Revision history for this message
David Ames (thedac) wrote :

@Jake,

Hi thanks for filing the bug. Can I get the debug log from the leader node, mysql-innodb-cluster/4*. Or better yet from all of the instances:

juju debug-log --replay --include mysql-innodb-cluster/4 --include mysql-innodb-cluster/5 --include mysql-innodb-cluster/6 --include mysql-innodb-cluster/11

The leader node's debug log will give us most of the information we need to figure out what happened.

How did get get from unit number 6 to 11 were there 5 attempts at scale out?

Changed in charm-mysql-innodb-cluster:
status: New → Incomplete
Revision history for this message
Jake Hill (routergod) wrote :

@David, thanks for such prompt response. Please find attached.

This cluster has some history, as you will see in the logs. Yes, the jump from unit 6 to 11 is me trying this a few times, trying to add either one or two instances at a time. Following the each attempt I used the remove-instance action to clean up after remove-unit, before trying again.

The unit numbers start at 4 because I had similar problems establishing the cluster initially. This was a migration from MySQL 5.x. Having deployed three units initially I had put one on the wrong host. I tried to correct that (adding what would have been unit 3) unsuccessfully. I then did remove-application and started again (units 4, 5 and 6).

Revision history for this message
David Ames (thedac) wrote :

@Jake,

It seems the debug-log is not getting us the full history. I suspect it is only getting the one day's logs. So I am still missing critical information on what happened.

Can you get /var/log/juju and /var/log/mysql on each of the units, tar and gz and attach, please?

Revision history for this message
Jake Hill (routergod) wrote :

@David,

Sorry I should have noticed. There seems to be some log rotation in mysql (see ls-l.txt) so not sure if even this is enough. If not I can repeat what I've attempted previously.

Revision history for this message
Jake Hill (routergod) wrote :

Contents of /var/log/mysql and /var/log/juju from all mysql-innodb-cluster units

Revision history for this message
Alex Zero (citadelcore) wrote :

Same happens for me.

Revision history for this message
Launchpad Janitor (janitor) wrote :

[Expired for charm-mysql-innodb-cluster because there has been no activity for 60 days.]

Changed in charm-mysql-innodb-cluster:
status: Incomplete → Expired
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers