Stuck on "Cluster has no quorum as visible from <leader_ip> and cannot process write transactions. 2 members are not active"

Bug #1917332 reported by Przemysław Lal
18
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
Medium
Dariusz Smigiel

Bug Description

All mysql-innodb-cluster units entered blocked state after running fine for a couple of days.
This happened during the weekend, without any heavy load or operations going on in parallel.

ubuntu@przemeklal-bastion:~$ juju status mysql-innodb-cluster
Model Controller Cloud/Region Version SLA Timestamp
neutron-work przemeklal-serverstack serverstack/serverstack 2.8.8 unsupported 15:48:36Z

App Version Status Scale Charm Store Rev OS Notes
mysql-innodb-cluster 8.0.23 blocked 3 mysql-innodb-cluster jujucharms 5 ubuntu

Unit Workload Agent Machine Public address Ports Message
mysql-innodb-cluster/0* blocked idle 0 10.5.0.7 MySQL InnoDB Cluster not healthy: Cluster has no quorum as visible from '10.5.0.7:3306' and cannot process write transactions. 2 members are not active
mysql-innodb-cluster/1 blocked idle 1 10.5.0.18 MySQL InnoDB Cluster not healthy: None
mysql-innodb-cluster/2 blocked idle 2 10.5.0.9 MySQL InnoDB Cluster not healthy: None

Machine State DNS Inst id Series AZ Message
0 started 10.5.0.7 7268ef34-31d8-492d-af7b-950d8f48f156 focal nova ACTIVE
1 started 10.5.0.18 489ae28a-43e3-4386-a90b-24eed1e04d3a focal nova ACTIVE
2 started 10.5.0.9 6e9d1f71-5580-4ae5-8841-d1151ea8a7a5 focal nova ACTIVE

In juju debug-log there are repeated error messages for units /1 and /2 (every 5 mins):

unit-mysql-innodb-cluster-1: 2021-03-01 14:00:18 ERROR unit.mysql-innodb-cluster/1.juju-log Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
  File "<string>", line 2, in <module>
SystemError: RuntimeError: Dba.get_cluster: This function is not available through a session to a standalone instance (metadata exists, instance belongs to that metadata, but GR is not active)

I suspected that it could be the same issue as described in bug reports lp:1889792, lp:1881735 or lp:1901771. However the error message in traceback is different and the issue didn't occur during deployment or heavy load. Still, there's a good chance that these are related. Also, running steps from this comment: https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1881735/comments/9 didn't resolve the problem: leader instance executes the action and shows that there's nothing to do:

unit-mysql-innodb-cluster-0: 2021-03-01 14:47:39 INFO unit.mysql-innodb-cluster/0.juju-log Configuring and adding instance to the cluster: 10.5.0.9.
unit-mysql-innodb-cluster-0: 2021-03-01 14:47:42 WARNING unit.mysql-innodb-cluster/0.juju-log User clusteruser exists.
unit-mysql-innodb-cluster-0: 2021-03-01 14:47:45 WARNING unit.mysql-innodb-cluster/0.juju-log User clusteruser exists.
unit-mysql-innodb-cluster-0: 2021-03-01 14:47:45 WARNING unit.mysql-innodb-cluster/0.juju-log Instance: 10.5.0.9, already configured.
unit-mysql-innodb-cluster-0: 2021-03-01 14:47:45 WARNING unit.mysql-innodb-cluster/0.juju-log Instance: 10.5.0.9, already clustered.
unit-mysql-innodb-cluster-0: 2021-03-01 14:47:45 INFO unit.mysql-innodb-cluster/0.juju-log DEPRECATION WARNING: Function action_set is being removed : moved to function_set()
unit-mysql-innodb-cluster-0: 2021-03-01 14:47:47 INFO unit.mysql-innodb-cluster/0.juju-log DEPRECATION WARNING: Function action_get is being removed : moved to function_get()
unit-mysql-innodb-cluster-0: 2021-03-01 14:47:47 INFO unit.mysql-innodb-cluster/0.juju-log Configuring and adding instance to the cluster: 10.5.0.18.

I attached mysql daemon error logs from all instances. They show transient connectivity errors, however leader instance 0 can access units 1 and 2 on port 3306:

root@juju-221534-neutron-work-0:/home/ubuntu# nmap -p 3306 10.5.0.9
Starting Nmap 7.80 ( https://nmap.org ) at 2021-03-01 16:14 UTC
Nmap scan report for juju-221534-neutron-work-2.project.serverstack (10.5.0.9)
Host is up (0.00096s latency).

PORT STATE SERVICE
3306/tcp open mysql
MAC Address: FA:16:3E:76:78:2D (Unknown)

Nmap done: 1 IP address (1 host up) scanned in 0.31 seconds
root@juju-221534-neutron-work-0:/home/ubuntu# nmap -p 3306 10.5.0.18
Starting Nmap 7.80 ( https://nmap.org ) at 2021-03-01 16:14 UTC
Nmap scan report for juju-221534-neutron-work-1.project.serverstack (10.5.0.18)
Host is up (0.0018s latency).

PORT STATE SERVICE
3306/tcp open mysql
MAC Address: FA:16:3E:8D:F1:5B (Unknown)

Nmap done: 1 IP address (1 host up) scanned in 0.28 seconds

Journal logs for mysql systemd service on units 1 and 2 show that service has been restarted 3 times, but didn't recover correctly afterwards. Restarting it manually doesn't help.

Revision history for this message
Przemysław Lal (przemeklal) wrote :
Revision history for this message
David Ames (thedac) wrote :

Przemysław,

Hi, the logs seem to indicate network connectivity problems. MySQL InnoDB cluster is fairly sensitive to connectivity failures and eventually gave up.

2021-02-27T22:25:12.339883Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:25:14.808863Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'
2021-02-27T22:25:34.802640Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:25:55.080743Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'
2021-02-27T22:26:25.070488Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:26:27.034761Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'
2021-02-27T22:26:47.028794Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:26:49.132067Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'
2021-02-27T22:26:55.134889Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:27:00.961542Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'

To recover this cluster you can run the `reboot-cluster-from-complete-outage` action [0]. Note, if the output suggest the instance you have run the action on does not have the latest GTID state, run it on another until successful.

Clearly, we have some documentation bugs. I have already filed one on the ambiguity of "MySQL InnoDB Cluster not healthy: None" [1]. We may turn this bug into a documentation bug for the need to `reboot-cluster-from-complete-outage` when the cluster is fully stopped.

[0] https://github.com/openstack/charm-mysql-innodb-cluster/blob/master/src/actions.yaml#L28
[1] https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1917337

Revision history for this message
Przemysław Lal (przemeklal) wrote :

Thanks David, I tried to run the action you suggested, but with no success:

ubuntu@przemeklal-bastion:~$ juju run-action mysql-innodb-cluster/0 reboot-cluster-from-complete-outage --wait
unit-mysql-innodb-cluster-0:
  UnitId: mysql-innodb-cluster/0
  id: "92"
  message: Reboot cluster from complete outage failed.
  results:
    output: |+
      Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
      Restoring the default cluster from complete outage...

      Traceback (most recent call last):
        File "<string>", line 2, in <module>
      SystemError: RuntimeError: Dba.reboot_cluster_from_complete_outage: The MySQL instance '10.5.0.7:3306' belongs to an InnoDB Cluster and is reachable. Please use <Cluster>.force_quorum_using_partition_of() to restore from the quorum loss.

    traceback: |
      Traceback (most recent call last):
        File "/var/lib/juju/agents/unit-mysql-innodb-cluster-0/charm/actions/reboot-cluster-from-complete-outage", line 164, in reboot_cluster_from_complete_outage
          output = instance.reboot_cluster_from_complete_outage()
        File "/var/lib/juju/agents/unit-mysql-innodb-cluster-0/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 798, in reboot_cluster_from_complete_outage
          raise e
        File "/var/lib/juju/agents/unit-mysql-innodb-cluster-0/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 786, in reboot_cluster_from_complete_outage
          output = self.run_mysqlsh_script(_script).decode("UTF-8")
        File "/var/lib/juju/agents/unit-mysql-innodb-cluster-0/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 1541, in run_mysqlsh_script
          return subprocess.check_output(cmd, stderr=subprocess.PIPE)
        File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/usr/lib/python3.8/subprocess.py", line 512, in run
          raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['/snap/bin/mysqlsh', '--no-wizard', '--python', '-f', '/root/snap/mysql-shell/common/tmpaxf532am.py']' returned non-zero exit status 1.
  status: failed
  timing:
    completed: 2021-03-01 18:53:11 +0000 UTC
    enqueued: 2021-03-01 18:53:08 +0000 UTC
    started: 2021-03-01 18:53:10 +0000 UTC

I tried to run it a couple of times against all 3 units and I kept getting the same traceback as above.

Revision history for this message
David Ames (thedac) wrote :

Ironically, I had a cluster running on ServerStack that failed the same way over the weekend. Was your cluster also on ServerStack? In which case this bug becomes how to recover from:

"Cluster has no quorum as visible from '10.5.0.14:3306' and cannot process write transactions. 2 members are not active"

Which seems to happen only in extreme circumstances.

Revision history for this message
David Ames (thedac) wrote :
Revision history for this message
Przemysław Lal (przemeklal) wrote :

Yes, it's on ServerStack as well. I'll try to recover it manually and will report back in case of any progress.

Revision history for this message
Przemysław Lal (przemeklal) wrote :

I managed to restore the quorum manually using mysql-shell. Here are the steps:

1. juju ssh into the first non-leader instance

$ mysql-shell.mysqlsh
mysql-py> shell.connect('clusteruser:<cluster-password>@<leader-ip>')
mysql-py []> cluster = dba.get_cluster()
mysql-py []> cluster.force_quorum_using_partition_of('clusteruser:<cluster-password>@<leader-ip>')
mysql-py []> cluster.rejoin_instance('clusteruser:<cluster-password>@<leader-ip>')
<exit>

This restored the quorum. The only thing left was to rejoin instance on the second non-leader instance:

2. juju ssh into the second non-leader instance
$ mysql-shell.mysqlsh
mysql-py> shell.connect('clusteruser:<cluster-password>@<leader-ip>')
mysql-py []> cluster = dba.get_cluster()
mysql-py []> cluster.force_quorum_using_partition_of('clusteruser:<cluster-password>@<leader-ip>')
mysql-py []> cluster.rejoin_instance('clusteruser:<cluster-password>@<leader-ip>')
<exit>

3. After a couple of seconds the cluster is back up and running:
$ juju status mysql-innodb-cluster
Model Controller Cloud/Region Version SLA Timestamp
neutron-work przemeklal-serverstack serverstack/serverstack 2.8.8 unsupported 09:22:29Z

App Version Status Scale Charm Store Rev OS Notes
mysql-innodb-cluster 8.0.23 active 3 mysql-innodb-cluster jujucharms 5 ubuntu

Unit Workload Agent Machine Public address Ports Message
mysql-innodb-cluster/0* active idle 0 10.5.0.7 Unit is ready: Mode: R/W
mysql-innodb-cluster/1 active idle 1 10.5.0.18 Unit is ready: Mode: R/O
mysql-innodb-cluster/2 active idle 2 10.5.0.9 Unit is ready: Mode: R/O

Machine State DNS Inst id Series AZ Message
0 started 10.5.0.7 7268ef34-31d8-492d-af7b-950d8f48f156 focal nova ACTIVE
1 started 10.5.0.18 489ae28a-43e3-4386-a90b-24eed1e04d3a focal nova ACTIVE
2 started 10.5.0.9 6e9d1f71-5580-4ae5-8841-d1151ea8a7a5 focal nova ACTIVE

Note: cluster-password can be obtained from:
$ juju run --unit mysql-innodb-cluster/leader leader-get

Revision history for this message
Przemysław Lal (przemeklal) wrote :

Attached the crashdump tarball so that it might be easier to understand how this situation happened in the first place.

Changed in charm-mysql-innodb-cluster:
status: New → Confirmed
Revision history for this message
David Ames (thedac) wrote :

Thanks @Przemysław!

TRIAGE:

Create an action for force_quorum_using_partition_of per Comment #9.

 cluster.force_quorum_using_partition_of('clusteruser:<cluster-password>@<leader-ip>')
 cluster.rejoin_instance('clusteruser:<cluster-password>@<leader-ip>')

Document the usage of force_quorum_using_partition_of, and reboot-cluster-from-complete-outage

Note: It is my opinion that we cannot avoid getting into these various failure states as this is MySQL's territory. We can only identify the possible states and codifying the methods for recovering from them. I am open to alternative views.

Changed in charm-mysql-innodb-cluster:
status: Confirmed → Triaged
importance: Undecided → Medium
tags: added: onboarding
Changed in charm-mysql-innodb-cluster:
assignee: nobody → Dariusz Smigiel (smigiel-dariusz)
Changed in charm-mysql-innodb-cluster:
status: Triaged → In Progress
tags: added: good-first-bug
removed: onboarding
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers