Bug #1917332 “Stuck on “Cluster has no quorum as visible from <l...” : Bugs : MySQL InnoDB Cluster Charm

Revision history for this message

Przemyslaw Lal (przemeklal) wrote on 2021-03-01:

#3

mysql_logs.tar.xz Edit (4.7 KiB, application/x-tar)

Revision history for this message

David Ames (thedac) wrote on 2021-03-01:

#4

Przemysław,

Hi, the logs seem to indicate network connectivity problems. MySQL InnoDB cluster is fairly sensitive to connectivity failures and eventually gave up.

2021-02-27T22:25:12.339883Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:25:14.808863Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'
2021-02-27T22:25:34.802640Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:25:55.080743Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'
2021-02-27T22:26:25.070488Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:26:27.034761Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'
2021-02-27T22:26:47.028794Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:26:49.132067Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'
2021-02-27T22:26:55.134889Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:27:00.961542Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'

To recover this cluster you can run the `reboot-cluster-from-complete-outage` action [0]. Note, if the output suggest the instance you have run the action on does not have the latest GTID state, run it on another until successful.

Clearly, we have some documentation bugs. I have already filed one on the ambiguity of "MySQL InnoDB Cluster not healthy: None" [1]. We may turn this bug into a documentation bug for the need to `reboot-cluster-from-complete-outage` when the cluster is fully stopped.

[0] https://github.com/openstack/charm-mysql-innodb-cluster/blob/master/src/actions.yaml#L28
[1] https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1917337

Przemysław,

Hi, the logs seem to indicate network connectivity problems. MySQL InnoDB cluster is fairly sensitive to connectivity failures and eventually gave up.

2021-02-27T22:25:12.339883Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:25:14.808863Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'
2021-02-27T22:25:34.802640Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:25:55.080743Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'
2021-02-27T22:26:25.070488Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:26:27.034761Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'
2021-02-27T22:26:47.028794Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:26:49.132067Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'
2021-02-27T22:26:55.134889Z 0 [Warning] [MY-011493] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 has become unreachable.'
2021-02-27T22:27:00.961542Z 0 [Warning] [MY-011494] [Repl] Plugin group_replication reported: 'Member with address 10.5.0.18:3306 is reachable again.'

To recover this cluster you can run the `reboot-cluster-from-complete-outage` action [0]. Note, if the output suggest the instance you have run the action on does not have the latest GTID state, run it on another until successful.

Clearly, we have some documentation bugs. I have already filed one on the ambiguity of "MySQL InnoDB Cluster not healthy: None" [1]. We may turn this bug into a documentation bug for the need to `reboot-cluster-from-complete-outage` when the cluster is fully stopped.

[0] https://github.com/openstack/charm-mysql-innodb-cluster/blob/master/src/actions.yaml#L28
[1] https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1917337

Revision history for this message

Przemyslaw Lal (przemeklal) wrote on 2021-03-01:

#5

Thanks David, I tried to run the action you suggested, but with no success:

ubuntu@przemeklal-bastion:~$ juju run-action mysql-innodb-cluster/0 reboot-cluster-from-complete-outage --wait
unit-mysql-innodb-cluster-0:
  UnitId: mysql-innodb-cluster/0
  id: "92"
  message: Reboot cluster from complete outage failed.
  results:
    output: |+
      Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
      Restoring the default cluster from complete outage...

      Traceback (most recent call last):
        File "<string>", line 2, in <module>
      SystemError: RuntimeError: Dba.reboot_cluster_from_complete_outage: The MySQL instance '10.5.0.7:3306' belongs to an InnoDB Cluster and is reachable. Please use <Cluster>.force_quorum_using_partition_of() to restore from the quorum loss.

    traceback: |
      Traceback (most recent call last):
        File "/var/lib/juju/agents/unit-mysql-innodb-cluster-0/charm/actions/reboot-cluster-from-complete-outage", line 164, in reboot_cluster_from_complete_outage
          output = instance.reboot_cluster_from_complete_outage()
        File "/var/lib/juju/agents/unit-mysql-innodb-cluster-0/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 798, in reboot_cluster_from_complete_outage
          raise e
        File "/var/lib/juju/agents/unit-mysql-innodb-cluster-0/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 786, in reboot_cluster_from_complete_outage
          output = self.run_mysqlsh_script(_script).decode("UTF-8")
        File "/var/lib/juju/agents/unit-mysql-innodb-cluster-0/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 1541, in run_mysqlsh_script
          return subprocess.check_output(cmd, stderr=subprocess.PIPE)
        File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/usr/lib/python3.8/subprocess.py", line 512, in run
          raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['/snap/bin/mysqlsh', '--no-wizard', '--python', '-f', '/root/snap/mysql-shell/common/tmpaxf532am.py']' returned non-zero exit status 1.
  status: failed
  timing:
    completed: 2021-03-01 18:53:11 +0000 UTC
    enqueued: 2021-03-01 18:53:08 +0000 UTC
    started: 2021-03-01 18:53:10 +0000 UTC

I tried to run it a couple of times against all 3 units and I kept getting the same traceback as above.

Thanks David, I tried to run the action you suggested, but with no success:

ubuntu@przemeklal-bastion:~$ juju run-action mysql-innodb-cluster/0 reboot-cluster-from-complete-outage --wait
unit-mysql-innodb-cluster-0:
  UnitId: mysql-innodb-cluster/0
  id: "92"
  message: Reboot cluster from complete outage failed.
  results:
    output: |+
      Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
      Restoring the default cluster from complete outage...

Traceback (most recent call last):
        File "<string>", line 2, in <module>
      SystemError: RuntimeError: Dba.reboot_cluster_from_complete_outage: The MySQL instance '10.5.0.7:3306' belongs to an InnoDB Cluster and is reachable. Please use <Cluster>.force_quorum_using_partition_of() to restore from the quorum loss.

traceback: |
      Traceback (most recent call last):
        File "/var/lib/juju/agents/unit-mysql-innodb-cluster-0/charm/actions/reboot-cluster-from-complete-outage", line 164, in reboot_cluster_from_complete_outage
          output = instance.reboot_cluster_from_complete_outage()
        File "/var/lib/juju/agents/unit-mysql-innodb-cluster-0/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 798, in reboot_cluster_from_complete_outage
          raise e
        File "/var/lib/juju/agents/unit-mysql-innodb-cluster-0/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 786, in reboot_cluster_from_complete_outage
          output = self.run_mysqlsh_script(_script).decode("UTF-8")
        File "/var/lib/juju/agents/unit-mysql-innodb-cluster-0/charm/lib/charm/openstack/mysql_innodb_cluster.py", line 1541, in run_mysqlsh_script
          return subprocess.check_output(cmd, stderr=subprocess.PIPE)
        File "/usr/lib/python3.8/subprocess.py", line 411, in check_output
          return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
        File "/usr/lib/python3.8/subprocess.py", line 512, in run
          raise CalledProcessError(retcode, process.args,
      subprocess.CalledProcessError: Command '['/snap/bin/mysqlsh', '--no-wizard', '--python', '-f', '/root/snap/mysql-shell/common/tmpaxf532am.py']' returned non-zero exit status 1.
  status: failed
  timing:
    completed: 2021-03-01 18:53:11 +0000 UTC
    enqueued: 2021-03-01 18:53:08 +0000 UTC
    started: 2021-03-01 18:53:10 +0000 UTC

I tried to run it a couple of times against all 3 units and I kept getting the same traceback as above.

Revision history for this message

David Ames (thedac) wrote on 2021-03-01:

#6

Ironically, I had a cluster running on ServerStack that failed the same way over the weekend. Was your cluster also on ServerStack? In which case this bug becomes how to recover from:

"Cluster has no quorum as visible from '10.5.0.14:3306' and cannot process write transactions. 2 members are not active"

Which seems to happen only in extreme circumstances.

Revision history for this message

David Ames (thedac) wrote on 2021-03-01:

#7

This may have some insight https://lefred.be/content/mysql-innodb-cluster-how-to-manage-a-split-brain-situation/

Revision history for this message

Przemyslaw Lal (przemeklal) wrote on 2021-03-01:

#8

Yes, it's on ServerStack as well. I'll try to recover it manually and will report back in case of any progress.

Revision history for this message

Przemyslaw Lal (przemeklal) wrote on 2021-03-02:

#9

I managed to restore the quorum manually using mysql-shell. Here are the steps:

1. juju ssh into the first non-leader instance

$ mysql-shell.mysqlsh
mysql-py> shell.connect('clusteruser:<cluster-password>@<leader-ip>')
mysql-py []> cluster = dba.get_cluster()
mysql-py []> cluster.force_quorum_using_partition_of('clusteruser:<cluster-password>@<leader-ip>')
mysql-py []> cluster.rejoin_instance('clusteruser:<cluster-password>@<leader-ip>')
<exit>

This restored the quorum. The only thing left was to rejoin instance on the second non-leader instance:

2. juju ssh into the second non-leader instance
$ mysql-shell.mysqlsh
mysql-py> shell.connect('clusteruser:<cluster-password>@<leader-ip>')
mysql-py []> cluster = dba.get_cluster()
mysql-py []> cluster.force_quorum_using_partition_of('clusteruser:<cluster-password>@<leader-ip>')
mysql-py []> cluster.rejoin_instance('clusteruser:<cluster-password>@<leader-ip>')
<exit>

3. After a couple of seconds the cluster is back up and running:
$ juju status mysql-innodb-cluster
Model Controller Cloud/Region Version SLA Timestamp
neutron-work przemeklal-serverstack serverstack/serverstack 2.8.8 unsupported 09:22:29Z

App Version Status Scale Charm Store Rev OS Notes
mysql-innodb-cluster 8.0.23 active 3 mysql-innodb-cluster jujucharms 5 ubuntu

Unit Workload Agent Machine Public address Ports Message
mysql-innodb-cluster/0* active idle 0 10.5.0.7 Unit is ready: Mode: R/W
mysql-innodb-cluster/1 active idle 1 10.5.0.18 Unit is ready: Mode: R/O
mysql-innodb-cluster/2 active idle 2 10.5.0.9 Unit is ready: Mode: R/O

Machine State DNS Inst id Series AZ Message
0 started 10.5.0.7 7268ef34-31d8-492d-af7b-950d8f48f156 focal nova ACTIVE
1 started 10.5.0.18 489ae28a-43e3-4386-a90b-24eed1e04d3a focal nova ACTIVE
2 started 10.5.0.9 6e9d1f71-5580-4ae5-8841-d1151ea8a7a5 focal nova ACTIVE

Note: cluster-password can be obtained from:
$ juju run --unit mysql-innodb-cluster/leader leader-get

I managed to restore the quorum manually using mysql-shell. Here are the steps:

1. juju ssh into the first non-leader instance

$ mysql-shell.mysqlsh
mysql-py> shell.connect('clusteruser:<cluster-password>@<leader-ip>')
mysql-py []> cluster = dba.get_cluster()
mysql-py []> cluster.force_quorum_using_partition_of('clusteruser:<cluster-password>@<leader-ip>')
mysql-py []> cluster.rejoin_instance('clusteruser:<cluster-password>@<leader-ip>')
<exit>

This restored the quorum. The only thing left was to rejoin instance on the second non-leader instance:

2. juju ssh into the second non-leader instance
$ mysql-shell.mysqlsh
mysql-py> shell.connect('clusteruser:<cluster-password>@<leader-ip>')
mysql-py []> cluster = dba.get_cluster()
mysql-py []> cluster.force_quorum_using_partition_of('clusteruser:<cluster-password>@<leader-ip>')
mysql-py []> cluster.rejoin_instance('clusteruser:<cluster-password>@<leader-ip>')
<exit>

3. After a couple of seconds the cluster is back up and running:
$ juju status mysql-innodb-cluster  
Model         Controller              Cloud/Region             Version  SLA          Timestamp
neutron-work  przemeklal-serverstack  serverstack/serverstack  2.8.8    unsupported  09:22:29Z

App                   Version  Status  Scale  Charm                 Store       Rev  OS      Notes
mysql-innodb-cluster  8.0.23   active      3  mysql-innodb-cluster  jujucharms    5  ubuntu

Unit                     Workload  Agent  Machine  Public address  Ports  Message
mysql-innodb-cluster/0*  active    idle   0        10.5.0.7               Unit is ready: Mode: R/W
mysql-innodb-cluster/1   active    idle   1        10.5.0.18              Unit is ready: Mode: R/O
mysql-innodb-cluster/2   active    idle   2        10.5.0.9               Unit is ready: Mode: R/O

Machine  State    DNS        Inst id                               Series  AZ    Message
0        started  10.5.0.7   7268ef34-31d8-492d-af7b-950d8f48f156  focal   nova  ACTIVE
1        started  10.5.0.18  489ae28a-43e3-4386-a90b-24eed1e04d3a  focal   nova  ACTIVE
2        started  10.5.0.9   6e9d1f71-5580-4ae5-8841-d1151ea8a7a5  focal   nova  ACTIVE

Note: cluster-password can be obtained from:
$ juju run --unit mysql-innodb-cluster/leader leader-get

Revision history for this message

Przemyslaw Lal (przemeklal) wrote on 2021-03-02:

#10

juju-crashdump-bcc70e12-f747-4579-866d-27777bacc442.tar.xz Edit (118.2 MiB, application/x-tar)

Attached the crashdump tarball so that it might be easier to understand how this situation happened in the first place.

Alex Kavanagh (ajkavanagh) on 2021-03-02

Changed in charm-mysql-innodb-cluster:
status:	New → Confirmed

Revision history for this message

David Ames (thedac) wrote on 2021-03-02:

#11

Thanks @Przemysław!

TRIAGE:

Create an action for force_quorum_using_partition_of per Comment #9.

cluster.force_quorum_using_partition_of('clusteruser:<cluster-password>@<leader-ip>')
cluster.rejoin_instance('clusteruser:<cluster-password>@<leader-ip>')

Document the usage of force_quorum_using_partition_of, and reboot-cluster-from-complete-outage

Note: It is my opinion that we cannot avoid getting into these various failure states as this is MySQL's territory. We can only identify the possible states and codifying the methods for recovering from them. I am open to alternative views.

Changed in charm-mysql-innodb-cluster:
status:	Confirmed → Triaged
importance:	Undecided → Medium
tags:	added: onboarding

Dariusz Smigiel (smigiel-dariusz) on 2021-03-08

Changed in charm-mysql-innodb-cluster:
assignee:	nobody → Dariusz Smigiel (smigiel-dariusz)

OpenStack Infra (hudson-openstack) on 2021-04-20

Changed in charm-mysql-innodb-cluster:
status:	Triaged → In Progress

Corey Bryant (corey.bryant) on 2021-05-12

tags:

added: good-first-bug
removed: onboarding

Dariusz Smigiel (smigiel-dariusz) on 2021-12-07

Changed in charm-mysql-innodb-cluster:
assignee:	Dariusz Smigiel (smigiel-dariusz) → nobody

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-01-05: Change abandoned on charm-mysql-innodb-cluster (master)

#12

Change abandoned by "dasm <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/779427

Paulo Machado (paulomachado) on 2022-03-18

Changed in charm-mysql-innodb-cluster:
assignee:	nobody → Paulo Machado (paulomachado)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-24: Fix proposed to charm-mysql-innodb-cluster (master)

#13

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/835106

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-24: Change abandoned on charm-mysql-innodb-cluster (master)

#14

Change abandoned by "Paulo Silveira Machado <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/835106
Reason: Bad formatted review

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-03-24: Fix proposed to charm-mysql-innodb-cluster (master)

#15

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/835117

Revision history for this message

Robert Gildein (rgildein) wrote on 2022-05-26:

#16

Hi Paulo, I ran into the same problem and tried your fix-proposal, but it
didn't help. Unfortunately, I was in a hurry, so I did not investigate
further. This probably won't be a useful comment without any logs, but I
wanted to share that I hit same issue.

Revision history for this message

Paulo Machado (paulomachado) wrote on 2022-06-09:

#17

Thanks for the feedback Robert. Indeed this does not seems to cover all cases, it being a hard one to pin.

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2022-09-12: Change abandoned on charm-mysql-innodb-cluster (master)

#18

Change abandoned by "James Page <email address hidden>" on branch: master
Review: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/835117
Reason: This review is > 12 weeks without comment, and failed testing the last time it was checked. We are abandoning this for now. Feel free to reactivate the review by pressing the restore button and leaving a 'recheck' comment to get fresh test results.

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2023-05-31:

#19

Note, I've re-opened the review and rebased it to the current master branch: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/835117

	Status	Importance	Assigned to
MySQL InnoDB Cluster Charm	Status tracked in Trunk
Jammy	New	Undecided	Unassigned
Trunk	Triaged	Medium	Unassigned

MySQL InnoDB Cluster Charm

Stuck on "Cluster has no quorum as visible from <leader_ip> and cannot process write transactions. 2 members are not active"

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches