cluster-status: "null" with cluster-status action

Bug #1907202 reported by Nobuto Murata
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
Triaged
Medium
Unassigned

Bug Description

After reboot of units one by one, MySQL cluster got into unhealthy status. However, the cluster-status didn't give any informative messages except for cluster-status: "null" which is not so helpful in this context.

$ juju run-action mysql-innodb-cluster/0 cluster-status --wait
unit-mysql-innodb-cluster-0:
  UnitId: mysql-innodb-cluster/0
  id: "390"
  results:
    cluster-status: "null"
  status: completed
  timing:
    completed: 2020-12-08 07:35:10 +0000 UTC
    enqueued: 2020-12-08 07:34:27 +0000 UTC
    started: 2020-12-08 07:34:41 +0000 UTC

mysql-innodb-cluster/0* blocked executing 3/lxd/8 10.130.11.56 (cluster-status) MySQL InnoDB Cluster not healthy: None
  filebeat/107 active idle 10.130.11.56 Filebeat ready.
  logrotate/95 active idle 10.130.11.56 Unit is ready.
  nrpe-container/68 active idle 10.130.11.56 icmp,5666/tcp ready
  telegraf/107 active idle 10.130.11.56 9103/tcp Monitoring mysql-innodb-cluster/0
mysql-innodb-cluster/1 blocked idle 4/lxd/8 10.130.12.59 MySQL InnoDB Cluster not healthy: None
  filebeat/93 active idle 10.130.12.59 Filebeat ready.
  logrotate/55 active idle 10.130.12.59 Unit is ready.
  nrpe-container/54 active idle 10.130.12.59 icmp,5666/tcp ready
  telegraf/33 active idle 10.130.12.59 9103/tcp Monitoring mysql-innodb-cluster/1
mysql-innodb-cluster/2 blocked idle 5/lxd/8 10.130.13.133 MySQL InnoDB Cluster not healthy: None
  filebeat/92 active idle 10.130.13.133 Filebeat ready.
  logrotate/64 active idle 10.130.13.133 Unit is ready.
  nrpe-container/55 active idle 10.130.13.133 icmp,5666/tcp ready
  telegraf/32 active idle 10.130.13.133 9103/tcp Monitoring mysql-innodb-cluster/2

Revision history for this message
Nobuto Murata (nobuto) wrote :
Download full text (4.3 KiB)

unit-mysql-innodb-cluster-0: 07:33:39 INFO unit.mysql-innodb-cluster/0.juju-log DEPRECATION WARNING: Function action_set is being removed : moved to function_set()
unit-mysql-innodb-cluster-0: 07:34:09 ERROR unit.mysql-innodb-cluster/0.juju-log Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
  File "<string>", line 2, in <module>
SystemError: RuntimeError: Dba.get_cluster: This function is not available through a session to a standalone instance (metadata exists, instance belongs to that metadata, but GR is not active)

unit-mysql-innodb-cluster-0: 07:34:09 INFO unit.mysql-innodb-cluster/0.juju-log DEPRECATION WARNING: Function action_set is being removed : moved to function_set()
unit-mysql-innodb-cluster-0: 07:34:40 ERROR unit.mysql-innodb-cluster/0.juju-log Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
  File "<string>", line 2, in <module>
SystemError: RuntimeError: Dba.get_cluster: This function is not available through a session to a standalone instance (metadata exists, instance belongs to that metadata, but GR is not active)

unit-mysql-innodb-cluster-0: 07:34:40 INFO unit.mysql-innodb-cluster/0.juju-log DEPRECATION WARNING: Function action_set is being removed : moved to function_set()
unit-mysql-innodb-cluster-0: 07:35:09 ERROR unit.mysql-innodb-cluster/0.juju-log Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
  File "<string>", line 2, in <module>
SystemError: RuntimeError: Dba.get_cluster: This function is not available through a session to a standalone instance (metadata exists, instance belongs to that metadata, but GR is not active)

unit-mysql-innodb-cluster-0: 07:35:09 INFO unit.mysql-innodb-cluster/0.juju-log DEPRECATION WARNING: Function action_set is being removed : moved to function_set()
unit-mysql-innodb-cluster-0: 07:35:10 INFO unit.mysql-innodb-cluster/0.juju-log cluster:30: Reactive main running for hook cluster-relation-changed
unit-mysql-innodb-cluster-0: 07:35:12 INFO unit.mysql-innodb-cluster/0.juju-log cluster:30: Initializing Snap Layer
unit-mysql-innodb-cluster-0: 07:35:12 WARNING unit.mysql-innodb-cluster/0.cluster-relation-changed All snaps up to date.
unit-mysql-innodb-cluster-0: 07:35:12 INFO unit.mysql-innodb-cluster/0.juju-log cluster:30: Coordinator: Initializing coordinator layer
unit-mysql-innodb-cluster-0: 07:35:12 INFO unit.mysql-innodb-cluster/0.juju-log cluster:30: Initializing Leadership Layer (is leader)
unit-mysql-innodb-cluster-0: 07:35:12 INFO unit.mysql-innodb-cluster/0.juju-log cluster:30: Invoking reactive handler: reactive/mysql_innodb_cluster_handlers.py:274:db_router_respond
unit-mysql-innodb-cluster-0: 07:35:41 ERROR unit.mysql-innodb-cluster/0.juju-log cluster:30: Cluster is unavailable: Cannot set LC_ALL to locale en_US.UTF-8: No such file or directory
Traceback (most recent call last):
  File "<string>", line 2, in <module>
SystemError: RuntimeError: Dba.get_cluster: This function is not available through a session to a standalone instance (metada...

Read more...

Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I ran into this whilst doing charm-upgrades from stable->next for 21.01 testing. It's possible that during the upgrade all the units got restarted which means that the cluster went US. I'll do some further testing to verify.

In order to get this back, I needed to run:

cluster = dba.reboot_cluster_from_complete_outage()

from the mysqlsh utility (after connection). Then the database came back.

Note. the LC_ALL was a red-herring; it doesn't affect operations, it just generates the error message.

Changed in charm-mysql-innodb-cluster:
status: New → Triaged
importance: Undecided → High
assignee: nobody → Alex Kavanagh (ajkavanagh)
tags: added: charm-upgrade
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

I'm tempted to close this as invalid, but will leave it as incomplete. I've done some testing with the -next charms (that are about to become 21.01 charms), and then *only* way I can replicate this is to restart all the units at the same time. If I restart a single unit, wait for it to come back and the cluster to settle, I can't reproduce the problem. If I shut down two units (regardless of whether one of them is R/W) and then bring them back then the cluster recovers.

Please could additional details be provided about how/when/what timing the units were shutdown/restarted and failed to recover.

Thanks.

Changed in charm-mysql-innodb-cluster:
status: Triaged → Incomplete
importance: High → Undecided
assignee: Alex Kavanagh (ajkavanagh) → nobody
Revision history for this message
Billy Olsen (billy-olsen) wrote :

I'm going to reopen this one because the all machines shutdown is a valid case that does occur in the scenario of data center power outages.

Changed in charm-mysql-innodb-cluster:
status: Incomplete → Triaged
importance: Undecided → Medium
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

> I'm going to reopen this one because the all machines shutdown is a valid case that does occur in the scenario of data center power outages.

I'm not sure mysql8 *can* recover (or is safe to recover) in that state without some manual intervention. Perhaps what's needed is an action to trigger `reboot_cluster_from_complete_outage` and a workload message to indicate that the cluster is 'out'?

Revision history for this message
Billy Olsen (billy-olsen) wrote :

Agreed that mysql8 may have challenges in recovering which will likely require administrator intervention - however I think adding some capabilities to the charms to help operators/administrators with this would be something desirable for the charms themselves. An action sounds completely reasonable - but also noting that there are some scenarios where it requires more than just issuing the reboot_cluster_from_complete_outage and those should likely be worked through what the experience actually is.

I think this bug should be around providing status to the user that isn't just "null" or None - even something like "unable to detect cluster status" would go along way in making this more helpful. Further recovery enhancements can be made in appropriate feature work.

Revision history for this message
Nobuto Murata (nobuto) wrote :

That's exactly the point. Even with the scenario where "reboot-cluster-from-complete-outage" action is required to be run, cluster-status action output as null is not so helpful.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.