Bug #1938708 “Unit stays blocked after machine powerdown and reb...” : Series jammy : Bugs : MySQL InnoDB Cluster Charm

Alexander Balderson (asbalderson) on 2021-08-02

description:

updated

Revision history for this message

Alex Kavanagh (ajkavanagh) wrote on 2021-08-11:

#1

There are a number of similar bugs to this one open for mysql-innodb-cluster:

- Deployment blocked with "Cluster is inaccessible from this instance" (https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1926449)
- Stuck on "Cluster has no quorum as visible from <leader_ip> and cannot process write transactions. 2 members are not active" (https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1917332)

See more at: https://bugs.launchpad.net/charm-mysql-innodb-cluster

I think it boils down to: the payload mysql-innodb-cluster is either a) currently not configured in a way that recovers from cluster outages, b) isn't actually very good at coming back from cluster outages (i.e. a problem in the product).

Looking at https://dev.mysql.com/doc/mysql-shell/8.0/en/troubleshooting-innodb-cluster.html it seems it might be tricky to actually fully automate this.

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2022-07-18:

#2

A reproducer for this is quite simple: forcefully powering of a follower (replica) unit and powering it back on results in a unit being blocked with mysql starting but reporting the following in the error.log:

* [Server] Error reading GTIDs from relaylog: -1
* [Repl] Error reading relay log event for channel 'group_replication_applier': corrupted data in log event
* [Repl] Slave SQL for channel 'group_replication_applier': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: MY-013121
* [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'
* [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
* [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'
* [Repl] Plugin group_replication reported: 'Unable to confirm whether the server has left the group or not. Check performance_schema.replication_group_members to check group membership information.'
* [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
* [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'FIRST' position 0
2022-07-18T14:51:17.013502Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node 10.10.20.32:33061. Please check the connection status to this member'
* [Repl] Plugin group_replication reported: 'There was a previous plugin error while the member joined the group. The member will now exit the group.'

https://paste.ubuntu.com/p/8xPkRDXHDB/ (juju status and error.log before and after power events)

https://paste.ubuntu.com/p/dq2kr7znzv/ (full juju show-status-log for the affected unit)

18 Jul 2022 17:51:23+03:00 juju-unit executing running start hook
18 Jul 2022 17:51:56+03:00 juju-unit executing running leader-settings-changed hook
18 Jul 2022 17:52:24+03:00 juju-unit idle
18 Jul 2022 18:21:55+03:00 workload blocked Cluster is inaccessible from this instance. Please check logs for details.

https://paste.ubuntu.com/p/H83HdMcVQK/ (cluster-status action output)

A reproducer for this is quite simple: forcefully powering of a follower (replica) unit and powering it back on results in a unit being blocked with mysql starting but reporting the following in the error.log:

* [Server] Error reading GTIDs from relaylog: -1
* [Repl] Error reading relay log event for channel 'group_replication_applier': corrupted data in log event
* [Repl] Slave SQL for channel 'group_replication_applier': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: MY-013121
* [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'
* [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
* [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'
* [Repl] Plugin group_replication reported: 'Unable to confirm whether the server has left the group or not. Check performance_schema.replication_group_members to check group membership information.'
* [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
* [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'FIRST' position 0
2022-07-18T14:51:17.013502Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node 10.10.20.32:33061. Please check the connection status to this member'
* [Repl] Plugin group_replication reported: 'There was a previous plugin error while the member joined the group. The member will now exit the group.'

https://paste.ubuntu.com/p/8xPkRDXHDB/ (juju status and error.log before and after power events)

https://paste.ubuntu.com/p/dq2kr7znzv/ (full juju show-status-log for the affected unit)

18 Jul 2022 17:51:23+03:00  juju-unit  executing    running start hook
18 Jul 2022 17:51:56+03:00  juju-unit  executing    running leader-settings-changed hook
18 Jul 2022 17:52:24+03:00  juju-unit  idle         
18 Jul 2022 18:21:55+03:00  workload   blocked      Cluster is inaccessible from this instance. Please check logs for details.

https://paste.ubuntu.com/p/H83HdMcVQK/ (cluster-status action output)

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2022-07-18:

#3

In order to fix this we can make the followers crash-tolerant which is currently not done:

https://dev.mysql.com/doc/refman/8.0/en/replication-solutions-unexpected-replica-halt.html
"In order for replication to be resilient to unexpected halts of the server (sometimes described as crash-safe) it must be possible for the replica to recover its state before halting. This section describes the impact of an unexpected halt of a replica during replication, and how to configure a replica for the best chance of recovery to continue replication."

The right option isn't explicitly enabled and is set to OFF by default:

"Set relay_log_recovery = ON, which enables automatic relay log recovery immediately following server startup. This global variable defaults to OFF and is read-only at runtime, but you can set it to ON with the --relay-log-recovery option at replica startup following an unexpected halt of a replica. Note that this setting ignores the existing relay log files, in case they are corrupted or inconsistent. The relay log recovery process starts a new relay log file and fetches transactions from the source beginning at the replication SQL thread position recorded in the applier metadata repository. The previous relay log files are removed over time by the replica's normal purge mechanism."

This could also be enabled to make the setup safer (but slower):

"Set sync_relay_log=1, which instructs the replication receiver thread to synchronize the relay log to disk after each received transaction is written to it.

This means the replica's record of the current position read from the source's binary log (in the applier metadata repository) is never ahead of the record of transactions saved in the relay log. Note that although this setting is the safest, it is also the slowest due to the number of disk writes involved."

Revision history for this message

Dmitrii Shcherbakov (dmitriis) wrote on 2022-07-18:

#4

Doing the following:

sudo bash -c 'echo "relay_log_recovery = ON" >> /etc/mysql/mysql.conf.d/mysqld.cnf'
sudo systemctl restart mysql.service

On an affected unit helps.

And then a unit properly comes up: https://paste.ubuntu.com/p/bFTQCBjPYs/

I suggest turning this feature on by default.

Revision history for this message

Russell Stather (russellstather) wrote on 2022-12-21:

#5

I agree, this fixed my issues perfectly.

Alex Kavanagh (ajkavanagh) on 2023-05-31

Changed in charm-mysql-innodb-cluster:
status:	New → Triaged
importance:	Undecided → High

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-05-31: Fix proposed to charm-mysql-innodb-cluster (master)

#6

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/884871

Changed in charm-mysql-innodb-cluster:
status:	Triaged → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-05-31: Fix merged to charm-mysql-innodb-cluster (master)

#7

Reviewed: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/884871
Committed: https://opendev.org/openstack/charm-mysql-innodb-cluster/commit/21494229747a9d4013d360b57ccd3de7749f0ffb
Submitter: "Zuul (22348)"
Branch: master

commit 21494229747a9d4013d360b57ccd3de7749f0ffb
Author: Alex Kavanagh <email address hidden>
Date: Wed May 31 11:48:08 2023 +0100

Set relay_log_recovery = ON by default

    This is to aid in recoverying nodes from unexpected halts/power failures
    and ease their re-introduction into the cluster. It ensures that
    potentially corrupted relay log files are discarded.

Change-Id: Iaf6ba030001bc458abf5536454745cae905aceee
Closes-Bug: #1938708

Changed in charm-mysql-innodb-cluster:
status:	In Progress → Fix Committed

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-06-01: Fix proposed to charm-mysql-innodb-cluster (stable/jammy)

#8

Fix proposed to branch: stable/jammy
Review: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/884826

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-06-02: Fix merged to charm-mysql-innodb-cluster (stable/jammy)

#9

Reviewed: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/884826
Committed: https://opendev.org/openstack/charm-mysql-innodb-cluster/commit/f24dfa710b97cd907cd10b7a3a803e9c05ce47b0
Submitter: "Zuul (22348)"
Branch: stable/jammy

commit f24dfa710b97cd907cd10b7a3a803e9c05ce47b0
Author: Alex Kavanagh <email address hidden>
Date: Wed May 31 11:48:08 2023 +0100

Set relay_log_recovery = ON by default

    This is to aid in recoverying nodes from unexpected halts/power failures
    and ease their re-introduction into the cluster. It ensures that
    potentially corrupted relay log files are discarded.

    Change-Id: Iaf6ba030001bc458abf5536454745cae905aceee
    Closes-Bug: #1938708
    (cherry picked from commit 21494229747a9d4013d360b57ccd3de7749f0ffb)

	Status	Importance	Assigned to
MySQL InnoDB Cluster Charm	Status tracked in Trunk
Jammy	Fix Committed	High	Unassigned
Trunk	Fix Committed	High	Unassigned

MySQL InnoDB Cluster Charm

Unit stays blocked after machine powerdown and reboot

Bug Description

Duplicates of this bug

Other bug subscribers

Remote bug watches