Unit stays blocked after machine powerdown and reboot

Bug #1938708 reported by Alexander Balderson
16
This bug affects 2 people
Affects Status Importance Assigned to Milestone
MySQL InnoDB Cluster Charm
Status tracked in Trunk
Jammy
Fix Committed
High
Unassigned
Trunk
Fix Committed
High
Unassigned

Bug Description

Solutions QA has started testing deployments which simulate sudden power-outages for an AZ.

During these tests we've found that mysql-innodb-cluster on the powered down machine fails to come back to "active" and instead goes into blocked, reporting "Cluster is inaccessible from this instance. Please check logs for details."

I didn't see anything that stood out in the logs about the inaccessibility, and a perhaps a more trained eye can identify where the short coming is.

Testruns which have hit this issue can be found at:
https://solutions.qa.canonical.com/bugs/bugs/bug/1938708

description: updated
Revision history for this message
Alex Kavanagh (ajkavanagh) wrote :

There are a number of similar bugs to this one open for mysql-innodb-cluster:

- Deployment blocked with "Cluster is inaccessible from this instance" (https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1926449)
- Stuck on "Cluster has no quorum as visible from <leader_ip> and cannot process write transactions. 2 members are not active" (https://bugs.launchpad.net/charm-mysql-innodb-cluster/+bug/1917332)

See more at: https://bugs.launchpad.net/charm-mysql-innodb-cluster

I think it boils down to: the payload mysql-innodb-cluster is either a) currently not configured in a way that recovers from cluster outages, b) isn't actually very good at coming back from cluster outages (i.e. a problem in the product).

Looking at https://dev.mysql.com/doc/mysql-shell/8.0/en/troubleshooting-innodb-cluster.html it seems it might be tricky to actually fully automate this.

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

A reproducer for this is quite simple: forcefully powering of a follower (replica) unit and powering it back on results in a unit being blocked with mysql starting but reporting the following in the error.log:

* [Server] Error reading GTIDs from relaylog: -1
* [Repl] Error reading relay log event for channel 'group_replication_applier': corrupted data in log event
* [Repl] Slave SQL for channel 'group_replication_applier': Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, the server was unable to fetch a keyring key required to open an encrypted relay log file, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: MY-013121
* [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'
* [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
* [Repl] Plugin group_replication reported: '[GCS] The member is already leaving or joining a group.'
* [Repl] Plugin group_replication reported: 'Unable to confirm whether the server has left the group or not. Check performance_schema.replication_group_members to check group membership information.'
* [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
* [Repl] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'FIRST' position 0
2022-07-18T14:51:17.013502Z 0 [Warning] [MY-011735] [Repl] Plugin group_replication reported: '[GCS] Shutting down an outgoing connection. This happens because something might be wrong on a bi-directional connection to node 10.10.20.32:33061. Please check the connection status to this member'
* [Repl] Plugin group_replication reported: 'There was a previous plugin error while the member joined the group. The member will now exit the group.'

https://paste.ubuntu.com/p/8xPkRDXHDB/ (juju status and error.log before and after power events)

https://paste.ubuntu.com/p/dq2kr7znzv/ (full juju show-status-log for the affected unit)

18 Jul 2022 17:51:23+03:00 juju-unit executing running start hook
18 Jul 2022 17:51:56+03:00 juju-unit executing running leader-settings-changed hook
18 Jul 2022 17:52:24+03:00 juju-unit idle
18 Jul 2022 18:21:55+03:00 workload blocked Cluster is inaccessible from this instance. Please check logs for details.

https://paste.ubuntu.com/p/H83HdMcVQK/ (cluster-status action output)

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

In order to fix this we can make the followers crash-tolerant which is currently not done:

https://dev.mysql.com/doc/refman/8.0/en/replication-solutions-unexpected-replica-halt.html
"In order for replication to be resilient to unexpected halts of the server (sometimes described as crash-safe) it must be possible for the replica to recover its state before halting. This section describes the impact of an unexpected halt of a replica during replication, and how to configure a replica for the best chance of recovery to continue replication."

The right option isn't explicitly enabled and is set to OFF by default:

"Set relay_log_recovery = ON, which enables automatic relay log recovery immediately following server startup. This global variable defaults to OFF and is read-only at runtime, but you can set it to ON with the --relay-log-recovery option at replica startup following an unexpected halt of a replica. Note that this setting ignores the existing relay log files, in case they are corrupted or inconsistent. The relay log recovery process starts a new relay log file and fetches transactions from the source beginning at the replication SQL thread position recorded in the applier metadata repository. The previous relay log files are removed over time by the replica's normal purge mechanism."

This could also be enabled to make the setup safer (but slower):

"Set sync_relay_log=1, which instructs the replication receiver thread to synchronize the relay log to disk after each received transaction is written to it.

This means the replica's record of the current position read from the source's binary log (in the applier metadata repository) is never ahead of the record of transactions saved in the relay log. Note that although this setting is the safest, it is also the slowest due to the number of disk writes involved."

Revision history for this message
Dmitrii Shcherbakov (dmitriis) wrote :

Doing the following:

sudo bash -c 'echo "relay_log_recovery = ON" >> /etc/mysql/mysql.conf.d/mysqld.cnf'
sudo systemctl restart mysql.service

On an affected unit helps.

And then a unit properly comes up: https://paste.ubuntu.com/p/bFTQCBjPYs/

I suggest turning this feature on by default.

Revision history for this message
Russell Stather (russellstather) wrote :

I agree, this fixed my issues perfectly.

Changed in charm-mysql-innodb-cluster:
status: New → Triaged
importance: Undecided → High
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-mysql-innodb-cluster (master)
Changed in charm-mysql-innodb-cluster:
status: Triaged → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-mysql-innodb-cluster (master)

Reviewed: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/884871
Committed: https://opendev.org/openstack/charm-mysql-innodb-cluster/commit/21494229747a9d4013d360b57ccd3de7749f0ffb
Submitter: "Zuul (22348)"
Branch: master

commit 21494229747a9d4013d360b57ccd3de7749f0ffb
Author: Alex Kavanagh <email address hidden>
Date: Wed May 31 11:48:08 2023 +0100

    Set relay_log_recovery = ON by default

    This is to aid in recoverying nodes from unexpected halts/power failures
    and ease their re-introduction into the cluster. It ensures that
    potentially corrupted relay log files are discarded.

    Change-Id: Iaf6ba030001bc458abf5536454745cae905aceee
    Closes-Bug: #1938708

Changed in charm-mysql-innodb-cluster:
status: In Progress → Fix Committed
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix proposed to charm-mysql-innodb-cluster (stable/jammy)
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to charm-mysql-innodb-cluster (stable/jammy)

Reviewed: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/884826
Committed: https://opendev.org/openstack/charm-mysql-innodb-cluster/commit/f24dfa710b97cd907cd10b7a3a803e9c05ce47b0
Submitter: "Zuul (22348)"
Branch: stable/jammy

commit f24dfa710b97cd907cd10b7a3a803e9c05ce47b0
Author: Alex Kavanagh <email address hidden>
Date: Wed May 31 11:48:08 2023 +0100

    Set relay_log_recovery = ON by default

    This is to aid in recoverying nodes from unexpected halts/power failures
    and ease their re-introduction into the cluster. It ensures that
    potentially corrupted relay log files are discarded.

    Change-Id: Iaf6ba030001bc458abf5536454745cae905aceee
    Closes-Bug: #1938708
    (cherry picked from commit 21494229747a9d4013d360b57ccd3de7749f0ffb)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Duplicates of this bug

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.