MySQL InnoDB Cluster Charm

(latest/edge) prometheus-relation-joined hook can cause a mysql error during deployment

Bug #2018385 reported by Alex Kavanagh on 2023-05-03

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MySQL InnoDB Cluster Charm	Fix Committed	Undecided	Unassigned
	Jammy	New	Undecided	Unassigned

Bug Description

During the deployment of a cluster in the gate, there is a race-hazard error when the prometheus-relation-joined hook can result in a mysql error in the `create_user` method.

The issue is basically that if a transaction is attempted (with a commit) whilst the cluster is recovering Group Replication, then that commit will hard fail with the following error:

MySQLdb.OperationalError: (3100, "Error on observer while running replication hook 'before_commit'.")

The trace from the error.log file provides more details:

2023-04-22T07:59:58.834202Z 0 [System] [MY-011565] [Repl] Plugin group_replication reported: 'Setting super_read_only=ON.'
2023-04-22T07:59:58.834268Z 0 [Warning] [MY-013469] [Repl] Plugin group_replication reported: 'This member will start distributed recovery using clone. It is due to the num
ber of missing transactions being higher than the configured threshold of 1.'
2023-04-22T07:59:59.836244Z 0 [System] [MY-013471] [Repl] Plugin group_replication reported: 'Distributed recovery will transfer data using: Cloning from a remote group don
or.'
2023-04-22T07:59:59.837981Z 0 [System] [MY-011503] [Repl] Plugin group_replication reported: 'Group membership changed to 172.16.0.174:3306, 172.16.0.245:3306, 172.16.0.101
:3306 on view 16821495149807797:5.'
2023-04-22T07:59:59.840913Z 38 [System] [MY-011566] [Repl] Plugin group_replication reported: 'Setting super_read_only=OFF.'
2023-04-22T08:00:00.285589Z 39 [Warning] [MY-013460] [InnoDB] Clone removing all user data for provisioning: Started
2023-04-22T08:00:00.550852Z 39 [Warning] [MY-013460] [InnoDB] Clone removing all user data for provisioning: Finished
2023-04-22T08:00:01.404748Z 41 [ERROR] [MY-011600] [Repl] Plugin group_replication reported: 'Transaction cannot be executed while Group Replication is recovering. Try agai
n when the server is ONLINE.'

Essentially, what seems to be happening is that the prometheus-relation-joined hook fires quickly after the vault-relation-joined which had caused the instance to change from not TLS to TLS (using the cert from vault) and this had caused group replication to be restarted.

Possible solution:
------------------

The solution is to retry the commit if the 3100 error occurs several times (to allow Group Replication to finish) and then just return False so that the handler will try again on the next hook execution. This would allow the unit to recover gracefully from the error.

Tags:

Seyeong Kim (seyeongkim) on 2023-05-11

tags:

added: sts

Alex Kavanagh (ajkavanagh) on 2023-05-11

Changed in charm-mysql-innodb-cluster:
assignee:	nobody → Alex Kavanagh (ajkavanagh)

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-05-16: Fix proposed to charm-mysql-innodb-cluster (master)

Fix proposed to branch: master
Review: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/883300

Changed in charm-mysql-innodb-cluster:
status:	New → In Progress

Revision history for this message

OpenStack Infra (hudson-openstack) wrote on 2023-05-17: Fix merged to charm-mysql-innodb-cluster (master)

Reviewed: https://review.opendev.org/c/openstack/charm-mysql-innodb-cluster/+/883300
Committed: https://opendev.org/openstack/charm-mysql-innodb-cluster/commit/5fb5a05be128a2ec6a912394aacd14b44eb82998
Submitter: "Zuul (22348)"
Branch: master

commit 5fb5a05be128a2ec6a912394aacd14b44eb82998
Author: Alex Kavanagh <email address hidden>
Date: Tue May 16 20:21:43 2023 +0100

Wait for Group Replication to finish; 3100 before commit error

    The bug is triggered, as a race, usually by the
    prometheus-relation-joined hook, when it tries to create a user whilst
    Group Replication is recovering during a rolling restart. This patch
    alters the create_user() method so that it detects the failure condition
    and then retries for up to a minute (6 times, every 10 seconds) for the
    Group Replication to recover before giving up and returning False
    (indicating the the user was not created). This will usually result in
    the handler not completing during the hook, and then retrying on the
    next hook.

Change-Id: I5df4fd5ecbdd2b7bce525a9930dcffbc5868cbb8
Closes-Bug: #2018385

Changed in charm-mysql-innodb-cluster:
status:	In Progress → Fix Committed

Alex Kavanagh (ajkavanagh) on 2024-07-15

Changed in charm-mysql-innodb-cluster:
assignee:	Alex Kavanagh (ajkavanagh) → nobody

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.