promote_to_replica_source fails for MariaDB replica sets because of diverged slaves
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
OpenStack DBaaS (Trove) |
Fix Released
|
Medium
|
Zhao Chao |
Bug Description
We experienced continuous failures in MariaDB replication promotion gate checking jobs. From guestagent.log we get:
2018-03-08 17:35:50.920302 | primary | 2018-03-08 17:35:50.919 | 2018-03-08 17:29:01.907 1252 ERROR oslo.service.
2018-03-08 17:35:50.921771 | primary | 2018-03-08 17:35:50.921 | 2018-03-08 17:29:01.907 1252 ERROR oslo.service.
2018-03-08 17:35:50.923789 | primary | 2018-03-08 17:35:50.923 | 2018-03-08 17:29:01.907 1252 ERROR oslo.service.
2018-03-08 17:35:50.925991 | primary | 2018-03-08 17:35:50.925 | 2018-03-08 17:29:01.907 1252 ERROR oslo.service.
2018-03-08 17:35:50.927792 | primary | 2018-03-08 17:35:50.927 | 2018-03-08 17:29:01.907 1252 ERROR oslo.service.
2018-03-08 17:35:50.929497 | primary | 2018-03-08 17:35:50.929 | 2018-03-08 17:29:01.907 1252 ERROR oslo.service.
2018-03-08 17:35:50.931278 | primary | 2018-03-08 17:35:50.930 | 2018-03-08 17:29:01.907 1252 ERROR oslo.service.
2018-03-08 17:35:50.932910 | primary | 2018-03-08 17:35:50.932 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.934723 | primary | 2018-03-08 17:35:50.934 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.936547 | primary | 2018-03-08 17:35:50.936 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.938049 | primary | 2018-03-08 17:35:50.937 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.939691 | primary | 2018-03-08 17:35:50.939 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.941526 | primary | 2018-03-08 17:35:50.941 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.943437 | primary | 2018-03-08 17:35:50.942 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.945343 | primary | 2018-03-08 17:35:50.944 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.946868 | primary | 2018-03-08 17:35:50.946 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.949248 | primary | 2018-03-08 17:35:50.948 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.950810 | primary | 2018-03-08 17:35:50.950 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.952172 | primary | 2018-03-08 17:35:50.951 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.953739 | primary | 2018-03-08 17:35:50.953 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.955513 | primary | 2018-03-08 17:35:50.955 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.957501 | primary | 2018-03-08 17:35:50.956 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.959606 | primary | 2018-03-08 17:35:50.959 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.962183 | primary | 2018-03-08 17:35:50.961 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.965605 | primary | 2018-03-08 17:35:50.965 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.967957 | primary | 2018-03-08 17:35:50.967 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.970521 | primary | 2018-03-08 17:35:50.970 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.973019 | primary | 2018-03-08 17:35:50.972 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.975192 | primary | 2018-03-08 17:35:50.974 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagen
2018-03-08 17:35:50.977117 | primary | 2018-03-08 17:35:50.976 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:50.979119 | primary | 2018-03-08 17:35:50.978 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:50.981080 | primary | 2018-03-08 17:35:50.980 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:50.983035 | primary | 2018-03-08 17:35:50.982 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:50.985002 | primary | 2018-03-08 17:35:50.984 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:50.986827 | primary | 2018-03-08 17:35:50.986 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:50.988458 | primary | 2018-03-08 17:35:50.987 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:50.990036 | primary | 2018-03-08 17:35:50.989 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:50.991663 | primary | 2018-03-08 17:35:50.991 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:50.993707 | primary | 2018-03-08 17:35:50.993 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:50.996002 | primary | 2018-03-08 17:35:50.995 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:50.998089 | primary | 2018-03-08 17:35:50.997 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:51.000218 | primary | 2018-03-08 17:35:50.999 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:51.002352 | primary | 2018-03-08 17:35:51.001 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:51.003982 | primary | 2018-03-08 17:35:51.003 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:51.008072 | primary | 2018-03-08 17:35:51.007 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:51.009827 | primary | 2018-03-08 17:35:51.009 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:51.011906 | primary | 2018-03-08 17:35:51.011 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:51.014024 | primary | 2018-03-08 17:35:51.013 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:51.015689 | primary | 2018-03-08 17:35:51.015 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:51.017505 | primary | 2018-03-08 17:35:51.017 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:51.019139 | primary | 2018-03-08 17:35:51.018 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:51.021165 | primary | 2018-03-08 17:35:51.020 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:51.022993 | primary | 2018-03-08 17:35:51.022 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagen
2018-03-08 17:35:51.031877 | primary | 2018-03-08 17:35:51.024 | 2018-03-08 17:29:02.009 1252 DEBUG trove.guestagen
However this is not the root cause. Dive into the logs of MariaDB service, we could see something like:
Mar 08 06:16:00 zc-test-
ter_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_
master_
Mar 08 06:16:00 zc-test-
s
Mar 08 06:16:00 zc-test-
ion in log 'FIRST' at position 4, relay log '/var/lib/
Mar 08 06:16:15 zc-test-
d0eae5@
Mar 08 06:16:16 zc-test-
Mar 08 06:16:16 zc-test-
Mar 08 06:16:16 zc-test-
Mar 08 08:21:04 zc-test-
Mar 08 08:21:04 zc-test-
So the problem is about diverged slaves during changing the master node. This is caused by we attaching the old master to the new before attaching the other replicas to the new master, and new GTIDs may be created on the old master after attaching and synced to some of the other replicas by chance(the other replicas are still connecting to the old master, and MariaDB allows an instance to be a master and a slave simutaneously).
This can be fixed by first attaching the other replicas to the new master, and then dealing with the old master.
Changed in trove: | |
assignee: | nobody → Zhao Chao (zhaochao1984) |
importance: | Undecided → Medium |
status: | New → Confirmed |
Changed in trove: | |
status: | Confirmed → In Progress |
Reviewed: https:/ /review. openstack. org/550768 /git.openstack. org/cgit/ openstack/ trove/commit/ ?id=5895cf0ee99 022e2910ec8c723 93fe998f5860f8
Committed: https:/
Submitter: Zuul
Branch: master
commit 5895cf0ee99022e 2910ec8c72393fe 998f5860f8
Author: Zhao Chao <email address hidden>
Date: Thu Mar 8 17:09:14 2018 +0800
Avoid diverged slave when migrating MariaDB master
When promoting one slave to the new master in a replication group,
previously the old master will be attached to the new one right after
the new master is on. For MariaDB, attaching the old master to the new
one, new GTID may be created on the old master and also may be synced
to some of the other replicas, as they're still connecting to the old
master. The new GTID does not exists in the new master, making these
slaves diverged from the master. After that, when the diverged slave
connects to the new master, 'START SLAVE' will fail with logs like:
[ERROR] Error reading packet from server: Error: connecting slave
(server_ errno=1236)
requested to start from GTID X-XXXXXXXXXX-XX, which is not in the
master's binlog. Since the master's binlog contains GTIDs with
higher sequence numbers, it probably means that the slave has
diverged due to executing extra erroneous transactions
And these slaves will be left orphan and errored after to_replica_ source finishs.
promote_
Attaching the other replicas to the new master before dealing with the scenario- mariadb- multi Zuul job as well.
old master will fix this problem and the failure of the
trove-
Closes-Bug: #1754539 17f712fd613ae55 c7de3561116
Change-Id: Ib9c01b07c832f1
Signed-off-by: Zhao Chao <email address hidden>