promote_to_replica_source fails for MariaDB replica sets because of diverged slaves

Bug #1754539 reported by Zhao Chao
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
OpenStack DBaaS (Trove)
Fix Released
Medium
Zhao Chao

Bug Description

We experienced continuous failures in MariaDB replication promotion gate checking jobs. From guestagent.log we get:

2018-03-08 17:35:50.920302 | primary | 2018-03-08 17:35:50.919 | 2018-03-08 17:29:01.907 1252 ERROR oslo.service.loopingcall Traceback (most recent call last):
2018-03-08 17:35:50.921771 | primary | 2018-03-08 17:35:50.921 | 2018-03-08 17:29:01.907 1252 ERROR oslo.service.loopingcall File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 141, in _run_loop
2018-03-08 17:35:50.923789 | primary | 2018-03-08 17:35:50.923 | 2018-03-08 17:29:01.907 1252 ERROR oslo.service.loopingcall idle = idle_for_func(result, watch.elapsed())
2018-03-08 17:35:50.925991 | primary | 2018-03-08 17:35:50.925 | 2018-03-08 17:29:01.907 1252 ERROR oslo.service.loopingcall File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 338, in _idle_for
2018-03-08 17:35:50.927792 | primary | 2018-03-08 17:35:50.927 | 2018-03-08 17:29:01.907 1252 ERROR oslo.service.loopingcall % self._error_time)
2018-03-08 17:35:50.929497 | primary | 2018-03-08 17:35:50.929 | 2018-03-08 17:29:01.907 1252 ERROR oslo.service.loopingcall LoopingCallTimeOut: Looping call timed out after 57.48 seconds
2018-03-08 17:35:50.931278 | primary | 2018-03-08 17:35:50.930 | 2018-03-08 17:29:01.907 1252 ERROR oslo.service.loopingcall
2018-03-08 17:35:50.932910 | primary | 2018-03-08 17:35:50.932 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base [-] Exception enabling guest as replica: LoopingCallTimeOut: Looping call timed out after 57.48 seconds
2018-03-08 17:35:50.934723 | primary | 2018-03-08 17:35:50.934 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base Traceback (most recent call last):
2018-03-08 17:35:50.936547 | primary | 2018-03-08 17:35:50.936 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base File "/home/ubuntu/trove/trove/guestagent/strategies/replication/mysql_base.py", line 138, in enable_as_slave
2018-03-08 17:35:50.938049 | primary | 2018-03-08 17:35:50.937 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base self.connect_to_master(service, snapshot)
2018-03-08 17:35:50.939691 | primary | 2018-03-08 17:35:50.939 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base File "/home/ubuntu/trove/trove/guestagent/strategies/replication/experimental/mariadb_gtid.py", line 67, in connect_to_master
2018-03-08 17:35:50.941526 | primary | 2018-03-08 17:35:50.941 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base service.start_slave()
2018-03-08 17:35:50.943437 | primary | 2018-03-08 17:35:50.942 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base File "/home/ubuntu/trove/trove/guestagent/datastore/mysql_common/service.py", line 937, in start_slave
2018-03-08 17:35:50.945343 | primary | 2018-03-08 17:35:50.944 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base self._wait_for_slave_status("ON", client, 60)
2018-03-08 17:35:50.946868 | primary | 2018-03-08 17:35:50.946 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base File "/home/ubuntu/trove/trove/guestagent/datastore/mysql_common/service.py", line 969, in _wait_for_slave_status
2018-03-08 17:35:50.949248 | primary | 2018-03-08 17:35:50.948 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base time_out=max_time)
2018-03-08 17:35:50.950810 | primary | 2018-03-08 17:35:50.950 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base File "/home/ubuntu/trove/trove/common/utils.py", line 214, in poll_until
2018-03-08 17:35:50.952172 | primary | 2018-03-08 17:35:50.951 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base sleep_time=sleep_time, time_out=time_out).wait()
2018-03-08 17:35:50.953739 | primary | 2018-03-08 17:35:50.953 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 121, in wait
2018-03-08 17:35:50.955513 | primary | 2018-03-08 17:35:50.955 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base return hubs.get_hub().switch()
2018-03-08 17:35:50.957501 | primary | 2018-03-08 17:35:50.956 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 294, in switch
2018-03-08 17:35:50.959606 | primary | 2018-03-08 17:35:50.959 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base return self.greenlet.switch()
2018-03-08 17:35:50.962183 | primary | 2018-03-08 17:35:50.961 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 141, in _run_loop
2018-03-08 17:35:50.965605 | primary | 2018-03-08 17:35:50.965 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base idle = idle_for_func(result, watch.elapsed())
2018-03-08 17:35:50.967957 | primary | 2018-03-08 17:35:50.967 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 338, in _idle_for
2018-03-08 17:35:50.970521 | primary | 2018-03-08 17:35:50.970 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base % self._error_time)
2018-03-08 17:35:50.973019 | primary | 2018-03-08 17:35:50.972 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base LoopingCallTimeOut: Looping call timed out after 57.48 seconds
2018-03-08 17:35:50.975192 | primary | 2018-03-08 17:35:50.974 | 2018-03-08 17:29:01.955 1252 ERROR trove.guestagent.strategies.replication.mysql_base
2018-03-08 17:35:50.977117 | primary | 2018-03-08 17:35:50.976 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager [-] Error enabling replication.: LoopingCallTimeOut: Looping call timed out after 57.48 seconds
2018-03-08 17:35:50.979119 | primary | 2018-03-08 17:35:50.978 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager Traceback (most recent call last):
2018-03-08 17:35:50.981080 | primary | 2018-03-08 17:35:50.980 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager File "/home/ubuntu/trove/trove/guestagent/datastore/mysql_common/manager.py", line 415, in attach_replica
2018-03-08 17:35:50.983035 | primary | 2018-03-08 17:35:50.982 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager self.replication.enable_as_slave(app, replica_info, slave_config)
2018-03-08 17:35:50.985002 | primary | 2018-03-08 17:35:50.984 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager File "/home/ubuntu/trove/trove/guestagent/strategies/replication/mysql_base.py", line 138, in enable_as_slave
2018-03-08 17:35:50.986827 | primary | 2018-03-08 17:35:50.986 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager self.connect_to_master(service, snapshot)
2018-03-08 17:35:50.988458 | primary | 2018-03-08 17:35:50.987 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager File "/home/ubuntu/trove/trove/guestagent/strategies/replication/experimental/mariadb_gtid.py", line 67, in connect_to_master
2018-03-08 17:35:50.990036 | primary | 2018-03-08 17:35:50.989 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager service.start_slave()
2018-03-08 17:35:50.991663 | primary | 2018-03-08 17:35:50.991 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager File "/home/ubuntu/trove/trove/guestagent/datastore/mysql_common/service.py", line 937, in start_slave
2018-03-08 17:35:50.993707 | primary | 2018-03-08 17:35:50.993 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager self._wait_for_slave_status("ON", client, 60)
2018-03-08 17:35:50.996002 | primary | 2018-03-08 17:35:50.995 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager File "/home/ubuntu/trove/trove/guestagent/datastore/mysql_common/service.py", line 969, in _wait_for_slave_status
2018-03-08 17:35:50.998089 | primary | 2018-03-08 17:35:50.997 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager time_out=max_time)
2018-03-08 17:35:51.000218 | primary | 2018-03-08 17:35:50.999 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager File "/home/ubuntu/trove/trove/common/utils.py", line 214, in poll_until
2018-03-08 17:35:51.002352 | primary | 2018-03-08 17:35:51.001 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager sleep_time=sleep_time, time_out=time_out).wait()
2018-03-08 17:35:51.003982 | primary | 2018-03-08 17:35:51.003 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager File "/usr/local/lib/python2.7/dist-packages/eventlet/event.py", line 121, in wait
2018-03-08 17:35:51.008072 | primary | 2018-03-08 17:35:51.007 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager return hubs.get_hub().switch()
2018-03-08 17:35:51.009827 | primary | 2018-03-08 17:35:51.009 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager File "/usr/local/lib/python2.7/dist-packages/eventlet/hubs/hub.py", line 294, in switch
2018-03-08 17:35:51.011906 | primary | 2018-03-08 17:35:51.011 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager return self.greenlet.switch()
2018-03-08 17:35:51.014024 | primary | 2018-03-08 17:35:51.013 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 141, in _run_loop
2018-03-08 17:35:51.015689 | primary | 2018-03-08 17:35:51.015 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager idle = idle_for_func(result, watch.elapsed())
2018-03-08 17:35:51.017505 | primary | 2018-03-08 17:35:51.017 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager File "/usr/local/lib/python2.7/dist-packages/oslo_service/loopingcall.py", line 338, in _idle_for
2018-03-08 17:35:51.019139 | primary | 2018-03-08 17:35:51.018 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager % self._error_time)
2018-03-08 17:35:51.021165 | primary | 2018-03-08 17:35:51.020 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager LoopingCallTimeOut: Looping call timed out after 57.48 seconds
2018-03-08 17:35:51.022993 | primary | 2018-03-08 17:35:51.022 | 2018-03-08 17:29:01.990 1252 ERROR trove.guestagent.datastore.mysql_common.manager
2018-03-08 17:35:51.031877 | primary | 2018-03-08 17:35:51.024 | 2018-03-08 17:29:02.009 1252 DEBUG trove.guestagent.datastore.service [-] Casting set_status message to conductor (status is 'failed to spawn'). set_status /home/ubuntu/trove/trove/guestagent/datastore/service.py:166

However this is not the root cause. Dive into the logs of MariaDB service, we could see something like:

Mar 08 06:16:00 zc-test-mariadb-replica2-2 mysqld[3598]: 2018-03-08 6:16:00 140044504475392 [Note] 'CHANGE MASTER TO executed'. Previous state mas
ter_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='10.0.0.70', master_port='3306', master_log_file='',
 master_log_pos='4'.
Mar 08 06:16:00 zc-test-mariadb-replica2-2 mysqld[3598]: 2018-03-08 6:16:00 140044504475392 [Note] Previous Using_Gtid=No. New Using_Gtid=Slave_Po
s
Mar 08 06:16:00 zc-test-mariadb-replica2-2 mysqld[3598]: 2018-03-08 6:16:00 140044504073984 [Note] Slave SQL thread initialized, starting replicat
ion in log 'FIRST' at position 4, relay log '/var/lib/mysql/data/mysqld-relay-bin.000001' position: 4; GTID position '0-1923193259-41'
Mar 08 06:16:15 zc-test-mariadb-replica2-2 mysqld[3598]: 2018-03-08 6:16:15 140044504274688 [Note] Slave I/O thread: connected to master 'slave_db
d0eae5@10.0.0.70:3306',replication starts at GTID position '0-1923193259-41'
Mar 08 06:16:16 zc-test-mariadb-replica2-2 mysqld[3598]: 2018-03-08 6:16:16 140044504274688 [ERROR] Error reading packet from server: Error: connecting slave requested to start from GTID 0-1923193259-41, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions (server_errno=1236)
Mar 08 06:16:16 zc-test-mariadb-replica2-2 mysqld[3598]: 2018-03-08 6:16:16 140044504274688 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-1923193259-41, which is not in the master's binlog. Since the master's binlog contains GTIDs with higher sequence numbers, it probably means that the slave has diverged due to executing extra erroneous transactions', Internal MariaDB error code: 1236
Mar 08 06:16:16 zc-test-mariadb-replica2-2 mysqld[3598]: 2018-03-08 6:16:16 140044504274688 [Note] Slave I/O thread exiting, read up to log 'FIRST', position 4; GTID position 0-1923193259-41
Mar 08 08:21:04 zc-test-mariadb-replica2-2 mysqld[3598]: 2018-03-08 8:21:04 140044504073984 [Note] Error reading relay log event: slave SQL thread was killed
Mar 08 08:21:04 zc-test-mariadb-replica2-2 mysqld[3598]: 2018-03-08 8:21:04 140044504073984 [Note] Slave SQL thread exiting, replication stopped in log 'FIRST' at position 4; GTID position '0-1923193259-41'

So the problem is about diverged slaves during changing the master node. This is caused by we attaching the old master to the new before attaching the other replicas to the new master, and new GTIDs may be created on the old master after attaching and synced to some of the other replicas by chance(the other replicas are still connecting to the old master, and MariaDB allows an instance to be a master and a slave simutaneously).

This can be fixed by first attaching the other replicas to the new master, and then dealing with the old master.

Zhao Chao (zhaochao1984)
Changed in trove:
assignee: nobody → Zhao Chao (zhaochao1984)
importance: Undecided → Medium
status: New → Confirmed
Changed in trove:
status: Confirmed → In Progress
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to trove (master)

Reviewed: https://review.openstack.org/550768
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=5895cf0ee99022e2910ec8c72393fe998f5860f8
Submitter: Zuul
Branch: master

commit 5895cf0ee99022e2910ec8c72393fe998f5860f8
Author: Zhao Chao <email address hidden>
Date: Thu Mar 8 17:09:14 2018 +0800

    Avoid diverged slave when migrating MariaDB master

    When promoting one slave to the new master in a replication group,
    previously the old master will be attached to the new one right after
    the new master is on. For MariaDB, attaching the old master to the new
    one, new GTID may be created on the old master and also may be synced
    to some of the other replicas, as they're still connecting to the old
    master. The new GTID does not exists in the new master, making these
    slaves diverged from the master. After that, when the diverged slave
    connects to the new master, 'START SLAVE' will fail with logs like:

        [ERROR] Error reading packet from server: Error: connecting slave
        requested to start from GTID X-XXXXXXXXXX-XX, which is not in the
        master's binlog. Since the master's binlog contains GTIDs with
        higher sequence numbers, it probably means that the slave has
        diverged due to executing extra erroneous transactions
        (server_errno=1236)

    And these slaves will be left orphan and errored after
    promote_to_replica_source finishs.

    Attaching the other replicas to the new master before dealing with the
    old master will fix this problem and the failure of the
    trove-scenario-mariadb-multi Zuul job as well.

    Closes-Bug: #1754539
    Change-Id: Ib9c01b07c832f117f712fd613ae55c7de3561116
    Signed-off-by: Zhao Chao <email address hidden>

Changed in trove:
status: In Progress → Fix Released
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/trove 10.0.0.0b1

This issue was fixed in the openstack/trove 10.0.0.0b1 development milestone.

Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix merged to trove (stable/queens)
Download full text (3.9 KiB)

Reviewed: https://review.openstack.org/580584
Committed: https://git.openstack.org/cgit/openstack/trove/commit/?id=2133963b8975d4d5968b6559c9be36d0f992d8ab
Submitter: Zuul
Branch: stable/queens

commit 2133963b8975d4d5968b6559c9be36d0f992d8ab
Author: Zhao Chao <email address hidden>
Date: Tue Mar 6 18:08:59 2018 +0800

    Fix zuul check jobs

    This patch is a squash of three commits that is needed to have
    a successful zuul run:

    1. Use neutronclient for floatingip operations

    As add_floating_ip and remove_floating_ip were removed from novaclient,
    migrating to neutronclient for floatingip operations.

    This is the first patch for fixing unittests. As integration tests
    didn't fail, floatingip operations are not covered there, so it may be
    necessary to add new scenario tests for this in following patches.

    NOTE(elod.illes): this backport is necessary due to nova client
    constraint update [1] otherwise (periodic stable) jobs in queens fails
    in stable/queens

    [1] https://review.openstack.org/#/c/571540/

    Signed-off-by: Zhao Chao <email address hidden>

    2. Avoid diverged slave when migrating MariaDB master

    When promoting one slave to the new master in a replication group,
    previously the old master will be attached to the new one right after
    the new master is on. For MariaDB, attaching the old master to the new
    one, new GTID may be created on the old master and also may be synced
    to some of the other replicas, as they're still connecting to the old
    master. The new GTID does not exists in the new master, making these
    slaves diverged from the master. After that, when the diverged slave
    connects to the new master, 'START SLAVE' will fail with logs like:

        [ERROR] Error reading packet from server: Error: connecting slave
        requested to start from GTID X-XXXXXXXXXX-XX, which is not in the
        master's binlog. Since the master's binlog contains GTIDs with
        higher sequence numbers, it probably means that the slave has
        diverged due to executing extra erroneous transactions
        (server_errno=1236)

    And these slaves will be left orphan and errored after
    promote_to_replica_source finishs.

    Attaching the other replicas to the new master before dealing with the
    old master will fix this problem and the failure of the
    trove-scenario-mariadb-multi Zuul job as well.

    Closes-Bug: #1754539
    Signed-off-by: Zhao Chao <email address hidden>

    3. Add a hook for restore process to check if successful

    1.This is the same solution with the backup process. Previously we only
    check the stderr of the restore command, but this method is not
    reliable, and the changes during percona-xtrabackup 2.4.11:
    https://jira.percona.com/browse/PXB-1542
    causes the gate jobs failed in creating a slave MySQL instance.

    The new hook currently is only used by InnoBackupEx restore runner with
    checking the exit status and stderr output of xbstream.

    NOTE(elod.illes): The following part is not backported because [1] is
    not backported e...

Read more...

tags: added: in-stable-queens
Revision history for this message
OpenStack Infra (hudson-openstack) wrote : Fix included in openstack/trove queens-eol

This issue was fixed in the openstack/trove queens-eol release.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.