Async slave fails after master full SST

Bug #1710297 reported by Juan Pablo Arruti
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.6
Invalid
Undecided
Unassigned
5.7
Invalid
Undecided
Unassigned

Bug Description

Tested in 5.6.36-82.0-56-log and 5.7.18-15-57-log Percona XtraDB Cluster.

When a master async replication node from a percona cluster is shutdown and writes occurs in others nodes, then when starting up again the master replication node doing full SST produces following error in slave connected to this node:

Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'

Test case:

1) Bootstrap a master cluster (2 nodes) with following settings:

server_id=200
log_bin=percona-bin
log_slave_updates
binlog_format=ROW
enforce_gtid_consistency=1
gtid_mode=on

2) Start a slave with following settings:

server_id=201
log_bin=percona-bin
log_slave_updates
binlog_format=ROW
enforce_gtid_consistency=1
gtid_mode=on

3) Point async-slave from server 201 to node 1 of 200.

4) Shutdown node 1 of the master cluster

5) Dropped all files in datadir from node 1 to force SST

6) Execute some writes in node 2 of the master cluster

7) Start node 1 of the master cluster

After failed connection attempts the slave threads shows the following error in mysql error log:

2017-08-11T19:39:36.389754Z 14 [ERROR] Error reading packet from server for channel '': The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires. (server_errno=1236)
2017-08-11T19:39:36.389767Z 14 [ERROR] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Error_code: 1236
2017-08-11T19:39:36.389787Z 14 [Note] Slave I/O thread exiting for channel '', read up to log 'percona-bin.000012', position 194

Possible workarounds are:

1) Avoid SST, IST doesn't show the same behavior.
2) Change master to the PXC node that generated new rows.

Tags: i201425
Revision history for this message
Nickolay Ihalainen (ihanick) wrote :

Verified as described.

> 2) Change master to the PXC node that generated new rows.

Change master to any node, except for "recovered by SST" one.
Slave could be re-attached to original master node after receiving all "offline" changes.

GTID auto position require binary logs with corresponding events to work.

Currently PXC nodes could have different name/location for binary logs.
Binary logs could have significant size and increase SST time.

tags: added: i201425
Revision history for this message
Krunal Bauskar (krunal-bauskar) wrote :

Not sure if anything can be done from PXC side.

Solution lies in complete solution setup where-in ProxySQL or whatever you use should re-direct the async slave to point to a member of cluster if existing async master fails or shutdown.
Re-direction can be done only after the original master is back in SYNCED state.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1998

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.