Percona Server moved to https://jira.percona.com/projects/PS

Connection issues after promoting slave to master

Bug #1435781 reported by Patrick McConnell on 2015-03-24

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Percona Server moved to https://jira.percona.com/projects/PS	New	Undecided	Unassigned

Bug Description

Percona server version: 5.6.21-69.0-675
OS version: Ubuntu 12.04.5

We have a MySQL host with 13 slaves. We switch masters by promoting one of the existing slaves. We switch using a script so it happens fairly quickly, and all slaves connect to the new master at the same time. All the slaves are caught up to the old master before their slave threads are stopped.

When all the slaves connect to the new master, the new master starts a binlog_dump to each slave. Shortly after the binlog_dump is started the slave will disconnect, wait for one minute, reconnect, start the binlog_dump, then disconnect/reconnect again.

This happens continuously for around 10-15 minutes, with less slaves having to reconnect each time. After 10-15 minutes they seem to have caught up with downloading the binlog, and they start working properly without intervention.

Stopping and starting the slave thread during the first 10-15 minutes does not seem to change the behaviour.

The master host has 6 binary log files of 1GB each.

While a slave is waiting to reconnect, we see this in SHOW SLAVE STATUS:

Slave_IO_State: Waiting to reconnect after a failed master event read
[...]
Slave_IO_Running: Connecting

In the attached log excerpt from the new master host (failover_log.txt) you can see the disconnect/reconnect pattern described above. The master switch occurred at 1:04 am.

Revision history for this message

Patrick McConnell (j-patrick) wrote on 2015-03-24:

Log excerpt from new master after master switch Edit (27.1 KiB, text/plain)

Revision history for this message

Sveta Smirnova (svetasmirnova) wrote on 2015-04-02:

Thank you for the report.

This can be caused by networking error in case if too many events are sending same time when you just started master. Please check your network bandwidth and amount of data which server tries to receive. As workaround you can add new slaves one-by-one: start replication on slave 1, wait 15 minutes, then on slave 2 and so on.

Changed in percona-server:
status:	New → Incomplete

Revision history for this message

Patrick McConnell (j-patrick) wrote on 2015-04-07:

Hi Sveta. Could you give me more detail on what sort of networking error you're referring to? I have worked with our networking team and we didn't find any problems on the servers or switches.

The workaround of waiting 15 minutes before starting replication on each slave will not work for us. We need a master switch to be as instant as possible - within a few seconds. If we wait 15 minutes between starting each of our 13 slaves, a master switch would take over three hours.

Each host (master and slaves) has a 1 Gb interface. The master has 6 x 1 GB binary logs. Assuming it is trying to transfer all binary logs to all slaves, that would be 78 GB (6 GB of log files x 13 slaves).

This should take at least 10 minutes to transfer over a 1 Gb interface, which corresponds well to the 10-15 minutes it took for the "Start binlog_dump"/"Aborted connection" messages to stop showing in the error log.

I think it's fair that it took 10-15 minutes to transfer the binary logs, but I don't think it's right that the slave threads kept aborting and reconnecting. I've played around with some timeout-related variables, but they didn't seem to have any impact on the problem.

I don't see anything on our servers or our network that would prevent 78 GB of data from being transferred, which is why we suspected a bug in MySQL. Do you have any ideas as to what could have caused the slave threads to abort while they are transferring the binary logs? (Please see the previously-attached log for details)

Valerii Kravchuk (valerii-kravchuk) on 2015-04-07