heartbeats/fakerotate cause a forced sync_master_info
Affects | Status | Importance | Assigned to | Milestone | ||
---|---|---|---|---|---|---|
MySQL Server |
Unknown
|
Unknown
|
||||
Percona Server moved to https://jira.percona.com/projects/PS | Status tracked in 5.7 | |||||
5.5 |
Invalid
|
Undecided
|
Unassigned | |||
5.6 |
Invalid
|
Undecided
|
Unassigned | |||
5.7 |
Fix Released
|
Medium
|
Vlad Lesin |
Bug Description
Originally reported at https:/
Description:
With master_
However, there's a case where some code crosses paths. The code handling [already-executed] gtid event skips uses a dummy heartbeat (treated as a fake rotate event) to signal the end of a skip period. Sounds great, until you get to the slave handling code, which takes that event and forces a master info sync on every heartbeat - even these dummy ones.
This can, even in a somewhat simple setup, lead to a ~5X increase in write load on a slave. Huge amount of wasted resources, very limiting to replication throughput.
fake hb handling:
https:/
which in turn calls:
https:/
note "flush_
How to repeat:
Generate a lot of nonsequential skips to demonstrate. A very basic msr setup will do that under load:
gtid on, of course
- Server A, writable master. msr replication from B and C.
- Servers B and C, backup masters. msr replication from A,C and A,B respectively.
- Server D, msr slave of all three, no log_bin (keep it fast).
Now, apply a significant write load to A (sysbench works fine) - I found it nice to organically lag B and C with load to A, but you can stop their sql threads for a period to simulate lag. If you do that, restart them both after a while.
I recommend stopping the writes to A here, so that D is caught up and not executing any row changes in the data - this helps to isolate load caused by these heartbeats.
Now, they'll be chugging through a lot of dupes and sending them down to D as heartbeats. In a pretty short test, I saw ~60 million heartbeats received on D (from p_s.replication
Now there's a lot of IO happening, considering D isn't actually executing any changes. Checking p_s.table_
Suggested fix:
Two ways to go about this, depending on the intentions of this process.
First, is it really necessary to send so many heartbeats just for gtid skips? I don't see that adding value, even if you do sync master info every trx. They're skips, they're specifically intended to do nothing. Eliminating them (on the rpl_master.cc side) would be the best solution - but that code was written for a purpose, someone must want them, a setting to control the behavior would be good. Or it might be very important for a reason I don't see in the code!
Second, if it's actually important, and even turning it off via a setting is impractical...take away the force sync. Change that "true" in rpl_slave.cc to "false" so it obeys the same rules other events do (the documented rules!)
Test case for MTR will be attached
tags: | added: performance |
Attached test case can be used to repeat the issue.
In the end it will print something like:
[connection server_1] WAIT/1024/ 1024/1024 from performance_ schema. table_io_ waits_summary_ by_table where OBJECT_ SCHEMA= 'mysql' and OBJECT_NAME ='slave_ master_ info'; WAIT/1024/ 1024/1024 table_statistic s where table_name= 'slave_ master_ info'; schema. replication_ connection_ status; HEARTBEATS LAST_HEARTBEAT_ TIMESTAMP RECEIVED_ TRANSACTION_ SET LAST_ERROR_NUMBER LAST_ERROR_MESSAGE LAST_ERROR_ TIMESTAMP 7701-11e7- 8781-30b5c2208a 0f 64 ON 4 2017-08-02 00:43:22 922a4d56- 7701-11e7- 8781-30b5c2208a 0f:1-357 0 0000-00-00 00:00:00 7701-11e7- 8691-30b5c2208a 0f 70 ON 4 2017-08-02 00:43:24 922a4d56- 7701-11e7- 8781-30b5c2208a 0f:1-53 0 0000-00-00 00:00:00 7701-11e7- 86b1-30b5c2208a 0f 76 ON 4 2017-08-02 00:43:26 922a4d56- 7701-11e7- 8781-30b5c2208a 0f:1-57 0 0000-00-00 00:00:00 WAIT/1024/ 1024/1024 from performance_ schema. table_io_ waits_summary_ by_table where OBJECT_ SCHEMA= 'mysql' and OBJECT_NAME ='slave_ master_ info'; WAIT/1024/ 1024/1024 table_statistic s where table_name= 'slave_ master_ info'; schema. replication_ connection_ status; HEARTBEATS LAST_HEARTBEAT_ TIMESTAMP RECEIVED_ TRANSACTION_ SET LAST_ERROR_NUMBER LAST_ERROR_MESSAGE LAST_ERROR_ TIMESTAMP 7701-11e7- 8781-30b5c2208a 0f 64 ON 7 2017-08-02 00:46:22 922a4d56- 7701-11e7- 8781-30b5c2208a 0f:1-357 0 0000-00-00 00:00:00 7701-11e7- 8691-30b5c2208a 0f 70 ON 10 2017-08-02 00:45:29 922a4d56- 7701-11e7- 8781-30b5c2208a 0f:1-53: 58-81 0 0000-00-00 00:00:00 7701-11e7- 86b1-30b5c2208a 0f 76 ON 7 2017-08-02 00:45:33 922a4d56- 7701-11e7- 8781-30b5c2208a 0f:1-57: 59-80 0 0000-00-00 00:00:00
select count(*) from t1;
count(*)
1325 <<<<<<<< ==== Number of rows on master
[connection server_4]
start slave;
select OBJECT_NAME, count_star, AVG_TIMER_
OBJECT_NAME count_star AVG_TIMER_
slave_master_info 36 11.149680108763 <<<< Remember IO numbers
select table_name, io_write_requests, io_write from sys.schema_
table_name io_write_requests io_write
slave_master_info 11 176.00 KiB <<<< Remeber IO numbers
select * from performance_
CHANNEL_NAME GROUP_NAME SOURCE_UUID THREAD_ID SERVICE_STATE COUNT_RECEIVED_
channel_1 922a4d56-
channel_2 925a4dbc-
channel_3 92a8be79-
select count(*) from t1;
count(*)
1325 <<<<<<<< ==== Slave D received all changes from A
[connection server_2]
start slave;
[connection server_3]
start slave;
[connection server_4]
select OBJECT_NAME, count_star, AVG_TIMER_
OBJECT_NAME count_star AVG_TIMER_
slave_master_info 46 22.199227280915 <<<<< 46 > 36
select table_name, io_write_requests, io_write from sys.schema_
table_name io_write_requests io_write
slave_master_info 15 240.00 KiB <<<<< 15 > 11
select * from performance_
CHANNEL_NAME GROUP_NAME SOURCE_UUID THREAD_ID SERVICE_STATE COUNT_RECEIVED_
channel_1 922a4d56-
channel_2 925a4dbc-
channel_3 92a8be79-
select count(*) from t1;
count(*)
1325 <<<<<<<< ==== No new rows added