Percona Server moved to https://jira.percona.com/projects/PS

Write/fsync amplification w/ duplicate GTIDs

Bug #1669928 reported by Sveta Smirnova on 2017-03-03

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
MySQL Server	Unknown	Unknown	mysql-bugs #85141
Percona Server moved to https://jira.percona.com/projects/PS	Status tracked in 5.7
5.5	Invalid	Undecided	Unassigned
5.6	Invalid	Undecided	Unassigned
5.7	Fix Released	Medium	Yura Sorokin	Percona Server moved to https://jira.percona.com/projects/PS 5.7.18-14

Bug Description

Originally reported at https://bugs.mysql.com/bug.php?id=85141

Description:
The slave_worker_info/slave_relay_log_info tables don't need to update on no-ops, only on executed gtids. skips still update the respective table (which one is based on slave_parallel_workers). Say replication comes across a gtid that has already been executed - normally, it would not need to write any data. because of this, it needs to write a row, commit, and (presumably based on config) fsync. It's not a large amount of data byte-wise, but that's a lot of syncs.

Say you have a slave of an msr setup, with three masters configured, and each master sourcing from one writer and using log_slave_updates. The slave sees 3 incoming streams of transactions. That means 2 out of 3 incoming transactions (well "groups" in terms of replication code but that word is ambiguous now with group rep and such) are skipped. That means the host is doing 3x as many commits as it actually needs to to remain consistent. That's a lot of wasted effort!

The amplification is directly proportional to the count of channels in a situation like that, it gets impractical very quickly, even with nice drives and a big controller write cache.

How to repeat:
Keep an eye on p_s.table_io_waits_summary_by_table, and remember to enable 'mysql' schema in setup_objects. Also, I was able to get some additional data on this when I ran the test on a host with google/percona's userstat plugin, which directly counts row changes by table. You can look in event_waits etc as well but those two are good clear views into what's happening.
You can also look at stuff like the Handler_commit status variable for an even higher level view of just the commits/syncs.

Easiest way is to set up a slave (C) of two masters, having one as a writer (A) and one as a slave of that (B) with log_slave_updates. This ensures that all GTIDs come in twice.
- Start writes to A and observe all for a bit. Sysbench or similar is fine.
- Lag the intermediate, B, by stopping its sql thread or otherwise.
- Stop writes to A, allow C to catch up if it's not already.
- Start sql_thread on B (or remove the forced lag).
-- note: at this point, you know all gtid's that C will see are ones it has already executed. no more data changes will be made by replication outside of the system tables.
- Now, watch C carefully. Look at all the writes to the relevant slave_*_info table (can test with or without mts) that are still going on! But no value is added, as they don't correspond to anything the server's actually changing.

Suggested fix:
If a slave goes down and comes back up, and is started, and the last gtid seen was a dupe gtid which it didn't write to the table about - that's fine consistency wise, as it is still in the executed gtid set, and will be skipped again. However, this is a noticeable change in behavior, so a variable controlling it might be best. In some cases, people might want to pay the additional write cost to skip reading all of those logs.

This would only be relevant to gtid processing, but the problem is based around gtid processing...most cases where you'd see an already executed transaction depend on gtid. There's one case of circular replication between two masters without gtid - I haven't tested that setup to see if this is a problem there, but the code filtering same server id is not the same as the code which filters is_executed gtids. I also haven't tested items that come through replication like settings in headers, but that doesn't concern me as the volume would always be low, and if I recall, timestamp setting and anything else is contained within a gtid-defined group and so would not be an additional issue.

Tags: