event is passed back and forth between dual master if event is from some other mysqld

Bug #940404 reported by Hui Liu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona Server moved to https://jira.percona.com/projects/PS
Status tracked in 5.7
5.1
Won't Fix
Medium
Unassigned
5.5
Triaged
Medium
Unassigned
5.6
Triaged
Medium
Unassigned
5.7
Triaged
Medium
Unassigned

Bug Description

For master fail-over scenario, we suffered the pain of unnecessary cost
for events passing between the dual master, back and forth.

Take an example:
MySQL1 <---(dual master) ---> MySQL2 ---(rep)---> MySQL3

For the toplogical structure above, MySQL1 is readable/writable, but MySQL2
is readonly. If MySQL1 is down, then MySQL2 takes over the write ability
and makes MySQL3 as it's dual master. Consider some relay log not yet applied
on MySQL2 when MySQL1 is down(log_slave_update=1), then these events would
be passed between MySQL2 and MySQL3 back and forth.

It's easy to solve once we detect such problem:
1. break the replication on MySQL2 which received changes from MySQL3.
2. wait until these events from MySQL1 are applied on MySQL3.
3. change master to the new binary position.

However, if the events were applied very quickly on heavy workload MySQL,
it's not easy to detect these unnecessary events, and made the master/slave
lower performance. So, we try to find these scenarios inner MySQL to alert
DBA.

If such events are detected:
1) event's server_id is not the same as local server_id
2) event's server_id is not in the slaves server_id list
3) there exists a cycle of topological structure.
then an error info is printed in error.log.

A patch is attached.

Revision history for this message
Hui Liu (hickey) wrote :
Revision history for this message
Hui Liu (hickey) wrote :

Tweak the test case for easier understand.

Stewart Smith (stewart)
Changed in percona-server:
importance: Undecided → Medium
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@hickey, I tried testing with rpl_circular_event_detect.result rpl_circular_event_detect.test and rpl_circular_event_detect.cnf and the tests passed without the patch. Here is the log: http://sprunge.us/PcjV

I have also attached the full log.

Can you let us know?

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

I have seen this happening in one of the older cases where following happened: (in a RBR scenario)

M1 <----> M2 in dual replication topology.

However, somehow a change master got executed which pointed M1 to M3 resulting in:

 M3 ---> M1 ----> M2 ('--->' is master-of relation).

After sometime this was rectified, however, the binlog on M1 recorded events with server_id of M3 -- lets call it 3

after rectification, it again became M1 <------> M2

So, binlog events with server_id 3 ping-ponged between M1 and M2.

tags: added: i24444
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@hickey, Can you provide more details regarding #3 ?

Marking this as incomplete till then.

Changed in percona-server:
status: New → Incomplete
Revision history for this message
Hui Liu (hickey) wrote :

The test case passed even no source code patch :)

$./mtr suite/rpl/t/rpl_circular_event_detect.test

==============================================================================

TEST RESULT TIME (ms) or COMMENT
--------------------------------------------------------------------------

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
rpl.rpl_circular_event_detect 'mix' [ pass ] 1461
rpl.rpl_circular_event_detect 'row' [ pass ] 1366
rpl.rpl_circular_event_detect 'stmt' [ pass ] 1458
--------------------------------------------------------------------------

What we did for the ping-ponged events between dual master instances, is just detect this case and print some warning/error into error.log, then DBA could handle it manually.

The main purpose is to just DETECT, but not FIX, though we could just delete the event before writing to relay log, but it's not what we expected for product environment.

We updated the patch for online MySQLs, which is more clear:

1. the rpl_circular_warning switch is opened automated once slave is registered.
2. rpl_circular_warning switch is turned off automated once detect the abnormal event.

tags: added: contribution
Revision history for this message
Valerii Kravchuk (valerii-kravchuk) wrote :

I see that the patch is still not applied, but IMHO it should be considered for all versions.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PS-1241

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.