Slow slave cause the whole swith process hang

Bug #1108688 reported by cenalulu
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
mysql-mmm
New
Undecided
Unassigned

Bug Description

Considering a enviroment like this: DB1 & DB2 are configured as master, while DB3 as slave. At first DB1 is the "preferred " writer and it has been given the writer IP -- VIP1. DB3 is the slave of DB1. Because of it's poor performance, it is frequently behind master. So we set max_backlog in "mmm_common.conf" as below to avoid DB3 changing from "ONLINE" to "DELAYED" back and forth.
<check rep_backlog>
        max_backlog 36000
</check>

And Now, when DB1 is down, while DB3 is struggling from big lag in replication, the whole MMM switch process will hang until DB3's SQL thread have caught up its relay log. In the meantime, VIP1 is attached to neither DB1 & DB2. So the app server will be not able to query VIP1 for a long time, which is awful.

After reading the source code, I found that this is caused by the switch sequence which is abstract as below:
1. monitor detected DB1 is down
2. monitor try to kill DB1 using configured killing script
3. set write-role as orphaned
4. monitor detect that orphaned write-role can be assigned to DB2
5. notify all the slaves (which is DB3 in this situation) that DB2 will be the new "writer"
6. agent on DB3 try to catch up DB1 as far as possible using "select master_pos_wait(xxx)"
7. agent on DB3 change master to DB2
8. notify DB2 that "writer" can now take the VIP1
9. agent on DB2 take this message, and assign VIP1 on eth0

The whole process will hang from step 6, since all these steps is sychronized. VIP1 will not able to be accessed from STEP 1 to STEP 9 depending how long DB3 is behind DB1.

Fortunately, this drawback can be fixed, and I'll attach the fix-patch as quick as possible

cenalulu (cenalulu)
summary: - Slow slave will hang the whole swith process
+ Slow slave cause the whole swith process hang
description: updated
description: updated
description: updated
To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.