Improper transition from REPLICATION_FAIL to ONLINE

Bug #708266 reported by Sasha Pachev
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
mysql-mmm
New
Undecided
Unassigned

Bug Description

This is with version 2.2.1.

Suppose we have a high setting of trap_period and max_backlog. Then we can encounter the following unpleasant surprise which can be called a feature with a stretch but otherwise a bug:

- slave thread dies
- the host goes into REPLICATION_FAIL
- the slave thread is restarted
- the slave is far behind, but it is still put in ONLINE because the trap to go into REPLICATION_DELAY has not yet been triggered
- it stays long enough in ONLINE to get assigned the role of READER

Suggestion for a fix:

In this code in in lib/Monitor/Monitor.pm:

if ($state eq 'REPLICATION_DELAY' || $state eq 'REPLICATION_FAIL') {
                        if ($ping && $mysql && (($rep_backlog && $rep_threads) || $peer_state ne 'ONLINE')) {

                                # REPLICATION_DELAY || REPLICATION_FAIL -> AWAITING_RECOVERY
                                if ($agent->flapping) {
                                        FATAL "State of host '$host' changed from $state to AWAITING_RECOVERY (because it's flapping)";
                                        $agent->state('AWAITING_RECOVERY');
                                        $self->send_agent_status($host);
                                        next;
                                }

                                # REPLICATION_DELAY || REPLICATION_FAIL -> ONLINE
                                FATAL "State of host '$host' changed from $state to ONLINE";
                                $agent->state('ONLINE');
                                $self->send_agent_status($host);
                                next;
                        }

Do not put it in ONLINE right away if the previous state was REPLICATION_FAIL if there is any lag at all, or nothing wrong with assuming REPLICATION_DELAY unconditionally. Put it in REPLICATION_DELAY first. Then if the slave catches up fast, the next iteration will put it in ONLINE.

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.