Asynchronous slave thread remains stopped after node re-joins the cluster

Bug #1288479 reported by markus_albe on 2014-03-06
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
High
Seppo Jaakola
Percona XtraDB Cluster
Status tracked in 5.6
5.5
Undecided
Unassigned
5.6
Undecided
Unassigned

Bug Description

When a node goes into non-primary state asynchronous slave thread is halted, but then remains stopped after node successfully re-joins the cluster.

It was mentioned by Seppo that "node could restart slave thread after he joined back in cluster", but he also mentioned a caveat: "automatic slave restart can be dangerous also, DBA may have started slave in some other cluster node during the node absense".

Thus, the ideal fix would be a switch "async_slave_auto_restart" or similarly named that would give DBAs the freedom to have the thread stared automatically when node joins the cluster after going into non-primary state.

tags: added: i39956
Changed in codership-mysql:
assignee: nobody → Seppo Jaakola (seppo-jaakola)
importance: Undecided → High
status: New → In Progress
milestone: none → 5.5.36-25.10
Seppo Jaakola (seppo-jaakola) wrote :

Replication slave thread is not stopped automatically when node changes to non primary state, but the slave thread stopping happens when next replication event tries to apply. Node in non primary state will return 'Unknown Error' to slave applier, and slave thread then decides to stop, error message in the log is:

140320 23:56:53 [ERROR] Slave SQL: Error 'Unknown command' on query. Default database: 'test'. Query: 'BEGIN', Error_code: 1047
140320 23:56:53 [Warning] Slave: Unknown command Error_code: 1047
140320 23:56:53 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysqlbin.000001' position 555

Seppo Jaakola (seppo-jaakola) wrote :

Created an optional fix for this. f wsrep_restart_slave is set (default is unset), then mysql will automatically restart slave, which was stopped due to node going non primary. Slave restart will happen when node joins back to cluster, or was bootstrapped to start new cluster.

Fix was pushed in wsrep-5.5 revision: http://bazaar.launchpad.net/~codership/codership-mysql/wsrep-5.5/revision/3967

Seppo Jaakola (seppo-jaakola) wrote :
Changed in codership-mysql:
status: In Progress → Fix Committed
no longer affects: galera
Seppo Jaakola (seppo-jaakola) wrote :

Pushed a tuned fix, which takes in consideration that node state may have turned back primary, before the slave error is handled. Pushed fixes for revisions:
wsrep-5.5: http://bazaar.launchpad.net/~codership/codership-mysql/wsrep-5.5/revision/3969
wsrep-5.6: http://bazaar.launchpad.net/~codership/codership-mysql/5.6/revision/4066

Changed in codership-mysql:
status: Fix Committed → Fix Released
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers