PXC updates mysql.event table with same value which can break async slave with replication filters

Bug #1528020 reported by Sveta Smirnova on 2015-12-20
10
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.6
Fix Released
Undecided
Krunal Bauskar

Bug Description

During restart of a node which has events, created on another node, PXC updates mysql.event table, so status of the event becomes SLAVESIDE_DISABLED without checking if this status was already set.

This, at the first glance, innocent, while not needed, update can break async slave of this node which has replication filters.

How to repeat:

- setup 3-node cluster: node1, node2, node3
- additionally enable GTIDs on all nodes
- point async-slave to node1 of this cluster
- have replication filters on slave, such as --replicate-ignore-db=test, --replicate-wild-ignore-table=test.%
- create event on node1 in database test
- wait some time, then redirect async slave to node2
- check if replication is up and running
- restart node2
- replication would fail, because slave will receive update on mysql.event table for a record which does not exists on the slave, because filters were fired.

Or read code:

$cat -n sql/events.cc
...
1142 #ifdef WITH_WSREP
1143 // when SST from master node who initials event, the event status is ENABLED
1144 // this is problematic because there are two nodes with same events and both enabled.
1145 if (et->originator != thd->server_id)
1146 {
1147 store_record(table, record[1]);
1148 table->field[ET_FIELD_STATUS]->
1149 store((longlong) Event_parse_data::SLAVESIDE_DISABLED,
1150 TRUE);
1151 (void) table->file->ha_update_row(table->record[1], table->record[0]);
1152 delete et;
1153 continue;
1154 }
1155 #endif
...

Suggested fix:

Disable binlog before updating mysql.event table on startup. Or, better, do this operation not on stratup, but right after record was replicated to the node2 (with disabled binlogging too)

Krunal Bauskar (krunal-bauskar) wrote :

commit fe9f72fccfd70ab75fe536e8363a8613f33f7eda
Author: Krunal Bauskar <email address hidden>
Date: Thu Dec 31 08:51:01 2015 +0530
PXC#492: PXC updates mysql.event table with same value which can break async
slave with replication filters
Issue:
To effectively understand the issue let's consider it using a topology.
galera cluster with 2 nodes (each node having unique server-id)
async slave that is replicating from one of the galera cluster
(async slave has replication filter configured to avoid replication
of events created on master)
Now let's say event is created on node#1 which is also acting as master.
This event is not replicated to slave due to replication filter
but replicated to galera-node#2 as per normal galera replication
protocol.
Suddenly node#1 goes-off and load balancer make node#2 as master
causing async slave to switch to new master.
Now for some reason if node#2 is restarted. On restart node#2 will
try to read events from its local event table which already have
status = SLAVESIDE_DISABLED.
Due to bug in code this status was re-updated to same status but this
action also generate an UPDATE bin-log statement which is then replicated
to an async slave but async slave doesn't have this event table
entry so replication fails.
What are the issues ?
a. re-updating status to same value is redundant action and should
be avoided.
b. Even if action is allowed it shouldn't generate binlog.
Infact, the complete semantics around mysql.* replication
should be re-thought but let's limit this issue to said problem
for now.
Solution:
--------
Avoid null updates as in update a = x where a = x;
Avoid writing such actions to binlogs.
This issue doesn't occur if the server-ids of galera-nodes
are same. Ideally this should be the case as galera cluster
is single atomic entity from bigger eco-system perspective
but still for some workaround issues user tend to set
unique-ids for each galera nodes.
---------------------------------------------------------------
There is still another left over issue to be fixed as part of
different tracking issue.
What happens when the event is created on node#1 before node#2 is
allowed to boot-up and both nodes has same server-id.
Existing logic will enable events on both the nodes.
So isn't it better to have different server-id for each node.
I would still say no given that galera is single atomic system.

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1875

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers