Comment 3 for bug 1370532

Revision history for this message
Krunal Bauskar (krunal-bauskar) wrote :

- Tried following use-case

- Started 2 node cluster: node-1 and node-2

- node-1:
  create table t (i int) engine=innodb;
  insert into t values (1);

- node-2
  select * from t; .... got 1 as expected

- node-1
  flush table with read lock
  set global wsrep_desync=1; .... this halted the cluster as both are local action and needs local commit ordering but the latch is held by FTWRL.

----------------

Let's understand what FTWRL does ?
- It pauses cluster-node. node continue to receive write-set but they are not applied to applier is paused.
Once unlocked all such events are again re-applied.

Let's now understand what wsrep_desync does ?
- It simply indicate that this node shouldn't be consider for flow control but it too continue to receive the event.

If FTWRL has paused cluster-node w/o wsrep_desync in short period of time (based on configuration and workload) node with FTWRL will start to emit flow control that will completely pause a cluster.

-----------

When normally a user would use FTWRL. While taking backup and of-course if user doesn't want the node to avoid sending flowcontrol then user can set wsrep_desync before enabling FTWRL so the sequence would be

- wsrep_desync = 1
- FTWRL
- take backup
- unlock
- wsrep_desync = 0

-----------

With that flow clarified I would propose to block wsrep_desync toggling if node is already in pause state.
node can be desync only when it is unpaused.