Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Setting wsrep_desync=1 after FTWRL blocks a node

Bug #1370532 reported by Przemek on 2014-09-17

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.5	Won't Fix	Undecided	Unassigned
5.6	Fix Released	High	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.6.27-25.13
5.7	Fix Released	Undecided	Unassigned

Bug Description

Tested the problem below on PXC 5.5.37, 5.6.20 and Maria Cluster 10.0.13.

percona33 mysql> flush tables with read lock;
Query OK, 0 rows affected (0.00 sec)
-- err log on the same node:
2014-09-17 12:46:25 16422 [Note] WSREP: Provider paused at c3b203a1-3435-11e4-aa44-9605577e3230:249 (3)

percona33 mysql> set global wsrep_desync=1;
... (waiting)
-- percona33 err log:
2014-09-17 12:47:35 16422 [Note] WSREP: Member 2.0 (percona33) desyncs itself from group
2014-09-17 12:47:35 16422 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 249)

-- remaining nodes err log:
2014-09-17 12:47:35 19275 [Note] WSREP: Member 2.0 (percona33) desyncs itself from group

The only way to unblock the node is to kill -9 mysqld.

Tags:

Revision history for this message

Przemek (pmalkowski) wrote on 2014-09-17:

Corresponding bug filled at:
https://github.com/codership/galera/issues/131

Revision history for this message

Miguel Angel Nieto (miguelangelnieto) wrote on 2014-09-24:

The test case can be repeated really easy. Confirmed.

Revision history for this message

Krunal Bauskar (krunal-bauskar) wrote on 2015-12-09:

- Tried following use-case

- Started 2 node cluster: node-1 and node-2

- node-1:
create table t (i int) engine=innodb;
insert into t values (1);

- node-2
select * from t; .... got 1 as expected

- node-1
flush table with read lock
set global wsrep_desync=1; .... this halted the cluster as both are local action and needs local commit ordering but the latch is held by FTWRL.

----------------

Let's understand what FTWRL does ?
- It pauses cluster-node. node continue to receive write-set but they are not applied to applier is paused.
Once unlocked all such events are again re-applied.

Let's now understand what wsrep_desync does ?
- It simply indicate that this node shouldn't be consider for flow control but it too continue to receive the event.

If FTWRL has paused cluster-node w/o wsrep_desync in short period of time (based on configuration and workload) node with FTWRL will start to emit flow control that will completely pause a cluster.

-----------

When normally a user would use FTWRL. While taking backup and of-course if user doesn't want the node to avoid sending flowcontrol then user can set wsrep_desync before enabling FTWRL so the sequence would be

- wsrep_desync = 1
- FTWRL
- take backup
- unlock
- wsrep_desync = 0

-----------

With that flow clarified I would propose to block wsrep_desync toggling if node is already in pause state.
node can be desync only when it is unpaused.

Revision history for this message

Hrvoje Matijakovic (hrvojem) wrote on 2016-01-07:

Fixed here: https://github.com/percona/percona-xtradb-cluster/pull/70

Revision history for this message

Przemek (pmalkowski) wrote on 2017-03-22:

The problem is back in recent PXC versions. Tested on PXC 5.6.35 and 5.7.17:

mysql> flush tables with read lock; set global wsrep_desync=1;
Query OK, 0 rows affected (0.00 sec)

... hangs

In processlist, unkillable process:

The only way to unblock a node from permanent DESYNCED state is forcible kill.

Revision history for this message

Przemek (pmalkowski) wrote on 2017-11-07:

Just verified that 5.7.16-27 and 5.7.17-27 are affected, but since 5.7.17-29 up to 5.7.19 are fine.

Revision history for this message

Przemek (pmalkowski) wrote on 2017-11-07:

Also 5.6.37 is OK now:

percona1 mysql> select @@version,@@version_comment\G
*************************** 1. row ***************************
@@version: 5.6.37-82.2-56-log
@@version_comment: Percona XtraDB Cluster (GPL), Release rel82.2, Revision 114f2f2, WSREP version 26.21, wsrep_26.21
1 row in set (0.00 sec)

percona1 mysql> flush tables with read lock; set global wsrep_desync=1;
Query OK, 0 rows affected (0.00 sec)

ERROR 1105 (HY000): Explictly desync/resync of already desynced/paused node is prohibited

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-797

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.