Setting wsrep_desync=1 after FTWRL blocks a node

Bug #1370532 reported by Przemek on 2014-09-17
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Won't Fix
Undecided
Unassigned
5.6
Fix Released
High
Unassigned
5.7
Fix Released
Undecided
Unassigned

Bug Description

Tested the problem below on PXC 5.5.37, 5.6.20 and Maria Cluster 10.0.13.

percona33 mysql> flush tables with read lock;
Query OK, 0 rows affected (0.00 sec)
-- err log on the same node:
2014-09-17 12:46:25 16422 [Note] WSREP: Provider paused at c3b203a1-3435-11e4-aa44-9605577e3230:249 (3)

percona33 mysql> set global wsrep_desync=1;
... (waiting)
-- percona33 err log:
2014-09-17 12:47:35 16422 [Note] WSREP: Member 2.0 (percona33) desyncs itself from group
2014-09-17 12:47:35 16422 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 249)

-- remaining nodes err log:
2014-09-17 12:47:35 19275 [Note] WSREP: Member 2.0 (percona33) desyncs itself from group

-- second mysql session, opened before the test:
percona33 mysql> show processlist;
+----+-------------+-----------+------+---------+------+--------------------+---------------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+----+-------------+-----------+------+---------+------+--------------------+---------------------------+-----------+---------------+
| 1 | system user | | NULL | Sleep | 265 | NULL | NULL | 0 | 0 |
| 2 | system user | | NULL | Sleep | 265 | wsrep aborter idle | NULL | 0 | 0 |
| 3 | system user | | NULL | Sleep | 265 | NULL | NULL | 0 | 0 |
| 4 | root | localhost | NULL | Query | 74 | Opening tables | set global wsrep_desync=1 | 0 | 0 |
| 5 | root | localhost | NULL | Query | 0 | init | show processlist | 0 | 0 |
+----+-------------+-----------+------+---------+------+--------------------+---------------------------+-----------+---------------+
5 rows in set (0.00 sec)

No new connections are accepted on percona33 node.
Ctrl+c in the session when desync mode was turned on does not work, also after killing it:
percona33 mysql> show processlist;
+----+-------------+-----------+------+---------+------+--------------------+---------------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+----+-------------+-----------+------+---------+------+--------------------+---------------------------+-----------+---------------+
| 1 | system user | | NULL | Sleep | 404 | NULL | NULL | 0 | 0 |
| 2 | system user | | NULL | Sleep | 404 | wsrep aborter idle | NULL | 0 | 0 |
| 3 | system user | | NULL | Sleep | 404 | NULL | NULL | 0 | 0 |
| 4 | root | localhost | NULL | Killed | 213 | Opening tables | set global wsrep_desync=1 | 0 | 0 |
| 5 | root | localhost | NULL | Query | 0 | init | show processlist | 0 | 0 |
+----+-------------+-----------+------+---------+------+--------------------+---------------------------+-----------+---------------+
5 rows in set (0.00 sec)

The only way to unblock the node is to kill -9 mysqld.

Przemek (pmalkowski) wrote :

The test case can be repeated really easy. Confirmed.

Krunal Bauskar (krunal-bauskar) wrote :

- Tried following use-case

- Started 2 node cluster: node-1 and node-2

- node-1:
  create table t (i int) engine=innodb;
  insert into t values (1);

- node-2
  select * from t; .... got 1 as expected

- node-1
  flush table with read lock
  set global wsrep_desync=1; .... this halted the cluster as both are local action and needs local commit ordering but the latch is held by FTWRL.

----------------

Let's understand what FTWRL does ?
- It pauses cluster-node. node continue to receive write-set but they are not applied to applier is paused.
Once unlocked all such events are again re-applied.

Let's now understand what wsrep_desync does ?
- It simply indicate that this node shouldn't be consider for flow control but it too continue to receive the event.

If FTWRL has paused cluster-node w/o wsrep_desync in short period of time (based on configuration and workload) node with FTWRL will start to emit flow control that will completely pause a cluster.

-----------

When normally a user would use FTWRL. While taking backup and of-course if user doesn't want the node to avoid sending flowcontrol then user can set wsrep_desync before enabling FTWRL so the sequence would be

- wsrep_desync = 1
- FTWRL
- take backup
- unlock
- wsrep_desync = 0

-----------

With that flow clarified I would propose to block wsrep_desync toggling if node is already in pause state.
node can be desync only when it is unpaused.

Przemek (pmalkowski) wrote :

The problem is back in recent PXC versions. Tested on PXC 5.6.35 and 5.7.17:

mysql> flush tables with read lock; set global wsrep_desync=1;
Query OK, 0 rows affected (0.00 sec)

... hangs

In processlist, unkillable process:

mysql> select * from information_schema.processlist where state like 'op%';
+----+------+-----------+------+---------+-------+----------------+---------------------------+----------+-----------+---------------+
| ID | USER | HOST | DB | COMMAND | TIME | STATE | INFO | TIME_MS | ROWS_SENT | ROWS_EXAMINED |
+----+------+-----------+------+---------+-------+----------------+---------------------------+----------+-----------+---------------+
| 7 | root | localhost | NULL | Query | 16390 | Opening tables | set global wsrep_desync=1 | 16389992 | 0 | 0 |
+----+------+-----------+------+---------+-------+----------------+---------------------------+----------+-----------+---------------+
1 row in set (0.00 sec)

The only way to unblock a node from permanent DESYNCED state is forcible kill.

Przemek (pmalkowski) wrote :

Just verified that 5.7.16-27 and 5.7.17-27 are affected, but since 5.7.17-29 up to 5.7.19 are fine.

Przemek (pmalkowski) wrote :

Also 5.6.37 is OK now:

percona1 mysql> select @@version,@@version_comment\G
*************************** 1. row ***************************
        @@version: 5.6.37-82.2-56-log
@@version_comment: Percona XtraDB Cluster (GPL), Release rel82.2, Revision 114f2f2, WSREP version 26.21, wsrep_26.21
1 row in set (0.00 sec)

percona1 mysql> flush tables with read lock; set global wsrep_desync=1;
Query OK, 0 rows affected (0.00 sec)

ERROR 1105 (HY000): Explictly desync/resync of already desynced/paused node is prohibited

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-797

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers