removing node from a cluster via SET GLOBAL wsrep_cluster_address - hangs node being removed

Bug #931713 reported by Patrick Zoblisein
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Fix Released
Low
Seppo Jaakola
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Released
Undecided
Unassigned

Bug Description

Hi,

Trying to remove a single node from a 3-node cluster via:

SET GLOBAL wsrep_cluster_address='gcomm://'; (have also tested, with the same result - SET GLOBAL wsrep_cluster_address='';)

using Server version: 5.5.17-22.1-log Percona XtraDB Cluster (GPL), Release 22.1, Revision 3683 wsrep_22.3.r3683

Command hangs the server - command never returns - and the server is no longer accessible - I have to use a kill -9 to get the instance usable again.

From the error log from the node being removed:
120213 22:15:09 [Note] WSREP: Stop replication
120213 22:15:09 [Note] WSREP: Closing send monitor...
120213 22:15:09 [Note] WSREP: Closed send monitor.
120213 22:15:09 [Note] WSREP: gcomm: terminating thread
120213 22:15:09 [Note] WSREP: gcomm: joining thread
120213 22:15:09 [Note] WSREP: gcomm: closing backend
120213 22:15:09 [Note] WSREP: evs::proto(e0c81b4c-567e-11e1-0800-85ed415644d4, LEAVING, view_id(REG,5d6a7cdc-565a-11e1-0800-7dba6589ad90,92)) uuid 5d6a7cdc-565a-11e1-0800-7dba6589ad90 missing from install message, assuming partitioned
120213 22:15:09 [Note] WSREP: evs::proto(e0c81b4c-567e-11e1-0800-85ed415644d4, LEAVING, view_id(REG,5d6a7cdc-565a-11e1-0800-7dba6589ad90,92)) uuid 6cd1e65e-567f-11e1-0800-377d6c6676ab missing from install message, assuming partitioned
120213 22:15:09 [Note] WSREP: GMCast::handle_stable_view: view(view_id(NON_PRIM,5d6a7cdc-565a-11e1-0800-7dba6589ad90,92) memb {
        e0c81b4c-567e-11e1-0800-85ed415644d4,
} joined {
} left {
} partitioned {
        5d6a7cdc-565a-11e1-0800-7dba6589ad90,
        6cd1e65e-567f-11e1-0800-377d6c6676ab,
})
120213 22:15:09 [Note] WSREP: New COMPONENT: primary = no, my_idx = 0, memb_num = 1
120213 22:15:09 [Note] WSREP: GMCast::handle_stable_view: view((empty))
120213 22:15:09 [Note] WSREP: gcomm: closed
120213 22:15:09 [Note] WSREP: Flow-control interval: [230, 256]
120213 22:15:09 [Note] WSREP: Received NON-PRIMARY.
120213 22:15:09 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 4046094)
120213 22:15:09 [Note] WSREP: Received self-leave message.
120213 22:15:09 [Note] WSREP: New cluster view: global state: 0ec148da-46c0-11e1-0800-e762c70be5e7:4046094, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 1
120213 22:15:09 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
120213 22:15:09 [Note] WSREP: Flow-control interval: [0, 0]
120213 22:15:09 [Note] WSREP: Received SELF-LEAVE. Closing connection.
120213 22:15:09 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 4046094)
120213 22:15:09 [Note] WSREP: RECV thread exiting 0: Success
120213 22:15:09 [Note] WSREP: recv_thread() joined.
120213 22:15:09 [Note] WSREP: Closing slave action queue.
120213 22:15:09 [Note] WSREP: New cluster view: global state: 0ec148da-46c0-11e1-0800-e762c70be5e7:4046094, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 1
120213 22:15:09 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
120213 22:15:09 [Note] WSREP: applier thread exiting (code:0)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:09 [Note] WSREP: applier thread exiting (code:5)
120213 22:15:11 [Note] WSREP: rollbacker thread exiting

Contrast this with a valid gcomm address (which works as expected although it's not removing the node from the cluster):

mysql> set global wsrep_cluster_address='gcomm://12.23.34.45:1260';
Query OK, 0 rows affected (3.51 sec)

Thanks
Patrick

Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

Trying with lp:codership-mysql (wsrep-5.5) I cannot reproduce this issue. Tried with:

SET GLOBAL wsrep_cluster_address='gcomm://';

and then joining node back with:

 SET GLOBAL wsrep_cluster_address='gcomm://127.0.0.1:4567';

and it works both ways with current wsrep-5.5 trunk version.

However, command:
SET GLOBAL wsrep_cluster_address='';

...is rejected

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Ok, that means that behaviour is different from 5.1 where

SET GLOBAL wsrep_cluster_address='';

passes and disconnects from cluster.

Changed in codership-mysql:
status: New → In Progress
assignee: nobody → Seppo Jaakola (seppo-jaakola)
milestone: none → 5.5.20-23.4
importance: Undecided → Low
Revision history for this message
Seppo Jaakola (seppo-jaakola) wrote :

A fix pushed in codership-mysql wsrep-5.5 branch to accept empty wsrep_cluster_address: http://bazaar.launchpad.net/~codership/codership-mysql/wsrep-5.5/revision/3708

Changed in codership-mysql:
status: In Progress → Fix Committed
Changed in codership-mysql:
status: Fix Committed → Fix Released
Revision history for this message
Patrick Zoblisein (patrickzoblisein) wrote :
Download full text (12.0 KiB)

Hi - I would like to provide some more information related to the behavior originally described here.

I've been doing some more testing of the `SET GLOBAL wsrep_cluster_address='<>;` functionality and I'm still able to hang my node - regardless of whether it's changing the cluster_address to another node in the cluster, or removing the node from the cluster via gcomm://;

Performing the dynamic change in my lab environment works - no issues as Seppo notes above. It's not until I make a connection to the node, "post-set-GLOBAL-wsrep_cluster_address" - here is a lot of output showing no connections (other than my working one) while I bounce back and forth between cluster_address':

[root@lab1 pxc]# mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 4
Server version: 5.5.20-beta23.4-log Percona XtraDB Cluster (GPL) 5.5.20-beta23.4, Revision 3724, wsrep_23.4.r3724

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> show processlist;
+----+-----------------+-----------+------+---------+------+------------------------+------------------+-----------+---------------+-----------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | Rows_read |
+----+-----------------+-----------+------+---------+------+------------------------+------------------+-----------+---------------+-----------+
| 1 | system user | | NULL | Sleep | 13 | wsrep aborter idle | NULL | 0 | 0 | 1 |
| 2 | system user | | NULL | Sleep | 13 | NULL | NULL | 0 | 0 | 1 |
| 3 | event_scheduler | localhost | NULL | Daemon | 13 | Waiting on empty queue | NULL | 0 | 0 | 1 |
| 4 | root | localhost | NULL | Query | 0 | sleeping | show processlist | 0 | 0 | 1 |
+----+-----------------+-----------+------+---------+------+------------------------+------------------+-----------+---------------+-----------+
4 rows in set (0.00 sec)

mysql> show variables like 'wsrep_cluster_address';
+-----------------------+----------------------------+
| Variable_name | Value |
+-----------------------+----------------------------+
| wsrep_cluster_address | gcomm://1.2.3.4:1260 |
+-----------------------+----------------------------+
1 row in set (0.00 sec)

mysql> show processlist;
+----+-----------------+-----------+------+---------+------+------------------------+------------------+-----------+---------------+-----------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | Rows_read |
+----+-----------------+-----------+------+---------+------+------------------------+------------------+-----------+---------------+-----------+
| 1 | system user | | NULL | Sleep | 38 | wsrep aborter idle | NULL | 0 | 0 | 1 |
| 2 | system user | ...

Revision history for this message
Henrik Ingo (hingo) wrote :

Patrick: I once saw a connection hung in login state like you, although it was not related to changing cluster address (that I know of). This was in the same lab that you are doing, but a pre-0.8 galera OEM vendor. Solution then was to use --skip-name-resolve.

Looking at this bug now, there's probably something deeper going on than just skip-name-resolve, looks buggy. Anyway, just wanted to share.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Hi Patrick,

I'm not sure what you're trying to achieve there changing wsrep_cluster_address randomly without regards to node state. Changing wsrep_cluster_address is intended for rational use, it is a human driven event, not a random disaster, and from your report it looks like you were testing cluster resilience to human intervention. :)

Now, wsrep_cluster_address value in itself is of little interest to us, it is a just string. Far more interesting is what this changing it causes in Galera. Wsrep status variables and error log would tell a lot about what is _really_ going on. You should always check cluster and node state before giving it administration command.

I'm not denying that there could be a bug, however 2.1 will see a lot of changes in this department, so I'd suggest to return to this after 2.1 release.

Revision history for this message
Patrick Zoblisein (patrickzoblisein) wrote :

Hi Alex,

Randomly changing the address to show that there could be an issue with running the command. I had noticed it once earlier back in February, reported it as a bug, and was told it wasn't reproducible.

But, on the occasion that I used it to remove a node from a cluster during my testing, it was still hosing my node occasionally.

The random changes in the latest test case was simply to show that, yes, the command works but only as long as you don't get a hung connection after doing it.

If the command isn't be used wily-nily without a thought to cluster status - then maybe it should not be allowed to be dynamic? Only allowable via the cnf file and a node restart?

I will revisit the behavior once 2.1 is made available.

Thanks
Patrick

Revision history for this message
Patrick Zoblisein (patrickzoblisein) wrote :
Download full text (16.0 KiB)

Hi - have found some time to revisit this using:

mysql> select @@version;
+-----------------+
| @@version |
+-----------------+
| 5.5.24-23.6-log |
+-----------------+

Dynamically changing the gcomm value via: set global wsrep_cluster_address='gcomm://'; no longer hangs and completes as expected.

Only issue remaining (and I don't know if you want a new bug report or if this comment will suffice), is that when dynamically changing the wsrep_cluster_address, the very next connection to the node is hung as an "unauthenticated user". Only way to clear this is to shutdown the node via 'kill -9'.

Node is also using --skip-name-resolve as suggested by Henrik above.

[zoblisei@db001 ~]$ sudo mysql -u dbadm -p -h127.0.0.1 --port=3307
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 373809
Server version: 5.5.24-23.6-log Percona XtraDB Cluster (GPL) 5.5.24-23.6, Revision 342, wsrep_23.6.r342

Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show processlist;
+--------+-------------+-----------------+------+---------+--------+--------------------+------------------+-----------+---------------+-----------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined | Rows_read |
+--------+-------------+-----------------+------+---------+--------+--------------------+------------------+-----------+---------------+-----------+
| 1 | system user | | NULL | Sleep | 926519 | wsrep aborter idle | NULL | 0 | 0 | 1 |
| 2 | system user | | NULL | Sleep | 230573 | committed 288 | NULL | 0 | 0 | 1 |
| 373809 | dbadm | 127.0.0.1:44454 | NULL | Query | 0 | sleeping | show processlist | 0 | 0 | 1 |
+--------+-------------+-----------------+------+---------+--------+--------------------+------------------+-----------+---------------+-----------+
3 rows in set (0.00 sec)

mysql> show variables like 'wsrep%add%';
+-----------------------------+----------------------------------------------------+
| Variable_name | Value |
+-----------------------------+----------------------------------------------------+
| wsrep_cluster_address | gcomm://db003:4307 |
| wsrep_node_address | |
| wsrep_node_incoming_address | 1.2.3.4:3307 |
| wsrep_sst_receive_address | db001:5307 |
+-----------------------------+----------------------------------------------------+
4 rows in set (0.00 sec)

mysql> set global wsrep_cluster_address='gcomm://';
Query OK, 0 rows affected (2.01 sec)

mysql> show variables like 'wsrep_cluster_address';
+-----------------------+-----...

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Confirmed. Submitted lp:1022250.

Changed in percona-xtradb-cluster:
status: New → Fix Released
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1208

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.