[ERROR] WSREP: FSM: no such a transition JOINED -> JOINED

Bug #1629566 reported by Jericho Rivera
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
New
Undecided
Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.6
Fix Released
Undecided
Unassigned

Bug Description

Stumbled upon this case when testing a similar customer test scenario but got a different result.

Crashed server after executing "SET GLOBAL WSREP_DESYNC=OFF;" when node was set to WSREP_DESYNC=ON state while SST was in progress.

2016-10-01 10:29:07 6621 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 104041)
2016-10-01 10:29:07 6621 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-10-01 10:29:07 6621 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'donor' --address '10.0.3.222:4444/xtrabackup_sst//1' --socket '/var/lib/mysql/mysql.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --binlog '/data/logs/pxc/mysql-bin' --gtid '021cfd28-86c2-11e6-849b-fb0597300eec:104041''
2016-10-01 10:29:07 6621 [Note] WSREP: sst_donor_thread signaled with 0
WSREP_SST: [INFO] Logging all stderr of SST/Innobackupex to syslog (20161001 10:29:07.221)
2016-10-01 10:29:09 6621 [Note] WSREP: (33d39fca, 'tcp://0.0.0.0:4567') turning message relay requesting off
2016-10-01 10:29:18 6621 [Note] WSREP: 1.0 (pxc2): State transfer to 2.0 (pxc3) complete.
2016-10-01 10:29:18 6621 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 104041)
2016-10-01 10:29:18 6621 [Note] WSREP: SYNC message ignored as node 1.0 (pxc2) was re-transitioned to DONOR mode before it synced.
2016-10-01 10:29:25 6621 [Note] WSREP: 2.0 (pxc3): State transfer from 1.0 (pxc2) complete.
2016-10-01 10:29:25 6621 [Note] WSREP: Member 2.0 (pxc3) synced with group.
2016-10-01 10:29:39 6621 [Note] WSREP: 1.0 (pxc2): State transfer to 2.0 (pxc3) complete.
2016-10-01 10:29:39 6621 [ERROR] WSREP: FSM: no such a transition JOINED -> JOINED
14:29:39 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=2
max_threads=153
thread_count=3
connection_count=1
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 69134 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7fb514000990
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fb54c6f1998 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x907295]
/usr/sbin/mysqld(handle_fatal_signal+0x4b4)[0x66b074]
/lib64/libpthread.so.0(+0xf7e0)[0x7fb569ca17e0]
/lib64/libc.so.6(gsignal+0x35)[0x7fb567eda5e5]
/lib64/libc.so.6(abort+0x175)[0x7fb567edbdc5]
/usr/lib64/libgalera_smm.so(_ZN6galera3FSMINS_10Replicator5StateENS_13ReplicatorSMM10TransitionENS_10EmptyGuardENS3_11StateActionEE8shift_toES2_+0x212)[0x7fb54ca6f302]
/usr/lib64/libgalera_smm.so(_ZN6galera13ReplicatorSMM12process_joinEll+0x178)[0x7fb54ca6a478]
/usr/lib64/libgalera_smm.so(_ZN6galera15GcsActionSource8dispatchEPvRK10gcs_actionRb+0x21d)[0x7fb54ca4587d]
/usr/lib64/libgalera_smm.so(_ZN6galera15GcsActionSource7processEPvRb+0x63)[0x7fb54ca45ed3]
/usr/lib64/libgalera_smm.so(_ZN6galera13ReplicatorSMM10async_recvEPv+0x73)[0x7fb54ca6a8a3]
/usr/lib64/libgalera_smm.so(galera_recv+0x24)[0x7fb54ca7ae14]
/usr/sbin/mysqld[0x5a21f1]
/usr/sbin/mysqld(start_wsrep_THD+0x36e)[0x58b83e]
/lib64/libpthread.so.0(+0x7aa1)[0x7fb569c99aa1]
/lib64/libc.so.6(clone+0x6d)[0x7fb567f90aad]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 2
Status: NOT_KILLED

You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.
161001 10:29:39 mysqld_safe Number of processes running now: 0
161001 10:29:39 mysqld_safe WSREP: not restarting wsrep node automatically
161001 10:29:39 mysqld_safe mysqld from pid file /var/lib/mysql/pxc2.pid ended

How to repeat:
- Start joiner node to SST from donor node.
- As soon as donor node goes to DONOR\DESYNCED (Shifting SYNCED -> DONOR/DESYNCED) state execute:
  select @@global.wsrep_desync; set @@global.wsrep_desync=ON; select @@global.wsrep_desync;
+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync | OFF |
+---------------+-------+
1 row in set (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

+---------------+-------+
| Variable_name | Value |
+---------------+-------+
| wsrep_desync | ON |
+---------------+-------+
1 row in set (0.00 sec)
- Wait until node goes to JOINED (Shifting DONOR/DESYNCED -> JOINED) state and state transfer completes, execute:
  set @@global.wsrep_desync=OFF;
PXC:root@localhost((none))> set global wsrep_desync=OFF;
Query OK, 0 rows affected (0.00 sec)

[root@pxc2 ~]# rpm -qa|grep Percona-XtraDB
Percona-XtraDB-Cluster-client-56-5.6.30-25.16.3.el6.x86_64
Percona-XtraDB-Cluster-server-56-5.6.30-25.16.3.el6.x86_64
Percona-XtraDB-Cluster-galera-3-3.16-3.rhel6.x86_64
Percona-XtraDB-Cluster-shared-56-5.6.30-25.16.3.el6.x86_64
Percona-XtraDB-Cluster-56-5.6.30-25.16.3.el6.x86_64

Revision history for this message
Przemek (pmalkowski) wrote :
Download full text (6.8 KiB)

I can confirm the donor crashes when wsrep_desync is set during SST session, and later is tried to be turned off.

[7cf5e9886947] {root} ((none)) > select @@version,@@version_comment\G
*************************** 1. row ***************************
        @@version: 5.6.30-76.3-56-log
@@version_comment: Percona XtraDB Cluster (GPL), Release rel76.3, Revision 64742ac, WSREP version 25.16, wsrep_25.16
1 row in set (0.00 sec)

First, donor is chosen to serve full SST, and during the SST was in progress, on donor the explicit desync was executed:

[7cf5e9886947] {root} ((none)) > set global wsrep_desync = ON;
Query OK, 0 rows affected (0.00 sec)

Log from donor:

2016-10-02 13:10:47 10465 [Note] WSREP: Member 1.0 (node3) requested state transfer from 'node1'. Selected 0.0 (node1)(SYNCED) as donor.
2016-10-02 13:10:47 10465 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 26066)
2016-10-02 13:10:47 10465 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-10-02 13:10:47 10465 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'donor' --address '172.17.0.3:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' --binlog 'mysql-binlog' --gtid '3e6fc59a-75a5-11e6-8d27-c6583ca1a844:26066''
2016-10-02 13:10:47 10465 [Note] WSREP: sst_donor_thread signaled with 0
WSREP_SST: [INFO] Streaming with xbstream (20161002 13:10:47.426)
WSREP_SST: [INFO] Using socat as streamer (20161002 13:10:47.428)
WSREP_SST: [INFO] Using /tmp/tmp.85qn89WXgV as xtrabackup temporary directory (20161002 13:10:47.441)
WSREP_SST: [INFO] Using /tmp/tmp.vOdDwTkzdD as innobackupex temporary directory (20161002 13:10:47.443)
WSREP_SST: [INFO] Streaming GTID file before SST (20161002 13:10:47.447)
WSREP_SST: [INFO] Evaluating xbstream -c ${INFO_FILE} | socat -u stdio TCP:172.17.0.3:4444; RC=( ${PIPESTATUS[@]} ) (20161002 13:10:47.448)
WSREP_SST: [INFO] Sleeping before data transfer for SST (20161002 13:10:47.455)
2016-10-02 13:10:49 10465 [Note] WSREP: (864f40f1, 'tcp://0.0.0.0:4567') turning message relay requesting off
WSREP_SST: [INFO] Streaming the backup to joiner at 172.17.0.3 4444 (20161002 13:10:57.457)
WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/etc/my.cnf --defaults-group=mysqld --no-version-check $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio TCP:172.17.0.3:4444; RC=( ${PIPESTATUS[@]} ) (20161002 13:10:57.460)
2016-10-02 13:11:08 10465 [Note] WSREP: 0.0 (node1): State transfer to 1.0 (node3) complete.
2016-10-02 13:11:08 10465 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 26066)
2016-10-02 13:11:08 10465 [Note] WSREP: SYNC message ignored as node 0.0 (node1) was re-transitioned to DONOR mode before it synced.
WSREP_SST: [INFO] Total time on donor: 0 seconds (20161002 13:11:08.717)
WSREP_SST: [INFO] Cleaning up temporary directories (20161002 13:11:08.719)
2016-10-02 13:11:43 10465 [Note] WSREP: 1.0 (node3): State transfer from 0.0 (node1) complete.
2016-10-02 13:11:43 10465 [Note] WSREP: Member 1.0 (node3) synced with group.

After that, donor state:

[7cf5e9886947] {ro...

Read more...

Revision history for this message
Przemek (pmalkowski) wrote :

Error logs from donor and joiner.

Revision history for this message
Przemek (pmalkowski) wrote :
Download full text (5.9 KiB)

This happens also on Galera cluster binaries from Codership:

[root@83878f054a4c /]# rpm -qa|grep -i mysql
perl-DBD-MySQL-4.013-3.el6.x86_64
mysql-wsrep-client-5.6-5.6.33-25.17.el6.x86_64
mysql-wsrep-5.6-5.6.33-25.17.el6.x86_64
mysql-wsrep-libs-compat-5.6-5.6.33-25.17.el6.x86_64
mysql-wsrep-server-5.6-5.6.33-25.17.el6.x86_64

mysql> select @@version,@@version_comment;
+-----------+-------------------------------------------+
| @@version | @@version_comment |
+-----------+-------------------------------------------+
| 5.6.33 | MySQL Community Server (GPL), wsrep_25.17 |
+-----------+-------------------------------------------+
1 row in set (0.00 sec)

mysql> set global wsrep_desync=0;
Query OK, 0 rows affected (0.01 sec)

mysql> show status like 'ws%';
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (111)
ERROR:
Can't connect to the server

2016-10-10 07:05:04 868 [Note] WSREP: Member 1.0 (galera1) requested state transfer from '*any*'. Selected 0.0 (83878f054a4c)(SYNCED) as donor.
2016-10-10 07:05:04 868 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 68)
2016-10-10 07:05:04 868 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
2016-10-10 07:05:04 868 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'donor' --address '172.17.0.4:4444/xtrabackup_sst//1' --socket '/var/run/mysqld/mysqld.sock' --datadir '/var/lib/mysql/' --defaults-file '/etc/my.cnf' --defaults-group-suffix '' '' --gtid 'c8d6c713-89c1-11e6-af94-2b7d3c8cd95f:68''
2016-10-10 07:05:04 868 [Note] WSREP: sst_donor_thread signaled with 0
WSREP_SST: [INFO] Streaming with xbstream (20161010 07:05:04.234)
WSREP_SST: [INFO] Using socat as streamer (20161010 07:05:04.236)
WSREP_SST: [INFO] Using /tmp/tmp.XpXn6B67ok as xtrabackup temporary directory (20161010 07:05:04.250)
WSREP_SST: [INFO] Using /tmp/tmp.yY5HBbDHLi as innobackupex temporary directory (20161010 07:05:04.253)
WSREP_SST: [INFO] Streaming GTID file before SST (20161010 07:05:04.257)
WSREP_SST: [INFO] Evaluating xbstream -c ${INFO_FILE} | socat -u stdio TCP:172.17.0.4:4444; RC=( ${PIPESTATUS[@]} ) (20161010 07:05:04.258)
WSREP_SST: [INFO] Sleeping before data transfer for SST (20161010 07:05:04.263)
WSREP_SST: [INFO] Streaming the backup to joiner at 172.17.0.4 4444 (20161010 07:05:14.266)
WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/etc/my.cnf --defaults-group=mysqld --no-version-check $tmpopts $INNOEXTRA --galera-info --stream=$sfmt $itmpdir 2>${DATA}/innobackup.backup.log | socat -u stdio TCP:172.17.0.4:4444; RC=( ${PIPESTATUS[@]} ) (20161010 07:05:14.267)
2016-10-10 07:05:18 868 [Note] WSREP: Provider paused at c8d6c713-89c1-11e6-af94-2b7d3c8cd95f:68 (95)
2016-10-10 07:05:19 868 [Note] WSREP: resuming provider at 95
2016-10-10 07:05:19 868 [Note] WSREP: Provider resumed.
2016-10-10 07:05:19 868 [Note] WSREP: 0.0 (83878f054a4c): State transfer to 1.0 (galera1) complete.
2016-10-10 07:05:19 868 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 68)
2016-10-10 07:05:19 868 [Warning] WSREP: SYNC message...

Read more...

Revision history for this message
Krunal Bauskar (krunal-bauskar) wrote :

This issue got introduced while our desync_count logic was replaced by upstream desync_count logic which caused inconsistency in PXC.

It has been solved in 5.6.32.

---------------

I verified that 5.6.30 we hit the issue and also that 5.6.32 doesn't hit the issue.

Related checkin:

commit 4579c434498eaaff1feb40ece7b8dfad01b31915
Merge: b0a8801 c51109c
Author: Krunal Bauskar <email address hidden>
Date: Mon Aug 15 12:22:59 2016 +0530

    Merge pull request #80 from natsys/3.x-pxc-665

    - #PXC-665: SST donor node has status "JOINED" after successful resync

commit c51109c44cb39c94cfa9644334a810d3aeee66d2
Author: Julius Goryavsky <email address hidden>
Date: Fri Aug 12 19:34:08 2016 -0400

    - #PXC-665: SST donor node has status "JOINED" after successful resync

    Operations that use the global lock (such as FLUSH TABLES FOR READ LOCK) and DDL operations that are protected via the RSU can be performed during SST. At the same time, they come in
to interference with the SST process, which leads to a variety of errors, including hang or crash of the donor node after SST donation. Alternatively, the SST donor node stays in the "JO
INED" status, even after the completion of the SST.

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1930

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Bug attachments

Remote bug watches

Bug watches keep track of this bug in other bug trackers.