WSREP_SST: [ERROR] with low data rate between DONOR and JOINER

Bug #1337919 reported by Michel Tougas
8
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
New
Undecided
Unassigned

Bug Description

Galera version: 5.5.37
Xtrabackup version: 1.5.1

In a Galera cluster with 3 nodes, one is down and is restarted.
The bandwidth between the JOINER (node 1) and the DONOR (node 2) is low (less than 10 MB/sec).
The other node (node 3) is SYNCED and there is high bandwidth between nodes 2 and 3.
There is DML traffic running on nodes 2 and 3 while node 1 tries to join the cluster.
The SST provider is xtrabackup-v2.

When bandwidth between the JOINER (node 1) and the DONOR (node 2) is higher, there is no issue.

SST fails and the mysql daemon then stops on node 1.
The logs on JOINER show:

WSREP_SST: [INFO] Evaluating nc -dl 4444 | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140626 10:09:56.988)
140626 10:22:46 [Note] WSREP: Created page /var/broadworks/persistent/gcache.page.000000 of size 134217728 bytes
WSREP_SST: [ERROR] xtrabackup process ended without creating '/var/broadworks/persistent//xtrabackup_galera_info' (20140626 10:25:59.236)
WSREP_SST: [INFO] Contents of datadir (20140626 10:25:59.238)
WSREP_SST: [INFO] -rw-rw---- 1 myadmin bwadmin 191 Jun 26 10:10 /var/broadworks/persistent//backup-my.cnf
-rw-rw---- 1 myadmin bwadmin 134219048 Jun 26 10:22 /var/broadworks/persistent//galera.cache
-rw------- 1 myadmin bwadmin 134217728 Jun 26 10:25 /var/broadworks/persistent//gcache.page.000000
-rw-rw---- 1 myadmin bwadmin 109 Jun 26 10:09 /var/broadworks/persistent//grastate.dat
-rw-rw---- 1 myadmin bwadmin 104857600 Jun 26 10:10 /var/broadworks/persistent//ibdata1
-rw-rw---- 1 myadmin bwadmin 60468 Jun 26 09:39 /var/broadworks/persistent//innobackup.prepare.log
-rw-rw---- 1 myadmin bwadmin 0 Jun 26 10:09 /var/broadworks/persistent//sst_in_progress

/var/broadworks/persistent//bwas:
total 25325756
-rw-rw---- 1 myadmin bwadmin 71303168 Jun 26 10:14 as_user_test2.ibd
-rw-rw---- 1 myadmin bwadmin 71303168 Jun 26 10:14 as_user_test3.ibd
-rw-rw---- 1 myadmin bwadmin 50331648 Jun 26 10:14 as_user_test.ibd
-rw-rw---- 1 myadmin bwadmin 9806282752 Jun 26 10:19 users_test2.ibd
-rw-rw---- 1 myadmin bwadmin 9667870720 Jun 26 10:25 users_test3.ibd
-rw-rw---- 1 myadmin bwadmin 6241124352 Jun 26 10:14 users_test.ibd (20140626 10:25:59.241)
WSREP_SST: [ERROR] Cleanup after exit with status:32 (20140626 10:25:59.242)
WSREP_SST: [INFO] Removing the sst_in_progress file (20140626 10:25:59.243)
140626 10:25:59 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'joiner' --address 'lin16-hs22' --auth '' --datadir '/var/broadworks/persistent/' --defaults-file '/usr/local/mariadb/mariadb_base/etc/my.cnf' --parent '22404': 32 (Broken pipe)
140626 10:25:59 [ERROR] WSREP: Failed to read uuid:seqno from joiner script.
140626 10:25:59 [ERROR] WSREP: SST failed: 32 (Broken pipe)
140626 10:25:59 [ERROR] Aborting

140626 10:25:59 [Warning] WSREP: 0.0 (sun92-x4170): State transfer to 1.0 (lin16-hs22) failed: -22 (Invalid argument)
140626 10:25:59 [ERROR] WSREP: gcs/src/gcs_group.c:gcs_group_handle_join_msg():723: Will never receive state. Need to abort.

The DONOR logs show:

WSREP_SST: [INFO] Sleeping before data transfer for SST (20140626 10:09:50.790)
WSREP_SST: [INFO] Streaming the backup to joiner at lin16-hs22 4444 (20140626 10:10:00.793)
WSREP_SST: [INFO] Evaluating innobackupex --defaults-file=/usr/local/mariadb/mariadb_base/etc/my.cnf --no-version-check $INNOEXTRA --galera-info --stream=$sfmt ${TMPDIR} 2>${DATA}/innobackup.backup.log | nc lin16-hs22 4444; RC=( ${PIPESTATUS[@]} ) (20140626 10:10:00.795)
140626 10:25:23 [Note] WSREP: Provider paused at 2b928f66-fc92-11e3-9cee-122a6cbbbcf8:2459656 (356321)
140626 10:25:53 [Note] WSREP: resuming provider at 356321
140626 10:25:53 [Note] WSREP: Provider resumed.
WSREP_SST: [ERROR] innobackupex finished with error: 9. Check /var/broadworks/persistent//innobackup.backup.log (20140626 10:25:54.120)
WSREP_SST: [ERROR] Cleanup after exit with status:22 (20140626 10:25:54.122)
140626 10:25:54 [ERROR] WSREP: Failed to read from: wsrep_sst_xtrabackup-v2 --role 'donor' --address 'lin16-hs22:4444/xtrabackup_sst' --auth '(null)' --socket '/var/broadworks/run/mysql.sock' --datadir '/var/broadworks/persistent/' --defaults-file '/usr/local/mariadb/mariadb_base/etc/my.cnf' --gtid '2b928f66-fc92-11e3-9cee-122a6cbbbcf8:2453164'
140626 10:25:54 [ERROR] WSREP: Process completed with error: wsrep_sst_xtrabackup-v2 --role 'donor' --address 'lin16-hs22:4444/xtrabackup_sst' --auth '(null)' --socket '/var/broadworks/run/mysql.sock' --datadir '/var/broadworks/persistent/' --defaults-file '/usr/local/mariadb/mariadb_base/etc/my.cnf' --gtid '2b928f66-fc92-11e3-9cee-122a6cbbbcf8:2453164': 22 (Invalid argument)
140626 10:25:54 [ERROR] WSREP: Command did not run: wsrep_sst_xtrabackup-v2 --role 'donor' --address 'lin16-hs22:4444/xtrabackup_sst' --auth '(null)' --socket '/var/broadworks/run/mysql.sock' --datadir '/var/broadworks/persistent/' --defaults-file '/usr/local/mariadb/mariadb_base/etc/my.cnf' --gtid '2b928f66-fc92-11e3-9cee-122a6cbbbcf8:2453164'
140626 10:25:54 [Warning] WSREP: 0.0 (sun92-x4170): State transfer to 1.0 (lin16-hs22) failed: -22 (Invalid argument)

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

So xtrabackup failed on donor side. It must have left a log file in mysql data dir. Could you post it?

no longer affects: galera
Revision history for this message
Michel Tougas (mtougas) wrote :

I no longer have the logs from the original test but I just ran it again and here it is. I just post the end of the log file because there are a lot of "log scanned up to ..." logs. So I guess there is a timeout somewhere? Can it be adjusted depending on the bandwidth between the nodes in the cluster?

...
>> log scanned up to (31106649839)
>> log scanned up to (31106649839)
>> log scanned up to (31106649839)
>> log scanned up to (31106649839)
>> log scanned up to (31106649839)
>> log scanned up to (31106649839)
>> log scanned up to (31106649839)
DBD::mysql::db selectall_hashref failed: Lock wait timeout exceeded; try restarting transaction at /usr/local/mariadb/5.5.37//bin/innobackupex line 2996.
innobackupex: Error:
Error executing 'SHOW STATUS': DBD::mysql::db selectall_hashref failed: Lock wait timeout exceeded; try restarting transaction at /usr/local/mariadb/5.5.37//bin/innobackupex line 2996.
140707 15:37:25 innobackupex: Waiting for ibbackup (pid=27803) to finish

The JOINER logs show that it failed after ~ 14 minutes 30 seconds.

WSREP_SST: [INFO] Evaluating nc -dl 4444 | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20140707 15:23:06.252)
140707 15:35:04 [Note] WSREP: Created page /var/broadworks/persistent/gcache.page.000000 of size 134217728 bytes
WSREP_SST: [ERROR] xtrabackup process ended without creating '/var/broadworks/persistent//xtrabackup_galera_info' (20140707 15:37:31.607)

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.