3-node debian 5.6.cluster crashes/freezes

Bug #1301616 reported by chris fortescue on 2014-04-02
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster
Status tracked in 5.6
5.5
Undecided
Unassigned
5.6
Undecided
Unassigned

Bug Description

Hiya,
3 nodes; new setup; latest packages

percona-toolkit 2.2.7
percona-xtrabackup 2.1.8-733-1.wheezy
percona-xtradb-cluster-client-5.6 5.6.15-25.5-759.wheezy
percona-xtradb-cluster-common-5.6 5.6.15-25.5-759.wheezy
percona-xtradb-cluster-galera-3.x 213.wheezy
percona-xtradb-cluster-garbd-3.x 213.wheezy
percona-xtradb-cluster-server-5.6 5.6.15-25.5-759.wheezy
percona-xtradb-cluster-test-5.6 5.6.15-25.5-759.wheezy

node1 - started with /etc/init.d/mysql bootstrap-pxc
node2 - mysqld totally crashed and couldn't be restarted
node3 - mysqld " "

I've included the log from node3.
I read 1million rows of data into each node (1..3) concurrently without a problem. So cluster appeared to be working. We also restored a 80G mysql dump which appeared to work but now I wonder...

Once I noticed it hung (I'm testing it out), I actually resorted to kill -9 the mysql process on node1 (started with bootstrap-pxc) as both node2/3 were down and couldn't be restarted and I couldn't connect to node1 with mysql cli. Once I restarted node1, then node2/3, everything seems ok but....

I hope you can fix this because it is a bad crash and doesn't instill confidence. Let me know if I can give you anything else to provide forensics.

Below are the cnf from the bootstrap node and node3. Strangely, no log emitted on the bootstrap node1 or node 2 but node3 shows a bad exception get thrown (apparently).

-Chris

>>>> Node1.cnf <<<<<

[mysqld]
datadir=/var/lib/mysql
user=mysql
# Path to Galera library
wsrep_provider=/usr/lib/libgalera_smm.so
# Cluster connection URL contains the IPs of node#1, node#2 and node#3

wsrep_cluster_address=gcomm://10.66.2.51,10.66.2.52,10.66.2.53
#wsrep_cluster_address=gcomm://

# In order for Galera to work correctly binlog format should be ROW
binlog_format=ROW
# MyISAM storage engine has only experimental support
default_storage_engine=InnoDB
# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=2
# Node #1 address
wsrep_node_address=10.66.2.51
# SST method
wsrep_sst_method=xtrabackup-v2
# Cluster name
wsrep_cluster_name=my_clf_cluster
# Authentication for SST method
wsrep_sst_auth="sstuser:s3cret"

>>>> Node3.cnf <<<<

[mysqld]
datadir=/var/lib/mysql
user=mysql
# Path to Galera library
wsrep_provider=/usr/lib/libgalera_smm.so
# Cluster connection URL contains the IPs of node#1, node#2 and node#3
wsrep_cluster_address=gcomm://10.66.2.51,10.66.2.52,10.66.2.53
# In order for Galera to work correctly binlog format should be ROW
binlog_format=ROW
# MyISAM storage engine has only experimental support
default_storage_engine=InnoDB
# This changes how InnoDB autoincrement locks are managed and is a requirement for Galera
innodb_autoinc_lock_mode=2
# Node #3 address
wsrep_node_address=10.66.2.53
# SST method
wsrep_sst_method=xtrabackup-v2
# Cluster name
wsrep_cluster_name=my_clf_cluster
# Authentication for SST method
wsrep_sst_auth="sstuser:s3cret"

chris fortescue (cfortescu) wrote :
chris fortescue (cfortescu) wrote :

I doubled the memory on the cluster to 2GB per node and it happened again. This time, it brought down all 3 nodes.

Alex Yurchenko (ayurchen) wrote :

"bad prefix" - looks like a gcache corruption fixed in Galera 3.5

chris fortescue (cfortescu) wrote :

Ok, you say it's 'fixed' and I see a standalone package, galera-25.3.5-amd64.deb, that explicitly conflicts with the cluster versions (below). I did an apt-get update but there was no update. Isn't this fix explicitly critical to anyone running a cluster? I must be missing something and humbly ask what I'm missing.

Here's what I have as of 4/23/2014,8:51am PDT
i
i percona-toolkit 2.2.7 all Advanced MySQL and system command-line tools
ii percona-xtrabackup 2.1.8-733-1.wheezy amd64 Open source backup tool for InnoDB and XtraDB
ii percona-xtradb-cluster-client-5.6 5.6.15-25.5-759.wheezy amd64 Percona Server database client binaries
ii percona-xtradb-cluster-common-5.6 5.6.15-25.5-759.wheezy amd64 Percona Server database common files (e.g. /etc/mysql/my.cnf)
ii percona-xtradb-cluster-galera-3.x 213.wheezy amd64 Galera components of Percona XtraDB Cluster
ii percona-xtradb-cluster-garbd-3.x 213.wheezy amd64 Garbd components of Percona XtraDB Cluster
ii percona-xtradb-cluster-server-5.6 5.6.15-25.5-759.wheezy amd64 Percona Server database server binaries
ii percona-xtradb-cluster-test-5.6 5.6.15-25.5-759.wheezy amd64 Percona Server database test suite

@Chris,

It is fix committed, not released (ie. fix released) yet.

However, you can get it from TESTING:

http://www.percona.com/downloads/TESTING/Percona-XtraDB-Cluster-galera-56/galera-3.x/215/deb/

This has all the fixes of https://launchpad.net/percona-xtradb-cluster/+milestone/galera-3.5 in it.

chris fortescue (cfortescu) wrote :

@raghavendra

That did the trick! Ran an 80G restore against a 3-node cluster.

Thanks a million!

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers