Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

WSREP: unserialize error invalid flags 2: 71 (Protocol error)

Bug #1264225 reported by ceejayoz on 2013-12-26

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Invalid	Undecided	Unassigned

Bug Description

I'm getting what appears to be a repeat of https://bugs.launchpad.net/galera/+bug/1153727 in the latest XtraDB Cluster 5.5.34-25.9.

I have a four node Percona XtraDB cluster. Each night, one node shuts down for a backup and subsequently rejoins the cluster. This has run fine for quite some time now.

Last night, the backup node refused to rejoin the cluster. The other three nodes are humming along nicely, but any attempt to join the failed node is unsuccessful.

Luckily, we provision all of our nodes using Ansible, so we've got a repeatable recipe for blowing away a failed node and replacing it with a brand new, from scratch database node.

<strong>Un</strong>luckily, new nodes also fail. No changes to the Ansible scripts have been made recently and we've successfully launched instances since. This is all running on a two-AZ VPC in Amazon AWS, if that makes any difference.

The new server's MySQL log says:

    131226 04:44:50 mysqld_safe Starting mysqld daemon with databases from /data/mysql
    131226 04:44:50 mysqld_safe Skipping wsrep-recover for empty datadir: /data/mysql
    131226 04:44:50 mysqld_safe Assigning 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position
    131226 4:44:50 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1'
    131226 4:44:50 [Note] WSREP: Read nil XID from storage engines, skipping position init
    131226 4:44:50 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
    131226 4:44:50 [Note] WSREP: wsrep_load(): Galera 3.2(r170) by Codership Oy <email address hidden> loaded successfully.
    131226 4:44:50 [Note] WSREP: CRC-32C: using hardware acceleration.
    131226 4:44:50 [Warning] WSREP: Could not open saved state file for reading: /data/mysql//grastate.dat
    131226 4:44:50 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
    131226 4:44:50 [Note] WSREP: Passing config to GCS: base_host = 10.0.1.226; base_port = 4567; cert.log_conflicts = no; gcache.dir = /data/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /data/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.proto_max = 5
    131226 4:44:50 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
    131226 4:44:50 [Note] WSREP: wsrep_sst_grab()
    131226 4:44:50 [Note] WSREP: Start replication
    131226 4:44:50 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
    131226 4:44:50 [Note] WSREP: protonet asio version 0
    131226 4:44:50 [Note] WSREP: Using CRC-32C (optimized) for message checksums.
    131226 4:44:50 [Note] WSREP: backend: asio
    131226 4:44:50 [Note] WSREP: GMCast version 0
    131226 4:44:50 [Note] WSREP: (7497477a-6de8-11e3-9a71-7af4d47d85be, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
    131226 4:44:50 [Note] WSREP: (7497477a-6de8-11e3-9a71-7af4d47d85be, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
    131226 4:44:50 [Note] WSREP: EVS version 0
    131226 4:44:50 [Note] WSREP: PC version 0
    131226 4:44:50 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.0.2.180:,10.0.1.226:,10.0.1.21:,10.0.2.236:'
    131226 4:44:50 [Warning] WSREP: (7497477a-6de8-11e3-9a71-7af4d47d85be, 'tcp://0.0.0.0:4567') address 'tcp://10.0.1.226:4567' points to own listening address, blacklisting
    131226 4:44:53 [Warning] WSREP: no nodes coming from prim view, prim not possible
    131226 4:44:53 [Note] WSREP: view(view_id(NON_PRIM,7497477a-6de8-11e3-9a71-7af4d47d85be,1) memb {
     7497477a-6de8-11e3-9a71-7af4d47d85be,0
    } joined {
    } left {
    } partitioned {
    })
    131226 4:44:54 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.5049S), skipping check
    131226 4:45:23 [Note] WSREP: view((empty))
    131226 4:45:23 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
      at gcomm/src/pc.cpp:connect():141
    131226 4:45:23 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():196: Failed to open backend connection: -110 (Connection timed out)
    131226 4:45:23 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'my_wsrep_cluster' at 'gcomm://10.0.2.180,10.0.1.226,10.0.1.21,10.0.2.236': -110 (Connection timed out)
    131226 4:45:23 [ERROR] WSREP: gcs connect failed: Connection timed out
    131226 4:45:23 [ERROR] WSREP: wsrep::connect() failed: 7
    131226 4:45:23 [ERROR] Aborting

    131226 4:45:23 [Note] WSREP: Service disconnected.
    131226 4:45:24 [Note] WSREP: Some threads may fail to exit.
    131226 4:45:24 [Note] /usr/sbin/mysqld: Shutdown complete

131226 04:45:24 mysqld_safe mysqld from pid file /data/mysql/mysql.pid ended

The existing nodes in the cluster spit the following out whenever the new node attempts to join:

131226 4:45:23 [Warning] WSREP: unserialize error invalid flags 2: 71 (Protocol error)
at gcomm/src/gcomm/datagram.hpp:unserialize():101

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-12-26

affects:

percona-server → percona-xtradb-cluster