WSREP: unserialize error invalid flags 2: 71 (Protocol error)

Bug #1264225 reported by ceejayoz
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Invalid
Undecided
Unassigned

Bug Description

I'm getting what appears to be a repeat of https://bugs.launchpad.net/galera/+bug/1153727 in the latest XtraDB Cluster 5.5.34-25.9.

I have a four node Percona XtraDB cluster. Each night, one node shuts down for a backup and subsequently rejoins the cluster. This has run fine for quite some time now.

Last night, the backup node refused to rejoin the cluster. The other three nodes are humming along nicely, but any attempt to join the failed node is unsuccessful.

Luckily, we provision all of our nodes using Ansible, so we've got a repeatable recipe for blowing away a failed node and replacing it with a brand new, from scratch database node.

<strong>Un</strong>luckily, new nodes also fail. No changes to the Ansible scripts have been made recently and we've successfully launched instances since. This is all running on a two-AZ VPC in Amazon AWS, if that makes any difference.

The new server's MySQL log says:

    131226 04:44:50 mysqld_safe Starting mysqld daemon with databases from /data/mysql
    131226 04:44:50 mysqld_safe Skipping wsrep-recover for empty datadir: /data/mysql
    131226 04:44:50 mysqld_safe Assigning 00000000-0000-0000-0000-000000000000:-1 to wsrep_start_position
    131226 4:44:50 [Note] WSREP: wsrep_start_position var submitted: '00000000-0000-0000-0000-000000000000:-1'
    131226 4:44:50 [Note] WSREP: Read nil XID from storage engines, skipping position init
    131226 4:44:50 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
    131226 4:44:50 [Note] WSREP: wsrep_load(): Galera 3.2(r170) by Codership Oy <email address hidden> loaded successfully.
    131226 4:44:50 [Note] WSREP: CRC-32C: using hardware acceleration.
    131226 4:44:50 [Warning] WSREP: Could not open saved state file for reading: /data/mysql//grastate.dat
    131226 4:44:50 [Note] WSREP: Found saved state: 00000000-0000-0000-0000-000000000000:-1
    131226 4:44:50 [Note] WSREP: Passing config to GCS: base_host = 10.0.1.226; base_port = 4567; cert.log_conflicts = no; gcache.dir = /data/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /data/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; repl.causal_read_timeout = PT30S; repl.commit_order = 3; repl.key_format = FLAT8; repl.proto_max = 5
    131226 4:44:50 [Note] WSREP: Assign initial position for certification: -1, protocol version: -1
    131226 4:44:50 [Note] WSREP: wsrep_sst_grab()
    131226 4:44:50 [Note] WSREP: Start replication
    131226 4:44:50 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
    131226 4:44:50 [Note] WSREP: protonet asio version 0
    131226 4:44:50 [Note] WSREP: Using CRC-32C (optimized) for message checksums.
    131226 4:44:50 [Note] WSREP: backend: asio
    131226 4:44:50 [Note] WSREP: GMCast version 0
    131226 4:44:50 [Note] WSREP: (7497477a-6de8-11e3-9a71-7af4d47d85be, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
    131226 4:44:50 [Note] WSREP: (7497477a-6de8-11e3-9a71-7af4d47d85be, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
    131226 4:44:50 [Note] WSREP: EVS version 0
    131226 4:44:50 [Note] WSREP: PC version 0
    131226 4:44:50 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer '10.0.2.180:,10.0.1.226:,10.0.1.21:,10.0.2.236:'
    131226 4:44:50 [Warning] WSREP: (7497477a-6de8-11e3-9a71-7af4d47d85be, 'tcp://0.0.0.0:4567') address 'tcp://10.0.1.226:4567' points to own listening address, blacklisting
    131226 4:44:53 [Warning] WSREP: no nodes coming from prim view, prim not possible
    131226 4:44:53 [Note] WSREP: view(view_id(NON_PRIM,7497477a-6de8-11e3-9a71-7af4d47d85be,1) memb {
     7497477a-6de8-11e3-9a71-7af4d47d85be,0
    } joined {
    } left {
    } partitioned {
    })
    131226 4:44:54 [Warning] WSREP: last inactive check more than PT1.5S ago (PT3.5049S), skipping check
    131226 4:45:23 [Note] WSREP: view((empty))
    131226 4:45:23 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
      at gcomm/src/pc.cpp:connect():141
    131226 4:45:23 [ERROR] WSREP: gcs/src/gcs_core.c:gcs_core_open():196: Failed to open backend connection: -110 (Connection timed out)
    131226 4:45:23 [ERROR] WSREP: gcs/src/gcs.c:gcs_open():1291: Failed to open channel 'my_wsrep_cluster' at 'gcomm://10.0.2.180,10.0.1.226,10.0.1.21,10.0.2.236': -110 (Connection timed out)
    131226 4:45:23 [ERROR] WSREP: gcs connect failed: Connection timed out
    131226 4:45:23 [ERROR] WSREP: wsrep::connect() failed: 7
    131226 4:45:23 [ERROR] Aborting

    131226 4:45:23 [Note] WSREP: Service disconnected.
    131226 4:45:24 [Note] WSREP: Some threads may fail to exit.
    131226 4:45:24 [Note] /usr/sbin/mysqld: Shutdown complete

    131226 04:45:24 mysqld_safe mysqld from pid file /data/mysql/mysql.pid ended

The existing nodes in the cluster spit the following out whenever the new node attempts to join:

    131226 4:45:23 [Warning] WSREP: unserialize error invalid flags 2: 71 (Protocol error)
  at gcomm/src/gcomm/datagram.hpp:unserialize():101

affects: percona-server → percona-xtradb-cluster
Revision history for this message
Alex Yurchenko (ayurchen) wrote :

this is most likely a result of mixing Galera 2.x. and 3.x. To be used in a mixed cluster 3.x nodes must have socket.checksum set to 1.

Revision history for this message
ceejayoz (ceejayoz) wrote :

Looks like you're right - same Percona XtraDB cluster, but the new ones are getting the Galera 3.x package. THANK YOU!

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@ceejayoz,

Did you end up with Galera-3 package due to upgrade or fresh
install? For upgrade of galera-2 I don't think it would have gone with
galera-3. For fresh installs, we have updated the installation
docs.

Changed in percona-xtradb-cluster:
status: New → Invalid
Revision history for this message
ceejayoz (ceejayoz) wrote :

@raghavendra-prabhu We had an AWS instance crap out, so I replaced it with a new one (provisioned via Ansible). It'd have been a fresh install. Definitely not an upgrade - doing a `yum update` on the pre-existing instances keeps them on 2.x.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@ceejayoz,

Thanks for confirmation. For new installations we have updated
the installation docs (so you can update ansible, puppet, chef etc.) so that you can choose appropriate galera package:
PXC55 - galera2
PXC56 - galera3

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1564

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.