3 node PXC killed after reboot of 1 node: exception from gcomm, backend must be restarted: aborting due to conflicting prims: older overrides (FATAL)

Bug #1571356 reported by Patrick Wagner
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
New
Undecided
Unassigned

Bug Description

We're running a 3 node PX 5.6.28-76.1-56 cluster on Centos 6 x64, with both ignore_sb=1 and ignore_quorum=1 set.

Last week, one of the nodes (node03) unexpectedly crashed and rebooted at about 00:10, and on reboot, "something" happened (nothing has been logged in syslog or mysql.log with regards to mysql), but the other nodes reported an incoming WSREP connection, and according to monitoring the "bad" node03 reported a 3 node cluster (and wsrep_local_state = 4) and happily answered queries for about 20 seconds at 00:15 along with the other nodes, but then suddenly reported a cluster size of "1", while both of the other nodes stopped answering queries entirely. node01 and node02 logged "exception from gcomm, backend must be restarted: aborting due to conflicting prims: older overrides (FATAL)" in their logs.

At 00:23, according to monitoring node03 briefly remounted its root filesystem / read-only for about a minute, during which time mysqld on node03 died. The other nodes remained stuck, so the entire cluster was down.

At about 04:30, I began investigating. Shutting mysqld down on the 2 "stuck" nodes via "service stop" failed, I needed to SIGKILL mysqld on both of them and then used one of them to bootstrap the cluster and the other 2 then joined just fine.

Note: writesets happen rarely on this particular cluster (only through manual DDL/DML statements by administrators), so during the automatic re-join attempt at 00:15 the WSREP state of the crashed node should have been the exact same one as the state on the 2 other nodes.

my.cnf and mysql.log of all nodes attached (containing all events logged on 2016-04-12)

Revision history for this message
Patrick Wagner (patrick-wagner-s) wrote :
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1898

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.