Galera

Whole Cluste become NOT-Primary after ASIO issue on a node

Series 3.x
Bug #1249805

Bug #1249805 reported by Marco Tusa on 2013-11-10

This bug affects 1 person

	Status	Importance	Assigned to	Milestone
Galera	Status tracked in 3.x
2.x	Fix Committed	Undecided	Unassigned
3.x	Fix Committed	Undecided	Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Status tracked in 5.6
5.5	Fix Released	Medium	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.5.39-25.11
5.6	Fix Released	Medium	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.6.19-25.6

Bug Description

Hi folks.
we recently had an issue with a customer installation.
Customer has multiple clusters but the issue was present on on cluster only (at least that we know).

Customer starts to have issue on one node when all the machines on the cluster were starting to report a SCSI issue as follow:
Sense codes:
Nov 6 22:09:14 Server Administrator: Storage Service EventID: 2095 Unexpected sense. SCSI sense data: Sense key: 6 Sense code: 29 Sense qualifier: CD: Physical Disk 0:1:1 Controller 0, Connector 0

What happened was that ONLY ONE node was actually reading it as a network issue and then it starts to disconnect and rejoin the cluster.
The interruption was a little bit larger of the default suspect_timeout.
Customer obviously work on the issue to have it fix on the machines, but meanwhile another issue happened.

The cluster nodes are xdb1-5.
The xdb3 node was the one normally failing.

On the 131108 2:04:09 node xdb1 rejoin the cluster and xdb2 become is donor.

While performing the synchronization node xdb5 crashed with the error:
131108 2:04:26 [Note] WSREP: declaring b9f56651-4818-11e3-0800-42a32dd84cfd stable
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check
02:04:26 UTC - mysqld got signal 6 ;

The error is related to the ASIO libraries and as such I think it should be associate to the network transfer.
Node xdb5 stops and MySQL crashed.

At the same time the node xdb1 seems having an issue, not clear to me because the xdb5 crashes or because same/another network issue happens:

131108 2:04:26 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=4,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=4,source=f9398cbc-4816-11e3-0800-dca12a279d32,source_view_id=view_id(REG,0d7bd986-481a-11e3-0800-1fea849b8d9b,35),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=263904,node_list=( 0d7bd986-481a-11e3-0800-1fea849b8d9b,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,00000000-0000-0000-0000-000000000000,0),safe_seq=-1,im_range=[-1,-1],}
35d7689b-4817-11e3-0800-67141291930b,node: {operational=0,suspected=1,leave_seq=-1,view_id=view_id(REG,0d7bd986-481a-11e3-0800-1fea849b8d9b,35),safe_seq=-1,im_range=[0,-1],}
692dff7e-4815-11e3-0800-0cec2658b42a,node: {operational=0,suspected=1,leave_seq=-1,view_id=view_id(REG,0d7bd986-481a-11e3-0800-1fea849b8d9b,35),safe_seq=-1,im_range=[0,-1],}
b9f56651-4818-11e3-0800-42a32dd84cfd,node: {operational=0,suspected=1,leave_seq=-1,view_id=view_id(REG,0d7bd986-481a-11e3-0800-1fea849b8d9b,35),safe_seq=-1,im_range=[0,-1],}
131108 2:04:26 [ERROR] WSREP: state after handling message: evs::proto(evs::proto(0d7bd986-481a-11e3-0800-1fea849b8d9b, GATHER, view_id(REG,0d7bd986-481a-11e3-0800-1fea849b8d9b,35)), GATHER) {
current_view=view(view_id(REG,0d7bd986-481a-11e3-0800-1fea849b8d9b,35) memb {
0d7bd986-481a-11e3-0800-1fea849b8d9b,
35d7689b-4817-11e3-0800-67141291930b,
692dff7e-4815-11e3-0800-0cec2658b42a,
b9f56651-4818-11e3-0800-42a32dd84cfd,
f9398cbc-4816-11e3-0800-dca12a279d32,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=-1,safe_seq=-1,node_index=node: {idx=0,range=[7,6],safe_seq=-1} node: {idx=1,range=[0,-1],safe_seq=-1} node: {idx=2,range=[0,-1],safe_seq=-1} node: {idx=3,range=[0,-1],safe_seq=-1} node: {idx=4,range=[1,0],safe_seq=-1} ,msg_index= (0,0),evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=0,source=0d7bd986-481a-11e3-0800-1fea849b8d9b,source_view_id=view_id(REG,0d7bd986-481a-11e3-0800-1fea849b8d9b,35),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq
131108 2:04:26 [ERROR] WSREP: exception from gcomm, backend must be restarted:nlself_i != same_view.end(): (FATAL) <------------------------------
at gcomm/src/evs_proto.cpp:handle_join():3529
131108 2:04:26 [Note] WSREP: Received self-leave message.
131108 2:04:26 [Note] WSREP: Flow-control interval: [0, 0]
131108 2:04:26 [Note] WSREP: Received SELF-LEAVE. Closing connection.
131108 2:04:26 [Note] WSREP: Shifting JOINER -> CLOSED (TO: 67975520)
131108 2:04:26 [Note] WSREP: RECV thread exiting 0: Success

After this also the other nodes get affected .

xdb2 seems to ne try to reconnect to the joiner but not successfully:
131108 2:04:26 [Note] WSREP: (692dff7e-4815-11e3-0800-0cec2658b42a, 'tcp://0.0.0.0:4567') turning message relay requesting off
131108 2:04:26 [Note] WSREP: (692dff7e-4815-11e3-0800-0cec2658b42a, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://172.24.0.254:4567
131108 2:04:28 [Note] WSREP: (692dff7e-4815-11e3-0800-0cec2658b42a, 'tcp://0.0.0.0:4567') reconnecting to f9398cbc-4816-11e3-0800-dca12a279d32 (tcp://172.24.0.254:4567), attempt 0
131108 2:04:30 [Note] WSREP: (692dff7e-4815-11e3-0800-0cec2658b42a, 'tcp://0.0.0.0:4567') reconnecting to 0d7bd986-481a-11e3-0800-1fea849b8d9b (tcp://172.24.0.86:4567), attempt 0
131108 2:05:00 [Note] WSREP: evs::proto(692dff7e-4815-11e3-0800-0cec2658b42a, GATHER, view_id(REG,0d7bd986-481a-11e3-0800-1fea849b8d9b,35)) suspecting node: 0d7bd986-481a-11e3-0800-1fea849b8d9b
131108 2:05:05 [Warning] WSREP: evs::proto(692dff7e-4815-11e3-0800-0cec2658b42a, GATHER, view_id(REG,0d7bd986-481a-11e3-0800-1fea849b8d9b,35)) install timer expired
evs::proto(evs::proto(692dff7e-4815-11e3-0800-0cec2658b42a, GATHER, view_id(REG,0d7bd986-481a-11e3-0800-1fea849b8d9b,35)), GATHER) {
current_view=view(view_id(REG,0d7bd986-481a-11e3-0800-1fea849b8d9b,35) memb {
0d7bd986-481a-11e3-0800-1fea849b8d9b,
35d7689b-4817-11e3-0800-67141291930b,
692dff7e-4815-11e3-0800-0cec2658b42a,
b9f56651-4818-11e3-0800-42a32dd84cfd,
f9398cbc-4816-11e3-0800-dca12a279d32,
} joined {
} left {
} partitioned {
}),

As well xdb3-4.

What happened after that is that it seems all the nodes received the request to change the state to become NON-Primary, at that point production get affected.

Summarizing.
Cluster of 5 nodes.
xdb1 was Joining so active 4 of 5.
xdb2 become Donor so active 3 of 5.
xdb5 crashed so active 2 of 5.

Would have be possible that the remaining 2 Nodes choose to become NON-Active because different state and see it as split brain?
Why was not possible for xdb2 to rejoin the cluster, and do IST Sync and participate to the quorum given the xdb1 stops and then xdb2 was free, before the possible quorum declaration?

We suspect a possible link with the issue below, but I would like to have a better understanding of the sequence of the events and the root cause.

Attached the whole set of logs relevant to the issue.

Suspected related Issues:
https://groups.google.com/forum/#!msg/codership-team/KJI1JyE9tuI/HCGb8ZBIbC8J
https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1232747

Revision history for this message

Marco Tusa (marcotusa) wrote on 2013-11-10:

all logs attached Edit (70.7 KiB, application/zip)

Revision history for this message

Teemu Ollakka (teemu-ollakka) wrote on 2013-11-15:

Log file analysis:

Firstly there were 4 nodes online, xdb2-5. Then node xdb1 started join process and requested SST from xdb2. Due to network conditions, xdb5 dropped from group briefly at least one time (xdb5 log 2:24:15 and onwards). Due to lp:1232747 nodes xdb1 and xdb5 crashed during group renegotiation, which caused remaining xdb2-4 to form singleton groups because of previous failed attempts making evs install_timeout_count_ counter to reach maximum value. At this point primary component was lost and could not be re-established because some nodes (probably xdb1, xdb5) from previous known primary component were not present.

So, there are at least two issues to be addressed:
* Obvious one lp:1232747 which should be fixed
* Make evs install_timeout_count_ counter maximum value higher or mark other nodes invalid one by one to avoid loosing too many nodes from group at once. One way would be to set max value to the size of last known group and mark nodes invalid only if they fail to reach consensus within install timeout period.

It might also make sense to isolate constantly failing nodes from the group for longer periods of time to avoid causing too much turbulence for the group.

Revision history for this message

Teemu Ollakka (teemu-ollakka) wrote on 2014-05-15:

Crash fixes and evs membership improvements committed in: https://github.com/codership/galera/issues/14