gcomm exception in PC validate_state_msgs()

Bug #1182367 reported by Teemu Ollakka on 2013-05-21
22
This bug affects 4 people
Affects Status Importance Assigned to Milestone
Galera
Status tracked in 3.x
2.x
Undecided
Teemu Ollakka
3.x
Undecided
Teemu Ollakka
Percona XtraDB Cluster
Status tracked in 5.6
5.5
Undecided
Unassigned
5.6
Undecided
Unassigned

Bug Description

Originally reported in: https://groups.google.com/forum/?fromgroups#!topic/percona-discussion/fLL1eDJ34Ts

After some partitionings/re-merges exception was thrown from pc::Proto::validate_state_msgs(). Relevant part of log:

130517 18:21:36 [Note] WSREP: declaring 07015197-bbbe-11e2-0800-3e7a13126565 stable
130517 18:21:36 [Note] WSREP: Node 07015197-bbbe-11e2-0800-3e7a13126565 state prim
130517 18:21:36 [Note] WSREP: view(view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68) memb {
    07015197-bbbe-11e2-0800-3e7a13126565,
    ec71ed3f-bbc1-11e2-0800-3622c3a697f8,
} joined {
} left {
} partitioned {
    7ac28eaf-bbb9-11e2-0800-1c760b64ad99,
})
130517 18:21:36 [Note] WSREP: forgetting 7ac28eaf-bbb9-11e2-0800-1c760b64ad99 (tcp://xx.xx..30.34:4567)
130517 18:21:36 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 1, memb_num = 2
130517 18:21:36 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
130517 18:21:36 [Note] WSREP: STATE EXCHANGE: sent state msg: 828d5335-bef0-11e2-0800-30030a942a4d
130517 18:21:36 [Note] WSREP: STATE EXCHANGE: got state msg: 828d5335-bef0-11e2-0800-30030a942a4d from 0 (db2)
130517 18:21:36 [Note] WSREP: STATE EXCHANGE: got state msg: 828d5335-bef0-11e2-0800-30030a942a4d from 1 (dbserver1)
130517 18:21:36 [Note] WSREP: Quorum results:
    version = 2,
    component = PRIMARY,
    conf_id = 23,
    members = 2/2 (joined/total),
    act_id = 5703326,
    last_appl. = 5703275,
    protocols = 0/4/2 (gcs/repl/appl),
    group UUID = 85b364e5-bb0c-11e2-0800-aa1ab3b9ca31
130517 18:21:36 [Note] WSREP: Flow-control interval: [23, 23]
130517 18:21:36 [Note] WSREP: New cluster view: global state: 85b364e5-bb0c-11e2-0800-aa1ab3b9ca31:5703326, view# 24: Primary, number of nodes: 2, my index: 1, protocol version 2
130517 18:21:36 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130517 18:21:36 [Note] WSREP: Assign initial position for certification: 5703326, protocol version: 2
130517 18:21:41 [Note] WSREP: cleaning up 7ac28eaf-bbb9-11e2-0800-1c760b64ad99 (tcp://xx.xx..30.34:4567)
130517 18:21:52 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://xx.xx..30.34:4567
130517 18:21:53 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') reconnecting to 7ac28eaf-bbb9-11e2-0800-1c760b64ad99 (tcp://xx.xx..30.34:4567), attempt 0
130517 18:21:54 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') turning message relay requesting off
130517 18:21:58 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://xx.xx..30.34:4567
130517 18:21:59 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') reconnecting to 7ac28eaf-bbb9-11e2-0800-1c760b64ad99 (tcp://xx.xx..30.34:4567), attempt 0
130517 18:22:18 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') turning message relay requesting off
130517 18:22:25 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://xx.xx..30.34:4567
130517 18:22:26 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') reconnecting to 7ac28eaf-bbb9-11e2-0800-1c760b64ad99 (tcp://xx.xx..30.34:4567), attempt 0
130517 18:22:29 [Note] WSREP: (ec71ed3f-bbc1-11e2-0800-3622c3a697f8, 'tcp://0.0.0.0:4567') turning message relay requesting off
130517 18:22:30 [Note] WSREP: declaring 07015197-bbbe-11e2-0800-3e7a13126565 stable
130517 18:22:30 [Note] WSREP: declaring 7ac28eaf-bbb9-11e2-0800-1c760b64ad99 stable
130517 18:22:35 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
pc::Proto{uuid=ec71ed3f-bbc1-11e2-0800-3622c3a697f8,start_prim=0,npvo=0,ignore_sb=0,ignore_quorum=0,state=1,last_sent_seq=279,checksum=1,instances=
    07015197-bbbe-11e2-0800-3e7a13126565,prim=1,un=1,last_seq=370,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1
    7ac28eaf-bbb9-11e2-0800-1c760b64ad99,prim=0,un=0,last_seq=1327602,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,23),to_seq=5872395,weight=1
    ec71ed3f-bbc1-11e2-0800-3622c3a697f8,prim=1,un=1,last_seq=279,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1
,state_msgs=
    07015197-bbbe-11e2-0800-3e7a13126565,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 07015197-bbbe-11e2-0800-3e7a13126565,prim=1,un=0,last_seq=370,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1
    ec71ed3f-bbc1-11e2-0800-3622c3a697f8,prim=1,un=0,last_seq=279,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1
}}
    7ac28eaf-bbb9-11e2-0800-1c760b64ad99,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 07015197-bbbe-11e2-0800-3e7a13126565,prim=1,un=1,last_seq=1297805,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,23),to_seq=5872395,weight=1
    7ac28eaf-bbb9-11e2-0800-1c760b64ad99,prim=0,un=0,last_seq=1327602,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,23),to_seq=5872395,weight=1
    ec71ed3f-bbc1-11e2-0800-3622c3a697f8,prim=1,un=1,last_seq=1375500,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,23),to_seq=5872395,weight=1
}}
    ec71ed3f-bbc1-11e2-0800-3622c3a697f8,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 07015197-bbbe-11e2-0800-3e7a13126565,prim=1,un=0,last_seq=370,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1
    ec71ed3f-bbc1-11e2-0800-3622c3a697f8,prim=1,un=0,last_seq=279,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1
}}
,current_view=view(view_id(REG,07015197-bbbe-11e2-0800-3e7a13126565,78) memb {
    07015197-bbbe-11e2-0800-3e7a13126565,
    7ac28eaf-bbb9-11e2-0800-1c760b64ad99,
    ec71ed3f-bbc1-11e2-0800-3622c3a697f8,
} joined {
    7ac28eaf-bbb9-11e2-0800-1c760b64ad99,
} left {
} partitioned {
}),pc_view=view(view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68) memb {
    07015197-bbbe-11e2-0800-3e7a13126565,
    ec71ed3f-bbc1-11e2-0800-3622c3a697f8,
} joined {
} left {
} partitioned {
}),mtu=32636}
130517 18:22:35 [Note] WSREP: evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=0,source=ec71ed3f-bbc1-11e2-0800-3622c3a697f8,source_view_id=view_id(REG,07015197-bbbe-11e2-0800-3e7a13126565,78),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=8372711,node_list=()
} 116
130517 18:22:35 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=3,user_type=255,order=1,seq=3,seq_range=-1,aru_seq=0,flags=4,source=7ac28eaf-bbb9-11e2-0800-1c760b64ad99,source_view_id=view_id(REG,07015197-bbbe-11e2-0800-3e7a13126565,78),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=8659774,node_list=()
}
130517 18:22:35 [ERROR] WSREP: state after handling message: evs::proto(evs::proto(ec71ed3f-bbc1-11e2-0800-3622c3a697f8, OPERATIONAL, view_id(REG,07015197-bbbe-11e2-0800-3e7a13126565,78)), OPERATIONAL) {
current_view=view(view_id(REG,07015197-bbbe-11e2-0800-3e7a13126565,78) memb {
    07015197-bbbe-11e2-0800-3e7a13126565,
    7ac28eaf-bbb9-11e2-0800-1c760b64ad99,
    ec71ed3f-bbc1-11e2-0800-3622c3a697f8,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=3,safe_seq=0,node_index=node: {idx=0,range=[4,3],safe_seq=3} node: {idx=1,range=[4,3],safe_seq=0} node: {idx=2,range=[4,3],safe_seq=3} ,msg_index= (2,0),evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=0,source=ec71ed3f-bbc1-11e2-0800-3622c3a697f8,source_view_id=view_id(REG,07015197-bbbe-11e2-0800-3e7a13126565,78),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=8372711,node_list=()
}
    (0,1),evs::msg{version=0,type=1,user_type=255,order=0,seq=1,seq_range=0,aru_seq=0,flags=4,source=07015197-bbbe-11e2-0800-3e7a13
130517 18:22:35 [ERROR] WSREP: exception from gcomm, backend must be restarted:msg_state == local_state: ec71ed3f-bbc1-11e2-0800-3622c3a697f8 node 07015197-bbbe-11e2-0800-3e7a13126565 prim state message and local states not consistent: msg node prim=1,un=0,last_seq=370,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1 local state prim=1,un=1,last_seq=370,last_prim=view_id(PRIM,07015197-bbbe-11e2-0800-3e7a13126565,68),to_seq=5873864,weight=1 (FATAL)
     at gcomm/src/pc_proto.cpp:validate_state_msgs():606

The reason for this crash is comparing message state to local state using == operator which compares also un status flag which is not expected to stay consistent over partitionings. Operator should be changed to compare only parts of states which must stay globally consistent.

Changed in galera:
status: New → Confirmed
assignee: nobody → Teemu Ollakka (teemu-ollakka)
milestone: none → 23.2.6
importance: Undecided → High
Changed in galera:
milestone: 23.2.6 → 23.2.7
Changed in galera:
milestone: 23.2.7 → 24.2.8
Alex Yurchenko (ayurchen) wrote :

Another report at https://groups.google.com/d/msg/codership-team/CqmJVHylz4M/PjkovPEbW5cJ, logs attached. Node clocks are not synced though. According to reporter node2 is 176 seconds off and node3 is 120 seconds off.

Changed in galera:
milestone: 25.2.8 → 25.2.9
Victor Teichert (victor-2) wrote :

Hello,

I had a cluster lock up after one node was partitioned from the cluster. It looks like it is the same issue described here:

On the server that partitioned:
140120 19:49:25 [Warning] WSREP: last inactive check more than PT1.5S ago (PT7.11956S), skipping check

Resulted with the other nodes all failing with a message like the following:

140120 19:50:12 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:

I have snipped this section and put all of the logs in the attached file
...
...
...

140120 19:50:13 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=2,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=4,source=b30e1037-74a5-11e3-b724-1f38224e5841,source_view_id=view_id(REG,3497e712-74a6-11e3-afef-63feb09277a0,400),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=30493889,node_list=()
 }140120 19:50:13 [ERROR] WSREP: exception from gcomm, backend must be restarted:msg_state == local_state: 97ea3a0a-6298-11e3-975a-07bbeda5973c node 97ea3a0a-6298-11e3-975a-07bbeda5973c prim state message and local states not consistent: msg node prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,97ea3a0a-6298-11e3-975a-07bbeda5973c,399),to_seq=2145834,weight=1 local state prim=1,un=1,last_seq=2,last_prim=view_id(PRIM,97ea3a0a-6298-11e3-975a-07bbeda5973c,399),to_seq=2145834,weight=1 (FATAL)

What is the reason that the entire cluster locked up? Is there anything that can be done to attempt and prevent this issue from reoccurring?

I can provide the logs from the other servers if needed.

Victor Teichert (victor-2) wrote :

I have attached what I believe to be a viable patch for this bug
galera 2.x gcomm/src/pc_proto.cpp r166

Victor Teichert (victor-2) wrote :

Here is what I believe to be a viable patch for
galera 3.x gcomm/src/pc_proto.cpp r170

Teemu Ollakka (teemu-ollakka) wrote :

Hi Victor,

Your patch will surely get rid of the exception, but I doubt it is a right thing to do. The thing is, in that code branch we are checking states between two nodes that both are claiming to be coming from primary component, so they should have seen exactly the same set of EVS views and messages and so have equal states. Getting this exception indicates that there is something wrong with state messages and continuing might cause data corruption. In most of the cases things will probably go ok, but it cannot be guaranteed. So use at your own risk.

It looks like one problem here is that in case of partitioning nodes remaining in primary mark partitioned nodes as non-primary too late (after message exchange), so if partitioned node comes back to group before forming a new primary component has been finished it may happen that state messages won't match.

Fix for this would be marking partitioned nodes as non-primary right after quorum computation, but that might require changes that are not backwards compatible.

Another thing worth inspecting is if the check for this state message should be done at all, it might be possible that the failing case here is with state message coming from previously partitioned node (node coming from non-prim). In that case fix would be simply avoiding state validation.

Teemu Ollakka (teemu-ollakka) wrote :

Further inspection revealed that the actual reason for this crash was setting un flag also on nodes that were coming from prim. In that case it is not necessary as nodes in prim know the status of other prim nodes.

Fixes were pushed in revisions:
http://bazaar.launchpad.net/~codership/galera/2.x/revision/168
http://bazaar.launchpad.net/~codership/galera/3.x/revision/172

The fix was release in 5.6.15-25.4 release in 5.6 series. (removed it from milestone)

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers