backend must be restarted message causes node hang

Bug #1212739 reported by Jay Janssen on 2013-08-15
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Galera
Undecided
Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Committed
Medium
Unassigned

Bug Description

Scenario:

node1 up and running, 1 node cluster.

Node2 starts, but it has an older grastate. This seems to cause a 'prim' conflict and drops node1 into the Init state. Node2 fails, and node1 goes into this state:

130815 15:21:40 [Note] WSREP: declaring 620654b6-05be-11e3-9fed-cbbc588bf177 stable
130815 15:21:40 [Note] WSREP: declaring d7f678fa-05bd-11e3-b212-922779171530 stable
130815 15:21:40 [Warning] WSREP: 32e0c052-05be-11e3-a964-ee85c7ae2549 conflicting prims: my prim: view_id(PRIM,32e0c052-05be-11e3-a964-ee85c7ae2549,1) other prim: view_id(PRIM,d7f678fa-05bd-11e3-b212-922779171530,10)
130815 15:21:40 [ERROR] WSREP: caught exception in PC, state dump to stderr follows:
pc::Proto{uuid=32e0c052-05be-11e3-a964-ee85c7ae2549,start_prim=1,npvo=0,ignore_sb=0,ignore_quorum=0,state=1,last_sent_seq=3,checksum=1,instances=
 32e0c052-05be-11e3-a964-ee85c7ae2549,prim=1,un=0,last_seq=3,last_prim=view_id(PRIM,32e0c052-05be-11e3-a964-ee85c7ae2549,1),to_seq=2,weight=1
,state_msgs=
 32e0c052-05be-11e3-a964-ee85c7ae2549,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 32e0c052-05be-11e3-a964-ee85c7ae2549,prim=1,un=0,last_seq=3,last_prim=view_id(PRIM,32e0c052-05be-11e3-a964-ee85c7ae2549,1),to_seq=2,weight=1
}}
 620654b6-05be-11e3-9fed-cbbc588bf177,pcmsg{ type=STATE, seq=0, flags= 0, node_map { 620654b6-05be-11e3-9fed-cbbc588bf177,prim=0,un=0,last_seq=4294967295,last_prim=view_id(NON_PRIM,00000000-0000-0000-0000-000000000000,0),to_seq=-1,weight=1
 d7f678fa-05bd-11e3-b212-922779171530,prim=1,un=0,last_seq=2,last_prim=view_id(PRIM,d7f678fa-05bd-11e3-b212-922779171530,10),to_seq=-1,weight=1
}}
,current_view=view(view_id(REG,32e0c052-05be-11e3-a964-ee85c7ae2549,12) memb {
 32e0c052-05be-11e3-a964-ee85c7ae2549,
 620654b6-05be-11e3-9fed-cbbc588bf177,
 d7f678fa-05bd-11e3-b212-922779171530,
} joined {
 620654b6-05be-11e3-9fed-cbbc588bf177,
 d7f678fa-05bd-11e3-b212-922779171530,
} left {
} partitioned {
}),pc_view=view(view_id(PRIM,32e0c052-05be-11e3-a964-ee85c7ae2549,1) memb {
 32e0c052-05be-11e3-a964-ee85c7ae2549,
} joined {
} left {
} partitioned {
}),mtu=32636}
130815 15:21:40 [Note] WSREP: evs::msg{version=0,type=1,user_type=255,order=4,seq=0,seq_range=0,aru_seq=-1,flags=4,source=d7f678fa-05bd-11e3-b212-922779171530,source_view_id=view_id(REG,32e0c052-05be-11e3-a964-ee85c7ae2549,12),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=271,node_list=()
} 116
130815 15:21:40 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=3,user_type=255,order=1,seq=0,seq_range=-1,aru_seq=0,flags=4,source=620654b6-05be-11e3-9fed-cbbc588bf177,source_view_id=view_id(REG,32e0c052-05be-11e3-a964-ee85c7ae2549,12),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=17,node_list=()
}
 state after handling message: evs::proto(evs::proto(32e0c052-05be-11e3-a964-ee85c7ae2549, OPERATIONAL, view_id(REG,32e0c052-05be-11e3-a964-ee85c7ae2549,12)), OPERATIONAL) {
current_view=view(view_id(REG,32e0c052-05be-11e3-a964-ee85c7ae2549,12) memb {
 32e0c052-05be-11e3-a964-ee85c7ae2549,
 620654b6-05be-11e3-9fed-cbbc588bf177,
 d7f678fa-05bd-11e3-b212-922779171530,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=0,safe_seq=0,node_index=node: {idx=0,range=[1,0],safe_seq=0} node: {idx=1,range=[1,0],safe_seq=0} node: {idx=2,range=[1,0],safe_seq=0} },
fifo_seq=105,
last_sent=0,
known={
 32e0c052-05be-11e3-a964-ee85c7ae2549,evs::node{operational=1,suspected=0,installed=1,fifo_seq=-1,}
 620654b6-05be-11e3-9fed-cbbc588bf177,evs::node{operational=1,suspected=0,installed=1,fifo_seq=17,}
 d7f678fa-05bd-11e3-b212-922779171530,evs::node{operational=1,suspected=0,installed=1,fifo_seq=273,}
 }
 }130815 15:21:40 [ERROR] WSREP: exception from gcomm, backend must be restarted:32e0c052-05be-11e3-a964-ee85c7ae2549 aborting due to conflicting prims: older overrides (FATAL)
  at gcomm/src/pc_proto.cpp:handle_state():888
130815 15:21:40 [Note] WSREP: Received self-leave message.
130815 15:21:40 [Note] WSREP: Flow-control interval: [0, 0]
130815 15:21:40 [Note] WSREP: Received SELF-LEAVE. Closing connection.
130815 15:21:40 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 0)
130815 15:21:40 [Note] WSREP: RECV thread exiting 0: Success
130815 15:21:40 [Note] WSREP: New cluster view: global state: 32e1936e-05be-11e3-9b1b-56bdddb5fdd3:0, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
130815 15:21:40 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130815 15:21:40 [Note] WSREP: applier thread exiting (code:0)

However, the daemon stays in the Init state. When I try to shut it down, it just hangs:

130815 15:22:31 [Note] /usr/sbin/mysqld: Normal shutdown

130815 15:22:31 [Note] WSREP: Stop replication
130815 15:22:31 [Note] WSREP: Closing send monitor...
130815 15:22:31 [Note] WSREP: Closed send monitor.

a) Adding galera bug component since it is related to galera.

b) Is it possible to restart backend without restarting mysql?

There is a way by setting wsrep_provider to same value or none
and back.

But currently it hangs. Possibly due to https://bugs.launchpad.net/codership-mysql/+bug/1208493

I tried with the fix of that bug but I can still replicate this.

c) Even if restarting mysql be considered, looks like it still
hangs.

If it is possible to reproduce this, is it possible to get a
backtrace (from core), just need to send -11 to mysqld when its
hung.

I tested after the recent codership fix merge (from lp:1208493), the shutdown works fine now.

However,

a) As stated in https://bugs.launchpad.net/codership-mysql/+bug/1208493/comments/9 changing the provider from library to none and back to library leaves the node in an unknown state - so you cannot restart backend this way.

b) Is the need for the backend restart also considered a bug here? If yes, then it will need to be considered separately from the hung node issue.

Alex Yurchenko (ayurchen) wrote :

Raghu,

"Node2 starts, but it has an older grastate. This seems to cause a 'prim' conflict and drops node1 into the Init state. Node2 fails, and node1 goes into this state"

- I bet this is not accurate or exact description. According to the log both nodes 1 and 2 were running and were in primary component before they suddenly saw each other (grastate file has nothing to do with it). Please describe how exactly you start and stop the nodes. In particular, how is this related to lp:1208493 ?

@Alex,

I was merely testing the shutdown part of it, not the bug which is causing the the backend to hang. I guess, only the latter can be dealt with here (for the shutdown part, it is being partly handled in lp:1208493 so should be fine).

Krunal Bauskar (krunal-bauskar) wrote :

Shutdown is working fine which was the main reason behind this bug. Will mark it as closed.

"I tested after the recent codership fix merge (from lp:1208493), the shutdown works fine now."

Changed in percona-xtradb-cluster:
status: New → Fix Committed
importance: Undecided → Medium

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1078

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers