terminate called after throwing an instance of 'std::out_of_range'

Bug #1232747 reported by jolan on 2013-09-29
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Galera
Status tracked in 3.x
2.x
High
Teemu Ollakka
3.x
High
Teemu Ollakka
Percona XtraDB Cluster
Status tracked in 5.6
5.5
Undecided
Unassigned
5.6
Undecided
Unassigned

Bug Description

Hi,

All three nodes of our Percona Cluster received exceptions in the last few hours which caused all nodes on the cluster to shut down.

dbw0 was running percona-xtradb-cluster-server-5.5 5.5.33-23.7.6-495.raring
dbc0 was running percona-xtradb-cluster-server-5.5 5.5.33-23.7.6-496.raring
dbe0 was running percona-xtradb-cluster-server-5.5 5.5.31-23.7.5-438.raring

We were planning on upgrading all nodes to 496 build of 5.5.33 but this happened before we could schedule a time to do that.

dbw0 received the exception in the summary.
dbc0/dbe0 received wsrep exceptions below.

dbw0:

terminate called after throwing an instance of 'std::out_of_range'
  what(): vector::_M_range_check
04:14:30 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=140
max_threads=153
thread_count=136
connection_count=136
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 343054 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x2e)[0x7dc9ae]
/usr/sbin/mysqld(handle_fatal_signal+0x491)[0x6beed1]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfbd0)[0x7f8f9fcc8bd0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7f8f9f2f0037]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7f8f9f2f3698]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x11d)[0x7f8f9d585e8d]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5ef76)[0x7f8f9d583f76]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5efa3)[0x7f8f9d583fa3]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x5f1de)[0x7f8f9d5841de]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZSt20__throw_out_of_rangePKc+0x5d)[0x7f8f9d5d69ad]
/usr/lib/libgalera_smm.so(+0xd90f6)[0x7f8f9db090f6]
/usr/lib/libgalera_smm.so(_ZN5gcomm3evs5Proto19update_im_safe_seqsERKNS0_15MessageNodeListE+0x121)[0x7f8f9db09b41]
/usr/lib/libgalera_smm.so(_ZN5gcomm3evs5Proto11handle_joinERKNS0_11JoinMessageESt17_Rb_tree_iteratorISt4pairIKNS_4UUIDENS0_4NodeEEE+0xb60)[0x7f8f9db199a0]
/usr/lib/libgalera_smm.so(_ZN5gcomm3evs5Proto10handle_msgERKNS0_7MessageERKNS_8DatagramE+0x387)[0x7f8f9db22407]
/usr/lib/libgalera_smm.so(_ZN5gcomm3evs5Proto9handle_upEPKvRKNS_8DatagramERKNS_11ProtoUpMetaE+0x27b)[0x7f8f9db22dcb]
/usr/lib/libgalera_smm.so(_ZN5gcomm8Protolay7send_upERKNS_8DatagramERKNS_11ProtoUpMetaE+0x36)[0x7f8f9db24746]
/usr/lib/libgalera_smm.so(_ZN5gcomm6GMCast9handle_upEPKvRKNS_8DatagramERKNS_11ProtoUpMetaE+0x22a)[0x7f8f9db37e4a]
/usr/lib/libgalera_smm.so(_ZN5gcomm10Protostack8dispatchEPKvRKNS_8DatagramERKNS_11ProtoUpMetaE+0x58)[0x7f8f9db5ddd8]
/usr/lib/libgalera_smm.so(_ZN5gcomm12AsioProtonet8dispatchERKPKvRKNS_8DatagramERKNS_11ProtoUpMetaE+0x4b)[0x7f8f9db85e8b]
/usr/lib/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket12read_handlerERKN4asio10error_codeEm+0x7a6)[0x7f8f9db688e6]
/usr/lib/libgalera_smm.so(_ZN4asio6detail7read_opINS_19basic_stream_socketINS_2ip3tcpENS_21stream_socket_serviceIS4_EEEEN5boost5arrayINS_14mutable_bufferELm1EEENS8_3_bi6bind_tImNS8_4_mfi3mf2ImN5gcomm13AsioTcpSocketERKNS_10error_codeEmEENSC_5list3INSC_5valueINS8_10shared_ptrISH_EEEEPFNS8_3argILi1EEEvEPFNSR_ILi2EEEvEEEEENSD_IvNSF_IvSH_SK_mEESY_EEEclESK_mi+0x94)[0x7f8f9db771c4]
/usr/lib/libgalera_smm.so(_ZN4asio6detail23reactive_socket_recv_opINS0_17consuming_buffersINS_14mutable_bufferEN5boost5arrayIS3_Lm1EEEEENS0_7read_opINS_19basic_stream_socketINS_2ip3tcpENS_21stream_socket_serviceISB_EEEES6_NS4_3_bi6bind_tImNS4_4_mfi3mf2ImN5gcomm13AsioTcpSocketERKNS_10error_codeEmEENSF_5list3INSF_5valueINS4_10shared_ptrISK_EEEEPFNS4_3argILi1EEEvEPFNSU_ILi2EEEvEEEEENSG_IvNSI_IvSK_SN_mEES11_EEEEE11do_completeEPNS0_15task_io_serviceEPNS0_25task_io_service_operationESL_m+0xdd)[0x7f8f9db7750d]
/usr/lib/libgalera_smm.so(_ZN4asio6detail15task_io_service3runERNS_10error_codeE+0x407)[0x7f8f9db89517]
/usr/lib/libgalera_smm.so(_ZN5gcomm12AsioProtonet10event_loopERKN2gu8datetime6PeriodE+0x1b0)[0x7f8f9db872f0]
/usr/lib/libgalera_smm.so(_ZN9GCommConn3runEv+0x5e)[0x7f8f9db9fe5e]
/usr/lib/libgalera_smm.so(_ZN9GCommConn6run_fnEPv+0x9)[0x7f8f9dba3139]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7f8e)[0x7f8f9fcc0f8e]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f8f9f3b2e1d]
You may download the Percona Server operations manual by visiting
http://www.percona.com/software/percona-server/. You may find information
in the manual which will help you identify the cause of the crash.
130929 04:14:30 mysqld_safe Number of processes running now: 0
130929 04:14:30 mysqld_safe WSREP: not restarting wsrep node automatically
130929 04:14:30 mysqld_safe mysqld from pid file /var/lib/mysql/dbw0.connected.cc.pid ended

dbc0:

130929 4:14:30 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=4,user_type=255,order=1,seq=3,seq_range=-1,aru_seq=3,flags=4,source=447a086c-2456-11e3-b1d8-bfcbd6dd35cd,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=10620186,node_list=( 447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=3,im_range=[4,3],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,00000000-0000-0000-0000-000000000000,0),safe_seq=-1,im_range=[-1,-1],}
)
}
 state after handling message: evs::proto(evs::proto(e10fe36e-2559-11e3-8f39-661947538c85, GATHER, view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94)), GATHER) {
current_view=view(view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94) memb {
        239e4407-0ce7-11e3-9bee-3be82ca03086,
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,
        e10fe36e-2559-11e3-8f39-661947538c85,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=-1,safe_seq=-1,node_index=node: {idx=0,range=[6,5],safe_seq=-1} node: {idx=1,range=[0,3],safe_seq=3} node: {idx=2,range=[6,5],safe_seq=-1} },
fifo_seq=7949526,
last_sent=5,
known={
        239e4407-0ce7-11e3-9bee-3be82ca03086,evs::node{operational=1,suspected=0,installed=0,fifo_seq=66278528,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=4,source=239e4407-0ce7-11e3-9bee-3be82ca03086,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=66278527,node_list=( 239e4407-0ce7-11e3-9bee-3be82ca03086,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=1,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[0,-1],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
)
},
}
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,evs::node{operational=1,suspected=0,installed=0,fifo_seq=10620186,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=3,seq_range=-1,aru_seq=3,flags=4,source=447a086c-2456-11e3-b1d8-bfcbd6dd35cd,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=10620186,node_list=( 447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=3,im_range=[4,3],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,00000000-0000-0000-0000-000000000000,0),safe_seq=-1,im_range=[-1,-1],}
)
},
}
        e10fe36e-2559-11e3-8f39-661947538c85,evs::node{operational=1,suspected=0,installed=0,fifo_seq=-1,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=0,source=e10fe36e-2559-11e3-8f39-661947538c85,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=7949526,node_list=( 239e4407-0ce7-11e3-9bee-3be82ca03086,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=3,im_range=[0,3],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
)
},
}
 }
 }130929 4:14:30 [ERROR] WSREP: exception from gcomm, backend must be restarted:nlself_i != same_view.end(): (FATAL)
         at gcomm/src/evs_proto.cpp:handle_join():3530
130929 4:14:30 [Note] WSREP: Received self-leave message.
130929 4:14:30 [Note] WSREP: Flow-control interval: [0, 0]
130929 4:14:30 [Note] WSREP: Received SELF-LEAVE. Closing connection.
130929 4:14:30 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 27410502)
130929 4:14:30 [Note] WSREP: RECV thread exiting 0: Success
130929 4:14:30 [Note] WSREP: New cluster view: global state: 73a23f83-0c81-11e3-9fc4-a2a855d3a912:27410502, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
130929 4:14:30 [Note] WSREP: Setting wsrep_ready to 0
130929 4:14:30 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130929 4:14:30 [Note] WSREP: applier thread exiting (code:0)
130929 4:14:30 [Note] WSREP: closing applier 6

dbe0:

130928 23:14:30 [Note] WSREP: evs::proto(239e4407-0ce7-11e3-9bee-3be82ca03086, GATHER, view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94)) suspecting node: 447a086c-2456-11e3-b1d8-bfcbd6dd35cd
130928 23:14:30 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=4,user_type=255,order=1,seq=3,seq_range=-1,aru_seq=3,flags=4,source=447a086c-2456-11e3-b1d8-bfcbd6dd35cd,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=000
00000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=10620186,node_list=( 447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=3,im_range=[4,3],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,00000000-0000-0000-0000-000000000000,0),safe_seq=-1,im_range=[-1,-1],}
)
}
 state after handling message: evs::proto(evs::proto(239e4407-0ce7-11e3-9bee-3be82ca03086, GATHER, view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94)), GATHER) {
current_view=view(view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94) memb {
        239e4407-0ce7-11e3-9bee-3be82ca03086,
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,
        e10fe36e-2559-11e3-8f39-661947538c85,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=-1,safe_seq=-1,node_index=node: {idx=0,range=[6,5],safe_seq=-1} node: {idx=1,range=[0,3],safe_seq=3} node: {idx=2,range=[6,5],safe_seq=-1} },
fifo_seq=66278530,
last_sent=5,
known={
        239e4407-0ce7-11e3-9bee-3be82ca03086,evs::node{operational=1,suspected=0,installed=0,fifo_seq=-1,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=0,source=239e4407-0ce7-11e3-9bee-3be82ca03086,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=6
6278530,node_list=( 239e4407-0ce7-11e3-9bee-3be82ca03086,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=1,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=3,im_range=[0,3],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
)
},
}
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,evs::node{operational=1,suspected=1,installed=0,fifo_seq=10620186,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=3,seq_range=-1,aru_seq=3,flags=4,source=447a086c-2456-11e3-b1d8-bfcbd6dd35cd,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=106
20186,node_list=( 447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=3,im_range=[4,3],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,00000000-0000-0000-0000-000000000000,0),safe_seq=-1,im_range=[-1,-1],}
)
},
}
        e10fe36e-2559-11e3-8f39-661947538c85,evs::node{operational=1,suspected=0,installed=0,fifo_seq=7949526,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=4,source=e10fe36e-2559-11e3-8f39-661947538c85,source_view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=7
949526,node_list=( 239e4407-0ce7-11e3-9bee-3be82ca03086,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
        447a086c-2456-11e3-b1d8-bfcbd6dd35cd,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=3,im_range=[0,3],}
        e10fe36e-2559-11e3-8f39-661947538c85,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,239e4407-0ce7-11e3-9bee-3be82ca03086,94),safe_seq=-1,im_range=[6,5],}
)
},
}
 }
 }130928 23:14:30 [ERROR] WSREP: exception from gcomm, backend must be restarted:nlself_i != same_view.end(): (FATAL)
         at gcomm/src/evs_proto.cpp:handle_join():3530
130928 23:14:30 [Note] WSREP: Received self-leave message.
130928 23:14:30 [Note] WSREP: Flow-control interval: [0, 0]
130928 23:14:30 [Note] WSREP: Received SELF-LEAVE. Closing connection.
130928 23:14:30 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 27410502)
130928 23:14:30 [Note] WSREP: RECV thread exiting 0: Success
130928 23:14:30 [Note] WSREP: New cluster view: global state: 73a23f83-0c81-11e3-9fc4-a2a855d3a912:27410502, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 2
130928 23:14:30 [Note] WSREP: Setting wsrep_ready to 0
130928 23:14:30 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130928 23:14:30 [Note] WSREP: applier thread exiting (code:0)
130928 23:14:30 [Note] WSREP: closing applier 10

Related branches

lp:galera
David Bennett: Pending requested 2014-07-25
Changed in galera:
assignee: nobody → Teemu Ollakka (teemu-ollakka)
Changed in percona-xtradb-cluster:
milestone: none → 5.5.34-23.7.6
Teemu Ollakka (teemu-ollakka) wrote :

Hi Jolan,

Could you provide a bit more context about what happened just before crashes. How long were nodes have been running, was some of the nodes restarted recently, was there any signs of network issues etc.

If there was some activity in error logs just before the crashes (within one minute or so), it would be interesting to see it too.

jolan (jolan) wrote :
jolan (jolan) wrote :
jolan (jolan) wrote :
jolan (jolan) wrote :

I attached the logs. There definitely was some sort of network event that preceded the crash and exceptions.

Also, all 3 nodes quit at the same time. dbe0's timezone was set to US Central instead of UTC so the hours don't match up but the minutes/seconds do.

Our cluster is comprised of 3 WAN nodes which are hosted in separate data centers.

dbw0 - (Dallas, TX Linode) - does reads/writes, 20ms latency to dbc0, 40ms latency to dbe0
dbc0 - (Atlanta, GA Linode) - only does mysqldump backups, 20ms latency to both dbe0/dbw0
dbe0 - (Newark, NJ Linode) - does reads/writes, 20ms latency to dbc0, 40ms latency to dbw0

Cluster was created 5 weeks ago.

dbe0 had been running percona-xtradb-cluster-server-5.5 5.5.31-23.7.5-438.raring for the whole 5 weeks.
dbc0 had been running for 4.5 days (was upgraded and restarted)
dbw0 had been running for 5.5 days (was upgraded and restarted)

We have had network hiccups before but they usually result in one node disappearing for a short time and then re-joining without incident.

There are a couple of "[Warning] WSREP: Quorum: No node with complete state:" warnings in the logs I attached which we haven't seen before.

Changed in galera:
status: New → Confirmed
Teemu Ollakka (teemu-ollakka) wrote :

Thanks for logs, they revealed the reason for the crash.

Apparently, if network partitions in certain point of group negotiation, one of the partitioned component may form a group with invalid group id, which in turn causes group ids of partitioned components to be identical. Crash happens when network connectivity returns and group tries to remerge.

Changed in percona-xtradb-cluster:
milestone: 5.5.34-23.7.6 → future-5.5
tags: added: i41204 i41255
To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers