Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

cluster crash and can't normal shutdown under very high loads

Bug #1080539 reported by marsxu on 2012-11-19

This bug affects 2 people

Affects		Status	Importance	Assigned to	Milestone
	Galera	Fix Released	Low	Teemu Ollakka	Galera 23.2.4
	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.5.30-23.7.4

Bug Description

I've got a three node xtradb cluster.

Sysbench command:
########sysbench######
for ((i=1;i<20;i++,i++,i++)); do sysbench --test=oltp --mysql-host=10.10.59.51 --mysql-port=5003 --mysql-user=testu --mysql-password=testp --mysql-db=sbtest --mysql-table-engine=innodb --mysql-engine-trx=yes --oltp-table-size=400000000 --num-threads=${i}00 --max-time=0 --max-requests=0 run;date; done
####################

  When --num-threads > 500 , mysql error log print(sometimes) :
#######error.log#######
121112 17:15:22 [Note] WSREP: (52d1b2b7-2c9f-11e2-0800-f27a6d19a130, 'tcp://10.10.59.230:5030') turning message relay requesting off
121112 17:15:22 [Note] WSREP: (52d1b2b7-2c9f-11e2-0800-f27a6d19a130, 'tcp://10.10.59.230:5030') cleaning up duplicate 0x7f74b801b0e0 after established 0x2b71870
121112 17:15:24 [Warning] WSREP: readjusting seq range 801 to 255
121112 17:15:24 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=1,user_type=255,order=0,seq=975216,seq_range=0,aru_seq=974414,flags=4,source=ffc955be-2c92-11e2-0800-658df0cf61e1,source_view_id=view_id(REG,52d1b2b7-2c9f-11e2-0800-f27a6d19a130,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=2474613,node_list=()
}
121112 17:15:24 [ERROR] WSREP: state after handling message: evs::proto(evs::proto(52d1b2b7-2c9f-11e2-0800-f27a6d19a130, OPERATIONAL, view_id(REG,52d1b2b7-2c9f-11e2-0800-f27a6d19a130,5)), OPERATIONAL) {
current_view=view(view_id(REG,52d1b2b7-2c9f-11e2-0800-f27a6d19a130,5) memb {
        52d1b2b7-2c9f-11e2-0800-f27a6d19a130,
        54808247-2c93-11e2-0800-d94fc5e26d6e,
        ffc955be-2c92-11e2-0800-658df0cf61e1,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=974410,safe_seq=974410,node_index=node: {idx=0,range=[974671,974670],safe_seq=974410} node: {idx=1,range=[974415,974414],safe_seq=974410} node: {idx=2,range=[974411,975216],safe_seq=974414} ,msg_index= (0,974414),evs::msg{version=0,type=1,user_type=255,order=0,seq=974414,seq_range=0,aru_seq=974410,flags=0,source=52d1b2b7-2c9f-11e2-0800-f27a6d19a130,source_view_id=view_id(REG,52d1b2b7-2c9f-11e2-0800-f27a6d19a130,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=1948755,node_list=()
}
        (1,974414),evs::msg{version=0,type=1,user_type=255,order=0,se
121112 17:15:24 [ERROR] WSREP: exception from gcomm, backend must be restarted:range.get_hs() == last_msg_seq: evs::msg{version=0,type=1,user_type=255,order=0,seq=974415,seq_range=255,aru_seq=974410,flags=0,source=52d1b2b7-2c9f-11e2-0800-f27a6d19a130,source_view_id=view_id(REG,52d1b2b7-2c9f-11e2-0800-f27a6d19a130,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=1948757,node_list=()
} evs::input_map: {aru_seq=974410,safe_seq=974410,node_index=node: {idx=0,range=[974671,974670],safe_seq=974410} node: {idx=1,range=[974415,974414],safe_seq=974410} node: {idx=2,range=[974411,975216],safe_seq=974414} ,msg_index= (0,974414),evs::msg{version=0,type=1,user_type=255,order=0,seq=974414,seq_range=0,aru_seq=974410,flags=0,source=52d1b2b7-2c9f-11e2-0800-f27a6d19a130,source_view_id=view_id(REG,52d1b2b7-2c9f-11e2-0800-f27a6d19a130,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=1948755,node_list=()
}
        (1,974414),evs::msg{version=0,type=1,user_type=255,order=0,seq=974414,seq_range=0,aru_seq=974410,
121112 17:15:24 [Note] WSREP: Received self-leave message.
121112 17:15:24 [Note] WSREP: Flow-control interval: [0, 0]
121112 17:15:24 [Note] WSREP: Received SELF-LEAVE. Closing connection.
121112 17:15:24 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 1229737)
121112 17:15:24 [Note] WSREP: RECV thread exiting 0: Success
#######error.log#######

All cluster nodes status like this:
mysql> show status like 'wsrep%';
+----------------------------+--------------------------------------+
| Variable_name | Value |
+----------------------------+--------------------------------------+
| wsrep_local_state_uuid | 23b36a90-299b-11e2-0800-0494f50ee1f9 |
| wsrep_protocol_version | 4 |
| wsrep_last_committed | 2110735 |
| wsrep_replicated | 238039 |
| wsrep_replicated_bytes | 329257886 |
| wsrep_received | 770 |
| wsrep_received_bytes | 7307 |
| wsrep_local_commits | 238039 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.000000 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 1793.307719 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 0 |
| wsrep_local_state_comment | Initialized (0) |
| wsrep_cert_index_size | 8627 |
| wsrep_causal_reads | 0 |
| wsrep_cluster_conf_id | 18446744073709551615 |
| wsrep_cluster_size | 2 |
| wsrep_cluster_state_uuid | 23b36a90-299b-11e2-0800-0494f50ee1f9 |
| wsrep_cluster_status | non-Primary |
| wsrep_connected | ON |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <email address hidden> |
| wsrep_provider_version | 2.2(r115) |
| wsrep_ready | OFF |
+----------------------------+--------------------------------------+

I can only normal shutdown write node.
Read node shutdown logs:
#######error.log#######
121112 19:42:13 [Note] /usr/local/Percona-XtraDB-Cluster-5.5.27-23.6.356.Linux.x86_64/bin/mysqld: Normal shutdown

121112 19:42:13 [Note] WSREP: Stop replication
121112 19:42:13 [Note] WSREP: Provider disconnect
121112 19:42:13 [Note] WSREP: Closing send monitor...
121112 19:42:13 [Note] WSREP: Closed send monitor.
121112 19:42:13 [Note] WSREP: closing connection 74
121112 19:42:13 [Note] WSREP: Before Lock_thread_count
121112 19:42:15 [Note] WSREP: waiting for client connections to close: 33 (Wait a long time .... no next response ....)
#######error.log#######

See original description

Related branches

lp:~raghavendra-prabhu/percona-xtradb-cluster/release-5.5.30-galera-2.x

marsxu (xjxyxgq) on 2012-11-19

affects:

codership-mysql → percona-xtradb-cluster

Revision history for this message

Teemu Ollakka (teemu-ollakka) wrote on 2012-11-19:

Hi,

Could you post the value of your wsrep_provider_options variable?

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2012-11-20:

Might be related to lp:1040108

Revision history for this message

marsxu (xjxyxgq) wrote on 2012-11-21:

to Teemu Ollakka：
wsrep_provider_options = "gmcast.listen_addr=tcp://10.10.59.230:5030;ist.recv_addr = tcp://10.10.59.230:5031;evs.keepalive_period = PT3S; evs.inactive_check_period = PT10S; evs.suspect_timeout = PT30S; evs.inactive_timeout = PT1M; evs.consensus_timeout = PT1M; gcache.size = 100000M; evs.send_window=1024; evs.user_send_window=1024; evs.debug_log_mask=EVS_DEBUG"

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2012-11-21:

Ooooh, evs.user_send_window should be strictly less than evs.send_window (like 2 times less) and they better be left to their defaults.

Revision history for this message

Teemu Ollakka (teemu-ollakka) wrote on 2012-11-21:

It seems that this high evs.send_window/evs.user_send_window values trigger easily a bug in gcomm. Better to keep them as defaults, or if you absolutely have to tune them, set evs.send_window at most 64.

Anyway, there is a bug that needs to be fixed. Thanks for reporting!

Changed in galera:
status:	New → Confirmed
assignee:	nobody → Teemu Ollakka (teemu-ollakka)

Alex Yurchenko (ayurchen) on 2012-12-22

Changed in galera:
importance:	Undecided → Low

Revision history for this message

Teemu Ollakka (teemu-ollakka) wrote on 2013-03-10:

Fix committed in: http://bazaar.launchpad.net/~codership/galera/2.x/revision/145

Changed in galera:
status:	Confirmed → Fix Committed

Alex Yurchenko (ayurchen) on 2013-03-11

Changed in galera:
milestone:	none → 23.2.4

Alex Yurchenko (ayurchen) on 2013-03-11

Changed in galera:
status:	Fix Committed → Fix Released

Revision history for this message

Dan Rogers (drogers-l) wrote on 2013-03-11:

The documentation at http://www.codership.com/wiki/doku.php?id=galera_parameters implies that you should set both these values to 512 for WAN replication, so this should probably also be updated.

http://www.slideshare.net/henrikingo/introduction-to-galera also gives the same values.

I had set these values based on this information and experienced a similar crash over the weekend.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-03-12:

Dan, could you try with this release? It should fix it.

Revision history for this message

Dan Rogers (drogers-l) wrote on 2013-03-12:

We run Percona XtraDB Cluster, so their new release isn't out yet, but as soon as it is, I'll give it a try, thanks.

I also turned evs.send_window/evs.user_send_window down to 64/32 so that we shouldn't have the issue any more.

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-03-12:

#10

Dan, no hurry. But just in case you didn't know, you don't have to wait for Percona release. Galera is a dlopenable plugin and as long as the API number (23 here) matches, you can use Codership's Galera package with Percona's XtraDB Cluster.

Revision history for this message

Dan Rogers (drogers-l) wrote on 2013-03-12:

#11

I didn't know that, thanks for the tip!

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-04-05

Changed in percona-xtradb-cluster:
milestone:	none → 5.5.30-23.7.4

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-04-05

Changed in percona-xtradb-cluster:
status:	New → Fix Released

Revision history for this message

Marco Tusa (marcotusa) wrote on 2013-06-05:

#12

Alex, I have just install the percona server Ver 5.5.30-23.7.4-log

Brand new cluster still single node I was just testing the installation procedure, and I got the issue of wsrep remaining there waiting for close.
what I have noticed is that I have in the configuration 18 thread define
wsrep_slave_threads=18
But it is still waiting for 19 if I set it to 20 is waiting for 21 and so on.

It seems a bug in the loop checking the running thread ... any idea???

130605 18:03:25 [Note] WSREP: closing applier 4
130605 18:03:25 [Note] WSREP: applier thread exiting (code:5)
130605 18:03:25 [Note] WSREP: closing applier 2
130605 18:03:27 [Note] WSREP: waiting for client connections to close: 19
130605 18:03:36 [Note] WSREP: Before Lock_thread_count

description:

updated

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-06-05:

#13

Marco, this is something new. If you could get stack dump ( gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mysqld) > stack.traces ) that would be great.

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

#14

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1256

Report a bug

This report contains Public information

Everyone can see this information.

You are

Subscribing...

Edit bug mail

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.