Percona XtraDB Cluster - HA scalable solution for MySQL

cluster crash and can't normal shutdown under very high loads

Reported by marsxu on 2012-11-19
12
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Galera
Low
Teemu Ollakka
Percona XtraDB Cluster
Undecided
Unassigned

Bug Description

I've got a three node xtradb cluster.

Sysbench command:
########sysbench######
for ((i=1;i<20;i++,i++,i++)); do sysbench --test=oltp --mysql-host=10.10.59.51 --mysql-port=5003 --mysql-user=testu --mysql-password=testp --mysql-db=sbtest --mysql-table-engine=innodb --mysql-engine-trx=yes --oltp-table-size=400000000 --num-threads=${i}00 --max-time=0 --max-requests=0 run;date; done
####################

  When --num-threads > 500 , mysql error log print(sometimes) :
#######error.log#######
121112 17:15:22 [Note] WSREP: (52d1b2b7-2c9f-11e2-0800-f27a6d19a130, 'tcp://10.10.59.230:5030') turning message relay requesting off
121112 17:15:22 [Note] WSREP: (52d1b2b7-2c9f-11e2-0800-f27a6d19a130, 'tcp://10.10.59.230:5030') cleaning up duplicate 0x7f74b801b0e0 after established 0x2b71870
121112 17:15:24 [Warning] WSREP: readjusting seq range 801 to 255
121112 17:15:24 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=1,user_type=255,order=0,seq=975216,seq_range=0,aru_seq=974414,flags=4,source=ffc955be-2c92-11e2-0800-658df0cf61e1,source_view_id=view_id(REG,52d1b2b7-2c9f-11e2-0800-f27a6d19a130,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=2474613,node_list=()
}
121112 17:15:24 [ERROR] WSREP: state after handling message: evs::proto(evs::proto(52d1b2b7-2c9f-11e2-0800-f27a6d19a130, OPERATIONAL, view_id(REG,52d1b2b7-2c9f-11e2-0800-f27a6d19a130,5)), OPERATIONAL) {
current_view=view(view_id(REG,52d1b2b7-2c9f-11e2-0800-f27a6d19a130,5) memb {
        52d1b2b7-2c9f-11e2-0800-f27a6d19a130,
        54808247-2c93-11e2-0800-d94fc5e26d6e,
        ffc955be-2c92-11e2-0800-658df0cf61e1,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=974410,safe_seq=974410,node_index=node: {idx=0,range=[974671,974670],safe_seq=974410} node: {idx=1,range=[974415,974414],safe_seq=974410} node: {idx=2,range=[974411,975216],safe_seq=974414} ,msg_index= (0,974414),evs::msg{version=0,type=1,user_type=255,order=0,seq=974414,seq_range=0,aru_seq=974410,flags=0,source=52d1b2b7-2c9f-11e2-0800-f27a6d19a130,source_view_id=view_id(REG,52d1b2b7-2c9f-11e2-0800-f27a6d19a130,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=1948755,node_list=()
}
        (1,974414),evs::msg{version=0,type=1,user_type=255,order=0,se
121112 17:15:24 [ERROR] WSREP: exception from gcomm, backend must be restarted:range.get_hs() == last_msg_seq: evs::msg{version=0,type=1,user_type=255,order=0,seq=974415,seq_range=255,aru_seq=974410,flags=0,source=52d1b2b7-2c9f-11e2-0800-f27a6d19a130,source_view_id=view_id(REG,52d1b2b7-2c9f-11e2-0800-f27a6d19a130,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=1948757,node_list=()
} evs::input_map: {aru_seq=974410,safe_seq=974410,node_index=node: {idx=0,range=[974671,974670],safe_seq=974410} node: {idx=1,range=[974415,974414],safe_seq=974410} node: {idx=2,range=[974411,975216],safe_seq=974414} ,msg_index= (0,974414),evs::msg{version=0,type=1,user_type=255,order=0,seq=974414,seq_range=0,aru_seq=974410,flags=0,source=52d1b2b7-2c9f-11e2-0800-f27a6d19a130,source_view_id=view_id(REG,52d1b2b7-2c9f-11e2-0800-f27a6d19a130,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=1948755,node_list=()
}
        (1,974414),evs::msg{version=0,type=1,user_type=255,order=0,seq=974414,seq_range=0,aru_seq=974410,
121112 17:15:24 [Note] WSREP: Received self-leave message.
121112 17:15:24 [Note] WSREP: Flow-control interval: [0, 0]
121112 17:15:24 [Note] WSREP: Received SELF-LEAVE. Closing connection.
121112 17:15:24 [Note] WSREP: Shifting SYNCED -> CLOSED (TO: 1229737)
121112 17:15:24 [Note] WSREP: RECV thread exiting 0: Success
#######error.log#######

   All cluster nodes status like this:
mysql> show status like 'wsrep%';
+----------------------------+--------------------------------------+
| Variable_name | Value |
+----------------------------+--------------------------------------+
| wsrep_local_state_uuid | 23b36a90-299b-11e2-0800-0494f50ee1f9 |
| wsrep_protocol_version | 4 |
| wsrep_last_committed | 2110735 |
| wsrep_replicated | 238039 |
| wsrep_replicated_bytes | 329257886 |
| wsrep_received | 770 |
| wsrep_received_bytes | 7307 |
| wsrep_local_commits | 238039 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.000000 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 1793.307719 |
| wsrep_apply_oooe | 0.000000 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 0.000000 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 0.000000 |
| wsrep_local_state | 0 |
| wsrep_local_state_comment | Initialized (0) |
| wsrep_cert_index_size | 8627 |
| wsrep_causal_reads | 0 |
| wsrep_cluster_conf_id | 18446744073709551615 |
| wsrep_cluster_size | 2 |
| wsrep_cluster_state_uuid | 23b36a90-299b-11e2-0800-0494f50ee1f9 |
| wsrep_cluster_status | non-Primary |
| wsrep_connected | ON |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <email address hidden> |
| wsrep_provider_version | 2.2(r115) |
| wsrep_ready | OFF |
+----------------------------+--------------------------------------+

I can only normal shutdown write node.
Read node shutdown logs:
#######error.log#######
121112 19:42:13 [Note] /usr/local/Percona-XtraDB-Cluster-5.5.27-23.6.356.Linux.x86_64/bin/mysqld: Normal shutdown

121112 19:42:13 [Note] WSREP: Stop replication
121112 19:42:13 [Note] WSREP: Provider disconnect
121112 19:42:13 [Note] WSREP: Closing send monitor...
121112 19:42:13 [Note] WSREP: Closed send monitor.
121112 19:42:13 [Note] WSREP: closing connection 74
121112 19:42:13 [Note] WSREP: Before Lock_thread_count
121112 19:42:15 [Note] WSREP: waiting for client connections to close: 33 (Wait a long time .... no next response ....)
#######error.log#######

marsxu (xjxyxgq) on 2012-11-19
affects: codership-mysql → percona-xtradb-cluster
Teemu Ollakka (teemu-ollakka) wrote :

Hi,

Could you post the value of your wsrep_provider_options variable?

Alex Yurchenko (ayurchen) wrote :

Might be related to lp:1040108

marsxu (xjxyxgq) wrote :

to Teemu Ollakka:
wsrep_provider_options = "gmcast.listen_addr=tcp://10.10.59.230:5030;ist.recv_addr = tcp://10.10.59.230:5031;evs.keepalive_period = PT3S; evs.inactive_check_period = PT10S; evs.suspect_timeout = PT30S; evs.inactive_timeout = PT1M; evs.consensus_timeout = PT1M; gcache.size = 100000M; evs.send_window=1024; evs.user_send_window=1024; evs.debug_log_mask=EVS_DEBUG"

Alex Yurchenko (ayurchen) wrote :

Ooooh, evs.user_send_window should be strictly less than evs.send_window (like 2 times less) and they better be left to their defaults.

Teemu Ollakka (teemu-ollakka) wrote :

It seems that this high evs.send_window/evs.user_send_window values trigger easily a bug in gcomm. Better to keep them as defaults, or if you absolutely have to tune them, set evs.send_window at most 64.

Anyway, there is a bug that needs to be fixed. Thanks for reporting!

Changed in galera:
status: New → Confirmed
assignee: nobody → Teemu Ollakka (teemu-ollakka)
Changed in galera:
importance: Undecided → Low
Teemu Ollakka (teemu-ollakka) wrote :
Changed in galera:
status: Confirmed → Fix Committed
Changed in galera:
milestone: none → 23.2.4
Changed in galera:
status: Fix Committed → Fix Released
Dan Rogers (drogers-l) wrote :

The documentation at http://www.codership.com/wiki/doku.php?id=galera_parameters implies that you should set both these values to 512 for WAN replication, so this should probably also be updated.

http://www.slideshare.net/henrikingo/introduction-to-galera also gives the same values.

I had set these values based on this information and experienced a similar crash over the weekend.

Alex Yurchenko (ayurchen) wrote :

Dan, could you try with this release? It should fix it.

Dan Rogers (drogers-l) wrote :

We run Percona XtraDB Cluster, so their new release isn't out yet, but as soon as it is, I'll give it a try, thanks.

I also turned evs.send_window/evs.user_send_window down to 64/32 so that we shouldn't have the issue any more.

Alex Yurchenko (ayurchen) wrote :

Dan, no hurry. But just in case you didn't know, you don't have to wait for Percona release. Galera is a dlopenable plugin and as long as the API number (23 here) matches, you can use Codership's Galera package with Percona's XtraDB Cluster.

Dan Rogers (drogers-l) wrote :

I didn't know that, thanks for the tip!

Changed in percona-xtradb-cluster:
milestone: none → 5.5.30-23.7.4
Changed in percona-xtradb-cluster:
status: New → Fix Released
Marco Tusa (marcotusa) wrote :

Alex, I have just install the percona server Ver 5.5.30-23.7.4-log

Brand new cluster still single node I was just testing the installation procedure, and I got the issue of wsrep remaining there waiting for close.
what I have noticed is that I have in the configuration 18 thread define
wsrep_slave_threads=18
But it is still waiting for 19 if I set it to 20 is waiting for 21 and so on.

It seems a bug in the loop checking the running thread ... any idea???

130605 18:03:25 [Note] WSREP: closing applier 4
130605 18:03:25 [Note] WSREP: applier thread exiting (code:5)
130605 18:03:25 [Note] WSREP: closing applier 2
130605 18:03:27 [Note] WSREP: waiting for client connections to close: 19
130605 18:03:36 [Note] WSREP: Before Lock_thread_count

description: updated
Alex Yurchenko (ayurchen) wrote :

Marco, this is something new. If you could get stack dump ( gdb -ex "set pagination 0" -ex "thread apply all bt" --batch -p $(pidof mysqld) > stack.traces ) that would be great.

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers