Cluster stalls while distributing transaction
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC |
Fix Committed
|
Medium
|
Unassigned |
Bug Description
We have 3-node cluster 5.5.31 for Linux on x86_64 (Source distribution, wsrep_23.
And we have quite a simple sequence of SQL. After executing in (in fact, after the commit) on the node A, nodes B and C stall in ~2 seconds, node A takes about 6 seconds to stall - and then the whole cluster stops executing queries.
On each daemon we can see several "wsrep in pre-commit stage" processes (writing queries) with the longest time, and many of "statistics", "Sending data" and "Copying to tmp table" processes. Looks like theese are pure selects, no update.
No need to say they all just stay in processlist without any movements for at least 5 minutes. New connections hang as well.
In the previous release of 5.5.30 most of the queries hang in "Query end" state, in the new version there is a significant diversity in labels, but server still perfectly hangs.
I attach database dump and the exact SQL we execute (dumped from the program). Here is "show status like '%wsrep%';" output:
+------
| Variable_name | Value |
+------
| wsrep_local_
| wsrep_protocol_
| wsrep_last_
| wsrep_replicated | 6543 |
| wsrep_replicate
| wsrep_received | 125479 |
| wsrep_received_
| wsrep_local_commits | 6542 |
| wsrep_local_
| wsrep_local_
| wsrep_local_replays | 0 |
| wsrep_local_
| wsrep_local_
| wsrep_local_
| wsrep_local_
| wsrep_flow_
| wsrep_flow_
| wsrep_flow_
| wsrep_cert_
| wsrep_apply_oooe | 0.203314 |
| wsrep_apply_oool | 0.003034 |
| wsrep_apply_window | 42.035215 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000102 |
| wsrep_commit_window | 36.303709 |
| wsrep_local_state | 4 |
| wsrep_local_
| wsrep_cert_
| wsrep_causal_reads | 0 |
| wsrep_incoming_
| wsrep_cluster_
| wsrep_cluster_size | 3 |
| wsrep_cluster_
| wsrep_cluster_
| wsrep_connected | ON |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_
| wsrep_provider_
| wsrep_ready | ON |
+------
ps. Hardware is just great, no chance the issue depends on IO or something - similar tasks usually perform instantly or next to it.
Sory, it looks like I've made the dump and the SQL-log from different databases.
Here is the right dump with matching killer-sql (doublechecked this time).