Cluster stalls while distributing transaction

Bug #1197771 reported by Dmitry Gribov on 2013-07-04
30
This bug affects 5 people
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster
Medium
Unassigned

Bug Description

We have 3-node cluster 5.5.31 for Linux on x86_64 (Source distribution, wsrep_23.7.5.rXXXX), just updated.

And we have quite a simple sequence of SQL. After executing in (in fact, after the commit) on the node A, nodes B and C stall in ~2 seconds, node A takes about 6 seconds to stall - and then the whole cluster stops executing queries.
On each daemon we can see several "wsrep in pre-commit stage" processes (writing queries) with the longest time, and many of "statistics", "Sending data" and "Copying to tmp table" processes. Looks like theese are pure selects, no update.

No need to say they all just stay in processlist without any movements for at least 5 minutes. New connections hang as well.

In the previous release of 5.5.30 most of the queries hang in "Query end" state, in the new version there is a significant diversity in labels, but server still perfectly hangs.

I attach database dump and the exact SQL we execute (dumped from the program). Here is "show status like '%wsrep%';" output:
+----------------------------+-------------------------------------------------------------+
| Variable_name | Value |
+----------------------------+-------------------------------------------------------------+
| wsrep_local_state_uuid | e25852ac-d618-11e2-0800-6c53c8061134 |
| wsrep_protocol_version | 4 |
| wsrep_last_committed | 747713579 |
| wsrep_replicated | 6543 |
| wsrep_replicated_bytes | 3720579 |
| wsrep_received | 125479 |
| wsrep_received_bytes | 137181653 |
| wsrep_local_commits | 6542 |
| wsrep_local_cert_failures | 0 |
| wsrep_local_bf_aborts | 6 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.223533 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 440.600028 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 105.620690 |
| wsrep_apply_oooe | 0.203314 |
| wsrep_apply_oool | 0.003034 |
| wsrep_apply_window | 42.035215 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000102 |
| wsrep_commit_window | 36.303709 |
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced |
| wsrep_cert_index_size | 79 |
| wsrep_causal_reads | 0 |
| wsrep_incoming_addresses | 192.168.100.65:8661,192.168.100.66:8661,192.168.100.67:8661 |
| wsrep_cluster_conf_id | 11 |
| wsrep_cluster_size | 3 |
| wsrep_cluster_state_uuid | e25852ac-d618-11e2-0800-6c53c8061134 |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_index | 0 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <email address hidden> |
| wsrep_provider_version | 23.2.2rc2(r136) |
| wsrep_ready | ON |
+----------------------------+-------------------------------------------------------------+

ps. Hardware is just great, no chance the issue depends on IO or something - similar tasks usually perform instantly or next to it.

Dmitry Gribov (grib-d) wrote :
Dmitry Gribov (grib-d) wrote :

Sory, it looks like I've made the dump and the SQL-log from different databases.
Here is the right dump with matching killer-sql (doublechecked this time).

Alex Yurchenko (ayurchen) wrote :

Dmitry,
1) You're saying connections hang for at least 5 minutes. Does it mean that they "unhang" eventually or you just don't wait that long?
2) Have you noticed that you're using a terribly old Galera version: 2.2rc2? The last one (2.6) fixes one case of software deadlock among other things. So I'd suggest you upgrade that one too and report if you're still seeing this.

Dmitry Gribov (grib-d) wrote :

1. First time we hit the issue connectionc hang for about a minute, then cluster unhang, yes. After certain point (the number of inserts grow from day to day) it lasted several times for 5 minutes or longer and we had to restart nodes B and/or C. This releases the lock somehow. The problem happens on the proidaction server, so we unfortunatly can't wait an hour to see if the hang will end in an hour or not. I guess after some time the server will recover.
2. All galera upgrades we have tested smashed the system to the dust one way or another so we stopped trying at some point. We could try playing with it if you really believe it may be involved.

Dmitry Gribov (grib-d) wrote :

Tried galera 2.6 in the test environment, seems to work. Tomorrow we will try in with the subject issue

Dmitry Gribov (grib-d) wrote :
Download full text (7.5 KiB)

Updated, nothing changed - cluster still hangs.
Assuming MySQL finishes the update transactions during the normal shutdown, I believe the "hang" would last ~6minutes and then would end my itself. And when I say "the number of inserts grow from day to day" I mean it is now growing FAST. Say when in worked fine it was 100%, with 1 minute hang it was 300% and with 6minutes it is about 330% from the base. There is some exponential dependency.

After restart both (B,C) nodes entered cluster with normal IST.
In the log a see this:

130705 17:02:03 [Note] /usr/local/mysql/bin/mysqld: Normal shutdown

130705 17:02:03 [Note] WSREP: Stop replication
130705 17:02:03 [Note] WSREP: Closing send monitor...
130705 17:02:03 [Note] WSREP: Closed send monitor.
130705 17:02:03 [Note] WSREP: gcomm: terminating thread
130705 17:02:03 [Note] WSREP: gcomm: joining thread
130705 17:02:03 [Note] WSREP: gcomm: closing backend
130705 17:02:03 [Note] WSREP: view(view_id(NON_PRIM,1f8117dc-e56e-11e2-b18b-be707def1a41,17) memb {
        46f7b72e-e56e-11e2-93b4-eb7ea19e049f,
} joined {
} left {
} partitioned {
        1f8117dc-e56e-11e2-b18b-be707def1a41,
        77acdf64-e56e-11e2-af2b-4771634ab74b,
})
130705 17:02:03 [Note] WSREP: view((empty))
130705 17:02:03 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
130705 17:02:03 [Note] WSREP: gcomm: closed
130705 17:02:03 [Note] WSREP: Flow-control interval: [16, 16]
130705 17:02:03 [Note] WSREP: Received NON-PRIMARY.
130705 17:02:03 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 753333185)
130705 17:02:03 [Note] WSREP: Received self-leave message.
130705 17:02:03 [Note] WSREP: Flow-control interval: [0, 0]
130705 17:02:03 [Note] WSREP: Received SELF-LEAVE. Closing connection.
130705 17:02:03 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 753333185)
130705 17:02:03 [Note] WSREP: RECV thread exiting 0: Success
130705 17:02:03 [Note] WSREP: recv_thread() joined.
130705 17:02:03 [Note] WSREP: Closing replication queue.
130705 17:02:03 [Note] WSREP: Closing slave action queue.
130705 17:02:03 [ERROR] /usr/local/mysql/bin/mysqld: Sort aborted: Server shutdown in progress
130705 17:02:05 [Note] WSREP: killing local connection: 6953
130705 17:02:05 [Note] WSREP: killing local connection: 6952
--//--
130705 17:02:05 [Note] WSREP: killing local connection: 402
130705 17:02:05 [Note] WSREP: killing local connection: 346
130705 17:02:05 [Note] WSREP: killing local connection: 258
130705 17:08:00 [Note] WSREP: New cluster view: global state: e25852ac-d618-11e2-0800-6c53c8061134:753333185, view# -1: non-Primary, number of nodes: 1, m
y index: 0, protocol version 2
130705 17:08:00 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130705 17:08:00 [Note] WSREP: New cluster view: global state: e25852ac-d618-11e2-0800-6c53c8061134:753333185, view# -1: non-Primary, number of nodes: 0, m
y index: -1, protocol version 2
130705 17:08:00 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130705 17:08:00 [Note] WSREP: applier thread exiting (code:0)
130705 17:08:00 [Note] WSREP: applier thread exiting (code:5)
130705 17:08:00 [Note] WSREP: applier thread exiting (code:5)
130705 17:08:...

Read more...

Dmitry Gribov (grib-d) wrote :

s/I mean it is now growing FAST./I mean it is _NOT_ growing FAST/

No way to edit nor to preview, sorry for my spelling.

Dmitry Gribov (grib-d) wrote :

The issue is still here :(

@Dmitry,

Several stall related issues have been fixed in PXC 5.5.33, it has not been released yet, however, it has been pushed to Percona experimental repo. Can you please test it and give us the feedback?

Dmitry Gribov (grib-d) wrote :

The issue is still here: time to apply transaction on the master and on the slaves differs drastically and this "distribution" time grows exponentially.
Lock happens after successful commit on the master node.

Alex Yurchenko (ayurchen) wrote :

Dmitry, I can't seem to find any mention of system limits here, so just in case I'm asking what says
- cat /proc/$(pidof mysqld)/limits | grep "open files"
- cat /proc/sys/vm/dirty_ratio
- cat /sys/block/<your data drive>/queue/scheduler
first of all - on slaves.

Dmitry Gribov (grib-d) wrote :

# cat /proc/$(pidof mysqld)/limits | grep "open files"
Max open files 268155 268155 files
# cat /proc/sys/vm/dirty_ratio
20
# cat /sys/block/sda/queue/scheduler
noop deadline [cfq]

Dmitry Gribov (grib-d) wrote :

ps. All servers are identical, master-slave only means a role in the transaction. Any server may start transaction (perform as master) and lock down the cluster.

Daniel Ylitalo (danigl) wrote :
Download full text (3.9 KiB)

I believe I'm hitting this bug too.

Although I'm running 5.5.33 as follows:

mysql> show status like '%wsrep%';
+----------------------------+-------------------------------------------+
| Variable_name | Value |
+----------------------------+-------------------------------------------+
| wsrep_local_state_uuid | 6296f7ce-0067-11e3-8c0b-87899827493a |
| wsrep_protocol_version | 4 |
| wsrep_last_committed | 1142049610 |
| wsrep_replicated | 197311 |
| wsrep_replicated_bytes | 148631191 |
| wsrep_received | 412172 |
| wsrep_received_bytes | 172662669 |
| wsrep_local_commits | 197308 |
| wsrep_local_cert_failures | 3 |
| wsrep_local_bf_aborts | 0 |
| wsrep_local_replays | 0 |
| wsrep_local_send_queue | 0 |
| wsrep_local_send_queue_avg | 0.000000 |
| wsrep_local_recv_queue | 0 |
| wsrep_local_recv_queue_avg | 0.022472 |
| wsrep_flow_control_paused | 0.000000 |
| wsrep_flow_control_sent | 0 |
| wsrep_flow_control_recv | 0 |
| wsrep_cert_deps_distance | 378.693525 |
| wsrep_apply_oooe | 0.083333 |
| wsrep_apply_oool | 0.000000 |
| wsrep_apply_window | 1.089744 |
| wsrep_commit_oooe | 0.000000 |
| wsrep_commit_oool | 0.000000 |
| wsrep_commit_window | 1.038462 |
| wsrep_local_state | 4 |
| wsrep_local_state_comment | Synced |
| wsrep_cert_index_size | 637 |
| wsrep_causal_reads | 0 |
| wsrep_incoming_addresses | 10.0.8.8:3306,10.0.8.9:3306,10.0.8.6:3306 |
| wsrep_cluster_conf_id | 9 |
| wsrep_cluster_size | 3 |
| wsrep_cluster_state_uuid | 6296f7ce-0067-11e3-8c0b-87899827493a |
| wsrep_cluster_status | Primary |
| wsrep_connected | ON |
| wsrep_local_index | 2 |
| wsrep_provider_name | Galera |
| wsrep_provider_vendor | Codership Oy <email address hidden> |
| wsrep_pro...

Read more...

Alex Yurchenko (ayurchen) wrote :

Dmitry, open files limit must be good, but the other two parameters are not so.
IIRC, dirty_ratio=20 means that the kernel may use up to 20% of RAM for caching writes, CFQ scheduler is not recommended for database loads - use deadline. Not sure if fixing those will help, but worth a try.

Alex Yurchenko (ayurchen) wrote :

Daniel, first thing to check is a presence of any table without a primary key. ROW-format DELETEs on such tables are notoriously slow. Yes, THAT MUCH slow - they do full table scans.

Daniel Ylitalo (danigl) wrote :

Alex, thank you!

That seems as a probable explanation as the job that hangs the cluster indeed tries to delete rows from a table without a primary key.

If I may suggest a feature of a config option, something like "wsrep_deny_delete_without_index" that denies delete queries against tables without a primary key

Dmitry Gribov (grib-d) wrote :

> Not sure if fixing those will help, but worth a try.
We will check this, but if the problem would be IO I'd expect a) master slowing exponentially as well, while it's linearly slowing b) this would noticebly affect lot's of other cases. But very seldom query is capable of hanging of this kind. There is some secret trigger, it is not just a raw slowness - in general our load is only 15% CPU and 7% IO, all is nice and fast.

And I believed MySQL and it's forks force direct_io on it's file handles for any way on Linux . So with InnoDB I can safely ignore default IO caching as MySQL will forcibly bypass it. Am I wrong?

pps. One more (probably) important thing: we use ram disk for MySQL temporary drive.

Alex Yurchenko (ayurchen) wrote :

> pps. One more (probably) important thing: we use ram disk for MySQL temporary drive.

Are you sure in this case that you're not simply running out of memory? Do you have swap enabled?

Alex Yurchenko (ayurchen) wrote :

> So with InnoDB I can safely ignore default IO caching as MySQL will forcibly bypass it. Am I wrong?

with InnoDB perhaps you can, with Galera cache file you can't. And to my understanding scheduler affects non-cached IO as well.

Dmitry Gribov (grib-d) wrote :

No, there is enough memory/space. And yes, we will check IO options

And latest release (in addidion to some big transactions) stalls with other error, and this stall looks like "total" - it's locked forever, error log just writes "BF lock wait long" + "INNODB MONITOR OUTPUT" over and over again.

Perhaps, I should fire a new ticket upon this, but I'm still not sure, how to reproduce this and is it a separate bug oк it is all about the same.
_Perhaps_ this is caused by small, but highly-concurrent requests on several nodes like "REPLACE INTO catalit_bookmark_locks ( user_id, art_id, lock_date, lock_id ) VALUES ( ?, ?, NOW(), UUID() )" where table looks like this:
CREATE TABLE `catalit_bookmark_locks` (
  `user_id` int(10) unsigned NOT NULL,
  `art_id` varchar(100) NOT NULL DEFAULT '',
  `lock_date` datetime NOT NULL,
  `lock_id` varchar(36) NOT NULL,
  PRIMARY KEY (`user_id`,`art_id`),
  CONSTRAINT `catalit_bookmark_locks_ibfk_1` FOREIGN KEY (`user_id`) REFERENCES `users` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB DEFAULT CHARSET=utf8

@Dmitry,

 >>_Perhaps_ this is caused by small, but highly-concurrent requests on several nodes like "REPLACE INTO catalit_bookmark_locks ( user_id, art_id, lock_date, lock_id ) VALUES ( ?, ?, NOW(), UUID() )" where table looks like this:

 Yes, from the fragment of the error it looks like you are seeing
 conflicts.

------------------------
LATEST DETECTED DEADLOCK
------------------------
131008 21:17:23
*** (1) TRANSACTION:
TRANSACTION 130A88E499, ACTIVE 0 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT 2 lock struct(s), heap size 376, 1 row lock(s)
MySQL thread id 224217, OS thread handle 0x7fe1cba0e700, query id 201781327 192.168.100.31 fbhub Updating
UPDATE users SET
                                utc_offset = IF(NULL IS NULL, utc_offset, NULL),
                                last_used=NOW(),
                                last_host_id = if(last_host_id = 19,19,'1735')
                        WHERE id='18700125' and last_used < now() - interval 30 minute
*** (1) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 6423238 n bits 136 index `PRIMARY` of table `lib_area_100`.`users` trx id 130A88E499 lock_mode X locks rec but not gap waiting
*** (2) TRANSACTION:
TRANSACTION 130A88DEA3, ACTIVE 1 sec starting index read, thread declared inside InnoDB 500
mysql tables in use 1, locked 1
8 lock struct(s), heap size 1248, 3 row lock(s), undo log entries 1
MySQL thread id 225033, OS thread handle 0x7fe1db71c700, query id 201781913 192.168.100.31 fbhub Updating
UPDATE users SET
                                s_subscr_text_authors = '<87>\r\n',
                                subscr_type = IF(subscr_type = 1, 1, 2),
                                subscribe_new_buys = 1
                        WHERE id = '18700125'
*** (2) HOLDS THE LOCK(S):
RECORD LOCKS space id 0 page no 6423238 n bits 136 index `PRIMARY` of table `lib_area_100`.`users` trx id 130A88DEA3 lock mode S locks rec b
ut not gap
*** (2) WAITING FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 0 page no 6423238 n bits 136 index `PRIMARY` of table `lib_area_100`.`users` trx id 130A88DEA3 lock_mode X locks rec b
ut not gap waiting
*** WE ROLL BACK TRANSACTION (1)

============================================================================================

As you can see, both the conflicting statements refer the same id
for update.

Dmitry Gribov (grib-d) wrote :

So what do we do about that? New bug report?

@Dmitry,

Yes please.

Dmitry Gribov (grib-d) wrote :

Tomorrow we have the stall without "BF lock wait long", just like it's described in the starting report - stall without any visible symptoms in the log file. I'll consider there are at least two bugs, yes.

Made #1243156 ticket about conflicting statements. Hope it's clear enough. I'll add more details into it when (if) the problem will happen again. As we have diminished our transactions all kind of stalls happen quite seldom now, once a week instead of once a day.

Now I have pt-stalk output with some private data written during the stall. I can put it into the issue tracker on customers.percona.com or send by e-mail.

Dmitry Gribov (grib-d) wrote :

In PXC 5.5.34 with Galera 23.2.8 subject problem is not reproduced. Neither we see "BF lock wait long" for some time, hopefully it has gone as well.

There are some stall cases with updating tables with no primary key, Daniel hit it as well. I believe this should be handled somehow as well - at least log warnings in detailed logging modes when making this updates.

But all this has nothing to do with the subject issue anyway.

@Dmitry,

Thanks for the acknowledgment. Yes, having tables without primary key can stress the performance. Anyways, closing this issue for now.

Changed in percona-xtradb-cluster:
status: New → Fix Committed
Srand (cyril-scetbon) wrote :

We have the same issue. I'lll let you know asap if using the new version help us to not meet the issue again.

Dmitry Yu Okunev (dyokunev) wrote :

I have a similar issue with different versions of galera/mysql and OS:
The working node have a lot of "wsrep in pre-commit stage" processes (without any CPU/HDD/network load), but the receiving node have a process in stage of "System lock" with 100% CPU load.

I've tried to attach with gdb on the moment of problem and got this backtrace:

#0 0x00007fa4c1bf34a3 in poll () from /lib/x86_64-linux-gnu/libc.so.6
#1 0x00000000005f92b2 in handle_connections_sockets () at /mnt/workspace/percona-xtradb-cluster-5.6-debs/label_exp/debian-wheezy-x64/target/Percona-XtraDB-Cluster-5.6.15/sql/mysqld.cc:7099
#2 0x00000000006001f6 in mysqld_main (argc=68, argv=0x137f028) at /mnt/workspace/percona-xtradb-cluster-5.6-debs/label_exp/debian-wheezy-x64/target/Percona-XtraDB-Cluster-5.6.15/sql/mysqld.cc:6494
#3 0x00007fa4c1b40ead in __libc_start_main () from /lib/x86_64-linux-gnu/libc.so.6
#4 0x00000000005f30fd in _start ()

I'll try to recompile with without "HAVE_POLL", now.

Dmitry Yu Okunev (dyokunev) wrote :

As expected, the same problem with select() instead of poll() :)

Srand (cyril-scetbon) wrote :

damn it, no workaround for now :(

Dmitry Gribov (grib-d) wrote :

We have set up pt-stalk to catch on "BF lock wait long" in the log and restart affected mysqld, worked nice as a workaround (except for "BF lock wait long" only show up after the clasterr stopped for some time).

Dmitry Gribov (grib-d) wrote :

The "BF lock wait long" stall was reproduced twice since then. So I suppose it was not solved completely or there is another reason for such a thing. Now it's so seldom I have no idea how to find the reason (but we must have pt-stalk output for this cases).

Dmitry Gribov (grib-d) wrote :

Noted: there were many "BF lock wait long" before, now there is one in the log, than the note stalls, just as it did before. We restart it - all is fine. Happens not so often, but still.
And I feel like I will soon guess which query triggers this now.

Dmitry Gribov (grib-d) wrote :

ps. Never seing "BF lock wait long" in the log without cluster being hang. Perhaps this wait-transactions should be simply handled as a deadlocks and an error should be fired to the software?

So the things has gone up in meantime as related issued are fixed now.
Please upgrade to 5.6 for more better experience.
Will close this issue for now.

Changed in percona-xtradb-cluster:
importance: Undecided → Medium
Rodrigo (rodri-bernardo) wrote :

I have this issue in Percona 5.6. Is there a new bug report?

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers