Bug #1243156 “Cluster stalls while resolving statements conflict...” : Series 5.6 : Bugs : MySQL patches by Codership

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2013-10-22:

#1

stall2.log Edit (8.3 KiB, text/plain)

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-10-23:

#2

@Dmitry,

Can you provide my.cnf - specifically do you have lock_wait_timeout set?

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2013-10-23:

#3

my.cnf Edit (4.7 KiB, text/plain)

File in the attachment

lock_wait_timeout = 120

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2013-10-23:

#4

sry, I mean innodb_lock_wait_timeout = 20. lock_wait_timeout is not set and it contains something strange:

mysql> show variables like "%lock_w%";
+--------------------------+----------+
| Variable_name | Value |
+--------------------------+----------+
| innodb_lock_wait_timeout | 120 |
| lock_wait_timeout | 31536000 |
+--------------------------+----------+

Revision history for this message

Przemek (pmalkowski) wrote on 2014-08-27:

#5

Dimitry, the lock_wait_timeout = 31536000 is just default value since MySQL 5.5+, nothing strange here (http://dev.mysql.com/doc/refman/5.5/en/server-system-variables.html#sysvar_lock_wait_timeout).

Is this issue maybe related to another bug report of yours: https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1197771 ?

What do you mean by "resolving two statements with the same ID"? Is this about updating rows with the same PK/UK id?

Does this problem still happen with latest PXC 5.5.37 or 5.6.19? If so, can you also attach full processlist and innodb engine status outputs from the time node or cluster is stalled? Also "show status like 'ws%';" from this and at least other node.

Changed in percona-xtradb-cluster:
status:	New → Incomplete

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-09-01:

#6

I doubt this may be related to 1197771 in general, this all looks quite different. But on the level of code I can't say.
"resolving two statements with the same ID" - I mean "updating rows with the same PK/UK id", but this is only a hypothesis, and this is definitely not the only requirement, there is something more.
With 5.5.37 we still have the issue, I was thinking to try the latest dev release (can't see when it's going to be released).

Collecting processlist seemed impossible last time we tried it if "WSREP: BF lock wait long" appeared, we will try once again.
And, btw, now we only have ONE "WSREP: BF lock wait long" in the log - when the bug was first met they go in series (as you can see in the attachment).

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-09-03:

#7

But this may be related to #1280896. Is there a test version with 1280896 fixed? I always have a hard time trying to get the test release of PXC... Шэв expect something like 5.5.39-25.11 there:
http://www.percona.com/downloads/TESTING/Percona-XtraDB-Cluster/
but no luck. Any ideas? We would test this issue against 5.5.39-25.11, for example.

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-09-08:

#8

processlist Edit (10.9 KiB, application/zip)

Nope. We have tested with this:
Percona XtraDB Cluster 5.5.39-25.11 + Galera 2.11.2675

Variable_name Value
wsrep_local_state_uuid 2db76c9f-1717-11e3-b9d2-2f0c3a53a9e5
wsrep_protocol_version 4
wsrep_last_committed 12010943836
wsrep_replicated 49401
wsrep_replicated_bytes 1095127796
wsrep_received 1250506
wsrep_received_bytes 8274408205
wsrep_local_commits 48527
wsrep_local_cert_failures 869
wsrep_local_replays 0
wsrep_local_send_queue 0
wsrep_local_send_queue_avg 0
wsrep_local_recv_queue 0
wsrep_local_recv_queue_avg 0.318141
wsrep_flow_control_paused 0
wsrep_flow_control_sent 0
wsrep_flow_control_recv 0
wsrep_cert_deps_distance 57.492063
wsrep_apply_oooe 0.218834
wsrep_apply_oool 0
wsrep_apply_window 3.387444
wsrep_commit_oooe 0
wsrep_commit_oool 0
wsrep_commit_window 2.859617
wsrep_local_state 4
wsrep_local_state_comment Synced
wsrep_cert_index_size 66
wsrep_causal_reads 0
wsrep_incoming_addresses 192.168.0.212:8661,192.168.1.72:8661,192.168.0.214:8661,192.168.1.69:8661
wsrep_cluster_conf_id 30
wsrep_cluster_size 4
wsrep_cluster_state_uuid 2db76c9f-1717-11e3-b9d2-2f0c3a53a9e5
wsrep_cluster_status Primary
wsrep_connected ON
wsrep_local_bf_aborts 138
wsrep_local_index 0
wsrep_provider_name Galera
wsrep_provider_vendor Codership Oy <email address hidden>
wsrep_provider_version 2.11(r318911d)
wsrep_ready ON
wsrep_thread_count 49

it still hangs with "WSREP: BF lock wait long" (see attachment)

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-09-08:

#9

+INNODB MONITOR OUTPUT

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-09-08:

#10

dbp02_bflock.log Edit (119.0 KiB, text/plain)

Revision history for this message

Przemek (pmalkowski) wrote on 2014-09-09:

#11

Thank you Dmitry for the details.
I can see a different problem here. You are using INSERT IGNORE statements, like:
INSERT IGNORE INTO arts_deploy ...
INSERT IGNORE INTO users ...
Can you confirm any of these tables is MyISAM? If yes, then there is this bug - lp:1236378 which shows a similar problem. Basically after any INSERT IGNORE a node becomes completely blocked.

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-09-10:

#12

Nope, we have no myisam tables at all.
And I'm getting a STRONG feeling the stalling queries always (or at least often) look like
insert ... on duplicate key update
where there IS duplicate and update works.

Revision history for this message

Przemek (pmalkowski) wrote on 2014-09-10:

#13

First let me apologize - I mixed INSERT IGNORE with INSERT DELAYED which was the case in lp:1236378, so please discard my last comment.

From the wsrep status on this node it is not clear why it is stuck. Can you enable the wsrep_log_conflicts and wsrep_debug to see if anything interesting gets logged?

Also the longest transaction is interesting:
---TRANSACTION 9DC96209B, ACTIVE 148 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 937 lock struct(s), heap size 129464, 15880 row lock(s), undo log entries 4078
MySQL thread id 2175, OS thread handle 0x7f50f510d700, query id 4179022 192.168.0.215 fbhub update
INSERT IGNORE INTO arts_deploy VALUES('119704','312','2013-01-01 00:00:00','0.99847')
------- TRX HAS BEEN WAITING 12 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 12856 page no 2176 n bits 520 index `PRIMARY` of table `lib_area_100`.`arts_deploy` trx id 9DC96209B lock mode S locks rec but not gap waiting
------------------
TABLE LOCK table `lib_area_100`.`tasty_arts` trx id 9DC96209B lock mode IX
RECORD LOCKS space id 474 page no 1523461 n bits 384 index `uaf` of table `lib_area_100`.`tasty_arts` trx id 9DC96209B lock_mode X locks rec but not gap
TABLE LOCK table `lib_area_100`.`spam_hits` trx id 9DC96209B lock mode IX
TABLE LOCK table `lib_area_100`.`spam_packs` trx id 9DC96209B lock mode IS
RECORD LOCKS space id 52 page no 4234 n bits 208 index `PRIMARY` of table `lib_area_100`.`spam_packs` trx id 9DC96209B lock mode S locks rec but not gap
TABLE LOCK table `lib_area_100`.`users` trx id 9DC96209B lock mode IS
RECORD LOCKS space id 12854 page no 5861 n bits 112 index `PRIMARY` of table `lib_area_100`.`users` trx id 9DC96209B lock mode S locks rec but not gap
TABLE LOCK table `lib_area_100`.`spam_hits_pj` trx id 9DC96209B lock mode IX
TABLE LOCK table `lib_area_100`.`spam_tickets` trx id 9DC96209B lock mode IX
RECORD LOCKS space id 10 page no 256253 n bits 296 index `PRIMARY` of table `lib_area_100`.`spam_tickets` trx id 9DC96209B lock_mode X locks rec but not gap

Can you outline what operations were made in this transaction prior to this stuck INSERT INGORE? And can you give all the involved tables definitions?

Also I would like to see "show status like 'ws%'; also from at least one other node during the problem is happening.
So you are saying the whole cluster gets stuck, but restarting just one node restores it? How do you know which node to restart then?

First let me apologize - I mixed INSERT IGNORE with INSERT DELAYED which was the case in lp:1236378, so please discard my last comment.

From the wsrep status on this node it is not clear why it is stuck. Can you enable the wsrep_log_conflicts and wsrep_debug to see if anything interesting gets logged?

Also the longest transaction is interesting:
---TRANSACTION 9DC96209B, ACTIVE 148 sec inserting
mysql tables in use 1, locked 1
LOCK WAIT 937 lock struct(s), heap size 129464, 15880 row lock(s), undo log entries 4078
MySQL thread id 2175, OS thread handle 0x7f50f510d700, query id 4179022 192.168.0.215 fbhub update
INSERT IGNORE INTO arts_deploy VALUES('119704','312','2013-01-01 00:00:00','0.99847')
------- TRX HAS BEEN WAITING 12 SEC FOR THIS LOCK TO BE GRANTED:
RECORD LOCKS space id 12856 page no 2176 n bits 520 index `PRIMARY` of table `lib_area_100`.`arts_deploy` trx id 9DC96209B lock mode S locks rec but not gap waiting
------------------
TABLE LOCK table `lib_area_100`.`tasty_arts` trx id 9DC96209B lock mode IX
RECORD LOCKS space id 474 page no 1523461 n bits 384 index `uaf` of table `lib_area_100`.`tasty_arts` trx id 9DC96209B lock_mode X locks rec but not gap
TABLE LOCK table `lib_area_100`.`spam_hits` trx id 9DC96209B lock mode IX
TABLE LOCK table `lib_area_100`.`spam_packs` trx id 9DC96209B lock mode IS
RECORD LOCKS space id 52 page no 4234 n bits 208 index `PRIMARY` of table `lib_area_100`.`spam_packs` trx id 9DC96209B lock mode S locks rec but not gap
TABLE LOCK table `lib_area_100`.`users` trx id 9DC96209B lock mode IS
RECORD LOCKS space id 12854 page no 5861 n bits 112 index `PRIMARY` of table `lib_area_100`.`users` trx id 9DC96209B lock mode S locks rec but not gap
TABLE LOCK table `lib_area_100`.`spam_hits_pj` trx id 9DC96209B lock mode IX
TABLE LOCK table `lib_area_100`.`spam_tickets` trx id 9DC96209B lock mode IX
RECORD LOCKS space id 10 page no 256253 n bits 296 index `PRIMARY` of table `lib_area_100`.`spam_tickets` trx id 9DC96209B lock_mode X locks rec but not gap

Can you outline what operations were made in this transaction prior to this stuck INSERT INGORE? And can you give all the involved tables definitions?

Also I would like to see "show status like 'ws%'; also from at least one other node during the problem is happening. 
So you are saying the whole cluster gets stuck, but restarting just one node restores it? How do you know which node to restart then?

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-09-10:

#14

Download full text (7.8 KiB)

CREATE TABLE `arts_deploy` (
  `art` int(10) unsigned NOT NULL,
  `face` smallint(5) unsigned NOT NULL,
  `added` datetime NOT NULL,
  `is_hot` decimal(6,2) unsigned NOT NULL DEFAULT '0.00',
  PRIMARY KEY (`art`,`face`),
  KEY `fi` (`face`,`is_hot`),
  KEY `fa` (`face`,`added`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

CREATE TABLE `tasty_arts` (
  `user` int(10) unsigned NOT NULL,
  `art` int(10) unsigned NOT NULL,
  `face` smallint(5) unsigned NOT NULL,
  `updated` datetime NOT NULL,
  `rate` float unsigned NOT NULL DEFAULT '0',
  UNIQUE KEY `uaf` (`user`,`art`,`face`),
  KEY `ufra` (`user`,`face`,`rate`,`art`),
  KEY `userupd` (`user`,`face`,`updated`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

CREATE TABLE `spam_hits` (
  `user` int(10) unsigned NOT NULL,
  `spam` int(10) unsigned NOT NULL,
  `time` datetime NOT NULL,
  `type` tinyint(3) unsigned NOT NULL,
  PRIMARY KEY (`time`,`spam`,`user`,`type`),
  KEY `spam_type` (`spam`,`type`,`user`),
  KEY `user_type` (`user`,`type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

CREATE TABLE `spam_packs` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `send_start` datetime DEFAULT NULL,
  `send_end` datetime DEFAULT NULL,
  `keyparam` varchar(50) DEFAULT NULL,
  `s_desc` varchar(1000) DEFAULT NULL,
  `type` tinyint(3) unsigned DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `index2` (`keyparam`,`type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

CREATE TABLE `users` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `login` varchar(100) DEFAULT NULL,
  `pwd` varchar(100) DEFAULT NULL,
  `s_mail` varchar(50) DEFAULT NULL,
  `s_www` varchar(255) DEFAULT NULL,
  `s_inn` varchar(50) DEFAULT NULL,
  `s_descr` text,
  `s_phone` varchar(100) DEFAULT NULL,
  `offert_accepted` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `s_full_name` varchar(255) DEFAULT NULL,
  `s_first_name` varchar(100) DEFAULT NULL,
  `s_middle_name` varchar(100) DEFAULT NULL,
  `s_last_name` varchar(100) DEFAULT NULL,
  `s_city` varchar(100) DEFAULT NULL,
  `s_address` varchar(255) DEFAULT NULL,
  `last_used` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `last_host_id` smallint(5) unsigned DEFAULT NULL,
  `mail_confirmed` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `msisdn_confirmed` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `msisdn_confirm_code` varchar(4) DEFAULT NULL,
  `msisdn_req_time` datetime DEFAULT NULL,
  `java_phone_model` varchar(30) DEFAULT NULL,
  `java_font` varchar(30) DEFAULT NULL,
  `java_safe_mode` tinyint(4) NOT NULL DEFAULT '0',
  `user_pic` varchar(16) DEFAULT NULL,
  `denied_libs` varchar(255) DEFAULT NULL,
  `recenser_type` int(11) DEFAULT NULL,
  `partner_id` int(11) DEFAULT NULL,
  `creat_date` datetime DEFAULT NULL,
  `partner` int(10) unsigned DEFAULT NULL,
  `partner_valid_till` date DEFAULT NULL,
  `partner_pin` int(10) unsigned DEFAULT NULL,
  `account` decimal(9,2) NOT NULL DEFAULT '0.00',
  `abonement_start` datetime DEFAULT NULL,
  `abonement_expires` date NOT NULL DEFAULT '2006-01-01',
  `abonement_period` smallint(6) DEFAULT NULL,
  `abonement_delay` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `abonement_max_price` decimal(6,2) DEFAULT NULL,
  `abonement_downloads` tinyint(3) unsigned NOT NULL DEFAULT '0',
...

CREATE TABLE `arts_deploy` (
  `art` int(10) unsigned NOT NULL,
  `face` smallint(5) unsigned NOT NULL,
  `added` datetime NOT NULL,
  `is_hot` decimal(6,2) unsigned NOT NULL DEFAULT '0.00',
  PRIMARY KEY (`art`,`face`),
  KEY `fi` (`face`,`is_hot`),
  KEY `fa` (`face`,`added`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

CREATE TABLE `tasty_arts` (
  `user` int(10) unsigned NOT NULL,
  `art` int(10) unsigned NOT NULL,
  `face` smallint(5) unsigned NOT NULL,
  `updated` datetime NOT NULL,
  `rate` float unsigned NOT NULL DEFAULT '0',
  UNIQUE KEY `uaf` (`user`,`art`,`face`),
  KEY `ufra` (`user`,`face`,`rate`,`art`),
  KEY `userupd` (`user`,`face`,`updated`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

CREATE TABLE `spam_hits` (
  `user` int(10) unsigned NOT NULL,
  `spam` int(10) unsigned NOT NULL,
  `time` datetime NOT NULL,
  `type` tinyint(3) unsigned NOT NULL,
  PRIMARY KEY (`time`,`spam`,`user`,`type`),
  KEY `spam_type` (`spam`,`type`,`user`),
  KEY `user_type` (`user`,`type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

CREATE TABLE `spam_packs` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `send_start` datetime DEFAULT NULL,
  `send_end` datetime DEFAULT NULL,
  `keyparam` varchar(50) DEFAULT NULL,
  `s_desc` varchar(1000) DEFAULT NULL,
  `type` tinyint(3) unsigned DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `index2` (`keyparam`,`type`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

CREATE TABLE `users` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `login` varchar(100) DEFAULT NULL,
  `pwd` varchar(100) DEFAULT NULL,
  `s_mail` varchar(50) DEFAULT NULL,
  `s_www` varchar(255) DEFAULT NULL,
  `s_inn` varchar(50) DEFAULT NULL,
  `s_descr` text,
  `s_phone` varchar(100) DEFAULT NULL,
  `offert_accepted` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `s_full_name` varchar(255) DEFAULT NULL,
  `s_first_name` varchar(100) DEFAULT NULL,
  `s_middle_name` varchar(100) DEFAULT NULL,
  `s_last_name` varchar(100) DEFAULT NULL,
  `s_city` varchar(100) DEFAULT NULL,
  `s_address` varchar(255) DEFAULT NULL,
  `last_used` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `last_host_id` smallint(5) unsigned DEFAULT NULL,
  `mail_confirmed` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `msisdn_confirmed` tinyint(1) unsigned NOT NULL DEFAULT '0',
  `msisdn_confirm_code` varchar(4) DEFAULT NULL,
  `msisdn_req_time` datetime DEFAULT NULL,
  `java_phone_model` varchar(30) DEFAULT NULL,
  `java_font` varchar(30) DEFAULT NULL,
  `java_safe_mode` tinyint(4) NOT NULL DEFAULT '0',
  `user_pic` varchar(16) DEFAULT NULL,
  `denied_libs` varchar(255) DEFAULT NULL,
  `recenser_type` int(11) DEFAULT NULL,
  `partner_id` int(11) DEFAULT NULL,
  `creat_date` datetime DEFAULT NULL,
  `partner` int(10) unsigned DEFAULT NULL,
  `partner_valid_till` date DEFAULT NULL,
  `partner_pin` int(10) unsigned DEFAULT NULL,
  `account` decimal(9,2) NOT NULL DEFAULT '0.00',
  `abonement_start` datetime DEFAULT NULL,
  `abonement_expires` date NOT NULL DEFAULT '2006-01-01',
  `abonement_period` smallint(6) DEFAULT NULL,
  `abonement_delay` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `abonement_max_price` decimal(6,2) DEFAULT NULL,
  `abonement_downloads` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `abonement_left_clicks` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `abonement_left_summ` decimal(6,2) DEFAULT NULL,
  `user_pic_height` tinyint(3) unsigned DEFAULT NULL,
  `user_pic_width` tinyint(3) unsigned DEFAULT NULL,
  `show_pay_btn` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `s_puid` varchar(255) DEFAULT NULL,
  `discount` decimal(4,4) NOT NULL DEFAULT '0.0000',
  `money_bonus` decimal(9,2) NOT NULL DEFAULT '0.00',
  `subscr_last_reminded` datetime DEFAULT NULL,
  `subscr_free_arts_given` datetime DEFAULT NULL,
  `subscr_type` tinyint(3) unsigned NOT NULL DEFAULT '2',
  `subscr_period` tinyint(3) unsigned DEFAULT '1',
  `subscr_content` tinyint(3) unsigned DEFAULT '2',
  `subscr_genres` text,
  `s_subscr_text_authors` text,
  `subscr_date` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `s_subscr_text_pattern` text,
  `subscr_languages` text,
  `prefered_currency` char(3) DEFAULT NULL,
  `last_paymethod` tinyint(3) unsigned DEFAULT NULL,
  `utc_offset` char(6) DEFAULT NULL,
  `last_ip` varchar(15) DEFAULT NULL,
  `socnet_last_reminded` datetime DEFAULT '0000-01-01 00:00:00',
  `moved_from` smallint(5) unsigned DEFAULT NULL,
  `moved_to` smallint(5) unsigned DEFAULT NULL,
  `subscribe_new_buys` tinyint(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`id`),
  KEY `partner` (`partner`),
  KEY `partner_valid_till` (`partner_valid_till`),
  KEY `abonement_expires` (`abonement_expires`,`abonement_left_clicks`),
  KEY `puid` (`s_puid`),
  KEY `s_phone` (`s_phone`),
  KEY `s_mail` (`s_mail`),
  KEY `login` (`login`)
) ENGINE=InnoDB AUTO_INCREMENT=46288879 DEFAULT CHARSET=utf8

CREATE TABLE `spam_tickets` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `user` int(10) unsigned NOT NULL,
  `channel` int(10) unsigned NOT NULL,
  `created` datetime NOT NULL,
  `valid_from` datetime NOT NULL,
  `valid_till` datetime NOT NULL,
  `last_update` datetime NOT NULL,
  `sent` datetime DEFAULT NULL,
  `status` enum('requested','rejected','waiting','rejected_on_lag','sending','sent','error','token_invalid','delivered','opened') DEFAULT NULL,
  `annoy` tinyint(3) unsigned NOT NULL DEFAULT '0',
  `active` tinyint(1) unsigned DEFAULT '1' COMMENT '1 - активно, NULL - неактивно',
  PRIMARY KEY (`id`),
  UNIQUE KEY `user_active` (`user`,`active`),
  KEY `channel` (`channel`),
  KEY `saac` (`status`,`active`,`annoy`,`created`),
  KEY `av` (`active`,`valid_till`),
  KEY `sal` (`status`,`active`,`last_update`),
  KEY `ucl` (`user`,`channel`,`last_update`),
  CONSTRAINT `spam_tickets_ibfk_1` FOREIGN KEY (`user`) REFERENCES `users` (`id`) ON DELETE CASCADE,
  CONSTRAINT `spam_tickets_ibfk_2` FOREIGN KEY (`channel`) REFERENCES `spam_channels` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=813435217 DEFAULT CHARSET=utf8

"INSERT IGNORE INTO arts_deploy" is executed with auto commit on, so it is not in transaction at all. But this query may be HIGHLY concurrent on the node level and the cluster level too, and there is a huge chance of a conflict (insertion of the same data from several threads on several nodes).

"restarting just one node restores it" - yes, but now you have to restart the right node, any random does not fit. We restart the one which has "WSREP: BF lock wait long" in the log (only one does, other nores hang silently).

And, btw, we now restart mysql by killing it with -9 as correct shutdown leads to SST in most cases - it shuts down slowly and seem to have some problems with those hanging queries termination or with something else, hard to say. While after -9 node recovers just fine (except the case when we kill this way one of the last two nodes, as I mentioned in #1258880 - it requires extra administration).

And I have commented out the peace of code that OFTEN hanged the cluster by making mass and higly-concurrent queries like this:

INSERT INTO banner_hits (baner,time,host,hits)
VALUES (?,?,?,?)
ON DUPLICATE KEY UPDATE hits = hits + ?

CREATE TABLE `banner_hits` (
  `baner` int(10) unsigned NOT NULL,
  `time` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
  `host` smallint(5) unsigned NOT NULL,
  `hits` smallint(5) unsigned NOT NULL DEFAULT '0',
  PRIMARY KEY (`baner`,`time`,`host`),
  KEY `host` (`host`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8

I am not sure this is the killer-SQL, it looks like the higher is the load the more chances to get "WSREP: BF lock wait long", no matter what the load is (BAD quaries only show their nature when the cluster is loaded). Anyway, with this code removed hang outs got seldom. This queries sometimes were in the processlist with "wsrep in pre-commit stage" state for 20s and more, several in the row for the same baner ID.

Hopу all this helps.

We will turn logging on, I'll share what we will get during the next "WSREP: BF lock wait long".

Przemek (pmalkowski) on 2014-09-15

Changed in percona-xtradb-cluster:
status:	Incomplete → New

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-09-24:

#15

dbp07.2014-09-23_08-02-16.new2.tgz Edit (154.7 KiB, application/x-tar)

Here is detailed data. Node dbp07 was affected by "BF lock wait long", logs from neighbors provided (-2 minutes from the incident).

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-09-26:

#16

dbh06.2014-09-26_14-41-48.tgz Edit (85.1 KiB, application/x-tar)

One more clean example: dbh06 hang

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2014-12-29:

#17

Noted that mass "delete" statements increase "BF lock wait long" rate dramatically.

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2015-02-03:

#18

We are testing current 5.6 build, it is finally working without core dumps and it seem to not have the issue. We have tested the "killing" load with it and not a single "WSREP: BF lock wait long" so far.

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2015-03-02:

#19

Funny, we have no "WSREP: BF lock wait long" in the log and locks a seldom. But we already had a cluster stall twice with quite a similar sympthoms: all nodes have many "wsrep in pre-commit stage" threads, then the whole cluster stalls. If you restart the "bad node" cluster unlocks. Only two differences: it happens not so often and there is no way to identify the "bad" node. Looks like "BF lock wait long" logging was disabled|removed while the problem is still there.

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2015-03-26:

#20

Catched "WSREP: BF lock wait long" in the exactly old form today, so it's seldom but it's still here

Krunal Bauskar (krunal-bauskar) on 2015-11-17

Changed in percona-xtradb-cluster:
status:	New → Confirmed
importance:	Undecided → High

Revision history for this message

Rodrigo (rodri-bernardo) wrote on 2016-09-29:

#21

We are experiencing this issue with 5.6.24-72.2-56-log. Fixed in later versions?

Revision history for this message

Dmitry Gribov (grib-d) wrote on 2016-09-29:

#22

Looks like the issue was fixed, haven't seen it for quite a long time.

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

#23

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-978

	Status	Importance	Assigned to
MySQL patches by Codership	Status tracked in 5.6
5.5	New	Undecided	Unassigned
5.6	New	Undecided	Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Confirmed	High	Unassigned

MySQL patches by Codership

Cluster stalls while resolving statements conflict

Bug Description

Other bug subscribers

Bug attachments

Remote bug watches