Preparing for TO isolation and cluster's locking.

Bug #1236378 reported by Alekc
20
This bug affects 3 people
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
Status tracked in 5.6
5.5
New
Undecided
Unassigned
5.6
New
Undecided
Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Status tracked in 5.6
5.5
Confirmed
High
Unassigned
5.6
Confirmed
High
Unassigned

Bug Description

Hello, recently I had some weird issues with xtradb cluster.

Sometimes cluster just get locked (all new connections are putted in queue, so --max connections is very rapidly reached, the only way to fix it is to reboot affected nodes.

Some screenshots from phpmyadmin from recent issue:

Node A:
https://www.dropbox.com/s/pn9g0qgwyy694w2/Screenshot%202013-10-07%2015.24.26.png

Node B:
https://www.dropbox.com/s/972sbw3twkjwcgz/Screenshot%202013-10-07%2015.24.39.png

My Environment:
3 physical Nodes on debian 7 running percona-xtradb-cluster-server-5.5 ( 5.5.33-23.7.6-496.wheezy) + 2 garbd nodes

Nodes: configuration
[client]
default-character-set = utf8mb4

# CLIENT #
port = 3306
socket = /var/run/mysqld/mysqld.sock
default-character-set = utf8mb4

[mysqld]
#character
character-set-client-handshake = FALSE
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci

server_id=1

# GENERAL #
user = mysql
default_storage_engine = InnoDB

# BINARY LOGGING #
log_bin = /var/lib/mysql/mysql-bin
expire_logs_days = 5
binlog_format=ROW

#log_bin=mysql-bin
datadir=/var/lib/mysql

wsrep_cluster_address=gcomm://192.168.0.211:4567,192.168.0.138:4567,192.168.0.140:4567
wsrep_provider=/usr/lib/libgalera_smm.so
wsrep_slave_threads=4
wsrep_cluster_name=ismgalaxy
wsrep_sst_method=xtrabackup
wsrep_node_name=sun
wsrep_sst_auth=xxxxxxxxxxxxxxxxxx
wsrep_replicate_myisam=ON
wsrep_notify_cmd=/usr/bin/wsrep_notify.sh
wsrep_sst_donor=earth
# Provider specific configuration options
wsrep_provider_options="gcache.size=1024M"

log_slave_updates

# SAFETY #
max_allowed_packet = 16M
max_connect_errors = 1000000
skip_name_resolve
sysdate_is_now = 1
innodb = FORCE
innodb_strict_mode = 1

# CACHES AND LIMITS #
tmp_table_size = 32M
max_heap_table_size = 32M
query_cache_type = 0
query_cache_size = 0
max_connections = 500
thread_cache_size = 50
open_files_limit = 65535
table_definition_cache = 4096
table_open_cache = 10000

# INNODB #
innodb_flush_method = O_DIRECT
innodb_log_files_in_group = 2
innodb_log_file_size = 512M
innodb_flush_log_at_trx_commit = 1
innodb_file_per_table = 1
innodb_buffer_pool_size = 15G
innodb_locks_unsafe_for_binlog=1
innodb_autoinc_lock_mode=2

# LOGGING #
log_error = /var/log/mysql/mysql-error.log
#log_queries_not_using_indexes = 1
slow_query_log = 1
slow_query_log_file = /var/log/mysql/mysql-slow.log

[sst]
#streamfmt=tar

Any suggestions? Thanks.

Tags: i53862
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Alekc,

Do you have any DDL running during this? If yes, then it may be related to https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1212955

Revision history for this message
Alekc (alekcander) wrote :

I don't think so. Usually blocked query were involving truncate of tables (cacti were doing it very often, more than 2 times every minute). For now I removed cacti integration with cluster, will see if this will solve the problem.

Revision history for this message
Alekc (alekcander) wrote :

A side note: We have noticed that database were locking when one of our developers was making mysqldump from their local machines with poor connectivity (so time needed for execution could reach 30 min), could this be the culprit? Although it's strange that another databases were involved...

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

- if a read lock was involved that could be very much the case. Check wsrep_flow_control_paused status variable.
- you may be seeing a case of DELETE from a table without primary key.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

corrections to the above: should read "Check wsrep_flow_control_paused status variable when this happens again."

Revision history for this message
Alekc (alekcander) wrote :

I checked, and all my tables have primary key. But I wonder how it's possible that lock on one of db on one of the cluster node can lock entire cluster :|

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

There is one global transaction order, it is not per db or table. If one transaction cannot be applied, everything ordered behind it can't be applied as well. You may find this post helpful: http://www.mysqlperformanceblog.com/2013/10/08/taking-backups-percona-xtradb-cluster-without-stalls-flow-control

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Alekc,

mysqldump (depending on locking used) can cause queues to be backed up. A solution for this is to use wsrep_desync (added in 5.5.33), there is a blog post on this too - http://www.mysqlperformanceblog.com/2013/10/08/taking-backups-percona-xtradb-cluster-without-stalls-flow-control/

Revision history for this message
Alekc (alekcander) wrote :

Thanks for the hint (maybe this information should be added somewhere on documentation :P).

I've also tried to make a mysqldump with following flags

databases=`$MYSQL --user=$MYSQL_USER -p$MYSQL_PASSWORD -e "SHOW DATABASES;" | grep -Ev "(Database|information_schema)"`

for db in $databases; do
  $MYSQLDUMP --force --single-transaction --add-drop-table --create-options --set-charset --quick --extended-insert --user=$MYSQL_USER -p$MYSQL_PASSWORD --databases $db | gzip > "$BACKUP_DIR/$db.gz"
done

Especialy --single-transaction seemes to be the key for succesfull dumping of about 40 Gb of databases without having issues (although it was executed directly on one of the nodes)

Revision history for this message
Alekc (alekcander) wrote :

I spoke too soon. My problem is still present.

Today cluster went down again trying to make insert in myisam tables (unfortunately i can't convert it to myisam since it's for full text searching).

Here you can see graph rappresenting number of active sessions of mysql

https://www.dropbox.com/s/g9yc94vqdne7ffj/Screenshot%202013-10-11%2011.22.07.png

This time I was able to get some more data:

wsrep_local_state_uuid 3980909d-1550-11e3-9811-aecaa033192d
wsrep_protocol_version 4
wsrep_last_committed 30.3 M
wsrep_replicated 883.6 k
wsrep_replicated_bytes 5.8 G
wsrep_received 1.0 M
wsrep_received_bytes 8.1 G
wsrep_local_commits 871.8 k
wsrep_local_cert_failures 37
wsrep_local_bf_aborts 160
wsrep_local_replays 0
wsrep_local_send_queue 99
wsrep_local_send_queue_avg 0
wsrep_local_recv_queue 37
wsrep_local_recv_queue_avg 0
wsrep_flow_control_paused 1
wsrep_flow_control_sent 0
wsrep_flow_control_recv 0
wsrep_cert_deps_distance 4.7 k
wsrep_apply_oooe 0
wsrep_apply_oool 0
wsrep_apply_window 0
wsrep_commit_oooe 0
wsrep_commit_oool 0
wsrep_commit_window 0
wsrep_local_state 4
wsrep_local_state_comment Synced
wsrep_cert_index_size 87.3 k
wsrep_causal_reads 0
wsrep_incoming_addresses 192.168.0.212:3306,192.168.0.213:3306,,,192.168.0.211:3306
wsrep_cluster_conf_id 26
wsrep_cluster_size 5
wsrep_cluster_state_uuid 3980909d-1550-11e3-9811-aecaa033192d
wsrep_cluster_status Primary
wsrep_connected ON
wsrep_local_index 0
wsrep_provider_name Galera
wsrep_provider_vendor Codership Oy <email address hidden>
wsrep_provider_version 2.7(r157)
wsrep_ready ON

Infact wsrep_flow_control_paused is on 1,

Query involved was
Preparing for TO isolation INSERT INTO `catalogsearch_fulltext` (`product_id`,`store_id`,`data_index`) VALUES ('642', '1', '25|

Table catalogsearch_fulltext structure:
CREATE TABLE IF NOT EXISTS `catalogsearch_fulltext` (
  `fulltext_id` int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT 'Entity ID',
  `product_id` int(10) unsigned NOT NULL COMMENT 'Product ID',
  `store_id` smallint(5) unsigned NOT NULL COMMENT 'Store ID',
  `data_index` longtext COMMENT 'Data index',
  PRIMARY KEY (`fulltext_id`),
  UNIQUE KEY `UNQ_CATALOGSEARCH_FULLTEXT_PRODUCT_ID_STORE_ID` (`product_id`,`store_id`),
  KEY `catalogsearch_fulltext_idx1` (`store_id`),
  FULLTEXT KEY `FTI_CATALOGSEARCH_FULLTEXT_DATA_INDEX` (`data_index`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COMMENT='Catalog search result table' AUTO_INCREMENT=180949905 ;

For solving I had to kill mysql process on node since when i Tried to restart it i received a lot of
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long
WSREP: BF lock wait long

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

So what we have here is at least one node blocked doing something and this bocks the whole cluster.

This block may be a deadlock, or it may be some operation simply taking too long. If it is the former, all nodes should be totally idle - no CPU or IO load. If it is the latter, at least one of the nodes should display either CPU or IO load - or both. This is important to know as internally these are two vastly different issues.

Unfortunately provided data is insufficient to diagnose anything further. To begin with we need info from ALL nodes - to find out which one is really the culprit. The info includes:

1) output of
SHOW FULL PROCESSLIST;
SHOW GLOBAL VARIABLES\G
SHOW STATUS LIKE 'wsrep%';
SHOW ENGINE InnoDB STATUS\G

2) a stack dump:
gdb $(which mysqld) -p $(pidof mysqld) -batch -ex "thr apply all bt" > stack.dump

3) error log spanning sufficient time before and after the perceived issue. E.g. it should cover all the time during which there was an increased thread count on the supplied screenshot plus 1-2 hours before that.

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Raghu, what shall we make of this:

wsrep_last_committed 30.3 M

? I don't think it is what we want.

Revision history for this message
Przemek (pmalkowski) wrote :
Download full text (7.6 KiB)

I think what is important fact missing here is that the *MyISAM* table was involved. By looking at lp:1249991 also submitted by Alekc, I can easily reproduce the problem when I execute INSERT DELAYED statement.
Tested both PXC 5.5.39 and 5.6.20. Note the order and node names in following example:

percona33 mysql> create table myisam1 (id int) engine=myisam;
Query OK, 0 rows affected (0.02 sec)

percona22 mysql> insert into myisam1 values (1);
Query OK, 1 row affected (0.02 sec)

percona33 mysql> select * from myisam1;
+------+
| id |
+------+
| 1 |
+------+
1 row in set (0.00 sec)

percona33 mysql> insert delayed into myisam1 values (2);
Query OK, 1 row affected, 1 warning (0.00 sec)

percona33 mysql> select * from myisam1;
+------+
| id |
+------+
| 1 |
| 2 |
+------+
2 rows in set (0.00 sec)

percona33 mysql> show processlist;
+----+-------------+-----------+------+----------------+------+--------------------+------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+----+-------------+-----------+------+----------------+------+--------------------+------------------+-----------+---------------+
| 1 | system user | | NULL | Sleep | 149 | committed 12 | NULL | 0 | 0 |
| 2 | system user | | NULL | Sleep | 795 | wsrep aborter idle | NULL | 0 | 0 |
| 3 | system user | | NULL | Sleep | 774 | NULL | NULL | 0 | 0 |
| 4 | root | localhost | test | Sleep | 15 | | NULL | 2 | 2 |
| 5 | DELAYED | localhost | test | Delayed insert | 123 | Waiting for INSERT | NULL | 0 | 0 |
| 6 | root | localhost | test | Query | 0 | init | show processlist | 0 | 0 |
+----+-------------+-----------+------+----------------+------+--------------------+------------------+-----------+---------------+
6 rows in set (0.00 sec)

percona11 mysql> select * from myisam1;
+------+
| id |
+------+
| 1 |
+------+
1 row in set (0.00 sec)

percona22 mysql> insert into myisam1 values (3);
Query OK, 1 row affected (0.00 sec)

percona22 mysql> insert into myisam1 values (4);
Query OK, 1 row affected (0.00 sec)

percona22 mysql> insert into myisam1 values (5);
Query OK, 1 row affected (0.01 sec)

percona22 mysql> insert into myisam1 values (5);
Query OK, 1 row affected (0.00 sec)

percona22 mysql> insert into myisam1 values (5);
Query OK, 1 row affected (0.00 sec)

percona22 mysql> insert into myisam1 values (5);
Query OK, 1 row affected (0.00 sec)

percona22 mysql> insert into myisam1 values (5);
Query OK, 1 row affected (0.00 sec)

percona22 mysql> insert into myisam1 values (5);
Query OK, 1 row affected (0.00 sec)

percona33 mysql> select * from myisam1;
+------+
| id |
+------+
| 1 |
| 2 |
+------+
2 rows in set (0.00 sec)

-- after a while DELAYED queury dissapears from 33 node:

percona33 mysql> show ...

Read more...

Changed in percona-xtradb-cluster:
status: New → Confirmed
tags: added: i53862
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-977

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.