Hanging slave status and async replication threads

Bug #1308135 reported by Ovais Tariq
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
MySQL patches by Codership
New
Undecided
Unassigned
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
New
Undecided
Unassigned

Bug Description

I have a two node cluster running the following version:
Server version: 5.6.15-rel63.0-log 5.6.15-, Percona XtraDB Cluster - binary (GPL), Release: 25.5, Revision 759 wsrep_25.5.r4061, wsrep_25.5.r4061

One of the node acts as an async slave of the MySQL 5.5 master. Today, several threads that were trying to execute "SHOW SLAVE STATUS" hung in the init state, while the slave IO thread and the slave SQL thread are also hung. The node is not processing any replication event.

Below is the output of "show processlist"
mysql> show processlist;
+--------+-------------+------------------+--------+---------+--------+----------------------------------------+-------------------+-----------+---------------+
| Id | User | Host | db | Command | Time | State | Info | Rows_sent | Rows_examined |
+--------+-------------+------------------+--------+---------+--------+----------------------------------------+-------------------+-----------+---------------+
| 1 | system user | | NULL | Sleep | 285373 | NULL | NULL | 0 | 0 |
| 2 | system user | | NULL | Sleep | 285373 | wsrep aborter idle | NULL | 0 | 0 |
| 3 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 4 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 5 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 6 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 7 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 8 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 9 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 10 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 11 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 12 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 13 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 14 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 15 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 16 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 17 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 18 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 19 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 20 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 21 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 22 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 23 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 24 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 25 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 26 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 27 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 28 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 29 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 30 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 31 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 32 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 33 | system user | | NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| 40 | system user | | NULL | Connect | 284967 | Queueing master event to the relay log | NULL | 0 | 0 |
| 41 | system user | | NULL | Connect | 56761 | wsrep in pre-commit stage | NULL | 0 | 0 |
| 43 | root | localhost | NULL | Query | 53891 | init | SHOW SLAVE STATUS | 0 | 0 |
| 26339 | cacti | xx.xx.xx.xx:52597 | NULL | Query | 53898 | init | SHOW SLAVE STATUS | 0 | 0 |
| 76345 | cacti | xx.xx.xx.xx:59872 | NULL | Query | 28542 | init | SHOW SLAVE STATUS | 0 | 0 |
| 76347 | cacti | xx.xx.xx.xx:59874 | NULL | Query | 28542 | init | SHOW SLAVE STATUS | 0 | 0 |
| 76636 | cacti | xx.xx.xx.xx:59961 | NULL | Query | 28397 | init | SHOW SLAVE STATUS | 0 | 0 |
| 114011 | root | localhost | db | Query | 9446 | init | show slave status | 0 | 0 |
| 123780 | cacti | xx.xx.xx.xx:40094 | NULL | Query | 4500 | init | SHOW SLAVE STATUS | 0 | 0 |
| 123781 | cacti | xx.xx.xx.xx:40095 | NULL | Query | 4500 | init | SHOW SLAVE STATUS | 0 | 0 |
| 128886 | root | localhost | db | Killed | 1884 | init | show slave status | 0 | 0 |
| 128983 | root | localhost | db | Query | 0 | init | show processlist | 0 | 0 |
+--------+-------------+------------------+--------+---------+--------+----------------------------------------+-------------------+-----------+---------------+
45 rows in set (0.00 sec)

The status variables show that there is no DML getting executed since the slave status hangs.

The relevant wsrep variables are as follows:
wsrep_cluster_address = gcomm://192.168.0.10,192.168.0.11
wsrep_provider = /usr/local/mysql/lib/libgalera_smm.so
wsrep_slave_threads = 32
wsrep_cluster_name = xyz_cluster
wsrep_sst_method = xtrabackup-v2
wsrep_node_name = node1
wsrep_node_address = 192.168.0.10
wsrep_provider_options = "gcache.size=40G"
wsrep_max_ws_rows = 1048576
wsrep_max_ws_size = 4294901759

Tags: i38783
Revision history for this message
Ovais Tariq (ovais-tariq) wrote :

Attached is the output of "thread apply all bt"

Revision history for this message
Ovais Tariq (ovais-tariq) wrote :

Attached is the output of "thread apply all bt full"

Revision history for this message
Ovais Tariq (ovais-tariq) wrote :

At the time this happened I also saw the below entry in the error log pointing to a write set that hit the write set size limitation:

2014-04-15 02:50:01 4631 [ERROR] WSREP: Maximum wirteset size exceeded by 1290961022: 90 (Message too long)
    at galera/src/write_set_ng.hpp:check_size():652
2014-04-15 02:50:01 4631 [ERROR] WSREP: unknown connection failure

Revision history for this message
Ovais Tariq (ovais-tariq) wrote :

Looking at InnoDB status shows that there is one big transaction with more than 3M undo log entries that is stuck in pre-commit stage:

---TRANSACTION 371192952674, ACTIVE 58170 sec
120727 lock struct(s), heap size 10762680, 3580305 row lock(s), undo log entries 3572731
MySQL thread id 41, OS thread handle 0x2ae998e37700, query id 1185851068 wsrep in pre-commit stage

tags: added: i38783
Revision history for this message
Alex Yurchenko (ayurchen) wrote :
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1667

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.