MySQL patches by Codership

Hanging slave status and async replication threads

Bug #1308135 reported by Ovais Tariq on 2014-04-15

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	MySQL patches by Codership	New	Undecided	Unassigned
	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	New	Undecided	Unassigned

Bug Description

I have a two node cluster running the following version:
Server version: 5.6.15-rel63.0-log 5.6.15-, Percona XtraDB Cluster - binary (GPL), Release: 25.5, Revision 759 wsrep_25.5.r4061, wsrep_25.5.r4061

One of the node acts as an async slave of the MySQL 5.5 master. Today, several threads that were trying to execute "SHOW SLAVE STATUS" hung in the init state, while the slave IO thread and the slave SQL thread are also hung. The node is not processing any replication event.

Below is the output mysql> show processlist;
+--------+------+------+---- />---------+------------------+--------+---------+--------+----------------------------------------+-------------------+-----------+---------------+
| db | Command | Time | State | Info | Rows_sent | Rows_examined |
/>---------+------------------+--------+---------+--------+----------------------------------------+-------------------+-----------+---------------+
| NULL | Sleep | 285373 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285373 | wsrep aborter idle | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Sleep | 285098 | NULL | NULL | 0 | 0 |
| NULL | Connect | 284967 | Queueing master event to the relay log | NULL | 0 | 0 |
| NULL | Connect | 56761 | wsrep in pre-commit stage | NULL | 0 | 0 |
| localhost | NULL | Query | 53891 | init | SHOW SLAVE STATUS | 0 | 0 |
| xx.xx.xx.xx:52597 | NULL | Query | 53898 | init | SHOW SLAVE STATUS | 0 | 0 |
| xx.xx.xx.xx:59872 | NULL | Query | 28542 | init | SHOW SLAVE STATUS | 0 | 0 |
| xx.xx.xx.xx:59874 | NULL | Query | 28542 | init | SHOW SLAVE STATUS | 0 | 0 |
| xx.xx.xx.xx:59961 | NULL | Query | 28397 | init | SHOW SLAVE STATUS | 0 | 0 |
| localhost | db | Query | 9446 | init | show slave status | 0 | 0 |
| xx.xx.xx.xx:40094 | NULL | Query | 4500 | init | SHOW SLAVE STATUS | 0 | 0 |
| xx.xx.xx.xx:40095 | NULL | Query | 4500 | init | SHOW SLAVE STATUS | 0 | 0 |
| localhost | db | Killed | 1884 | init | show slave status | 0 | 0 |
| localhost | db | Query | 0 | init | show processlist | 0 | 0 |
/>---------+------------------+--------+---------+--------+----------------------------------------+-------------------+-----------+---------------+

The status variables show that there is no DML getting executed since the slave status hangs.

The relevant wsrep variables are as follows:
wsrep_cluster_address = gcomm://192.168.0.10,192.168.0.11
wsrep_provider = /usr/local/mysql/lib/libgalera_smm.so
wsrep_slave_threads = 32
wsrep_cluster_name = xyz_cluster
wsrep_sst_method = xtrabackup-v2
wsrep_node_name = node1
wsrep_node_address = 192.168.0.10
wsrep_provider_options = "gcache.size=40G"
wsrep_max_ws_rows = 1048576
wsrep_max_ws_size = 4294901759

Tags:

Revision history for this message

Ovais Tariq (ovais-tariq) wrote on 2014-04-15:

gdb_std.txt Edit (124.4 KiB, text/plain)

Attached is the output of "thread apply all bt"

Revision history for this message

Ovais Tariq (ovais-tariq) wrote on 2014-04-15:

gdb_full.txt Edit (261.6 KiB, text/plain)

Attached is the output of "thread apply all bt full"

Revision history for this message

Ovais Tariq (ovais-tariq) wrote on 2014-04-15:

At the time this happened I also saw the below entry in the error log pointing to a write set that hit the write set size limitation:

2014-04-15 02:50:01 4631 [ERROR] WSREP: Maximum wirteset size exceeded by 1290961022: 90 (Message too long)
at galera/src/write_set_ng.hpp:check_size():652
2014-04-15 02:50:01 4631 [ERROR] WSREP: unknown connection failure

Revision history for this message

Ovais Tariq (ovais-tariq) wrote on 2014-04-15:

Looking at InnoDB status shows that there is one big transaction with more than 3M undo log entries that is stuck in pre-commit stage:

---TRANSACTION 371192952674, ACTIVE 58170 sec
120727 lock struct(s), heap size 10762680, 3580305 row lock(s), undo log entries 3572731
MySQL thread id 41, OS thread handle 0x2ae998e37700, query id 1185851068 wsrep in pre-commit stage

Ovais Tariq (ovais-tariq) on 2014-04-15

tags:

added: i38783

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2014-04-16:

Likely related to https://launchpad.net/bugs/1280557

Revision history for this message

Shahriyar Rzayev (rzayev-sehriyar) wrote on 2018-01-18:

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1667