5.7.14 Node hangs while executing "Drop database"

Bug #1648015 reported by Adrian Słowik
6
This bug affects 1 person
Affects Status Importance Assigned to Milestone
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Incomplete
Undecided
Unassigned

Bug Description

PXC 5.7.14 was created using db1, db2 and garbd as arbitrator.
During "drop database" statement cluster hung and killed itself.
(Drop database statement was issued once at a time and this bug occured saveral times)
We hit this bug saveral times and migrated databases to PXC 5.6

Now with only db1 and garbd as arbitrator I tried to replicate the problem (there is no second node to replicate to, only arbitrator). I created 3 loops to drop databases i parallel and node hung:

There is 7000 databases in the cluster, with 230 tables each = total of 1610000 tables.

Process list:

| Id | User | Command | Time | State | Info
| 1 | system user | Sleep | 162439 | wsrep: aborter idle | NULL |
| 2 | system user | Sleep | 75375 | wsrep: applier idle | NULL |
| 8218611 | root | Query | 843 | System lock | Drop database s0017522 |
| 8218614 | root | Query | 842 | wsrep: preparing for TO isolation | Drop database z0004779 |
| 8218622 | root | Query | 833 | wsrep: preparing for TO isolation | Drop database t0000391 |
| 8218666 | root | Query | 0 | starting | show full processlist |

Log file:

2016-12-07T09:37:16.688769Z 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
2016-12-07 10:37:16 0x7f61866e1700 InnoDB: Assertion failure in thread 140056843917056 in file ut0ut.cc line 917
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
09:37:16 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
Please help us make Percona XtraDB Cluster better by reporting any
bugs at https://bugs.launchpad.net/percona-xtradb-cluster

key_buffer_size=8388608
read_buffer_size=262144
max_used_connections=1301
max_threads=1301
thread_count=6
connection_count=4
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1025588 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/opt/mysql/bin/mysqld(my_print_stacktrace+0x47)[0x562f40c89e08]
/opt/mysql/bin/mysqld(handle_fatal_signal+0x46c)[0x562f4047cd27]
/lib64/libpthread.so.0(+0x10ec0)[0x7f73029feec0]
/lib64/libc.so.6(gsignal+0x3b)[0x7f7301e475eb]
/lib64/libc.so.6(abort+0x180)[0x7f7301e48a10]
/opt/mysql/bin/mysqld(+0x62726f)[0x562f4043f26f]
/opt/mysql/bin/mysqld(_ZN2ib5fatalD1Ev+0x6f)[0x562f40efcaaf]
/opt/mysql/bin/mysqld(srv_error_monitor_thread+0xeea)[0x562f40e98236]
/lib64/libpthread.so.0(+0x74d9)[0x7f73029f54d9]
/lib64/libc.so.6(clone+0x6d)[0x7f7301f01a7d]
You may download the Percona XtraDB Cluster operations manual by visiting
http://www.percona.com/software/percona-xtradb-cluster/. You may find information
in the manual which will help you identify the cause of the crash.

Some data from monitoring stuff:
- a lot of disk write operations (1230 ops/sec)
- a lot of innodb data fsyncs (491/sec)
- a lot of innodb pages read (1580/sec)
- a lot of opened files (729/sec)

Revision history for this message
Adrian Słowik (slowik-adrian) wrote :
Revision history for this message
Adrian Słowik (slowik-adrian) wrote :
Revision history for this message
Adrian Słowik (slowik-adrian) wrote :
Revision history for this message
Adrian Słowik (slowik-adrian) wrote :
Revision history for this message
Adrian Słowik (slowik-adrian) wrote :
description: updated
description: updated
Revision history for this message
Krunal Bauskar (krunal-bauskar) wrote :

Initial look at processlist suggest that drop table is in progress.
Other 2 threads are waiting as DROP/DDL action is serialized in Galera.
(This is classic Galera Blocking DDL issue).

----------------------------------------------------------------------

Server got killed as timeout for innodb exceeded the set threshold.

2016-12-07T09:37:16.688769Z 0 [ERROR] [FATAL] InnoDB: Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.

----------------------------------------------------------------------

Did this workload worked w/o any wait with PXC-5.6 ?

Revision history for this message
Adrian Słowik (slowik-adrian) wrote :

Server is not doing anything (it just awaits for mutex) - see https://launchpadlibrarian.net/296964684/pm05-single.png - there is no "actual work done" for 600s - just one peak and then mutex lock

on PXC-5.6 there is no delay - everything works perfectly.

Revision history for this message
Krunal Bauskar (krunal-bauskar) wrote :

I believe reproducing it will only help understand it better as there is no such flow that I can imagine that can lead to this issue.

Do you have (or can create) a short reproducible test-case that you can share.

Revision history for this message
Adrian Słowik (slowik-adrian) wrote :

Try creating a lot of databases (7000 with 230 tables each), and then drop them in parallel to speed up effect. I can provide SQL dump of example database.

What is more: After mysql killed itself I'm unable to start it again as it hangs after recovery (log in attachment)

Revision history for this message
Adrian Słowik (slowik-adrian) wrote :
Revision history for this message
Adrian Słowik (slowik-adrian) wrote :

The other node (the one I lost during production traffic) was also refusing to start in the same way

Revision history for this message
Adrian Słowik (slowik-adrian) wrote :

Maybe there's something wrong with page cleaner refactoring in mysql 5.7:

Log info:
2016-12-08T07:33:32.381816Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 875364ms. The settings might not be optimal. (flushed=0, during the time.)
2016-12-08T07:33:32.382092Z 0 [Note] InnoDB: Waiting for purge to start
<freezes here>

Also while dropping databases, there ale lot of dirty pages so maybe page cleaner is not tuned to scan this amount of databases/tables

Change in 5.7:
http://dev.mysql.com/worklog/task/?id=6642

Similar Problem:
https://bugs.mysql.com/bug.php?id=76661

Revision history for this message
Adrian Słowik (slowik-adrian) wrote :

setting innodb_buffer_pool_load_at_startup=0 allowed to restart mysql

Revision history for this message
Adrian Słowik (slowik-adrian) wrote :

Dropping databases after restart (empty buffer pool) works fine - no delays.

Revision history for this message
Adrian Słowik (slowik-adrian) wrote :

When I fill the buffer pool and then start dropping tables, mysql freezes after saveral drops.

Monitoring of buffer pool data during the process:

[[root@pm05 :: (none)]] > show global status like "%buffer_pool_bytes%";
+--------------------------------+-------------+
| Variable_name | Value |
+--------------------------------+-------------+
| Innodb_buffer_pool_bytes_data | 60038922240 |
| Innodb_buffer_pool_bytes_dirty | 372752384 |
+--------------------------------+-------------+
2 rows in set (0.00 sec)

[[root@pm05 :: (none)]] > show global status like "%buffer_pool_bytes%";
+--------------------------------+-------------+
| Variable_name | Value |
+--------------------------------+-------------+
| Innodb_buffer_pool_bytes_data | 60038922240 |
| Innodb_buffer_pool_bytes_dirty | 339984384 |
+--------------------------------+-------------+
2 rows in set (0.00 sec)

[[root@pm05 :: (none)]] > show global status like "%buffer_pool_bytes%";
+--------------------------------+-------------+
| Variable_name | Value |
+--------------------------------+-------------+
| Innodb_buffer_pool_bytes_data | 60038922240 |
| Innodb_buffer_pool_bytes_dirty | 325713920 |
+--------------------------------+-------------+
2 rows in set (0.00 sec)

[[root@pm05 :: (none)]] > show global status like "%buffer_pool_bytes%";
+--------------------------------+-------------+
| Variable_name | Value |
+--------------------------------+-------------+
| Innodb_buffer_pool_bytes_data | 60038922240 |
| Innodb_buffer_pool_bytes_dirty | 288768000 |
+--------------------------------+-------------+
2 rows in set (0.01 sec)

[[root@pm05 :: (none)]] > show global status like "%buffer_pool_bytes%";
+--------------------------------+-------------+
| Variable_name | Value |
+--------------------------------+-------------+
| Innodb_buffer_pool_bytes_data | 60038922240 |
| Innodb_buffer_pool_bytes_dirty | 288768000 |
+--------------------------------+-------------+
2 rows in set (0.00 sec)

Looks like it's not PXC issue, but mysql 5.7 bug

Revision history for this message
Kenn Takara (kenn-takara) wrote :

Hi Adrian,

I've been trying to repro this with PXC 5.7 (on a smaller scale though) on a 2-node cluster.

I've tried it with 180 databases (each with 140 tables).
I then startup two scripts (one for each node), each one deleting half of the databases.

Haven't been able to hang the cluster yet.

Is this similar to your scenario? If you can provide more info on your setup (OS info and example scripts, that would be great).

Thanks,
Kenn

Revision history for this message
Adrian Słowik (slowik-adrian) wrote :

You need to warm up the buffer pool first (select * from all the tables you created)
Start dropping tables when buffer pool is full. In my case it's about 56GB

"When I fill the buffer pool and then start dropping tables, mysql freezes"

I think there's something wrong with page cleaner refactoring in mysql 5.7

Dropping databases with empty buffer pool (just after mysql starts) works fine.

I will attach SQL scripts to recreate the databases (each one is exact clone)

Revision history for this message
Adrian Słowik (slowik-adrian) wrote :
Revision history for this message
Adrian Słowik (slowik-adrian) wrote :
  • data Edit (335.6 KiB, application/x-sql)
Revision history for this message
Krunal Bauskar (krunal-bauskar) wrote :

Based on your analysis is sounds like the issue is with warmed up buffer pool which should then be presented even in Percona-Server. Can you try your test-case on PS and if reproduced there then PS team can look at it and fix it in PS code directly. PXC will inherit the fix.

Changed in percona-xtradb-cluster:
status: New → Incomplete
Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1943

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.