Percona Server moved to https://jira.percona.com/projects/PS

Innodb crashes during truncation of very huge table due to long semaphore wait and unable to recover afterwards

Bug #1633869 reported by Arturas Moskvinas on 2016-10-16

This bug report is a duplicate of: Bug #1241657: Truncate table server stall. Edit Remove

This bug affects 1 person

Affects		Status	Importance	Assigned to	Milestone
	Percona Server moved to https://jira.percona.com/projects/PS	New	Undecided	Unassigned

Bug Description

We were truncating very dense and large innodb table, ~3 billion records, 450gb in size. This table is use very frequently so during trucation there were a lot of metadata locks on table. Truncation was executed on master which executed it as expected without failure but immediate slave crashed afterwards, heavy read activity is mostly happening on slave. In the beginning innodb complained with such warning:

```
2016-10-11 18:07:36 7ffaa19c3700 InnoDB: ERROR: the age of the last checkpoint is 1932736089,
InnoDB: which exceeds the log group capacity 1932731597.
InnoDB: If you are using big BLOB or TEXT rows, you must set the
InnoDB: combined size of log files at least 10 times bigger than the
InnoDB: largest such row.
2016-10-11 18:07:52 7ffaa19c3700 InnoDB: ERROR: the age of the last checkpoint is 2168317344,
InnoDB: which exceeds the log group capacity 1932731597.
InnoDB: If you are using big BLOB or TEXT rows, you must set the
InnoDB: combined size of log files at least 10 times bigger than the
InnoDB: largest such row.
```
then warned about long semaphore wait:
```
InnoDB: Warning: a long semaphore wait:
--Thread 140611944437504 has waited at srv0srv.cc line 2653 for 241.00 seconds the semaphore:
X-lock on RW-latch at 0x133c180 '&dict_operation_lock'
a writer (thread id 140714429921024) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file row0purge.cc line 770
Last time write locked in file /mnt/workspace/percona-server-5.6-debian-binary/label_exp/debian-jessie-64bit/percona-server-5.6-5.6.31-77.0/storage/innobase/row/row0mysql.cc line 3309
InnoDB: ###### Starts InnoDB Monitor for 30 secs to print diagnostic info:
InnoDB: Pending preads 0, pwrites 0

=====================================
2016-10-11 18:10:31 7fe2c57fe700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 14 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 5786999 srv_active, 0 srv_shutdown, 17326 srv_idle
srv_master_thread log flush and writes: 5804243
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 211326465
--Thread 140611944437504 has waited at srv0srv.cc line 2653 for 249.00 seconds the semaphore:
X-lock on RW-latch at 0x133c180 '&dict_operation_lock'
a writer (thread id 140714429921024) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file row0purge.cc line 770
Last time write locked in file /mnt/workspace/percona-server-5.6-debian-binary/label_exp/debian-jessie-64bit/percona-server-5.6-5.6.31-77.0/storage/innobase/row/row0mysql.cc line 3309
OS WAIT ARRAY INFO: signal count 898318718
Mutex spin waits 380720421, rounds 1968086313, OS waits 49624479
RW-shared spins 1323524916, rounds 8197195339, OS waits 132052227
RW-excl spins 44618552, rounds 1780713384, OS waits 16836802
Spin rounds per wait: 5.17 mutex, 6.19 RW-shared, 39.91 RW-excl
------------
TRANSACTIONS
------------
...
Lots of long transactions in progress or in metadata lock state...
```

Such warning repeated at least couple of times and finally DB crashed when semaphore wait reached >600s:
```
2016-10-11 18:22:07 7fe2bf582700 InnoDB: ERROR: the age of the last checkpoint is 3367022445,
InnoDB: which exceeds the log group capacity 1932731597.
InnoDB: If you are using big BLOB or TEXT rows, you must set the
InnoDB: combined size of log files at least 10 times bigger than the
InnoDB: largest such row.

=====================================
2016-10-11 18:22:07 7fe2c57fe700 INNODB MONITOR OUTPUT
=====================================
Per second averages calculated from the last 4 seconds
-----------------
BACKGROUND THREAD
-----------------
srv_master_thread loops: 5786999 srv_active, 0 srv_shutdown, 17326 srv_idle
srv_master_thread log flush and writes: 5804243
----------
SEMAPHORES
----------
OS WAIT ARRAY INFO: reservation count 211330118
--Thread 140611944437504 has waited at srv0srv.cc line 2653 for 945.00 seconds the semaphore:
X-lock on RW-latch at 0x133c180 '&dict_operation_lock'
a writer (thread id 140714429921024) has reserved it in mode exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file row0purge.cc line 770
Last time write locked in file /mnt/workspace/percona-server-5.6-debian-binary/label_exp/debian-jessie-64bit/percona-
server-5.6-5.6.31-77.0/storage/innobase/row/row0mysql.cc line 3309
OS WAIT ARRAY INFO: signal count 898322054
Mutex spin waits 380768612, rounds 1968301761, OS waits 49628050
RW-shared spins 1323525499, rounds 8197195971, OS waits 132052231
RW-excl spins 44618828, rounds 1780718830, OS waits 16836822
Spin rounds per wait: 5.17 mutex, 6.19 RW-shared, 39.91 RW-excl
...
END OF INNODB MONITOR OUTPUT
============================
InnoDB: ###### Diagnostic info printed to the standard error stream
InnoDB: Error: semaphore wait has lasted > 600 seconds
InnoDB: We intentionally crash the server, because it appears to be hung.
2016-10-11 18:22:15 7fe2debff700 InnoDB: Assertion failure in thread 140612376459008 in file srv0srv.cc line 2148
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.6/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
18:22:15 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=2260
max_threads=4098
thread_count=787
connection_count=785
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1646765 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x2c)[0x8bd26c]
/usr/sbin/mysqld(handle_fatal_signal+0x469)[0x647d49]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf8f0)[0x7ffaf01be8f0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37)[0x7ffaee0d4077]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x148)[0x7ffaee0d5458]
/usr/sbin/mysqld[0xa24034]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x80ca)[0x7ffaf01b70ca]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7ffaee1878ad]
You may download the Percona Server operations manual by visiting
http://www.percona.com/software/percona-server/. You may find information
in the manual which will help you identify the cause of the crash.

```

After crash - innodb was unable to recover and we had to resnapshot from master. It seems like innodb data file was actually deleted by truncated but it was never recorded in innodb data dictionary.
```
2016-10-11 18:22:24 60933 [ERROR] InnoDB: In file './redacted/redacted.ibd', tablespace id and flags are 1973550 and 0, but in the InnoDB data dictionary they are 1973535 and 0. Have you moved InnoDB .ibd files around without using the commands DISCARD TABLESPACE and IMPORT TABLESPACE? Please refer to http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting-datadict.html for how to resolve the issue.
2016-10-11 18:22:24 7f3e4b929700 InnoDB: Operating system error number 2 in a file operation.
InnoDB: The error means the system cannot find the path specified.
2016-10-11 18:22:24 60933 [ERROR] InnoDB: Could not find a valid tablespace file for 'redacted/redacted'. See http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting-datadict.html for how to resolve the issue.
2016-10-11 18:22:24 7f3e4b929700 InnoDB: cannot calculate statistics for table "redacted"."redacted" because the .ibd file is missing. For help, please refer to http://dev.mysql.com/doc/refman/5.6/en/innodb-troubleshooting.html
2016-10-11 18:22:24 60933 [ERROR] Slave SQL: Error 'Table 'redacted.redacter' doesn't exist' on query. Default database: 'redacted'. Query: 'truncate table redacted', Error_code: 1146
2016-10-11 18:22:24 60933 [Warning] Slave: InnoDB: Tablespace is missing for table 'redacted/redacted' Error_code: 1812
2016-10-11 18:22:24 60933 [Warning] Slave: Table 'redacted.redacted' doesn't exist Error_code: 1146
2016-10-11 18:22:24 60933 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'mysql-bin.020690' position 965743976
2016-10-11 18:22:24 7f3dfcd37700 InnoDB: Error: page 1712910 log sequence number 25579950139333
```

At the moment we can't reproduce issue on testing server. But we will keep trying to reproduce it. If some extra information will help you diagnose

Table structure:
```
CREATE TABLE `large_table` (
  `primary_key_ms` bigint(20) NOT NULL AUTO_INCREMENT,
  `uuid1` binary(16) NOT NULL,
  `uuid2` binary(16) NOT NULL,
  PRIMARY KEY (`primary_key_ms`,`uuid1`,`uuid2`),
  KEY `idx` (`uuid1`,`uuid2`)
) ENGINE=InnoD
```