process lock on start results in db failure
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
Percona-XtraDB |
New
|
Undecided
|
Unassigned |
Bug Description
I've installed Percona Server 5.7.16-10 and imported a lot of DBs
The filesystem is XFS to allow for online growth, never had a problem with it before. The OS is Ubuntu 14.04.5 LTS.. Date/time are correct on slave/master
8 CPU, 28GB ram in Azure
processor : 7
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2673 v3 @ 2.40GHz
stepping : 2
microcode : 0xffffffff
cpu MHz : 2397.211
cache size : 30720 KB
physical id : 0
siblings : 8
core id : 7
cpu cores : 8
apicid : 7
initial apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx lm constant_tsc rep_good nopl eagerfpu pni pclmulqdq ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 avx2 smep bmi2 erms xsaveopt
bugs :
bogomips : 4794.42
clflush size : 64
cache_alignment : 64
address sizes : 42 bits physical, 48 bits virtual
power management:
strace log
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[pid 5202] open(".
[ no more progress here ]
when running strace -f -p pid you see this repeated with no further progress into open/scanning or processing, and CPU drops. The first 8000 databases scan very quickly then speed drops suddenly before dropping to a file op/sec, then nothing beyond the line last above
After the process stops opening files, the strace for the pid is as low, repeated. (the pid differs but the content ie *exactly* the same)
[pid 55080] <... gettimeofday resumed> {1480683333, 879871}, NULL) = 0
[pid 55080] nanosleep({0, 999000000}, <unfinished ...>
[pid 55067] <... nanosleep resumed> NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 1000}, NULL) = 0
[pid 55067] nanosleep({0, 2000}, NULL) = 0
[pid 55067] nanosleep({0, 4000}, NULL) = 0
[pid 55067] nanosleep({0, 8000}, NULL) = 0
[pid 55067] nanosleep({0, 16000}, NULL) = 0
[pid 55067] nanosleep({0, 32000}, NULL) = 0
[pid 55067] nanosleep({0, 64000}, NULL) = 0
[pid 55067] nanosleep({0, 128000}, NULL) = 0
[pid 55067] nanosleep({0, 256000}, NULL) = 0
[pid 55067] nanosleep({0, 512000}, NULL) = 0
[pid 55067] nanosleep({0, 1024000}, NULL) = 0
[pid 55067] nanosleep({0, 2048000}, NULL) = 0
[pid 55067] nanosleep({0, 4096000}, NULL) = 0
[pid 55067] nanosleep({0, 8192000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 1000}, NULL) = 0
[pid 55067] nanosleep({0, 2000}, NULL) = 0
[pid 55067] nanosleep({0, 4000}, NULL) = 0
[pid 55067] nanosleep({0, 8000}, NULL) = 0
[pid 55067] nanosleep({0, 16000}, NULL) = 0
[pid 55067] nanosleep({0, 32000}, NULL) = 0
[pid 55067] nanosleep({0, 64000}, NULL) = 0
[pid 55067] nanosleep({0, 128000}, NULL) = 0
[pid 55067] nanosleep({0, 256000}, NULL) = 0
[pid 55067] nanosleep({0, 512000}, NULL) = 0
[pid 55067] nanosleep({0, 1024000}, NULL) = 0
[pid 55067] nanosleep({0, 2048000}, NULL) = 0
[pid 55067] nanosleep({0, 4096000}, NULL) = 0
[pid 55067] nanosleep({0, 8192000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, <unfinished ...>
[pid 55077] <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 55077] io_getevents(
[pid 55078] <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 55078] io_getevents(
[pid 55076] <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 55076] io_getevents(
[pid 55075] <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 55075] io_getevents(
[pid 55074] <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 55074] io_getevents(
[pid 55073] <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 55073] io_getevents(
[pid 55072] <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 55072] io_getevents(
[pid 55071] <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 55071] io_getevents(
[pid 55070] <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 55070] io_getevents(
[pid 55069] <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 55069] io_getevents(
[pid 55067] <... nanosleep resumed> NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 1000}, NULL) = 0
[pid 55067] nanosleep({0, 2000}, NULL) = 0
[pid 55067] nanosleep({0, 4000}, NULL) = 0
[pid 55067] nanosleep({0, 8000}, NULL) = 0
[pid 55067] nanosleep({0, 16000}, NULL) = 0
[pid 55067] nanosleep({0, 32000}, NULL) = 0
[pid 55067] nanosleep({0, 64000}, NULL) = 0
[pid 55067] nanosleep({0, 128000}, NULL) = 0
[pid 55067] nanosleep({0, 256000}, NULL) = 0
[pid 55067] nanosleep({0, 512000}, NULL) = 0
[pid 55067] nanosleep({0, 1024000}, NULL) = 0
[pid 55067] nanosleep({0, 2048000}, NULL) = 0
[pid 55067] nanosleep({0, 4096000}, NULL) = 0
[pid 55067] nanosleep({0, 8192000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
[pid 55067] nanosleep({0, 10000000}, NULL) = 0
The DB loads in recovery mode 4,5,6 but not 3.
sysctl.conf
fs.file-max = 20000000
net.ipv4.
kernel.pid_max = 65535
kernel.
net.core.
net.core.rmem_max = 8388608
net.core.somaxconn = 16384
net.core.wmem_max = 8388608
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.tcp_rmem = 4096 87380 8388608
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.
net.ipv4.tcp_wmem = 4096 87380 8388608
vm.overcommit_
vm.swappiness = 0
fs.aio-max-nr = 500000
limits.conf
mysql soft nofile 150000
mysql hard nofile 200000
the error.log file is a non-stop flow of these
2016-12-
2016-12-
2016-12-
2016-12-
2016-12-
2016-12-
2016-12-
2016-12-
2016-12-
2016-12-
mysqld.conf
[mysqld]
user = mysql
pid-file = /var/run/
socket = /var/run/
port = 3306
basedir = /usr
datadir = /var/lib/mysql
tmpdir = /tmp
lc-messages-dir = /usr/share/mysql
explicit_
max_connections
wait_timeout=5
interactive_
myisam_
sort_buffer_
innodb_
skip-name-resolve
default-
max_allowed_
expire_logs_days = 3
server-id = 7
innodb_
innodb_flush_method = O_DIRECT
innodb_
bind-address = 0.0.0.0
log-error = /var/log/
log_error_
#sql_mode=
sql_mode=""
symbolic-links=0
lower_case_
master-
relay-log-
replicate-do-table = account.
replicate-do-table = account.
replicate-do-table = account.
replicate-do-table = account.
replicate-
replicate-
replicate-
slave-skip-errors = 1062
open-files-
myisam-
innodb_
innodb_
affects: | percona-xtradb-cluster → percona-xtradb |
Should have added that this happened on two servers at the same time, both slaves to the same master.