MySQL WSREP server fails to stop
Affects | Status | Importance | Assigned to | Milestone | |
---|---|---|---|---|---|
MySQL patches by Codership |
New
|
Undecided
|
Unassigned |
Bug Description
Hi, guys
We are experiencing rather strange issue with mysql wsrep server. Sometimes when we run /etc/init.d/mysql stop it tries to stop but hangs due to unknown reason. The last message in log is:
130111 23:36:36 [Note] /usr/sbin/mysqld: Normal shutdown
130111 23:36:36 [Note] WSREP: rollbacker thread exiting
130111 23:36:36 [Note] Event Scheduler: Purging the queue. 0 events
strace -ttt -f -p `pidof mysqld` shows the following:
Process 4905 attached with 22 threads - interrupt to quit
[pid 4959] 1358242220.703215 futex(0x10550e4, FUTEX_WAIT_PRIVATE, 8, NULL <unfinished ...>
[pid 4928] 1358242220.703252 futex(0x21a358c, FUTEX_WAIT_PRIVATE, 6, NULL <unfinished ...>
[pid 4927] 1358242220.703265 futex(0x21a358c, FUTEX_WAIT_PRIVATE, 8, NULL <unfinished ...>
[pid 4925] 1358242220.703277 futex(0x3aa9ae4, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
[pid 4924] 1358242220.703290 gettimeofday( <unfinished ...>
[pid 4923] 1358242220.703302 futex(0x215404c, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
[pid 4922] 1358242220.703313 rt_sigtimedwait
[pid 4921] 1358242220.703348 futex(0x18a2b14, FUTEX_WAIT_PRIVATE, 1, NULL <unfinished ...>
[pid 4920] 1358242220.703359 restart_
[pid 4919] 1358242220.703368 restart_
[pid 4918] 1358242220.703378 restart_
[pid 4916] 1358242220.703388 io_getevents(
[pid 4915] 1358242220.703404 io_getevents(
[pid 4914] 1358242220.703416 io_getevents(
[pid 4913] 1358242220.703427 io_getevents(
[pid 4912] 1358242220.703438 io_getevents(
[pid 4911] 1358242220.703449 io_getevents(
[pid 4910] 1358242220.703460 io_getevents(
[pid 4909] 1358242220.703472 io_getevents(
[pid 4908] 1358242220.703482 io_getevents(
[pid 4907] 1358242220.703494 io_getevents(
[pid 4905] 1358242220.703505 futex(0x10550e4, FUTEX_WAIT_PRIVATE, 7, NULL <unfinished ...>
[pid 4928] 1358242220.703574 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 4924] 1358242220.703600 <... gettimeofday resumed> {1358242220, 703548}, NULL) = 0
[pid 4905] 1358242220.703616 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable)
[pid 4928] 1358242220.703885 futex(0x21a358c, FUTEX_WAIT_PRIVATE, 8, NULL <unfinished ...>
[pid 4924] 1358242220.703932 gettimeofday( <unfinished ...>
[pid 4905] 1358242220.703943 futex(0x10550e4, FUTEX_WAIT_PRIVATE, 8, NULL <unfinished ...>
[pid 4924] 1358242220.703972 <... gettimeofday resumed> {1358242220, 703955}, NULL) = 0
[pid 4924] 1358242220.704003 epoll_wait(36, {}, 128, 21) = 0
[pid 4924] 1358242220.725173 gettimeofday(
[pid 4924] 1358242220.725217 gettimeofday(
[pid 4924] 1358242220.725256 epoll_wait(36, {}, 128, 1) = 0
[pid 4924] 1358242220.726417 gettimeofday(
[pid 4924] 1358242220.726471 clock_gettime(
[pid 4924] 1358242220.726509 clock_gettime(
[pid 4924] 1358242220.726546 clock_gettime(
[pid 4924] 1358242220.726582 clock_gettime(
[pid 4924] 1358242220.726618 clock_gettime(
[pid 4924] 1358242220.726770 clock_gettime(
[pid 4924] 1358242220.726806 gettimeofday(
[pid 4924] 1358242220.726846 epoll_ctl(36, EPOLL_CTL_MOD, 37, {EPOLLIN|
[pid 4924] 1358242220.726921 gettimeofday(
[pid 4924] 1358242220.726957 epoll_wait(36, {{EPOLLIN, {u32=61515008, u64=61515008}}}, 128, 227) = 1
[pid 4924] 1358242220.726992 gettimeofday(
[pid 4924] 1358242220.727025 gettimeofday(
[pid 4924] 1358242220.727059 epoll_wait(36, {}, 128, 227) = 0
[pid 4924] 1358242220.954503 gettimeofday(
[pid 4924] 1358242220.954595 gettimeofday(
[pid 4924] 1358242220.954890 epoll_wait(36, {}, 128, 0) = 0
[pid 4924] 1358242220.955002 gettimeofday(
[pid 4924] 1358242220.955081 clock_gettime(
[pid 4924] 1358242220.955152 clock_gettime(
[pid 4924] 1358242220.955227 clock_gettime(
[pid 4924] 1358242220.955299 clock_gettime(
[pid 4924] 1358242220.955373 clock_gettime(
[pid 4924] 1358242220.955441 clock_gettime(
[pid 4924] 1358242220.955510 clock_gettime(
[pid 4924] 1358242220.955583 clock_gettime(
[pid 4924] 1358242220.955824 clock_gettime(
[pid 4924] 1358242220.955939 gettimeofday(
[pid 4924] 1358242220.956016 epoll_ctl(36, EPOLL_CTL_MOD, 37, {EPOLLIN|
[pid 4924] 1358242220.956093 gettimeofday(
[pid 4924] 1358242220.956159 epoll_wait(36, {{EPOLLIN, {u32=61515008, u64=61515008}}}, 128, 271) = 1
[pid 4924] 1358242220.956236 gettimeofday(
[pid 4924] 1358242220.956310 gettimeofday(
[pid 4924] 1358242220.956386 epoll_wait(36, <unfinished ...>
[pid 4920] 1358242220.983830 <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 4920] 1358242220.983902 futex(0x1b61df0, FUTEX_WAKE_PRIVATE, 1) = 0
[pid 4920] 1358242220.983977 gettimeofday(
[pid 4920] 1358242220.984058 gettimeofday(
[pid 4920] 1358242220.984135 gettimeofday(
[pid 4920] 1358242220.984206 futex(0x1b61e34, FUTEX_WAIT_
[pid 4916] 1358242221.203892 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4915] 1358242221.203954 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4914] 1358242221.203994 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4913] 1358242221.204032 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4912] 1358242221.204070 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4911] 1358242221.204107 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4910] 1358242221.204144 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4909] 1358242221.204185 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4908] 1358242221.204226 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4907] 1358242221.204265 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4916] 1358242221.204338 io_getevents(
[pid 4915] 1358242221.204377 io_getevents(
[pid 4914] 1358242221.204410 io_getevents(
[pid 4913] 1358242221.204442 io_getevents(
[pid 4912] 1358242221.204475 io_getevents(
[pid 4911] 1358242221.204508 io_getevents(
[pid 4910] 1358242221.204544 io_getevents(
[pid 4909] 1358242221.204578 io_getevents(
[pid 4908] 1358242221.204614 io_getevents(
[pid 4907] 1358242221.204937 io_getevents(
[pid 4924] 1358242221.227790 <... epoll_wait resumed> {}, 128, 271) = 0
[pid 4924] 1358242221.227826 gettimeofday(
[pid 4924] 1358242221.227879 clock_gettime(
[pid 4924] 1358242221.227913 clock_gettime(
[pid 4924] 1358242221.227949 clock_gettime(
[pid 4924] 1358242221.227981 clock_gettime(
[pid 4924] 1358242221.228014 clock_gettime(
[pid 4924] 1358242221.228046 clock_gettime(
[pid 4924] 1358242221.228080 gettimeofday(
[pid 4924] 1358242221.228119 epoll_ctl(36, EPOLL_CTL_MOD, 37, {EPOLLIN|
[pid 4924] 1358242221.228160 gettimeofday(
[pid 4924] 1358242221.228194 epoll_wait(36, {{EPOLLIN, {u32=61515008, u64=61515008}}}, 128, 227) = 1
[pid 4924] 1358242221.228229 gettimeofday(
[pid 4924] 1358242221.228262 gettimeofday(
[pid 4924] 1358242221.228296 epoll_wait(36, <unfinished ...>
[pid 4918] 1358242221.437732 <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 4919] 1358242221.437795 <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 4918] 1358242221.437815 futex(0x1b61d60, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 4919] 1358242221.437840 futex(0x1b61cd0, FUTEX_WAKE_PRIVATE, 1 <unfinished ...>
[pid 4918] 1358242221.437855 <... futex resumed> ) = 0
[pid 4919] 1358242221.437927 <... futex resumed> ) = 0
[pid 4918] 1358242221.437947 gettimeofday( <unfinished ...>
[pid 4919] 1358242221.437966 gettimeofday( <unfinished ...>
[pid 4918] 1358242221.437979 <... gettimeofday resumed> {1358242221, 437953}, NULL) = 0
[pid 4919] 1358242221.438005 <... gettimeofday resumed> {1358242221, 437971}, NULL) = 0
[pid 4918] 1358242221.438071 futex(0x1b61da4, FUTEX_WAIT_
[pid 4919] 1358242221.438097 gettimeofday(
[pid 4919] 1358242221.438130 futex(0x1b61d14, FUTEX_WAIT_
[pid 4924] 1358242221.455653 <... epoll_wait resumed> {}, 128, 227) = 0
[pid 4924] 1358242221.455687 gettimeofday(
[pid 4924] 1358242221.455732 gettimeofday(
[pid 4924] 1358242221.455768 epoll_wait(36, {}, 128, 1) = 0
[pid 4924] 1358242221.456937 gettimeofday(
[pid 4924] 1358242221.456982 clock_gettime(
[pid 4924] 1358242221.457016 clock_gettime(
[pid 4924] 1358242221.457051 clock_gettime(
[pid 4924] 1358242221.457083 clock_gettime(
[pid 4924] 1358242221.457117 clock_gettime(
[pid 4924] 1358242221.457148 clock_gettime(
[pid 4924] 1358242221.457181 clock_gettime(
[pid 4924] 1358242221.457214 clock_gettime(
[pid 4924] 1358242221.457247 clock_gettime(
[pid 4924] 1358242221.457282 gettimeofday(
[pid 4924] 1358242221.457320 epoll_ctl(36, EPOLL_CTL_MOD, 37, {EPOLLIN|
[pid 4924] 1358242221.457391 gettimeofday(
[pid 4924] 1358242221.457438 epoll_wait(36, {{EPOLLIN, {u32=61515008, u64=61515008}}}, 128, 199) = 1
[pid 4924] 1358242221.457474 gettimeofday(
[pid 4924] 1358242221.457507 gettimeofday(
[pid 4924] 1358242221.457541 epoll_wait(36, {}, 128, 199) = 0
[pid 4924] 1358242221.656911 gettimeofday(
[pid 4924] 1358242221.656961 gettimeofday(
[pid 4924] 1358242221.656996 epoll_wait(36, {}, 128, 0) = 0
[pid 4924] 1358242221.657027 gettimeofday(
[pid 4924] 1358242221.657067 clock_gettime(
[pid 4924] 1358242221.657101 clock_gettime(
[pid 4924] 1358242221.657137 clock_gettime(
[pid 4924] 1358242221.657169 clock_gettime(
[pid 4924] 1358242221.657213 clock_gettime(
[pid 4924] 1358242221.657274 clock_gettime(
[pid 4924] 1358242221.657310 clock_gettime(
[pid 4924] 1358242221.657351 clock_gettime(
[pid 4924] 1358242221.657405 clock_gettime(
[pid 4924] 1358242221.657438 clock_gettime(
[pid 4924] 1358242221.657470 clock_gettime(
[pid 4924] 1358242221.657503 gettimeofday(
[pid 4924] 1358242221.657541 epoll_ctl(36, EPOLL_CTL_MOD, 37, {EPOLLIN|
[pid 4924] 1358242221.657582 gettimeofday(
[pid 4924] 1358242221.657616 epoll_wait(36, {{EPOLLIN, {u32=61515008, u64=61515008}}}, 128, 70) = 1
[pid 4924] 1358242221.657917 gettimeofday(
[pid 4924] 1358242221.658041 gettimeofday(
[pid 4924] 1358242221.658083 epoll_wait(36, <unfinished ...>
[pid 4916] 1358242221.704824 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4915] 1358242221.704862 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4914] 1358242221.704881 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4913] 1358242221.704900 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4912] 1358242221.704918 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4911] 1358242221.704936 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4910] 1358242221.704954 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4909] 1358242221.704972 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4916] 1358242221.705006 io_getevents(
[pid 4915] 1358242221.705026 io_getevents(
[pid 4914] 1358242221.705042 io_getevents(
[pid 4913] 1358242221.705058 io_getevents(
[pid 4912] 1358242221.705074 io_getevents(
[pid 4911] 1358242221.705089 io_getevents(
[pid 4910] 1358242221.705105 io_getevents(
[pid 4909] 1358242221.705120 io_getevents(
[pid 4908] 1358242221.705885 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4907] 1358242221.705922 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4908] 1358242221.706019 io_getevents(
[pid 4907] 1358242221.706043 io_getevents(
[pid 4924] 1358242221.728276 <... epoll_wait resumed> {}, 128, 70) = 0
[pid 4924] 1358242221.728311 gettimeofday(
[pid 4924] 1358242221.728361 clock_gettime(
[pid 4924] 1358242221.728396 clock_gettime(
[pid 4924] 1358242221.728431 clock_gettime(
[pid 4924] 1358242221.728463 clock_gettime(
[pid 4924] 1358242221.728495 clock_gettime(
[pid 4924] 1358242221.728527 clock_gettime(
[pid 4924] 1358242221.728561 gettimeofday(
[pid 4924] 1358242221.728599 epoll_ctl(36, EPOLL_CTL_MOD, 37, {EPOLLIN|
[pid 4924] 1358242221.728843 gettimeofday(
[pid 4924] 1358242221.728956 epoll_wait(36, {{EPOLLIN, {u32=61515008, u64=61515008}}}, 128, 228) = 1
[pid 4924] 1358242221.728998 gettimeofday(
[pid 4924] 1358242221.729033 gettimeofday(
[pid 4924] 1358242221.729067 epoll_wait(36, {}, 128, 228) = 0
[pid 4924] 1358242221.957501 gettimeofday(
[pid 4924] 1358242221.957553 clock_gettime(
[pid 4924] 1358242221.957587 clock_gettime(
[pid 4924] 1358242221.957622 clock_gettime(
[pid 4924] 1358242221.957837 clock_gettime(
[pid 4924] 1358242221.957873 clock_gettime(
[pid 4924] 1358242221.957942 clock_gettime(
[pid 4924] 1358242221.957975 clock_gettime(
[pid 4924] 1358242221.958009 clock_gettime(
[pid 4924] 1358242221.958042 clock_gettime(
[pid 4924] 1358242221.958079 gettimeofday(
[pid 4924] 1358242221.958118 epoll_ctl(36, EPOLL_CTL_MOD, 37, {EPOLLIN|
[pid 4924] 1358242221.958159 gettimeofday(
[pid 4924] 1358242221.958203 epoll_wait(36, {{EPOLLIN, {u32=61515008, u64=61515008}}}, 128, 270) = 1
[pid 4924] 1358242221.958275 gettimeofday(
[pid 4924] 1358242221.958319 gettimeofday(
[pid 4924] 1358242221.958353 epoll_wait(36, <unfinished ...>
[pid 4916] 1358242222.205852 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4915] 1358242222.205907 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4914] 1358242222.205940 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4913] 1358242222.205977 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4912] 1358242222.206009 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4911] 1358242222.206041 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4910] 1358242222.206076 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4909] 1358242222.206113 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4916] 1358242222.206176 io_getevents(
[pid 4915] 1358242222.206211 io_getevents(
[pid 4914] 1358242222.206241 io_getevents(
[pid 4913] 1358242222.206269 io_getevents(
[pid 4912] 1358242222.206298 io_getevents(
[pid 4911] 1358242222.206330 io_getevents(
[pid 4910] 1358242222.206358 io_getevents(
[pid 4909] 1358242222.206385 io_getevents(
[pid 4908] 1358242222.206938 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4907] 1358242222.206978 <... io_getevents resumed> {}{0, 500000000}) = 0
[pid 4908] 1358242222.207065 io_getevents(
[pid 4907] 1358242222.207103 io_getevents(
The third argument of futex call is permanently incrementing with the step of 2 except some TIDs.
Looks like there are some problems with race conditions.
I searched for a while and found some bugs caused by `leap second` insertion. But setting the time and restarting ntpd did not help. Also I tried to change innodb_
We experience this problem on both Centos 6.3 and Ubuntu 12.04.1 no matter if it is KVM VM (1GB of RAM) or real machine (96GB of RAM) . It is easily triggered when running deployment of our services using puppet which does some sequential restarts of mysql server.
The only thing that helps us is kill -9 <mysqld_pid>. But I do not think this is affordable solution.
I attached strace, mysqld and dmesg. Let me know if you need other logs.
Looking forward for your help.
Hi,
According to logs wsrep provider fails to even start shutting down for some reason. When this happens again, could you run
$ gdb --batch -ex "thr apply all bt" -p $(pidof mysqld) /path/to/mysqld > backtrace.out
and attach that output here.
Also, what is the version of mysql-server-wsrep?