Bug #1153727 “Graceful handling of possible network issues” : Bugs : Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-03-11:

#1

Download full text (3.4 KiB)

Posting the stack here for convenience:

130311 4:02:40 [Warning] Aborted connection 31256926 to db: 'unconnected' user: 'admin' host: '10.2.46.110' (Got an error reading communication packets)
130311 4:02:40 [Warning] Aborted connection 31256936 to db: 'unconnected' user: 'admin' host: '10.2.46.110' (Got an error reading communication packets)
130311 4:02:40 [Warning] Aborted connection 31256942 to db: 'unconnected' user: 'admin' host: '10.2.46.110' (Got an error reading communication packets)
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
what(): Bad file descriptor
11:02:49 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=232
max_threads=15000
thread_count=6
connection_count=6
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 32837918 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7c9765]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x6a2ce4]
/lib64/libpthread.so.0[0x3bbba0f500]
/lib64/libc.so.6(gsignal+0x35)[0x3bbb2328a5]
/lib64/libc.so.6(abort+0x175)[0x3bbb234085]
/usr/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x12d)[0x7f49483f2a5d]
/usr/lib64/libstdc++.so.6(+0x3bbdabcbe6)[0x7f49483f0be6]
/usr/lib64/libstdc++.so.6(+0x3bbdabcc13)[0x7f49483f0c13]
/usr/lib64/libstdc++.so.6(+0x3bbdabcd0e)[0x7f49483f0d0e]
/usr/lib64/libgalera_smm.so(_ZN5boost15throw_exceptionIN4asio12system_errorEEEvRKT_+0x18a)[0x7f494876c70a]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail11throw_errorERKNS_10error_codeE+0x5b)[0x7f494876c82b]
/usr/lib64/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket17assign_local_addrEv+0x707)[0x7f494875d137]
/usr/lib64/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket15connect_handlerERKN4asio10error_codeE+0x67)[0x7f494875dda7]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail26reactive_socket_connect_opIN5boost3_bi6bind_tIvNS2_4_mfi3mf1IvN5gcomm13AsioTcpSocketERKNS_10error_codeEEENS3_5list2INS3_5valueINS2_10shared_ptrIS8_EEEEPFNS2_3argILi1EEEvEEEEEE11do_completeEPNS0_15task_io_serviceEPNS0_25task_io_service_operationES9_m+0x10e)[0x7f494876822e]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail15task_io_service3runERNS_10error_codeE+0x459)[0x7f4948788a49]
/usr/lib64/libgalera_smm.so(_ZN5gcomm12AsioProtonet10event_loopERKN2gu8datetime6PeriodE+0x1d6)...

Posting the stack here for convenience:

130311  4:02:40 [Warning] Aborted connection 31256926 to db: 'unconnected' user: 'admin' host: '10.2.46.110' (Got an error reading communication packets)
130311  4:02:40 [Warning] Aborted connection 31256936 to db: 'unconnected' user: 'admin' host: '10.2.46.110' (Got an error reading communication packets)
130311  4:02:40 [Warning] Aborted connection 31256942 to db: 'unconnected' user: 'admin' host: '10.2.46.110' (Got an error reading communication packets)
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
  what():  Bad file descriptor
11:02:49 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=232
max_threads=15000
thread_count=6
connection_count=6
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 32837918 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7c9765]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x6a2ce4]
/lib64/libpthread.so.0[0x3bbba0f500]
/lib64/libc.so.6(gsignal+0x35)[0x3bbb2328a5]
/lib64/libc.so.6(abort+0x175)[0x3bbb234085]
/usr/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x12d)[0x7f49483f2a5d]
/usr/lib64/libstdc++.so.6(+0x3bbdabcbe6)[0x7f49483f0be6]
/usr/lib64/libstdc++.so.6(+0x3bbdabcc13)[0x7f49483f0c13]
/usr/lib64/libstdc++.so.6(+0x3bbdabcd0e)[0x7f49483f0d0e]
/usr/lib64/libgalera_smm.so(_ZN5boost15throw_exceptionIN4asio12system_errorEEEvRKT_+0x18a)[0x7f494876c70a]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail11throw_errorERKNS_10error_codeE+0x5b)[0x7f494876c82b]
/usr/lib64/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket17assign_local_addrEv+0x707)[0x7f494875d137]
/usr/lib64/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket15connect_handlerERKN4asio10error_codeE+0x67)[0x7f494875dda7]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail26reactive_socket_connect_opIN5boost3_bi6bind_tIvNS2_4_mfi3mf1IvN5gcomm13AsioTcpSocketERKNS_10error_codeEEENS3_5list2INS3_5valueINS2_10shared_ptrIS8_EEEEPFNS2_3argILi1EEEvEEEEEE11do_completeEPNS0_15task_io_serviceEPNS0_25task_io_service_operationES9_m+0x10e)[0x7f494876822e]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail15task_io_service3runERNS_10error_codeE+0x459)[0x7f4948788a49]
/usr/lib64/libgalera_smm.so(_ZN5gcomm12AsioProtonet10event_loopERKN2gu8datetime6PeriodE+0x1d6)[0x7f4948782d96]
/usr/lib64/libgalera_smm.so(_ZN9GCommConn3runEv+0x57)[0x7f494879b8a7]
/usr/lib64/libgalera_smm.so(_ZN9GCommConn6run_fnEPv+0x9)[0x7f494879f199]
/lib64/libpthread.so.0[0x3bbba07851]
/lib64/libc.so.6(clone+0x6d)[0x3bbb2e811d]

no longer affects:

codership-mysql

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-03-13

tags:

added: i30151

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-03-26:

#2

fix committed in lp:galera/2.x r148

Changed in galera:
assignee:	nobody → Teemu Ollakka (teemu-ollakka)
status:	New → Fix Committed

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-04-05

Changed in percona-xtradb-cluster:
milestone:	none → 5.5.30-23.7.4

Raghavendra D Prabhu (raghavendra-prabhu) on 2013-04-05

Changed in percona-xtradb-cluster:
status:	New → Fix Released

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-05-25:

#3

The issue seems to be recurring again here https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1184034

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-05-25:

#4

The difference I see is instead of

" what(): Bad file descriptor"

we see

" what(): Transport endpoint is not connected"

because of the additional code added.

This 'transport endpoint not connected' seems to mimic how nmap scans the port.

Now, we can say that cluster ports should be protected with iptables or so against this but again, it would be better if the server didn't crash on this.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-05-25:

#5

Full backtrace here http://paste.wnohang.net/114621

and it is crashing here:

void gcomm::AsioTcpSocket::assign_remote_addr()
....
        remote_addr_ = uri_string(
            gcomm::TCP_SCHEME,
            gcomm::escape_addr(socket_.remote_endpoint().address()),
            gu::to_string(socket_.remote_endpoint().port())
            );

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-05-25:

#6

And, yes the error is

perror 107
OS error code 107: Transport endpoint is not connected

Related discussion here http://comments.gmane.org/gmane.comp.lib.boost.asio.user/4781 and http://boost.2283326.n4.nabble.com/boost-asio-exception-not-passed-to-async-callback-td2599023.html

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-05-25:

#7

So, https://bazaar.launchpad.net/~codership/galera/2.x/revision/148 add try/catch around connect_handler but not in accept_handler. So, adding try/catch + FAILED_HANDLER should help here right?

Revision history for this message

Teemu Ollakka (teemu-ollakka) wrote on 2013-05-25:

#8

Adding try/catch and deleting failed socket object + creating new object for acceptor should do it, calling FAILED_HANDLER is not necessary since upper layers haven't seen the socket yet.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-05-25:

#9

With sendto() it rather emits: (which I believe are from connect_handler())

130525 18:35:53 [Warning] WSREP: unserialize error invalid protocol version 2: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():108
130525 18:35:57 [Warning] WSREP: unserialize error invalid protocol version 4: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():108
130525 18:35:57 [Warning] WSREP: unserialize error invalid protocol version 4: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():108
130525 18:35:57 [Warning] WSREP: unserialize error invalid protocol version 2: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():108
130525 18:35:57 [Warning] WSREP: unserialize error invalid flags 6: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():101

I tested with https://bazaar.launchpad.net/~raghavendra-prabhu/percona-xtradb-cluster/galera-bug1153727/revision/125 it doesn't crash (it just logs 130525 18:53:15 [ERROR] WSREP: Transport endpoint is not connected) but other nodes can't connect either, hence requires a restart of the node.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-05-25:

#10

@Teemu, ack. yeah, failed_handler is not in this scope (got a compile error with that :)).

Revision history for this message

Laurent Minost (lolomin) wrote on 2013-05-28:

#11

Hi guys,

Thanks for these informations !
I've faced the same crash on one of my development PXDBC node using 5.5.30-23.7.4 so I can confirm that this bug is still present in latest release.
Can we FW both 4444 & 4567 ports (and maybe others?) from IPs that are not one of the cluster nodes to avoid this problem to occur again please ?

Regards,

Laurent

Alex Yurchenko (ayurchen) on 2013-06-01

Changed in galera:
milestone:	none → 23.2.6

Revision history for this message

Laurent Minost (lolomin) wrote on 2013-06-01:

#12

Hi Alexey,

Thanks for this fix !

Can you please tell if firewalling ports like described above is enough to avoid this problem occuring again or there is others things needed that need us to wait for a new release to be published ?

Regards,

Laurent

Revision history for this message

Alex Yurchenko (ayurchen) wrote on 2013-06-01:

#13

Laurent, this particular fix was already released by Percona with their latest Galera build. If you still have issues you should wait for another fix, tracked in https://bugs.launchpad.net/galera/+bug/1184034.

Firewalling the Galera ports should be enough - or at least will help.

Revision history for this message

Raghavendra D Prabhu (raghavendra-prabhu) wrote on 2013-06-02:

#14

@Laurent, as Alexey has mentioned, the fix for 'what(): Bad file
descriptor has b;een in latest PXC.

However, seeing your mail in the mailing list, it looks like for
you and few others it crashed with
" what(): Transport endpoint is not connected".

This is not fixed yet. It needs to be fixed in galera. Looks like
it will be fixed in Galera 23.2.6 in bug https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1184034 . You can look at 23.2.6 milestone here https://launchpad.net/galera/+milestone/23.2.6

You can look at https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1153727/comments/4 for more details.

Also, you can further track updates to this crash in https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1184034 itself. (since galera milestone is marked there)

Revision history for this message

Laurent Minost (lolomin) wrote on 2013-06-03:

#15

Hi Raghavendra & Alexey,

Thanks for these detailed informations, so I will wait for next release and will try to block concerned ports for the moment to avoid the problem to occur again.

Regards,

Laurent

Revision history for this message

Laurent Minost (lolomin) wrote on 2013-06-04:

#16

Download full text (11.4 KiB)

Hi all,

Bad news here, same crash today :

root@dedwen:/var/log/mysql# cat mysql-error.log
130602 6:31:51 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130602 7:12:25 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130602 7:13:50 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130602 10:18:11 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130603 7:28:16 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130603 7:28:17 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130603 12:03:43 [Warning] IP address '172.16.29.16' could not be resolved: Name or service not known
130604 5:25:34 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:35:02 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:40:28 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:45:50 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:48:28 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:53:42 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 9:40:29 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
what(): Transport endpoint is not connected
11:24:56 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=268435456
read_buffer_size=131072
max_used_connections=53
max_threads=102
thread_count=27
connection_count=27
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1947646 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7ed245]
/usr/sbin/mysqld(handle_fatal_signal+0x4b4)[0x6ba864]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf030)[0x7f26d36e9030]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f26d22c5475]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x180)[0x7f26d22c86f0]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x11d)[0x7f26c877a89d]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x63996)[0x7f26c8778996]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x639c3)[0x7f26c87789c3]
/usr/lib/x86_64-linux-...

Hi all,

Bad news here, same crash today :

root@dedwen:/var/log/mysql# cat mysql-error.log
130602  6:31:51 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130602  7:12:25 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130602  7:13:50 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130602 10:18:11 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130603  7:28:16 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130603  7:28:17 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130603 12:03:43 [Warning] IP address '172.16.29.16' could not be resolved: Name or service not known
130604  5:25:34 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604  8:35:02 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604  8:40:28 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604  8:45:50 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604  8:48:28 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604  8:53:42 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604  9:40:29 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
  what():  Transport endpoint is not connected
11:24:56 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=268435456
read_buffer_size=131072
max_used_connections=53
max_threads=102
thread_count=27
connection_count=27
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1947646 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7ed245]
/usr/sbin/mysqld(handle_fatal_signal+0x4b4)[0x6ba864]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf030)[0x7f26d36e9030]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f26d22c5475]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x180)[0x7f26d22c86f0]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x11d)[0x7f26c877a89d]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x63996)[0x7f26c8778996]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x639c3)[0x7f26c87789c3]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x63bee)[0x7f26c8778bee]
/usr/lib/mysql/plugin/libgalera_smm.so(_ZN5boost15throw_exceptionIN4asio12system_errorEEEvRKT_+0x18a)[0x7f26c8b4fbda]
/usr/lib/mysql/plugin/libgalera_smm.so(_ZNK4asio12basic_socketINS_2ip3tcpENS_21stream_socket_serviceIS2_EEE15remote_endpointEv+0x1e2)[0x7f26c8b52ea2]
/usr/lib/mysql/plugin/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket18assign_remote_addrEv+0x5e7)[0x7f26c8b3f137]
/usr/lib/mysql/plugin/libgalera_smm.so(_ZN5gcomm15AsioTcpAcceptor14accept_handlerEN5boost10shared_ptrINS_6SocketEEERKN4asio10error_codeE+0xc3)[0x7f26c8b3fb83]
/usr/lib/mysql/plugin/libgalera_smm.so(_ZN4asio6detail25reactive_socket_accept_opINS_12basic_socketINS_2ip3tcpENS_21stream_socket_serviceIS4_EEEES4_N5boost3_bi6bind_tIvNS8_4_mfi3mf2IvN5gcomm15AsioTcpAcceptorENS8_10shared_ptrINSD_6SocketEEERKNS_10error_codeEEENS9_5list3INS9_5valueIPSE_EENSN_ISH_EEPFNS8_3argILi1EEEvEEEEEE11do_completeEPNS0_15task_io_serviceEPNS0_25task_io_service_operationESI_m+0x180)[0x7f26c8b46e30]
/usr/lib/mysql/plugin/libgalera_smm.so(_ZN4asio6detail15task_io_service3runERNS_10error_codeE+0x45a)[0x7f26c8b694fa]
/usr/lib/mysql/plugin/libgalera_smm.so(_ZN5gcomm12AsioProtonet10event_loopERKN2gu8datetime6PeriodE+0x1d6)[0x7f26c8b63126]
/usr/lib/mysql/plugin/libgalera_smm.so(_ZN9GCommConn3runEv+0x57)[0x7f26c8b7c847]
/usr/lib/mysql/plugin/libgalera_smm.so(_ZN9GCommConn6run_fnEPv+0x9)[0x7f26c8b7f7c9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6b50)[0x7f26d36e0b50]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7f26d236da7d]
You may download the Percona Server operations manual by visiting
http://www.percona.com/software/percona-server/. You may find information
in the manual which will help you identify the cause of the crash.
130604 13:24:57 mysqld_safe Number of processes running now: 0
130604 13:24:57 mysqld_safe WSREP: not restarting wsrep node automatically
130604 13:24:57 mysqld_safe mysqld from pid file /var/lib/mysql/dedwen.pid ended
130604 13:27:45 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
130604 13:27:45 mysqld_safe WSREP: Running position recovery with --log_error=/tmp/tmp.VHoWERXB12
130604 13:27:56 mysqld_safe WSREP: Recovered position e9355fb6-755b-11e2-0800-1fb4961d36c0:1058164
130604 13:27:56 [Note] WSREP: wsrep_start_position var submitted: 'e9355fb6-755b-11e2-0800-1fb4961d36c0:1058164'
130604 13:27:56 [Note] WSREP: Read nil XID from storage engines, skipping position init
130604 13:27:56 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/mysql/plugin/libgalera_smm.so'
130604 13:27:56 [Note] WSREP: wsrep_load(): Galera 2.5(r150) by Codership Oy <info@codership.com> loaded succesfully.
130604 13:27:56 [Note] WSREP: Found saved state: e9355fb6-755b-11e2-0800-1fb4961d36c0:-1
130604 13:27:56 [Note] WSREP: Reusing existing '/var/lib/mysql//galera.cache'.
130604 13:27:56 [Note] WSREP: Passing config to GCS: base_host = 172.16.29.232; base_port = 4567; cert.log_conflicts = no; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 2048M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
130604 13:27:56 [Note] WSREP: Assign initial position for certification: 1058164, protocol version: -1
130604 13:27:56 [Note] WSREP: wsrep_sst_grab()
130604 13:27:56 [Note] WSREP: Start replication
130604 13:27:56 [Note] WSREP: Setting initial position to e9355fb6-755b-11e2-0800-1fb4961d36c0:1058164
130604 13:27:56 [Note] WSREP: protonet asio version 0
130604 13:27:56 [Note] WSREP: backend: asio
130604 13:27:56 [Note] WSREP: GMCast version 0
130604 13:27:56 [Note] WSREP: (cdc89674-cd09-11e2-0800-808db1644387, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
130604 13:27:56 [Note] WSREP: (cdc89674-cd09-11e2-0800-808db1644387, 'tcp://0.0.0.0:4567') multicast: , ttl: 1
130604 13:27:56 [Note] WSREP: EVS version 0
130604 13:27:56 [Note] WSREP: PC version 0
130604 13:27:56 [Note] WSREP: gcomm: connecting to group 'my_wsrep_cluster', peer ''
130604 13:27:56 [Note] WSREP: Node cdc89674-cd09-11e2-0800-808db1644387 state prim
130604 13:27:56 [Note] WSREP: view(view_id(PRIM,cdc89674-cd09-11e2-0800-808db1644387,1) memb {
        cdc89674-cd09-11e2-0800-808db1644387,
} joined {
} left {
} partitioned {
})
130604 13:27:56 [Note] WSREP: gcomm: connected
130604 13:27:56 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
130604 13:27:56 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
130604 13:27:56 [Note] WSREP: Opened channel 'my_wsrep_cluster'
130604 13:27:56 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 0, memb_num = 1
130604 13:27:56 [Note] WSREP: Waiting for SST to complete.
130604 13:27:56 [Note] WSREP: STATE_EXCHANGE: sent state UUID: cdc94ff0-cd09-11e2-0800-fc2e41fe8b87
130604 13:27:56 [Note] WSREP: STATE EXCHANGE: sent state msg: cdc94ff0-cd09-11e2-0800-fc2e41fe8b87
130604 13:27:56 [Note] WSREP: STATE EXCHANGE: got state msg: cdc94ff0-cd09-11e2-0800-fc2e41fe8b87 from 0 (dedwen)
130604 13:27:56 [Note] WSREP: Quorum results:
        version    = 2,
        component  = PRIMARY,
        conf_id    = 0,
        members    = 1/1 (joined/total),
        act_id     = 1058164,
        last_appl. = -1,
        protocols  = 0/4/2 (gcs/repl/appl),
        group UUID = e9355fb6-755b-11e2-0800-1fb4961d36c0
130604 13:27:56 [Note] WSREP: Flow-control interval: [16, 16]
130604 13:27:56 [Note] WSREP: Restored state OPEN -> JOINED (1058164)
130604 13:27:56 [Note] WSREP: Member 0 (dedwen) synced with group.
130604 13:27:56 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 1058164)
130604 13:27:56 [Note] WSREP: New cluster view: global state: e9355fb6-755b-11e2-0800-1fb4961d36c0:1058164, view# 1: Primary, number of nodes: 1, my index: 0, protocol version 2
130604 13:27:56 [Note] WSREP: SST complete, seqno: 1058164
130604 13:27:56 [Note] Plugin 'FEDERATED' is disabled.
130604 13:27:56 InnoDB: The InnoDB memory heap is disabled
130604 13:27:56 InnoDB: Mutexes and rw_locks use GCC atomic builtins
130604 13:27:56 InnoDB: Compressed tables use zlib 1.2.3
130604 13:27:56 InnoDB: Using Linux native AIO
130604 13:27:56 InnoDB: Initializing buffer pool, size = 1.8G
130604 13:27:57 InnoDB: Completed initialization of buffer pool
130604 13:27:57 InnoDB: highest supported file format is Barracuda.
130604 13:28:00  InnoDB: Waiting for the background threads to start
130604 13:28:01 Percona XtraDB (http://www.percona.com) 5.5.30-rel30.2 started; log sequence number 241569581381
130604 13:28:01 [ERROR] /usr/sbin/mysqld: Table './mysql/user' is marked as crashed and should be repaired
130604 13:28:01 [Warning] Checking table:   './mysql/user'
130604 13:28:01 [ERROR] 1 client is using or hasn't closed the table properly
130604 13:28:01 [ERROR] /usr/sbin/mysqld: Table './mysql/db' is marked as crashed and should be repaired
130604 13:28:01 [Warning] Checking table:   './mysql/db'
130604 13:28:01 [ERROR] 1 client is using or hasn't closed the table properly
130604 13:28:02 [Note] Event Scheduler: Loaded 0 events
130604 13:28:02 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130604 13:28:02 [Note] WSREP: Assign initial position for certification: 1058164, protocol version: 2
130604 13:28:02 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.5.30-30.2-log'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  Percona Server (GPL), Release 30.2, wsrep_23.7.4.r3843
130604 13:28:02 [Note] WSREP: Synchronized with group, ready for connections
130604 13:28:02 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
130604 13:28:26 [ERROR] /usr/sbin/mysqld: Table './mysql/proc' is marked as crashed and should be repaired
130604 13:28:26 [Warning] Checking table:   './mysql/proc'

I noticed that there is a lot of "WSREP: last inactive check more than PT1.5S ago" messages hours ago before the crash, could this be related or not ?

By the way, I did not had time to FWed ports on this server, so I will do it today, could you please tell :
- complete list of ports that should be blocked (4444, 4567, others ?)
- if this will totally or partially avoid the problem ?
- what is the ETA for a release with a fix for this problem ?

Thanks !

Regards,

Laurent

Alex Yurchenko (ayurchen) on 2013-06-28

Changed in galera:
status:	Fix Committed → Fix Released

Revision history for this message

Amol (ajkedar) wrote on 2014-02-14:

#17

Download full text (4.1 KiB)

i had 2 nodes crashed on ubuntu 10.04 LTS and percona-xtradb-cluster-common-5.5 5.5.30-23.7.4-403.lucid

140214 5:54:45 [Note] WSREP: Flow-control interval: [23, 23]
140214 5:54:45 [Note] WSREP: New cluster view: global state: af24cb2c-a76e-11e2-0800-982f50316bdc:13855971, view# 157: Primary, number of nodes: 2, my index: 0, protocol version 2
140214 5:54:45 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
140214 5:54:45 [Note] WSREP: Assign initial position for certification: 13855971, protocol version: 2
140214 5:54:51 [Note] WSREP: cleaning up 12caf6b8-b399-11e2-0800-85f0fe40ed2a (tcp://10.1.7.1:4567)
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
what(): Transport endpoint is not connected
06:30:44 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=67108864
read_buffer_size=131072
max_used_connections=85
max_threads=153
thread_count=5
connection_count=5
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 400387 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7fb9f5]
/usr/sbin/mysqld(handle_fatal_signal+0x4b4)[0x6c4724]
/lib/libpthread.so.0(+0xf8f0)[0x7f1b8a4a58f0]
/lib/libc.so.6(gsignal+0x35)[0x7f1b89094b25]
/lib/libc.so.6(abort+0x180)[0x7f1b89098670]
/usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x115)[0x7f1b8639b8c5]
/usr/lib/libstdc++.so.6(+0xcacf6)[0x7f1b86399cf6]
/usr/lib/libstdc++.so.6(+0xcad23)[0x7f1b86399d23]
/usr/lib/libstdc++.so.6(+0xcae1e)[0x7f1b86399e1e]
/usr/lib64/libgalera_smm.so(_ZN5boost15throw_exceptionIN4asio12system_errorEEEvRKT_+0x18a)[0x7f1b8671aeba]
/usr/lib64/libgalera_smm.so(_ZNK4asio12basic_socketINS_2ip3tcpENS_21stream_socket_serviceIS2_EEE15remote_endpointEv+0x1fe)[0x7f1b8671f85e]
/usr/lib64/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket18assign_remote_addrEv+0x627)[0x7f1b86705c57]
/usr/lib64/libgalera_smm.so(_ZN5gcomm15AsioTcpAcceptor14accept_handlerEN5boost10shared_ptrINS_6SocketEEERKN4asio10error_codeE+0xf3)[0x7f1b86706733]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail25reactive_socket_accept_opINS_12basic_socketINS_2ip3tcpENS_21stream_socket_serviceIS4_EEEES4_N5boost3_bi6bind_tIvNS8_4_mfi3mf2IvN5gcomm15AsioTcpAcceptorENS8_10shared_ptrINSD_6SocketEEERKNS_10error_codeEEENS9_5list3INS9_5valueIPSE_EENSN_ISH_EEPFNS8...

Affects		Status	Importance	Assigned to	Milestone
	Galera	Fix Released	Undecided	Teemu Ollakka	Galera 23.2.6
	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC	Fix Released	Undecided	Unassigned	Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC 5.5.30-23.7.4

Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC

Graceful handling of possible network issues

Bug Description

Related branches

Other bug subscribers

Remote bug watches