Percona XtraDB Cluster - HA scalable solution for MySQL

Graceful handling of possible network issues

Reported by Raghavendra D Prabhu on 2013-03-11
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Galera
Undecided
Teemu Ollakka
Percona XtraDB Cluster
Undecided
Unassigned

Bug Description

Under certain conditions mysqld can crash in the manner described here https://groups.google.com/forum/#!msg/codership-team/ZJT_YHzZgc4/j5zZMMOXOcoJ

Also, as described here: http://sprunge.us/HhHZ

The network issues can present in any way, however, if galera handles them better without crashing the server it would be nice.

Alex Yurchenko (ayurchen) wrote :
Download full text (3.4 KiB)

Posting the stack here for convenience:

130311 4:02:40 [Warning] Aborted connection 31256926 to db: 'unconnected' user: 'admin' host: '10.2.46.110' (Got an error reading communication packets)
130311 4:02:40 [Warning] Aborted connection 31256936 to db: 'unconnected' user: 'admin' host: '10.2.46.110' (Got an error reading communication packets)
130311 4:02:40 [Warning] Aborted connection 31256942 to db: 'unconnected' user: 'admin' host: '10.2.46.110' (Got an error reading communication packets)
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
  what(): Bad file descriptor
11:02:49 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=232
max_threads=15000
thread_count=6
connection_count=6
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 32837918 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7c9765]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x6a2ce4]
/lib64/libpthread.so.0[0x3bbba0f500]
/lib64/libc.so.6(gsignal+0x35)[0x3bbb2328a5]
/lib64/libc.so.6(abort+0x175)[0x3bbb234085]
/usr/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x12d)[0x7f49483f2a5d]
/usr/lib64/libstdc++.so.6(+0x3bbdabcbe6)[0x7f49483f0be6]
/usr/lib64/libstdc++.so.6(+0x3bbdabcc13)[0x7f49483f0c13]
/usr/lib64/libstdc++.so.6(+0x3bbdabcd0e)[0x7f49483f0d0e]
/usr/lib64/libgalera_smm.so(_ZN5boost15throw_exceptionIN4asio12system_errorEEEvRKT_+0x18a)[0x7f494876c70a]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail11throw_errorERKNS_10error_codeE+0x5b)[0x7f494876c82b]
/usr/lib64/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket17assign_local_addrEv+0x707)[0x7f494875d137]
/usr/lib64/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket15connect_handlerERKN4asio10error_codeE+0x67)[0x7f494875dda7]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail26reactive_socket_connect_opIN5boost3_bi6bind_tIvNS2_4_mfi3mf1IvN5gcomm13AsioTcpSocketERKNS_10error_codeEEENS3_5list2INS3_5valueINS2_10shared_ptrIS8_EEEEPFNS2_3argILi1EEEvEEEEEE11do_completeEPNS0_15task_io_serviceEPNS0_25task_io_service_operationES9_m+0x10e)[0x7f494876822e]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail15task_io_service3runERNS_10error_codeE+0x459)[0x7f4948788a49]
/usr/lib64/libgalera_smm.so(_ZN5gcomm12AsioProtonet10event_loopERKN2gu8datetime6PeriodE+0x1d6)...

Read more...

no longer affects: codership-mysql
tags: added: i30151
Alex Yurchenko (ayurchen) wrote :

fix committed in lp:galera/2.x r148

Changed in galera:
assignee: nobody → Teemu Ollakka (teemu-ollakka)
status: New → Fix Committed
Changed in percona-xtradb-cluster:
milestone: none → 5.5.30-23.7.4
Changed in percona-xtradb-cluster:
status: New → Fix Released

The difference I see is instead of

" what(): Bad file descriptor"

we see

" what(): Transport endpoint is not connected"

because of the additional code added.

This 'transport endpoint not connected' seems to mimic how nmap scans the port.

Now, we can say that cluster ports should be protected with iptables or so against this but again, it would be better if the server didn't crash on this.

Full backtrace here http://paste.wnohang.net/114621

and it is crashing here:

void gcomm::AsioTcpSocket::assign_remote_addr()
....
        remote_addr_ = uri_string(
            gcomm::TCP_SCHEME,
            gcomm::escape_addr(socket_.remote_endpoint().address()),
            gu::to_string(socket_.remote_endpoint().port())
            );

And, yes the error is

perror 107
OS error code 107: Transport endpoint is not connected

Related discussion here http://comments.gmane.org/gmane.comp.lib.boost.asio.user/4781 and http://boost.2283326.n4.nabble.com/boost-asio-exception-not-passed-to-async-callback-td2599023.html

So, https://bazaar.launchpad.net/~codership/galera/2.x/revision/148 add try/catch around connect_handler but not in accept_handler. So, adding try/catch + FAILED_HANDLER should help here right?

Teemu Ollakka (teemu-ollakka) wrote :

Adding try/catch and deleting failed socket object + creating new object for acceptor should do it, calling FAILED_HANDLER is not necessary since upper layers haven't seen the socket yet.

With sendto() it rather emits: (which I believe are from connect_handler())

130525 18:35:53 [Warning] WSREP: unserialize error invalid protocol version 2: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():108
130525 18:35:57 [Warning] WSREP: unserialize error invalid protocol version 4: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():108
130525 18:35:57 [Warning] WSREP: unserialize error invalid protocol version 4: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():108
130525 18:35:57 [Warning] WSREP: unserialize error invalid protocol version 2: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():108
130525 18:35:57 [Warning] WSREP: unserialize error invalid flags 6: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():101

I tested with https://bazaar.launchpad.net/~raghavendra-prabhu/percona-xtradb-cluster/galera-bug1153727/revision/125 it doesn't crash (it just logs 130525 18:53:15 [ERROR] WSREP: Transport endpoint is not connected) but other nodes can't connect either, hence requires a restart of the node.

@Teemu, ack. yeah, failed_handler is not in this scope (got a compile error with that :)).

Laurent Minost (lolomin) wrote :

Hi guys,

Thanks for these informations !
I've faced the same crash on one of my development PXDBC node using 5.5.30-23.7.4 so I can confirm that this bug is still present in latest release.
Can we FW both 4444 & 4567 ports (and maybe others?) from IPs that are not one of the cluster nodes to avoid this problem to occur again please ?

Regards,

Laurent

Changed in galera:
milestone: none → 23.2.6
Laurent Minost (lolomin) wrote :

Hi Alexey,

Thanks for this fix !

Can you please tell if firewalling ports like described above is enough to avoid this problem occuring again or there is others things needed that need us to wait for a new release to be published ?

Regards,

Laurent

Alex Yurchenko (ayurchen) wrote :

Laurent, this particular fix was already released by Percona with their latest Galera build. If you still have issues you should wait for another fix, tracked in https://bugs.launchpad.net/galera/+bug/1184034.

Firewalling the Galera ports should be enough - or at least will help.

@Laurent, as Alexey has mentioned, the fix for 'what(): Bad file
descriptor has b;een in latest PXC.

However, seeing your mail in the mailing list, it looks like for
you and few others it crashed with
" what(): Transport endpoint is not connected".

This is not fixed yet. It needs to be fixed in galera. Looks like
it will be fixed in Galera 23.2.6 in bug https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1184034 . You can look at 23.2.6 milestone here https://launchpad.net/galera/+milestone/23.2.6

You can look at https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1153727/comments/4 for more details.

Also, you can further track updates to this crash in https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1184034 itself. (since galera milestone is marked there)

Laurent Minost (lolomin) wrote :

Hi Raghavendra & Alexey,

Thanks for these detailed informations, so I will wait for next release and will try to block concerned ports for the moment to avoid the problem to occur again.

Regards,

Laurent

Laurent Minost (lolomin) wrote :
Download full text (11.4 KiB)

Hi all,

Bad news here, same crash today :

root@dedwen:/var/log/mysql# cat mysql-error.log
130602 6:31:51 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130602 7:12:25 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130602 7:13:50 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130602 10:18:11 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130603 7:28:16 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130603 7:28:17 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130603 12:03:43 [Warning] IP address '172.16.29.16' could not be resolved: Name or service not known
130604 5:25:34 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:35:02 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:40:28 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:45:50 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:48:28 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:53:42 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 9:40:29 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
  what(): Transport endpoint is not connected
11:24:56 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=268435456
read_buffer_size=131072
max_used_connections=53
max_threads=102
thread_count=27
connection_count=27
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1947646 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7ed245]
/usr/sbin/mysqld(handle_fatal_signal+0x4b4)[0x6ba864]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf030)[0x7f26d36e9030]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f26d22c5475]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x180)[0x7f26d22c86f0]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x11d)[0x7f26c877a89d]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x63996)[0x7f26c8778996]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x639c3)[0x7f26c87789c3]
/usr/lib/x86_64-linux-...

Changed in galera:
status: Fix Committed → Fix Released
Amol (ajkedar) wrote :
Download full text (4.1 KiB)

i had 2 nodes crashed on ubuntu 10.04 LTS and percona-xtradb-cluster-common-5.5 5.5.30-23.7.4-403.lucid

140214 5:54:45 [Note] WSREP: Flow-control interval: [23, 23]
140214 5:54:45 [Note] WSREP: New cluster view: global state: af24cb2c-a76e-11e2-0800-982f50316bdc:13855971, view# 157: Primary, number of nodes: 2, my index: 0, protocol version 2
140214 5:54:45 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
140214 5:54:45 [Note] WSREP: Assign initial position for certification: 13855971, protocol version: 2
140214 5:54:51 [Note] WSREP: cleaning up 12caf6b8-b399-11e2-0800-85f0fe40ed2a (tcp://10.1.7.1:4567)
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
  what(): Transport endpoint is not connected
06:30:44 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=67108864
read_buffer_size=131072
max_used_connections=85
max_threads=153
thread_count=5
connection_count=5
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 400387 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7fb9f5]
/usr/sbin/mysqld(handle_fatal_signal+0x4b4)[0x6c4724]
/lib/libpthread.so.0(+0xf8f0)[0x7f1b8a4a58f0]
/lib/libc.so.6(gsignal+0x35)[0x7f1b89094b25]
/lib/libc.so.6(abort+0x180)[0x7f1b89098670]
/usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x115)[0x7f1b8639b8c5]
/usr/lib/libstdc++.so.6(+0xcacf6)[0x7f1b86399cf6]
/usr/lib/libstdc++.so.6(+0xcad23)[0x7f1b86399d23]
/usr/lib/libstdc++.so.6(+0xcae1e)[0x7f1b86399e1e]
/usr/lib64/libgalera_smm.so(_ZN5boost15throw_exceptionIN4asio12system_errorEEEvRKT_+0x18a)[0x7f1b8671aeba]
/usr/lib64/libgalera_smm.so(_ZNK4asio12basic_socketINS_2ip3tcpENS_21stream_socket_serviceIS2_EEE15remote_endpointEv+0x1fe)[0x7f1b8671f85e]
/usr/lib64/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket18assign_remote_addrEv+0x627)[0x7f1b86705c57]
/usr/lib64/libgalera_smm.so(_ZN5gcomm15AsioTcpAcceptor14accept_handlerEN5boost10shared_ptrINS_6SocketEEERKN4asio10error_codeE+0xf3)[0x7f1b86706733]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail25reactive_socket_accept_opINS_12basic_socketINS_2ip3tcpENS_21stream_socket_serviceIS4_EEEES4_N5boost3_bi6bind_tIvNS8_4_mfi3mf2IvN5gcomm15AsioTcpAcceptorENS8_10shared_ptrINSD_6SocketEEERKNS_10error_codeEEENS9_5list3INS9_5valueIPSE_EENSN_ISH_EEPFNS8...

Read more...

Alex Yurchenko (ayurchen) wrote :

Amol, which Galera version are you using?

Amol (ajkedar) wrote :

It Is

ii percona-xtradb-cluster-galera-2.x 150.lucid Galera components of Percona XtraDB Cluster

@Amol,

Can you please upgrade to latest galera for this issue?

Amol (ajkedar) wrote :

are these versions ok?

||/ Name Version Description
+++-============================================-============================================-========================================================================================================
ii percona-toolkit 2.2.6 Advanced MySQL and system command-line tools
ii percona-xtrabackup 2.1.5-680-1.precise Open source backup tool for InnoDB and XtraDB
ii percona-xtradb-cluster-client-5.5 5.5.34-23.7.6-565.precise Percona Server database client binaries
ii percona-xtradb-cluster-common-5.5 5.5.34-23.7.6-565.precise Percona Server database common files
ii percona-xtradb-cluster-galera-2.x 162.precise Galera components of Percona XtraDB Cluster
ii percona-xtradb-cluster-server-5.5 5.5.34-23.7.6-565.precise Percona Server database server binaries

To post a comment you must log in.
This report contains Public information  Edit
Everyone can see this information.

Other bug subscribers