Graceful handling of possible network issues

Bug #1153727 reported by Raghavendra D Prabhu
14
This bug affects 2 people
Affects Status Importance Assigned to Milestone
Galera
Fix Released
Undecided
Teemu Ollakka
Percona XtraDB Cluster moved to https://jira.percona.com/projects/PXC
Fix Released
Undecided
Unassigned

Bug Description

Under certain conditions mysqld can crash in the manner described here https://groups.google.com/forum/#!msg/codership-team/ZJT_YHzZgc4/j5zZMMOXOcoJ

Also, as described here: http://sprunge.us/HhHZ

The network issues can present in any way, however, if galera handles them better without crashing the server it would be nice.

Tags: i30151
Revision history for this message
Alex Yurchenko (ayurchen) wrote :
Download full text (3.4 KiB)

Posting the stack here for convenience:

130311 4:02:40 [Warning] Aborted connection 31256926 to db: 'unconnected' user: 'admin' host: '10.2.46.110' (Got an error reading communication packets)
130311 4:02:40 [Warning] Aborted connection 31256936 to db: 'unconnected' user: 'admin' host: '10.2.46.110' (Got an error reading communication packets)
130311 4:02:40 [Warning] Aborted connection 31256942 to db: 'unconnected' user: 'admin' host: '10.2.46.110' (Got an error reading communication packets)
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
  what(): Bad file descriptor
11:02:49 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=232
max_threads=15000
thread_count=6
connection_count=6
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 32837918 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7c9765]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x6a2ce4]
/lib64/libpthread.so.0[0x3bbba0f500]
/lib64/libc.so.6(gsignal+0x35)[0x3bbb2328a5]
/lib64/libc.so.6(abort+0x175)[0x3bbb234085]
/usr/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x12d)[0x7f49483f2a5d]
/usr/lib64/libstdc++.so.6(+0x3bbdabcbe6)[0x7f49483f0be6]
/usr/lib64/libstdc++.so.6(+0x3bbdabcc13)[0x7f49483f0c13]
/usr/lib64/libstdc++.so.6(+0x3bbdabcd0e)[0x7f49483f0d0e]
/usr/lib64/libgalera_smm.so(_ZN5boost15throw_exceptionIN4asio12system_errorEEEvRKT_+0x18a)[0x7f494876c70a]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail11throw_errorERKNS_10error_codeE+0x5b)[0x7f494876c82b]
/usr/lib64/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket17assign_local_addrEv+0x707)[0x7f494875d137]
/usr/lib64/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket15connect_handlerERKN4asio10error_codeE+0x67)[0x7f494875dda7]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail26reactive_socket_connect_opIN5boost3_bi6bind_tIvNS2_4_mfi3mf1IvN5gcomm13AsioTcpSocketERKNS_10error_codeEEENS3_5list2INS3_5valueINS2_10shared_ptrIS8_EEEEPFNS2_3argILi1EEEvEEEEEE11do_completeEPNS0_15task_io_serviceEPNS0_25task_io_service_operationES9_m+0x10e)[0x7f494876822e]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail15task_io_service3runERNS_10error_codeE+0x459)[0x7f4948788a49]
/usr/lib64/libgalera_smm.so(_ZN5gcomm12AsioProtonet10event_loopERKN2gu8datetime6PeriodE+0x1d6)...

Read more...

no longer affects: codership-mysql
tags: added: i30151
Revision history for this message
Alex Yurchenko (ayurchen) wrote :

fix committed in lp:galera/2.x r148

Changed in galera:
assignee: nobody → Teemu Ollakka (teemu-ollakka)
status: New → Fix Committed
Changed in percona-xtradb-cluster:
milestone: none → 5.5.30-23.7.4
Changed in percona-xtradb-cluster:
status: New → Fix Released
Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

The issue seems to be recurring again here https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1184034

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

The difference I see is instead of

" what(): Bad file descriptor"

we see

" what(): Transport endpoint is not connected"

because of the additional code added.

This 'transport endpoint not connected' seems to mimic how nmap scans the port.

Now, we can say that cluster ports should be protected with iptables or so against this but again, it would be better if the server didn't crash on this.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

Full backtrace here http://paste.wnohang.net/114621

and it is crashing here:

void gcomm::AsioTcpSocket::assign_remote_addr()
....
        remote_addr_ = uri_string(
            gcomm::TCP_SCHEME,
            gcomm::escape_addr(socket_.remote_endpoint().address()),
            gu::to_string(socket_.remote_endpoint().port())
            );

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

And, yes the error is

perror 107
OS error code 107: Transport endpoint is not connected

Related discussion here http://comments.gmane.org/gmane.comp.lib.boost.asio.user/4781 and http://boost.2283326.n4.nabble.com/boost-asio-exception-not-passed-to-async-callback-td2599023.html

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

So, https://bazaar.launchpad.net/~codership/galera/2.x/revision/148 add try/catch around connect_handler but not in accept_handler. So, adding try/catch + FAILED_HANDLER should help here right?

Revision history for this message
Teemu Ollakka (teemu-ollakka) wrote :

Adding try/catch and deleting failed socket object + creating new object for acceptor should do it, calling FAILED_HANDLER is not necessary since upper layers haven't seen the socket yet.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

With sendto() it rather emits: (which I believe are from connect_handler())

130525 18:35:53 [Warning] WSREP: unserialize error invalid protocol version 2: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():108
130525 18:35:57 [Warning] WSREP: unserialize error invalid protocol version 4: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():108
130525 18:35:57 [Warning] WSREP: unserialize error invalid protocol version 4: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():108
130525 18:35:57 [Warning] WSREP: unserialize error invalid protocol version 2: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():108
130525 18:35:57 [Warning] WSREP: unserialize error invalid flags 6: 71 (Protocol error)
         at gcomm/src/gcomm/datagram.hpp:unserialize():101

I tested with https://bazaar.launchpad.net/~raghavendra-prabhu/percona-xtradb-cluster/galera-bug1153727/revision/125 it doesn't crash (it just logs 130525 18:53:15 [ERROR] WSREP: Transport endpoint is not connected) but other nodes can't connect either, hence requires a restart of the node.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Teemu, ack. yeah, failed_handler is not in this scope (got a compile error with that :)).

Revision history for this message
Laurent Minost (lolomin) wrote :

Hi guys,

Thanks for these informations !
I've faced the same crash on one of my development PXDBC node using 5.5.30-23.7.4 so I can confirm that this bug is still present in latest release.
Can we FW both 4444 & 4567 ports (and maybe others?) from IPs that are not one of the cluster nodes to avoid this problem to occur again please ?

Regards,

Laurent

Changed in galera:
milestone: none → 23.2.6
Revision history for this message
Laurent Minost (lolomin) wrote :

Hi Alexey,

Thanks for this fix !

Can you please tell if firewalling ports like described above is enough to avoid this problem occuring again or there is others things needed that need us to wait for a new release to be published ?

Regards,

Laurent

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Laurent, this particular fix was already released by Percona with their latest Galera build. If you still have issues you should wait for another fix, tracked in https://bugs.launchpad.net/galera/+bug/1184034.

Firewalling the Galera ports should be enough - or at least will help.

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Laurent, as Alexey has mentioned, the fix for 'what(): Bad file
descriptor has b;een in latest PXC.

However, seeing your mail in the mailing list, it looks like for
you and few others it crashed with
" what(): Transport endpoint is not connected".

This is not fixed yet. It needs to be fixed in galera. Looks like
it will be fixed in Galera 23.2.6 in bug https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1184034 . You can look at 23.2.6 milestone here https://launchpad.net/galera/+milestone/23.2.6

You can look at https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1153727/comments/4 for more details.

Also, you can further track updates to this crash in https://bugs.launchpad.net/percona-xtradb-cluster/+bug/1184034 itself. (since galera milestone is marked there)

Revision history for this message
Laurent Minost (lolomin) wrote :

Hi Raghavendra & Alexey,

Thanks for these detailed informations, so I will wait for next release and will try to block concerned ports for the moment to avoid the problem to occur again.

Regards,

Laurent

Revision history for this message
Laurent Minost (lolomin) wrote :
Download full text (11.4 KiB)

Hi all,

Bad news here, same crash today :

root@dedwen:/var/log/mysql# cat mysql-error.log
130602 6:31:51 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130602 7:12:25 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130602 7:13:50 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130602 10:18:11 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130603 7:28:16 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130603 7:28:17 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130603 12:03:43 [Warning] IP address '172.16.29.16' could not be resolved: Name or service not known
130604 5:25:34 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:35:02 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:40:28 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:45:50 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:48:28 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 8:53:42 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
130604 9:40:29 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
  what(): Transport endpoint is not connected
11:24:56 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=268435456
read_buffer_size=131072
max_used_connections=53
max_threads=102
thread_count=27
connection_count=27
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1947646 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7ed245]
/usr/sbin/mysqld(handle_fatal_signal+0x4b4)[0x6ba864]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xf030)[0x7f26d36e9030]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7f26d22c5475]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x180)[0x7f26d22c86f0]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x11d)[0x7f26c877a89d]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x63996)[0x7f26c8778996]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0x639c3)[0x7f26c87789c3]
/usr/lib/x86_64-linux-...

Changed in galera:
status: Fix Committed → Fix Released
Revision history for this message
Amol (ajkedar) wrote :
Download full text (4.1 KiB)

i had 2 nodes crashed on ubuntu 10.04 LTS and percona-xtradb-cluster-common-5.5 5.5.30-23.7.4-403.lucid

140214 5:54:45 [Note] WSREP: Flow-control interval: [23, 23]
140214 5:54:45 [Note] WSREP: New cluster view: global state: af24cb2c-a76e-11e2-0800-982f50316bdc:13855971, view# 157: Primary, number of nodes: 2, my index: 0, protocol version 2
140214 5:54:45 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
140214 5:54:45 [Note] WSREP: Assign initial position for certification: 13855971, protocol version: 2
140214 5:54:51 [Note] WSREP: cleaning up 12caf6b8-b399-11e2-0800-85f0fe40ed2a (tcp://10.1.7.1:4567)
terminate called after throwing an instance of 'boost::exception_detail::clone_impl<boost::exception_detail::error_info_injector<asio::system_error> >'
  what(): Transport endpoint is not connected
06:30:44 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/

key_buffer_size=67108864
read_buffer_size=131072
max_used_connections=85
max_threads=153
thread_count=5
connection_count=5
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 400387 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x7fb9f5]
/usr/sbin/mysqld(handle_fatal_signal+0x4b4)[0x6c4724]
/lib/libpthread.so.0(+0xf8f0)[0x7f1b8a4a58f0]
/lib/libc.so.6(gsignal+0x35)[0x7f1b89094b25]
/lib/libc.so.6(abort+0x180)[0x7f1b89098670]
/usr/lib/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x115)[0x7f1b8639b8c5]
/usr/lib/libstdc++.so.6(+0xcacf6)[0x7f1b86399cf6]
/usr/lib/libstdc++.so.6(+0xcad23)[0x7f1b86399d23]
/usr/lib/libstdc++.so.6(+0xcae1e)[0x7f1b86399e1e]
/usr/lib64/libgalera_smm.so(_ZN5boost15throw_exceptionIN4asio12system_errorEEEvRKT_+0x18a)[0x7f1b8671aeba]
/usr/lib64/libgalera_smm.so(_ZNK4asio12basic_socketINS_2ip3tcpENS_21stream_socket_serviceIS2_EEE15remote_endpointEv+0x1fe)[0x7f1b8671f85e]
/usr/lib64/libgalera_smm.so(_ZN5gcomm13AsioTcpSocket18assign_remote_addrEv+0x627)[0x7f1b86705c57]
/usr/lib64/libgalera_smm.so(_ZN5gcomm15AsioTcpAcceptor14accept_handlerEN5boost10shared_ptrINS_6SocketEEERKN4asio10error_codeE+0xf3)[0x7f1b86706733]
/usr/lib64/libgalera_smm.so(_ZN4asio6detail25reactive_socket_accept_opINS_12basic_socketINS_2ip3tcpENS_21stream_socket_serviceIS4_EEEES4_N5boost3_bi6bind_tIvNS8_4_mfi3mf2IvN5gcomm15AsioTcpAcceptorENS8_10shared_ptrINSD_6SocketEEERKNS_10error_codeEEENS9_5list3INS9_5valueIPSE_EENSN_ISH_EEPFNS8...

Read more...

Revision history for this message
Alex Yurchenko (ayurchen) wrote :

Amol, which Galera version are you using?

Revision history for this message
Amol (ajkedar) wrote :

It Is

ii percona-xtradb-cluster-galera-2.x 150.lucid Galera components of Percona XtraDB Cluster

Revision history for this message
Raghavendra D Prabhu (raghavendra-prabhu) wrote :

@Amol,

Can you please upgrade to latest galera for this issue?

Revision history for this message
Amol (ajkedar) wrote :

are these versions ok?

||/ Name Version Description
+++-============================================-============================================-========================================================================================================
ii percona-toolkit 2.2.6 Advanced MySQL and system command-line tools
ii percona-xtrabackup 2.1.5-680-1.precise Open source backup tool for InnoDB and XtraDB
ii percona-xtradb-cluster-client-5.5 5.5.34-23.7.6-565.precise Percona Server database client binaries
ii percona-xtradb-cluster-common-5.5 5.5.34-23.7.6-565.precise Percona Server database common files
ii percona-xtradb-cluster-galera-2.x 162.precise Galera components of Percona XtraDB Cluster
ii percona-xtradb-cluster-server-5.5 5.5.34-23.7.6-565.precise Percona Server database server binaries

Revision history for this message
Shahriyar Rzayev (rzayev-sehriyar) wrote :

Percona now uses JIRA for bug reports so this bug report is migrated to: https://jira.percona.com/browse/PXC-1309

To post a comment you must log in.
This report contains Public information  
Everyone can see this information.

Other bug subscribers

Remote bug watches

Bug watches keep track of this bug in other bug trackers.